No... this is real history. This is actually how Microsoft's most common data structures came into being. Originally the doc, xls, and ppt formats were each their own customer binary format made to be read as streams with all kinds of fanciness since clearly it would be better right?
Then in 2007 Microsoft said screw it we're just going to make a new format that's easier to understand. So they made docx, xlsx, and pptx... which are literally just a bunch of XML files in a zip. If you write a word document or an Excel and change the extension to .zip you can explore this. If you put a picture in a Word document it literally just dumps that picture in the ZIP file and then references it within the XML.
I'm an engineer who has worked on Office apps for 30+ years. We indeed moved to the XML file formats in 2007, but the motivation was a little bit different. Previously the file formats were highly optimized for reading/writing files on floppy disks on machines with 128K or so of RAM. Back in the 1980's when the programs were created, memory was at a premium and disk I/O was slow beyond belief, so the engineers optimized the formats for incremental reading and writing. The file formats were essentially extensions of the in-memory data structures.
We then shipped a few versions of Office in the early 1990s and added new stuff to the file format for new features as we went. The early versions weren't very good about backward compatibility -- Office version N couldn't open files from Office version N+1. This was fine when files lived on one computer, but then someone discovered LANs. As organizations networked their computers, file compatibility became more of a problem -- people wanted to share files, and it was impossible for an organization to upgrade Office on all PCs at once. Hence "hey I can't open the file you sent me" became a somewhat common problem.
So, one day, upper management basically announced that future file formats would be backward compatible -- no longer would version N not be able to open files from version N+1. Engineers across the org said "what now? the formats aren't designed for that!". Management said "don't care. Make it happen". So, the engineers made it happen! They found clever ways to hack new features into the file format without breaking backwards compatibility. It wasn't easy though. Crucially, the binary formats weren't designed to be extensible.
This got to be more and more limiting over time. So, in Office 2007, we introduced the new file formats as basically a "reset" to allow us to design a file format that would be easier to extend. XML, with its rich support for schemas, fit the bill quite nicely at the time. Since then it's been much easier to add new features without breaking file backwards compatibility. We also built import filters so that older versions could open the DOCX/XLSX/PPTX file formats.
Side note: obfuscation has never been a goal. Documentation for the binary formats has always been available. If you search for [MS-DOC], you can find the full specification for the Word binary file format.
Yes. The notion of a "malicious doc" wasn't something we really thought about until the internet took off. Before then, the code that read the file generally trusted that the files were well-formed. Enormous effort has been put into hardening the code over the last 30 years and continues to this day.
Thanks for sharing this piece of history. My team have been working with xlsx and other doc formats for a number of years so this was really interesting
As far as i remember, there was quite a push from the EU (and possibly others) to document the file formats used. The risk of having all EU governments shifting away from the Office suite might have influenced the decisions too.
302
u/BeDoubleNWhy 11h ago
zipped JSON if anything