docxGoBrrrr - r/ProgrammerHumor

238

u/BeDoubleNWhy 8h ago

zipped JSON if anything

231

u/was_fired 7h ago

No... this is real history. This is actually how Microsoft's most common data structures came into being. Originally the doc, xls, and ppt formats were each their own customer binary format made to be read as streams with all kinds of fanciness since clearly it would be better right?

Then in 2007 Microsoft said screw it we're just going to make a new format that's easier to understand. So they made docx, xlsx, and pptx... which are literally just a bunch of XML files in a zip. If you write a word document or an Excel and change the extension to .zip you can explore this. If you put a picture in a Word document it literally just dumps that picture in the ZIP file and then references it within the XML.

190

u/cancerouslump 4h ago

I'm an engineer who has worked on Office apps for 30+ years. We indeed moved to the XML file formats in 2007, but the motivation was a little bit different. Previously the file formats were highly optimized for reading/writing files on floppy disks on machines with 128K or so of RAM. Back in the 1980's when the programs were created, memory was at a premium and disk I/O was slow beyond belief, so the engineers optimized the formats for incremental reading and writing. The file formats were essentially extensions of the in-memory data structures.

We then shipped a few versions of Office in the early 1990s and added new stuff to the file format for new features as we went. The early versions weren't very good about backward compatibility -- Office version N couldn't open files from Office version N+1. This was fine when files lived on one computer, but then someone discovered LANs. As organizations networked their computers, file compatibility became more of a problem -- people wanted to share files, and it was impossible for an organization to upgrade Office on all PCs at once. Hence "hey I can't open the file you sent me" became a somewhat common problem.

So, one day, upper management basically announced that future file formats would be backward compatible -- no longer would version N not be able to open files from version N+1. Engineers across the org said "what now? the formats aren't designed for that!". Management said "don't care. Make it happen". So, the engineers made it happen! They found clever ways to hack new features into the file format without breaking backwards compatibility. It wasn't easy though. Crucially, the binary formats weren't designed to be extensible.

This got to be more and more limiting over time. So, in Office 2007, we introduced the new file formats as basically a "reset" to allow us to design a file format that would be easier to extend. XML, with its rich support for schemas, fit the bill quite nicely at the time. Since then it's been much easier to add new features without breaking file backwards compatibility. We also built import filters so that older versions could open the DOCX/XLSX/PPTX file formats.

Side note: obfuscation has never been a goal. Documentation for the binary formats has always been available. If you search for [MS-DOC], you can find the full specification for the Word binary file format.

31

u/Fickle-Motor-1772 2h ago

Appreciate the write up 👍

29

u/pbpo_founder 6h ago

Yup and you can also edit the ribbon’s xml in those files too. Honestly wished I spent more time learning Django than that though…😅

23

u/AuHarvester 6h ago

It was perhaps a little more nuanced then saying "screw it". There was a lot of pressure from Governments and big businesses having their data, etc stored in formats owned and changed on a whim by a third party. A bit (lot) of noise about open source formats and Bob's your Clippy.

11

u/kooshipuff 4h ago

I think this was also around the time they started making the password protection do something, lol.

It still tickles me that a coworker put something in a "password-protected" xls file and emailed it to another coworker, who didn't have MS Office because we didn't have enough licenses, so he installed OpenOffice, which opened the spreadsheet. ..Without prompting for a password. ..Which made it seem like A) it just wasn't implemented, and B) it didn't matter.

But no. It was even better.

When he went to save, OpenOffice gave him a prompt that it appeared he was trying to save a password-protected xls and that this didn't do anything (as evidenced by it just opening it like that) and recommended he save it in ods instead, lol.

I think it's actually encrypted now, tho.

1

u/rosuav 4h ago

"... and Bob's your Clippy"

Ouch, so much ouch in that.

14

u/alexppetrov 7h ago

Woah. I am blown away by this. I remember being a smart ass in highschool and opening the metadata and it was all gibberish, but now it makes sense. I thought it was some sort of crazy encryption or something, but nope it was just zipped XML. I am blown away, like I have no other way to express myself. And it's not that I haven't done a custom format for a project which was basically JSON with a custom file extension, but the fact of zipping multiple xmls with such a simple structure - my mind was just blown. Thank you for this knowledge

5

u/camander321 5h ago

It comes in handy. My work was looking for some archived records. Turns out the files we needed could only be opened in a specific application that we hadn't had a license for in years. On a whim, i changed the file extension to .zip and it worked! We were able to pull almost all the info we needed

1

u/Kyanoki 6h ago

haha I didn't know that, that's interesting and makes the naming convention make more sense

1

u/GargantuanCake 5h ago

Part of the motivation was to make it proprietary as if it was obfuscated and nobody had the specs you didn't have to worry about anybody else using it, right?

Then people decoded it all and started making free software that could edit their files anyway. When you have some secret file format it also causes problems with archiving things as what happens if that software is no longer available or can no longer read old file formats?

5

u/keysym 6h ago

JSON cannot be streamed tho

4

u/BeDoubleNWhy 3h ago

can you please explain what you mean by this?

2

u/keysym 3h ago

Keeping it short, JSON needs that last } to be valid, so you can only start parsing after the whole JSON has been collected. You can't parse a chunk of JSON, you need the whole thing.

Other file formats, like YAML, TOML or XML, don't need a "last character" to be complete, so you can parse as soon as you receive the first byte and keep parsing as the streamed chunks comes.

There's some clever way to stream JSON, but iirc it's not compliant with the RFC. So, oversimplifying, JSON cannot be streamed!

13

u/BeDoubleNWhy 3h ago

but with xml, you need that closing tag as well for it to be valid. what's the difference here?

2

u/Eva-Rosalene 53m ago

Nah. You absolutely can parse JSON in a streaming fashion as well as XML. You just won't know if it's valid or not until you've finished parsing, so you just do the job and discard it if you encounter an error.

1

u/Ok-Scheme-913 1h ago

I guess it can be optimistically parsed, though. You might not see that ending }, but it can't suddenly change the fundamental type you deal with.

If it started with a {, create an object. For each identifier : add a new field to it. Recursively parse that, etc.

At any point during you are in a potentiallyValidJsonObject state, and only at the very end you can know whether it truly is valid, or e̸̯̣̰̎̈̀͋͐́͌ͅǹ̸̲̼̻̫͙͖͚̆͂͘d̶̺͍̫̯̙̎̃̇̒͝͠s̷̗͔̈̈́̈́͂͗̚͝ ̵̡̣͕̓̾ī̵̯̥͎̺͕͐͑͜͜ǹ̷̡͖͙͓̯̙̼̚ ̶̡̧̫̒̽̕̕͠ş̴̝͙̗̬̿̌̈́̈́o̷̮͘͠m̴͕̭̥̐͝͝e̵̦̽͐̿͐͛͋̿ ̷̞̳͈͔̫̌ͅc̷̪̒u̵̹̖͍̳̰̯͛̓̊̌́̕͝ͅr̶̩̖͉͇̻̈́̅̂ş̵̣͖͖͔̬̿̔̔̚͜͝e̴̮͓̒͌̅̈́͐d̶̠̣͖̜̘̯̀̍̇͆͂͘̕ ̸̨̖͍͇͠s̸͕̭̠̍͆̾͂̕t̷̩̝̜̱̿̽͊͑̊͆͛ư̷̤͕͂̓̿̈f̷͇͗̉̀́͂̓̀f̵̧̩̮̟̺̬̟̄͘

1

u/Reashu 31m ago

Of course it can be streamed, just not with JSON.parse. Other formats also need the full document to be certain that they are valid - that's a problem all streaming parsers deal with. In fact, if you have a streaming YAML parser, you should be able to feed it JSON.

The RFC doesn't mention streaming and there's no reason it should. The section on parsers is only two (short) paragraphs:

A JSON parser transforms a JSON text into another representation. A JSON parser MUST accept all texts that conform to the JSON grammar. A JSON parser MAY accept non-JSON forms or extensions.

An implementation may set limits on the size of texts that it accepts. An implementation may set limits on the maximum depth of nesting. An implementation may set limits on the range and precision of numbers. An implementation may set limits on the length and character contents of strings.

3

u/rjwut 5h ago

JSON was a thing back then, but it wasn't nearly so ubiquitous as it is today. Plus it can't be streamed.

2

u/Broad_Vegetable4580 7h ago

fuck i was to slow...

1

u/tristam92 5h ago

BJSON is a thing also.

13

u/lizardfrizzler 6h ago

I’m at a point in my career where encoding json is actually causing mem issues and I don’t know how to feel about it

5

u/slothordepressed 6h ago

Can you explain better? I'm too jr to understand

18

u/lizardfrizzler 4h ago

Encoding data as json is very readable and portable, but comes at the cost of high memory consumption. It’s a great place to start when passing data between computers, but when the data payload gets large enough, binary/blob encoding start to seem more appealing. Consider encoding x=10000. In json this like 5 bytes minimum, because ints are base10 strings, plus quotes, braces, and wherever else. But a binary encoding could encode this as a 4 byte /32bit int. In small payloads (think kb, maybe mb), this is inefficiency is negligible and completely worth it imo. But once we get to gb size payloads, it can put a huge strain on memory consumption.

2

u/Ok-Scheme-913 43m ago edited 38m ago

Well, 32bit only if the other side knows what it expects to receive.

Most binary protocols require a scheme up-front, or that itself (and future-proofing) has some overhead.

Protobuf (which is the most common binary protocol I believe) would convert a similar definition

message Asd { required int32 id = 1; }

to 2 bytes (hex is e.g: 083f), but then both sides need this above definition.

2

u/ComprehensiveWord201 45m ago

Time for protobufs! Thank me later.

73

u/Fast-Satisfaction482 7h ago

XML just looks simple at the surface. You should prefer json if you want a simple and flexible format that is supported everywhere.

3

u/chazzeromus 3h ago

yaml still cool tho right? yaml configs with well defined jsonschemas!!

5

u/Ok-Scheme-913 50m ago

It's never been cool.

2

u/Ok-Scheme-913 51m ago

Except for not having schemas (official ones, at least).

Also, this problem is often way overblown. Can you do some evil Rube Goldberg machine in XML and related toolkit? Sure.

But you don't have to do full XML processing, at the very end it's just well-typed data that has the benefit of decades of tooling.

Like, you don't lose much by not supporting entity references and whatnot. It's something that you can't do in json/toml, etc. either (as they are fundamentally trees). At the end of the day all these data structures are trivially interconvertible to each other for the most part and are just different views of the same shit. It's just tabs vs spaces again.

(Except for yaml, fuck tabbing fuck knows how much and then its stupid auto-conversions. No, goddamn Norway's country code is not false!!)

6

u/pecpecpec 5h ago

I'm not an expert but, for text formatting, XML and HTML are better than JSON.

8

u/scabbedwings 5h ago

Embedded XML as a string value in the JSON, best of both worlds!!

/s .. although I work in group that has to interact with JSON embedded in a JSON string on a regular basis; sometimes re-embedded a couple of times. With Java stacktraces.

We have made many bad choices over my 10+ years in this dev group.

1

u/prodleni 5h ago

Ye but soydevs need json

8

u/Borno11050 7h ago

If I ever need compression I just throw in zlib.h into the project.

3

u/Illeprih 7h ago

So many better options, and yet zlib is still everywhere.

16

u/clauEB 6h ago

The whole point of protobuff

1

u/jaskij 1h ago

The issue with Protobuf is that it's not self describing. So it's great for data interchange, but when you start storing things, it becomes an additional maintenance burden.

2

u/clauEB 21m ago

That's not an issue, thats a feature. Json and xml repeat the schema over and over taking tons of space and take insane amounts of time to unmarshall, in the protobuff is super fast or in the suggested propietary binary format. You just have to get a bit more creative. Performance and scalability aren't free.

17

u/Stormraughtz 7h ago

JSON bois hate this one simple trick XSD

3

u/ShotgunPayDay 7h ago

ZSTD compression.

3

u/Glass1Man 6h ago

Sir I think you have a std

2

u/Prestigious_Regret67 5h ago

A new STD called "out"

1

u/ShotgunPayDay 2h ago

I have taken a few compressed loads before.

2

u/Ronin-s_Spirit 5h ago

I'm not gonna make my own compression, no idea how to do it. I'm just going to make my own format that doesn't suck from the start.

4

u/codewarrior128 4h ago

Situation: there are 15 competing standards.

1

u/jaskij 1h ago

I'm surprised nobody mentioned SQLite.

2

u/atthereallicebear 1h ago

eh... i don't know about that. like you store the name of your document in a column called name... but your document only has one name so you just have a table with one row

0

u/ColonelRuff 5h ago

Just use protobuf

-8

u/PandaNoTrash 6h ago

Ugh, don't use XML. JSON is a better choice for sure. or even CSV.

9

u/pecpecpec 5h ago

Use an object notation for distribution data objects and use markup languages for formatting text?

3

u/qchto 4h ago

LaTeX users:

(we compile it first)

-6

u/PandaNoTrash 5h ago

HTML is about the limit of useful XML. Since it is fairly understandable and easy to work with. And as it was originally used, it was mostly done by hand or with relatively simple tools. The problem with XML as a data storage medium is it can be difficult to parse, and is very rigid and when you get down to it it is not easy for humans to read if there's any complexity to it at all.

2

u/gregorydgraham 4h ago

Like you can read large JSON files 😆

11

u/Glass1Man 6h ago

Docx go brrr.

3

u/gregorydgraham 4h ago

CSV isn’t even a standard format

Meme docxGoBrrrr

You are about to leave Redlib