FLAC header UTF-8 bollocks (aka. Stupidcode)

I wish to set this down while it is still fresh in memory. I am currently reimplementing a FLAC decoder from scratch in order to understand how it works, so I can get a handle on how possible it is to implement it on various very small processors, and the point set down here has been pissing me off.

Edit: Since writing this page I have discovered that the FLAC format has been set down in RFC9639. This was only published in December 2024, ie. 3 months ago, so nobody has noticed yet; but it deserves notice. The actual FLAC documentation suffers from being written by someone to whom it is all obvious (because he invented it), so he doesn't notice all the holes where he has left out things that actually aren't obvious. He also assumes that everyone else will be just as familiar with various eclectic, obscure and misleadingly-misnamed references as he is. So you try and read it and keep running into flocks of nasal chickens. The RFC, however, has been rewritten and considerably expanded to fill up these holes and explain the obscure references: it is a much better description of the FLAC format than the "real" description, and much easier to understand. And as an example of this, it includes - at 9.1.5. Coded Number - an explanation of the exact problem I wrote this page about (which, pleasingly, seems to confirm that I did get it right).

The page about the FLAC headers in the FLAC documentation describes a field which it says can contain either from 8 to 56 bits, if the blocksize is variable, or from 8 to 48 bits, otherwise. (It is less than explicit about how you're supposed to tell if the blocksize is variable, and even says that under some circumstances "the decoder will have to pessimistically guess that it is a variable-blocksize stream"; oh, bloody great.)

It says the data in this field is ""UTF-8" coded", represents a block or sample number, and has a decoded value of length 31 or 36 bits. This immediately sets me wondering (a) why the fuck you need international extended character sets to represent a number, (b) why the double fuck are you doing this in the middle of a frame header, and (c) how the triple fuck are you getting an 11-digit number out of only 7 bytes of that. It goes on to add the following unhelpful comment:

The "UTF-8" coding used for the sample/frame number is the same variable length code used to store compressed UCS-2, extended to handle larger input.

Trying to follow that up as it stands finds all kinds of crap about compressed databases in geriatric dinosaur systems, which is basically completely bleeding useless and makes me wonder what the quadruple fuck is going on here anyway. But eventually, by including "flac" among my search terms, I found a link to the following stackoverflow post: http://stackoverflow.com/questions/53267434/cant-understand-flac-frame-header-format

This was quite good, but it still seemed to have been more meaningful to the original poster than it was now, because the wikipedia page (http://en.wikipedia.org/wiki/UTF-8) it gave a link to didn't have a whole lot about it and what it did have seemed to indicate that the values quoted in the stackoverflow answer are not valid code, which raised the question of why might one include in an answer a wikipedia link which says it's wrong. So I decided to look at the history of the wikipedia page and see what it used to say on 10th October 2019, when the post had been made: http://en.wikipedia.org/w/index.php?title=UTF-8&oldid=920398196

Turns out it's a fuck of a lot different from what it's like in March 2025. The old version is much longer and has fuck loads of useful and interesting shit about the history of the format used for encoding the UTF-8 extended character set developed, including details of variant encodings developed along the way for things other than the UTF-8 extended character set and how all this stuff influenced the character set encoding choice. Moreover, it does include the confirmation of the stackoverflow poster's answer. It calls it "FSS-UTF (1992) / UTF-8 (1993)", includes a table describing it of which the stackoverflow post quotes the bottom row, and says it was developed by Rob Pike and Ken Thompson (yes, that one) on the back of a beermat, and was "first officially presented at the USENIX conference in San Diego, from January 25 to 29, 1993".

All this, and more, has since been REMOVED by STUPID WHINING WIKIPEDIA NERDS bleating about it being "not relevant". Well it fucking well is relevant you stupid bastards; how the fuck could it possibly not be? And it's also very useful so PUT IT BACK YOU CUNTS.

The only thing that needs doing is to add the obvious extra row to the FSS-UTF (1992) / UTF-8 (1993) table to extend it to 7-byte sequences, and that gets you the FLAC header UTF-8 encoding, which looks like this:

n : i : 1st byte : subsequent bytes... 1 6 0xxxxxxx 1 + 0 = 1 2 5 110xxxxx 10xxxxxx 1 + 1 = 2 3 4 1110xxxx 10xxxxxx 10xxxxxx 1 + 2 = 3 4 3 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 1 + 3 = 4 5 2 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1 + 4 = 5 6 1 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1 + 5 = 6 7 0 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 1 + 6 = 7

The reason they've done this is of course that it guarantees that there will be at least one 0 and at least one 1 in every byte; since we're in the middle of a thing that ascribes a special meaning to a string of fourteen 1s, this prevents it fucking up if the number happens to have fourteen consecutive 1s in it. But for fuck's sake. If you're going to do this bloody thing then fucking well EXPLAIN WHAT IT IS. No it fucking ISN'T obvious what "the same variable length code used to store compressed UCS-2" fucking means. What the fuck is "compressed UCS-2"? Never bloody heard of it. And looking it up finds all sorts of ghastly shite about compressing giant databases, which is three times as obscure as all fuck and instead of helping makes it even more bleeding confusing.

Indeed, including examples of how not to explain things seems to be a prevalent feature of the FLAC documentation. As mentioned above it does not bleeding help to be told that the decoder may have to guess the blocksize is variable because it may not be able to actually tell. Did you actually design this thing, or did you just eat a stack of old code listings and puke it? (Edit: I now find that the abovementioned RFC9639 subtly implies the latter.) And as a further comment on the current matter it could be made more obvious that ZERO is actually a valid value for this code. Oh yeah, and it's binary, not character data, so there actually is enough to get the whole number out of it.

Back to Pigeon's Nest

Be kind to pigeons