You end up with a bunch of residuals, that you then want to compress. I tried using zstd, and it performed pretty poorly, giving a compressed size of 1406321 B, from 2097152 B of data (1024 * 1024 * 16 bits). The exact amount depended on compression level but didn't get much better)

Since I was curious, I decided to compute the entropy of the data. (See Entropy_(information_theory) on wikipedia )

$$H=\sum_x p(x) \log_2({1 \over p(x)})$$

This entropy measure strictly speaking only applies to infinite sequences of symbols, however in our case, with a reasonably large file of residuals, and assuming each residual is uncorrelated and identically distributed, the shannon entropy gives a lower bound on the number of bits needed to encode the data: the number of symbols times the entropy of the symbol distribution. Any correlation between symbols should only help the compressor.

A good 'entropy coder' should be able to get near this limit. There will be some overhead due to having to write the symbol encoding/table, but it shouldn't be that big in my case, as there were only around 2000 unique symbols (i.e. 2000 unique 16-bit values). You can compute the entropy in C++ code like so:

static double computeEntropy(const js::Vector<int16, 16>& filtered_data) { std::map<int16, size_t> count; for(size_t i=0; i<filtered_data.size(); ++i) { count[filtered_data[i]]++; } double entropy = 0; for(auto it = count.begin(); it != count.end(); ++it) { const double p = (double)it->second / (double)filtered_data.size(); entropy += -p * std::log2(p); } return entropy; }For my data this gives:

Entropy: 9.115353669970485 bits Theoretical min compressed size: 1194767.6362303714 BSo the theoretical minimum is ~1.2 MB. Zstd is only acheiving ~1.4 MB, which is nowhere near the theoretical limit.

This was quite disappointing to me, as I like zstd and use it a lot, and I know that zstd has some fancy ANS entropy encoding.

The reason for this failure was explained to my by Conor Stokes as being due to the Lempelâ€“Zivâ€“Welch stage of zstd compression, that is not well suited for entropy coding - and apparently this stage introduces a lot of overhead.

Interestingly, 7-zip compresses the data much better than zstd, reaching a size of 1,298,590 B.

So zstd is not the cure-all and panacea of data compression that I had hoped :)

I would also be interested in a good C++/C entropy encoding library if anyone knows one.

Edit: Comments on Hacker News here.