You end up with a bunch of residuals, that you then want to compress. I tried using zstd, and it performed pretty poorly, giving a compressed size of 1406321 B, from 2097152 B of data (1024 * 1024 * 16 bits). The exact amount depended on compression level but didn't get much better)
Since I was curious, I decided to compute the entropy of the data. (See Entropy_(information_theory) on wikipedia )
$$H=\sum_x p(x) \log_2({1 \over p(x)})$$
This entropy measure strictly speaking only applies to infinite sequences of symbols, however in our case, with a reasonably large file of residuals, and assuming each residual is uncorrelated and identically distributed, the shannon entropy gives a lower bound on the number of bits needed to encode the data: the number of symbols times the entropy of the symbol distribution. Any correlation between symbols should only help the compressor.
A good 'entropy coder' should be able to get near this limit. There will be some overhead due to having to write the symbol encoding/table, but it shouldn't be that big in my case, as there were only around 2000 unique symbols (i.e. 2000 unique 16-bit values). You can compute the entropy in C++ code like so:
static double computeEntropy(const js::Vector<int16, 16>& filtered_data) { std::map<int16, size_t> count; for(size_t i=0; i<filtered_data.size(); ++i) { count[filtered_data[i]]++; } double entropy = 0; for(auto it = count.begin(); it != count.end(); ++it) { const double p = (double)it->second / (double)filtered_data.size(); entropy += -p * std::log2(p); } return entropy; }For my data this gives:
Entropy: 9.115353669970485 bits Theoretical min compressed size: 1194767.6362303714 BSo the theoretical minimum is ~1.2 MB. Zstd is only acheiving ~1.4 MB, which is nowhere near the theoretical limit.
This was quite disappointing to me, as I like zstd and use it a lot, and I know that zstd has some fancy ANS entropy encoding.
The reason for this failure was explained to my by Conor Stokes as being due to the Lempel–Ziv–Welch stage of zstd compression, that is not well suited for entropy coding - and apparently this stage introduces a lot of overhead.
Interestingly, 7-zip compresses the data much better than zstd, reaching a size of 1,298,590 B.
So zstd is not the cure-all and panacea of data compression that I had hoped :)
I would also be interested in a good C++/C entropy encoding library if anyone knows one.
Edit: Comments on Hacker News here.