While brainstorming ideas for testing the metascheduler I'm building I thought to look into data compressors again... Specifically the PAQ family of compressors. The latest update is PAQ8jc (fixed tarball). I whipped up an ebuild and took it town using Intel's C++ library. I tested it out on a 1.8M XML file:
reference 1.8M gzip -9 168K bzip -9 108K PAQ8jc -5 61K PAQ8jc -7 61K (2 bytes smaller, but longer runtime/memusage)
Okay... So this shows that if I feel like getting my hands dirty with C++, there's actually some value in parallelizing this algorithm.
Something that caught my eye while looking into this is XML-WRT. It's a fantastic project which scratches an itch I developped in the middle of a lecture on WebServices some time ago. XML-WRT can be thought to work in two distinct steps; substitute common tagnames, attributes etc with shortened tokens; run result through zlib or FastPAQ depending on user preference. I tested its WRTified zlib/fastpaq targets on a the 15M Locations.xml file from gnome-applets (wow that's big):
reference 15M gzip -9 2.0M bzip2 -9 1.2M xml-wrt -2 1.8M (zlib default after wrt) xml-wrt -3 1.7M (zlib best after wrt) xml-wrt -10 693K (FastPAQ normal) xml-wrt -11 693K (FastPAQ best)
I also tested it on a 684M XML database (the default buffer size is too small for dictionary generation on this particular file):
reference 684M gzip -9 102M bzip2 -9 74M xml-wrt -l10 ---- xml-wrt -l10 -b100 51M
What I want you to take away from this is that xml-wrt/PAQ is pretty slick and actually quite usable. xml-wrt -10 will actually complete in a sane timeframe. PAQ8jc on the same file however will take literally ages and probably won't serve any practical purpose for you...