WRT PAQ XML Compression

While brainstorming ideas for testing the metascheduler I'm building I thought to look into data compressors again... Specifically the PAQ family of compressors. The latest update is PAQ8jc (fixed tarball). I whipped up an ebuild and took it town using Intel's C++ library. I tested it out on a 1.8M XML file:

   reference   1.8M
   gzip -9     168K
   bzip -9     108K
   PAQ8jc -5   61K
   PAQ8jc -7   61K (2 bytes smaller, but longer runtime/memusage)
   

Okay... So this shows that if I feel like getting my hands dirty with C++, there's actually some value in parallelizing this algorithm.

Something that caught my eye while looking into this is XML-WRT. It's a fantastic project which scratches an itch I developped in the middle of a lecture on WebServices some time ago. XML-WRT can be thought to work in two distinct steps; substitute common tagnames, attributes etc with shortened tokens; run result through zlib or FastPAQ depending on user preference. I tested its WRTified zlib/fastpaq targets on a the 15M Locations.xml file from gnome-applets (wow that's big):

   reference   15M 
   gzip -9     2.0M
   bzip2 -9    1.2M
   xml-wrt -2  1.8M (zlib default after wrt)
   xml-wrt -3  1.7M (zlib best after wrt)
   xml-wrt -10 693K (FastPAQ normal)
   xml-wrt -11 693K (FastPAQ best)
   

I also tested it on a 684M XML database (the default buffer size is too small for dictionary generation on this particular file):

reference           684M
gzip -9             102M
bzip2 -9             74M
xml-wrt -l10        ----
xml-wrt -l10 -b100   51M

What I want you to take away from this is that xml-wrt/PAQ is pretty slick and actually quite usable. xml-wrt -10 will actually complete in a sane timeframe. PAQ8jc on the same file however will take literally ages and probably won't serve any practical purpose for you...

0 comments: