20210126, 12:44  #1 
"Composite as Heck"
Oct 2017
1100010101_{2} Posts 
Chest compression game
I'm interested in chess and compression, so the compression of chess games is a natural thing to nerd out on. The PGN format and the algebraic notation it contains is human readable but spaceinefficient, I'm noodling around trying to minimise the footprint of the moves from a set of games. As this is just for fun only standard complete games are considered, and metadata is ignored as that goes into normal compression which is not the point of the exercise.
Here's the representations so far:
Here's comparisons using a corpus of 4096 random games (random as in a standard game where every valid move is chosen at random, ending decisively or in a draw by the 50 move rule), where PGN metadata is minimal boilerplate to pass scid validation: Code:
Format Filesize %ofANA XZsize %ofANA.xz SG4 1722293 17.9% 1289764 44.5% PGN 9996400 103.9% 2907876 100.3% ANA 9623664 100.0% 2897808 100.0% MLA 3321282 34.5% 2139160 73.8% ILA 1664735 17.3% 1015896 35.1% IPA 915582 9.5% 910660 31.4% None of the above representations consider redundancy across games (SG4 unknown), openings being the low hanging fruit (again, for real nonrandom games). There's a few obvious possibilities but the one that seems most suitable is to to store a serialised tree structure efficiently encoding the openings, not the entirety of all games but enough to encode the easy redundancy without incurring too much overhead. Above is pretty much everything considered so far, there may be avenues or existing solutions I'm oblivious to which would be interesting to learn about. How can it be refined further or differently? 
20210126, 12:50  #2 
Undefined
"The unspeakable one"
Jun 2006
My evil lair
13743_{8} Posts 
Perhaps adding Huffman encoding on top of IPA can show some improvement.
And one step further is full algebraic encoding as is used for compression in space probes where every bit sent is expensive. 
20210127, 14:36  #3 
"Composite as Heck"
Oct 2017
3×263 Posts 
I'm not familiar with huffman coding with input symbols of variable width, it has potential particularly when smart ordering is implemented. On the other hand when smart ordering is implemented the ILA representation also benefits, a real compressor can probably benefit more than whatever I can cook up even given the headstart of a packed representation. There is something that might tip the scales in huffmanencoded IPA's favour, and that's the fact that the IPA representation is not optimal. On average a packed index is something like half a bit inefficient, in the example above 4 bits to represent 13 states is wasting 3 states. I was considering using the remaining states to encode multiple moves in a single go, the most viable followups to the most viable current moves, but I don't think the current state of my cranklevelspaghetticode is up to the task. Huffman or similar might work and would be simpler.
A rudimentary ordering might work well enough that stockfish and all the drawbacks associated with using it isn't worth the effort. The ordering could be something like:

20210131, 12:36  #4 
"Composite as Heck"
Oct 2017
3×263 Posts 
Switched to Caissabase for testing, a database of over 4 million real games with strong players. To do that PGN ingest is implemented, although it's far from robust it can import all games from Caissabase excluding FEN, variants, the odd games with  denoting an unknown move and the 5 games with illegal castling moves. It's apparent that the engine needs some optimising to handle this many games quick enough for the current formats let alone the next formats to try, regardless here's rough preliminary data:
Code:
Format Size(MB) %ofANA XzSize %ofANA.xz SG4 508 26.7% 277 60.5% PGN 2946 155.0% 610 133.2% ANA 1901 100.0% 458 100.0% MLA 685 36.0% 329 71.8% ILA 307 16.1% 150 32.8% IPA 234 12.3% 193 42.1% TODO:

20210131, 13:02  #5  
"Robert Gerbicz"
Oct 2005
Hungary
10110101110_{2} Posts 
Quote:
for a move. The idea is that for each p number of possibility maintain a seperate array to extract the moves. Say for 13 possibilities: Code:
unsigned int tab13[100000]; have only 13 different valid chess moves. So you need (32*14)/121=3.7025 bits and not 4 bits. Ofcourse going higher you can get a closer distance to the optimal log(13)/log(2)=3.7004 bits, and optionally you can use only 1D array to compress all these info into a single array instead of using large number of tab2,tab3,tab4,..,tab13,tab14,.. arrays. Last fiddled with by R. Gerbicz on 20210131 at 13:03 Reason: grammar 

20210131, 15:02  #6 
"Robert Gerbicz"
Oct 2005
Hungary
2·727 Posts 
Even easier way if you want to encode only a single/few game:
use a generalized number system, say in the ith move you have only B[i] valid moves you have chosen the r[i]th move from these. And the game had L move then define x=r[0]+r[1]*B[0]+r[2]*B[0]*B[1]+... from this number you can easily extract the r[i] numbers: r[0]=x mod B[0] (and doesn't depend on future moves r[1],r[2],..) r[1]=(xr[0])/B[0] mod B[1] etc. with this you would need to use ceil(log2(B[0])+log2(B[1])+..+log2(B[L1])) bits of memory, hence if you are using this encoding for each game then you'd lose at most 1 bit per game, in average 0.5 bit. 
20210131, 18:57  #7 
"Composite as Heck"
Oct 2017
3×263 Posts 
Thank you, the generalised number solution is nice I'll implement it with GMP. If smart ordering is assumed a standard compressor compressing the ILA form should beat it, but it by far beats the IPA representation. If handling each game individually I think it can be archived as a length byte followed by the number stored in <length> bytes; 99% of games in Caissabase can fit in this form, the remaining few hundred can be encoded with the length as a varint, 0xFF + u16 as the length. The other option is to lean in to the maximal compression goal and encode all games into a single giant number to avoid the ~12 bits of overhead per game (~6MB for Caissabase). That's a lot of mult/div on the multihundred megabyte number that would be Caissabase, although they're all divisions by small numbers so maybe it's viable.
To be uniquely decodable indexed representations need to be altered to encode the end of the game as an option within the index itself. This should be more efficient than more overhead in the form of a turn count per game in the header. 
20210131, 19:40  #8 
"Robert Gerbicz"
Oct 2005
Hungary
2·727 Posts 
Yes, forget that above we need to store also the number of bits of x. Here 15 or 16 bits should be enough maybe for all games.

20210203, 01:22  #9 
"Composite as Heck"
Oct 2017
315_{16} Posts 
Here's another test with 200,000 random games. ~23.2 million moves (200k of those "end of game") with ~115 moves per game. The random game generator works (slightly) more realistically, king vs king waltz's are now not the norm. The General Number System representation is implemented per game as GNA. ISA is a refined ILA, a Cstring representation of a list of indexes with \0 denoting end of game instead of a turn count. MLA is gone (as a format, it's still heavily used in the engine), replaced with what I'm confusingly now calling ILA. ILA is now a 2 byte per move index + possiblemovecount format that allows for quick conversion to the indexed formats, aka everything except PGN and ANA.
Code:
Form Size %ofANA bitspermove  XZSize %ofANA.xz ANA 127480406 100.0% 44.20  41050624 100.0% ISA 23274407 18.3% 8.07  15659384 38.1% IPA 15748818 12.4% 5.46  15543648 37.9% GNA 13948569 10.9% 4.84  13949324 34.0% The maximum length of x assuming the 50 move rule is adhered to can probably fit in 16 bits (~6000 moves from both players at over 5 bits per move, which is a very generous bpm). Probably ~99.998% of Caissabase has a length of x that can fit within 11 bits, the ~70 remaining games might need 12 bits but I haven't tested it yet. 
20210207, 01:35  #10 
"Composite as Heck"
Oct 2017
789_{10} Posts 
Sort version 0 aka no sort, just the natural order from the valid move generator which iterates over the players pieces. Not random but not good.
Code:
Form Size %ofANA XZSize %ofANA.xz ANA 1868403142 100.0% 442580176 100.0% ISA 342966732 18.4% 188721016 42.6% IPA 242932132 13.0% 199165684 45.0% GNA 214021318 11.5% 211957748 47.9% Code:
Form Size %ofANA XZSize %ofANA.xz ANA 1868403142 100.0% 442580176 100.0% ISA 342966732 18.4% 174789504 39.5% IPA 242932132 13.0% 189283260 42.8% GNA 214005548 11.5% 211950820 47.9%
Potential refinement for sort v2 or beyond:
Hard to evaluate how some of these techniques stack up against each other or how they can sanely be mixed into one algorithm. 
20210207, 09:51  #11 
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
3×1,951 Posts 
Move sorting early in a game may be improved by pairing the compressor with an opening book.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Buddy compression  SELROC  GPU Computing  2  20190308 12:58 
Engineassisted game vs Stockfish, move 36 discussion. Drawn game?  MooMoo2  Other Chess Games  10  20170429 17:39 
GPU optimization over snappy compression  ihavethepotenti  Programming  2  20130215 00:20 
video compression chatter by jasong  jasong  jasong  24  20080113 11:23 
Compression program  Prime95  LMH > 100M  1  20050323 15:18 