mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Hobbies > Chess

Reply
 
Thread Tools
Old 2021-01-26, 12:44   #1
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

31516 Posts
Arrow Chest compression game

I'm interested in chess and compression, so the compression of chess games is a natural thing to nerd out on. The PGN format and the algebraic notation it contains is human readable but space-inefficient, I'm noodling around trying to minimise the footprint of the moves from a set of games. As this is just for fun only standard complete games are considered, and metadata is ignored as that goes into normal compression which is not the point of the exercise.

Here's the representations so far:
  • SG4 - Standard game representation format used in SI4 for comparison, the database format of scid
  • PGN - Standard PGN format
  • ANA - Newline delimited algebraic notation, PGN without the metadata for a fairer comparison
  • MLA - Move list array - Games are represented as a list of moves where each move is two bytes (source and destination, the upper two bits of destination denote promotion target if promoting)
  • ILA - Index list array - At each position all valid moves are generated, and the index of the move is stored, allowing a single byte to represent the move
  • IPA - Packed index list array - A list of indexes where each index only takes up as many bits as required instead of always a byte. For example if there are only 13 valid moves in a position only 4 bits are required to store the index.


Here's comparisons using a corpus of 4096 random games (random as in a standard game where every valid move is chosen at random, ending decisively or in a draw by the 50 move rule), where PGN metadata is minimal boilerplate to pass scid validation:
Code:
Format Filesize %-of-ANA XZ-size %-of-ANA.xz
  SG4  1722293    17.9%  1289764      44.5%
  PGN  9996400   103.9%  2907876     100.3%
  ANA  9623664   100.0%  2897808     100.0%
  MLA  3321282    34.5%  2139160      73.8%
  ILA  1664735    17.3%  1015896      35.1%
  IPA   915582     9.5%   910660      31.4%
There's a potential optimisation to be made in the ordering of valid moves in a position, if more likely valid moves are listed earlier then chess-unaware compressors can perform better as it skews index frequency (assuming real games not random). The proper way to do this would be to use stockfish, probably more trouble than it's worth for noodling but potentially a strong step albeit computationally expensive.

None of the above representations consider redundancy across games (SG4 unknown), openings being the low hanging fruit (again, for real non-random games). There's a few obvious possibilities but the one that seems most suitable is to to store a serialised tree structure efficiently encoding the openings, not the entirety of all games but enough to encode the easy redundancy without incurring too much overhead.

Above is pretty much everything considered so far, there may be avenues or existing solutions I'm oblivious to which would be interesting to learn about. How can it be refined further or differently?
M344587487 is offline   Reply With Quote
Old 2021-01-26, 12:50   #2
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

137438 Posts
Default

Perhaps adding Huffman encoding on top of IPA can show some improvement.

And one step further is full algebraic encoding as is used for compression in space probes where every bit sent is expensive.
retina is online now   Reply With Quote
Old 2021-01-27, 14:36   #3
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

14258 Posts
Default

I'm not familiar with huffman coding with input symbols of variable width, it has potential particularly when smart ordering is implemented. On the other hand when smart ordering is implemented the ILA representation also benefits, a real compressor can probably benefit more than whatever I can cook up even given the headstart of a packed representation. There is something that might tip the scales in huffman-encoded IPA's favour, and that's the fact that the IPA representation is not optimal. On average a packed index is something like half a bit inefficient, in the example above 4 bits to represent 13 states is wasting 3 states. I was considering using the remaining states to encode multiple moves in a single go, the most viable follow-ups to the most viable current moves, but I don't think the current state of my crank-level-spaghetti-code is up to the task. Huffman or similar might work and would be simpler.

A rudimentary ordering might work well enough that stockfish and all the drawbacks associated with using it isn't worth the effort. The ordering could be something like:
  • Castling, when available it's normally done pretty quickly
  • Taking a piece with no recourse
  • Taking a higher value piece with recourse
  • Taking an equal value piece with recourse
  • Moving a piece to a safe square
  • Moving a piece to a square covered by the opponent
  • Taking a lower value piece with recourse, sacrifice plays are relatively rare
That should give a compressor some nicer value distributions to chew on. I'm open to suggestions for refinement, this is just easily implementable as cover masks have already been implemented for validating moves.
M344587487 is offline   Reply With Quote
Old 2021-01-31, 12:36   #4
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

78910 Posts
Default

Switched to Caissabase for testing, a database of over 4 million real games with strong players. To do that PGN ingest is implemented, although it's far from robust it can import all games from Caissabase excluding FEN, variants, the odd games with -- denoting an unknown move and the 5 games with illegal castling moves. It's apparent that the engine needs some optimising to handle this many games quick enough for the current formats let alone the next formats to try, regardless here's rough preliminary data:
Code:
Format  Size(MB) %-of-ANA Xz-Size %-of-ANA.xz
   SG4      508     26.7%     277      60.5%
   PGN     2946    155.0%     610     133.2%
   ANA     1901    100.0%     458     100.0%
   MLA      685     36.0%     329      71.8%
   ILA      307     16.1%     150      32.8%
   IPA      234     12.3%     193      42.1%
xz compressed ILA beats xz compressed IPA showing that IPA is far from optimal. The decent compression of IPA by xz also highlights that the IPA representation needs work. The file format of ILA has been reworked into a C-string instead of the former representation which used a short as a move count.

TODO:
  • Optimise masks
  • Smart ordering as described above, doable once masks optimised
  • Tree format to deduplicate openings. This is largely done just waiting on some other components
  • FEN? Setting board state to/from FEN is trivial, the part that needs some thought is how to encode it into an archive format
  • Chess960? If FEN is supported it's probably not much effort to support this mode, just some reworking of the castling rules which is the only thing different from a standard game AFAIK
  • Metadata pre-processor. There's no point re-inventing standard compression, but a simple pre-processor that greatly reduces the input to a standard compressor could make sense. Encode tags, group related strings, etc
M344587487 is offline   Reply With Quote
Old 2021-01-31, 13:02   #5
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2×727 Posts
Default

Quote:
Originally Posted by M344587487 View Post
[*]IPA - Packed index list array - A list of indexes where each index only takes up as many bits as required instead of always a byte. For example if there are only 13 valid moves in a position only 4 bits are required to store the index.[/LIST]
You can reach almost the optimal log(13)/log(2) bits difficulty when you have 13 possibilities
for a move. The idea is that for each p number of possibility maintain a seperate array to
extract the moves. Say for 13 possibilities:
Code:
unsigned int tab13[100000];
Since 13^121<(2^32)^14 in only 14 ints you can encode 121 chess moves where in each move you
have only 13 different valid chess moves. So you need (32*14)/121=3.7025 bits and not 4 bits.
Ofcourse going higher you can get a closer distance to the optimal
log(13)/log(2)=3.7004 bits, and optionally you can use only 1D array to compress all these
info into a single array instead of using large number of tab2,tab3,tab4,..,tab13,tab14,.. arrays.

Last fiddled with by R. Gerbicz on 2021-01-31 at 13:03 Reason: grammar
R. Gerbicz is offline   Reply With Quote
Old 2021-01-31, 15:02   #6
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2×727 Posts
Default

Even easier way if you want to encode only a single/few game:
use a generalized number system, say in the i-th move you have only B[i] valid moves you have chosen the r[i]-th move from these. And the game had L move then define
x=r[0]+r[1]*B[0]+r[2]*B[0]*B[1]+...
from this number you can easily extract the r[i] numbers:
r[0]=x mod B[0] (and doesn't depend on future moves r[1],r[2],..)
r[1]=(x-r[0])/B[0] mod B[1] etc.

with this you would need to use ceil(log2(B[0])+log2(B[1])+..+log2(B[L-1])) bits of memory, hence if you are using this encoding for each game then you'd lose at most 1 bit per game, in average 0.5 bit.
R. Gerbicz is offline   Reply With Quote
Old 2021-01-31, 18:57   #7
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

3·263 Posts
Default

Thank you, the generalised number solution is nice I'll implement it with GMP. If smart ordering is assumed a standard compressor compressing the ILA form should beat it, but it by far beats the IPA representation. If handling each game individually I think it can be archived as a length byte followed by the number stored in <length> bytes; 99% of games in Caissabase can fit in this form, the remaining few hundred can be encoded with the length as a varint, 0xFF + u16 as the length. The other option is to lean in to the maximal compression goal and encode all games into a single giant number to avoid the ~12 bits of overhead per game (~6MB for Caissabase). That's a lot of mult/div on the multi-hundred megabyte number that would be Caissabase, although they're all divisions by small numbers so maybe it's viable.

To be uniquely decodable indexed representations need to be altered to encode the end of the game as an option within the index itself. This should be more efficient than more overhead in the form of a turn count per game in the header.
M344587487 is offline   Reply With Quote
Old 2021-01-31, 19:40   #8
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2×727 Posts
Default

Yes, forget that above we need to store also the number of bits of x. Here 15 or 16 bits should be enough maybe for all games.
R. Gerbicz is offline   Reply With Quote
Old 2021-02-03, 01:22   #9
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

14258 Posts
Default

Here's another test with 200,000 random games. ~23.2 million moves (200k of those "end of game") with ~115 moves per game. The random game generator works (slightly) more realistically, king vs king waltz's are now not the norm. The General Number System representation is implemented per game as GNA. ISA is a refined ILA, a C-string representation of a list of indexes with \0 denoting end of game instead of a turn count. MLA is gone (as a format, it's still heavily used in the engine), replaced with what I'm confusingly now calling ILA. ILA is now a 2 byte per move index + possible-move-count format that allows for quick conversion to the indexed formats, aka everything except PGN and ANA.

Code:
Form       Size  %-of-ANA  bits-per-move  |   XZ-Size  %-of-ANA.xz
 ANA  127480406    100.0%          44.20  |  41050624       100.0%
 ISA   23274407     18.3%           8.07  |  15659384        38.1%
 IPA   15748818     12.4%           5.46  |  15543648        37.9%
 GNA   13948569     10.9%           4.84  |  13949324        34.0%
GNA is incompressible which is a good sign. For now the overhead is implemented as varint and the numbers byte aligned because it was easy for ~12 bits of overhead per game. As order isn't used a better solution would be to order the games by GNS bit count, put all of the counts together to be easily compressed by diff+RLE (or external compressor) followed by the numbers unaligned. As bit count is stored we could omit the leading set bit per game, 1 bit per game doesn't sound like much but it's 25KB in this test and ~0.5MB for Caissabase. End of game has been encoded but not what the result was, that can be rolled into the GNS for ease at the cost of two bits although in the header it could potentially be huffman encoded to squeeze out a little more juice.


The maximum length of x assuming the 50 move rule is adhered to can probably fit in 16 bits (~6000 moves from both players at over 5 bits per move, which is a very generous bpm). Probably ~99.998% of Caissabase has a length of x that can fit within 11 bits, the ~70 remaining games might need 12 bits but I haven't tested it yet.
M344587487 is offline   Reply With Quote
Old 2021-02-07, 01:35   #10
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

3×263 Posts
Default

Sort version 0 aka no sort, just the natural order from the valid move generator which iterates over the players pieces. Not random but not good.
Code:
Form       Size %-of-ANA   XZ-Size %-of-ANA.xz
 ANA 1868403142   100.0% 442580176      100.0%
 ISA  342966732    18.4% 188721016       42.6%
 IPA  242932132    13.0% 199165684       45.0%
 GNA  214021318    11.5% 211957748       47.9%
Sort Version 1, which sorts in this order: Kingside castle, Queenside castle, value of a move considering the opponents immediate response. The value is calculated by adding the value of a taken piece, and subtracting the value of the players piece if it can potentially be taken. Not particularly smart but not expensive to calculate, shifts obvious good and bad material moves up and down respectively.
Code:
Form       Size %-of-ANA   XZ-Size %-of-ANA.xz
 ANA 1868403142   100.0% 442580176      100.0%
 ISA  342966732    18.4% 174789504       39.5%
 IPA  242932132    13.0% 189283260       42.8%
 GNA  214005548    11.5% 211950820       47.9%
  • GNA is slightly compressible relative to the random game test thanks I think to some of the games in Caissabase being repeated
  • GNA with sort v1 is slightly smaller than v0 thanks to generally having smaller indexes leading to sometimes using 1 less bit per game
  • When given a large amount of real games IPA beats GNA as IPA is surprisingly compressible. Each game is byte aligned which may explain some of it as opening redundancy is available
  • Sort v1 is better than anticipated, even the IPA.xz version saved ~10MB over v0 but the ISA.xz was the real winner at ~14MB saved


Potential refinement for sort v2 or beyond:
  • Checkmate seems obvious but expensive to calculate as the opponents next moves need to be generated to test for it. Also by definition not the most lucrative thing to check for
  • Checks on uncovered squares, particularly when the opponent can't respond by blocking with an immediate threat
  • Detecting forks from squares, particularly when the worst material trade is favourable or if a check is involved
  • Even trades when ahead on material is a common tactic, conversely a player is likely to be trade-averse when down on material
  • Instead of just considering the immediate response, evaluate a trade chain to its logical conclusion. This can be done with individual piece cover masks to count how covered a square is instead of just using a single mask showing what the opponent has covered

Hard to evaluate how some of these techniques stack up against each other or how they can sanely be mixed into one algorithm.
M344587487 is offline   Reply With Quote
Old 2021-02-07, 09:51   #11
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

133358 Posts
Default

Move sorting early in a game may be improved by pairing the compressor with an opening book.
henryzz is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Buddy compression SELROC GPU Computing 2 2019-03-08 12:58
Engine-assisted game vs Stockfish, move 36 discussion. Drawn game? MooMoo2 Other Chess Games 10 2017-04-29 17:39
GPU optimization over snappy compression ihavethepotenti Programming 2 2013-02-15 00:20
video compression chatter by jasong jasong jasong 24 2008-01-13 11:23
Compression program Prime95 LMH > 100M 1 2005-03-23 15:18

All times are UTC. The time now is 16:38.

Mon Apr 12 16:38:37 UTC 2021 up 4 days, 11:19, 1 user, load averages: 2.59, 2.06, 2.01

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.