WordMix learning Russian

The next WordMix and WordMix Pro release will include support for Russian, Portuguese and Dutch as dictionary languages. I had a lot of fun with the Cyrillic encoding of characters and especially the database for the words as I learned that a lot of Linux tools are still not ready for handling multi byte character sequences correctly.

Mostly the tool tr kept me busy, when I tried to convert lower case letters to upper case. The normal approach of

tr [:lower:] [:upper:]

only seems to work for the ASCII character set. If manually used on UTF-8 data, it screws everything up even more, like in the command:

tr \
  абвгдеёжзийклмнопрстуфхцчшщъыьэюя \
  AБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ

The trick was to use tr on the original KOI8-R encoded data (which is 8 bit), for which I also had to pass KOI8-R encoded parameters to the tool, which was a pain inside an otherwise UTF-8 encoded shell script. So I tried to read the KOI8-R encoded parameters from a file before passing it as arguments so I don’t screw up my shell script.

It took me several hours and attempts to find that out and to get all the encodings right, so now a working Russian dictionary is available. 🙂 It won’t be shipped by default though, so it needs to be fetched from the Internet once by the game, on first use.

Of course the global ranklist is prepared for the new languages as well.