thaicheck is a command line tool for checking Thai text files have valid letter sequence order and fixing them.

Features

The latest version is: 1.3

Introduction

Many systems (text editors, word processors, keyboard drivers etc.) allow you to type Thai in incorrect letter sequence. Some software may fail or even crash when encountering bad letter sequence order, especially if the person who wrote the font renderer was not a Thai language expert. When you write Thai by hand the sequence order that you write is not important as long as the text looks correct. When you type on a computer the font renderer can have problems with "non-standard" letter order. For example, you should always type tone or diacritic markers after typing the consonants and vowels for the same character cell. Most users are not even aware of this! Their text editor forces them to type correctly.

Standard letter order has been defined by the Thai API Consortium (TAPIC) in the WTT 2.0 standard. Through this standard it is possible to make sure that text is always rendered properly and looks correct, consistently, on all systems. Additionally to this specification, I have implemented an extra rule for sara am; this is consistent with the implementations used by Pango and Microsoft. Note that ru (Unicode 0x0E24) is classified as a "Following Vowel 3" by WTT but this classification is confusing as it is not a following vowel.

The purpose of this tool is to find and repair sequence errors in TIS-620 coded text. You can use thaiconv to detect the coding of your file and convert it. My Thai text editor Sontana also features thaicheck code.

Sequence Repair

If the sequence is invalid thaicheck will repair it by reordering the glyphs in the most likely valid order. A glyph composer is used to model the font renderer that was used when the text was created. For example, if the sequence contains two consecutive above vowels then it is likely that the renderer composed the second one over the first and the user only saw the last vowel.

thaicheck compositional model

Cell Composition Model
Left: A real word composed in the cell model, Right: A fake word to show all subcells occupied.

In order to model how Thai characters should be combined, thaicheck composes each Thai character/glyph in a cell as shown in the image above. The six subcells are for 1. lead vowel, 2. consonant, 3. above vowel, 4. above diacritic, 5. below vowel/diacritic and 6. following vowel. The composer is a logical model to determine what glyphs can be combined and is nothing to do with rendering fonts nicely. A few exception rules take care of multiple following vowels, double sara e are replaced by sara ae and nikhahit followed by sara aa will be replaced by sara am.

During repair, glyphs are read in and composed in the cell, if they cannot be composed then they may be discarded or the cell output and a new cell started. The cell is output in valid sequence order. Repair mode should be treated like a spell checker, i.e. it is not possible to accurately fix all mistakes but thaicheck can catch most of them. You should review the text and verify any repairs.

The idea of Thai letter sequence checking and repair is not new. Other repair algorithms attempt to swap pairs of glyphs in bad sequences to find a correctly composed sequence. My compositional model is compact (it could probably be refactored into less than 100 source lines of code). I haven't measured performance against the other methods. You can download my thaicheck test cases (gzip, TIS-620) I use to verify the behaviour of my model.

Download

File Platform md5
thaicheck-1_3-ARM.tar.bz2 Linux ARM (Zaurus) e807a2c6c1b58ec6fcf2248f4a9abd87
thaicheck-1_3-mac.tar.bz2 MacOS X 6a13de4807b98162679a9de945b2f55e
thaicheck-1_3-linux.tar.bz2 Linux x86 18e17300e9ca58b3cf3f745071c4258c
thaicheck-1_3-win.zip Windows 830a4ed79f7208cc0110f4d02437f629

thaicheck is compiled on vbcc (Amiga/MorphOS) and gcc (Linux/Windows). Linux and Windows binaries are statically linked. The Mac binary is universal (PPC and x86).

Instructions

Using thaicheck is straightforward, use the -h parameter to get some help information. Optional parameters are shown in square brackets. The default strictness is 2 (strict).

Work:Dev/ThaiCheck> thaicheck -h
thaicheck v1.2
Usage: thaicheck -r input-file [-w output-file] [-s strict (0,1,2)] [-f]
        -w write to given filename, default = stdout
        -s strictness level, 0 = passthrough, 1 = basic check, 2 = strict
        -f invoke fix mode

To check a text file with strictest setting:

Work:Dev/ThaiCheck> thaicheck -r /Test/test-tis620.thai
Error at 2,2: Leading Vowel cannot be followed by Following Vowel 1 (211)
Error at 5,3: Leading Vowel cannot be followed by Following Vowel 1 (210)

Some notes:

To repair a TIS-620 coded text file :

Work:Dev/ThaiCheck> thaicheck -r fixtest1.txt -w output.txt -f
Error at 2,5: Above Vowel 3 was composed by replacing a glyph
Error at 2,10: Above Vowel 3 cannot be composed : glyph ignored
Error at 3,11: Partially composed cell.
Error at 1,12: Leading Vowel was composed by replacing a glyph
Error at 3,13: Below Vowel 2 was composed by replacing a glyph
Error at 1,14: Leading Vowel was composed by replacing a glyph
Error at 1,16: Partially composed cell.
Error at 1,17: Partially composed cell.
Error at 2,7: Below Vowel 2 cannot be followed by Above Diacritic 1
Error at 2,14: Below Vowel 2 cannot be followed by Above Diacritic 1

More notes: