thaiconv

thaiconv is a tool for converting text coding for Thai.

If you have read my Thai email guide then you will know that sending and receiving email in Thai can be beset with problems. To help you, thaiconv will determine the coding of a text file and convert it to a coding that you can read.

People on Un*x systems will be using something called iconv. I didn't know about iconv until after several versions of thaiconv, however, thaiconv has features that iconv doesn't have. I have created thaiconv to painlessly convert between Thai codings whereas iconv is a general tool.

Features

thaiconv standard features include:
  1. Small command line executable.
  2. Conversion of TIS-620, UTF-8, HTML Unicode and cross coded UTF-8. See Codings for an explanation.
  3. Analysis mode to try to guess the coding of an input file.
  4. Used in Sontana, my Thai text editor.
thaiconv assumes that the text file you want to process is in Thai of some form. It's not suitable for processing other languages although it is OK if there are Roman letters in the file (i.e. no accents).

Changes in Version 1.6

Improved guessing for the text coding.

Changes in Version 1.5

  • Support for Unicode BOM. In previous versions this caused problems with the scan function. From 1.5 the BOM can be read but will not be output (since BOM is irrelevant to UTF-8).
  • Write to a specified file.
  • Bug fixes.

Download

Choose your platform:
Linux and Windows binaries are statically linked. The Mac binary is univeral (PPC and x86). If the idea of typing a command into a shell sounds too technical for you, then use Sontana instead.

Instructions

Input Parameter Explanation

Usage: thaiconv [-h] [-s] [-sq] [-in X] [-out Y] -r input-filename -w output-filename
-h
Display useful help information.
-s
Scan input file and report on type.
-sq
Scan quick: same as above but only output a number according to the coding.
-in informat
Define input format. Optional, default = use scan to determine. Input and output formats are represented as numbers on the command line. See the help information for details.
-out outformat
Define output format. Optional, default = 0 (TIS-620).
-r input-filename
Use input-filename as input file. Required.
-w output-filename
Use output-filename as output file. If the output filename is not specified then output will go to the console (stdout).

Examples

Using thaiconv is straightforward, use -h to get comprehensive help information:

Work:Dev/ThaiConv:> thaiconv -h
thaiconv: Thai text transcoding tool. Version 1.6, Build 04062009.
Usage: thaiconv [-h] [-s] [-sq] [-in X] [-out Y] -r infilename -w outfilename

Convert plain text file encoding for Thai to another encoding.
---
 -s    scan file to determine type
 -sq   scan quiet - as above but only output the input file mode number
 -in   input format, see list below
 -out  output format, see list below
 -r    filename to read
 -w    filename to write
 -h    this help
The only required parameter is -r.
If -w is not specified the output will go to stdout.
---
Input/Output Formats:
0 = TIS-620
1 = UTF-8 Thai
2 = HTML
3 = UTF-8 Latin 1 (cross coded Thai)
---
Notes:
If the input format is not specified then it will be determined
automatically. If the result is not obvious TIS-620 will be assumed.
Use scan mode to find automatic result.
Output format defaults to TIS-620 unless specified.

For extended information please see
<http://www.lyndonhill.com/Projects/thaiconv.html>

To convert a file from UTF-8 to TIS-620 :

Work:Dev/ThaiConv:> thaiconv -r utf8file.txt -out 0 > tis620file.txt

To convert a file from TIS-620 to UTF-8 :

Work:Dev/ThaiConv:> thaiconv -r tis620file.txt -out 1 > utf8file.txt

To convert a file from HTML Unicode to TIS-620:

Work:Dev/ThaiConv:> thaiconv -r htmlfile.txt -out 0 > tis620file.txt

To get thaiconv to tell you about the text file's coding:

Work:Dev/ThaiConv:> thaiconv -s -r testfile.txt
thaiconv Scan Report
--------------------

12      plain ASCII characters.
0       extended ASCII characters.
17      HTML Unicode entities in Thai range.
29      Total characters.

File is probably Thai HTML Unicode

If you want to use thaiconv in a script and just want to know what coding the file is without parsing a lot of output:

Work:Dev/ThaiConv:> thaiconv -sq -r testfile.txt
2

Codings

The following table lists formats understood by thaiconv.

Standard 7 bit ASCII All alphabetical, numeric and punctuation characters used in standard ASCII. No accents, umlauts, ulls, fancy punctuation or graphics.
TIS-620 Thai characters are stored in the "Latin1 area" of ASCII, i.e. using characters beyond 7 bit ASCII; thus allowing ASCII and Thai to co-exist.
UTF-8 (Thai range) The Unicode standard, specifically the section on Thai characters (0xE00 - 0xE7F).
HTML Unicode (Thai range) Unicode as represented in HTML: An entity of the form &#NNNN; where NNNN is a decimal number.
Cross coded UTF-8 TIS-620 that has been converted to UTF-8 Latin1 (0xA0-0xF0). For example, the Thai character that has the value 160 in TIS-620 may have the Latin representation é, this character gets converted to the Unicode for é. This mode is likely to be converted correctly only if the cross coding and decoding occur in the same locality.