thaiconv | Lyndon Hill

thaiconv is a command line tool for converting text coding for Thai.

If you have read my Thai email guide then you will know that sending and receiving email in Thai can be beset with problems. To help you, thaiconv will determine the coding of a text file and convert it to a coding that you can read.

People on Un*x systems will be using something called iconv. I didn't know about iconv until after several versions of thaiconv, however, thaiconv has features that iconv doesn't have. I have created thaiconv to painlessly convert between Thai codings whereas iconv is a general tool.

Features

thaiconv standard features include:

Conversion of TIS-620, UTF-8, HTML Unicode and cross coded UTF-8. See Codings for an explanation
Analysis mode to try to guess the coding of an input file
Small command line executable
Used in Sontana, my Thai text editor

thaiconv assumes that the text file you want to process is in Thai of some form. It's not suitable for processing other languages although it is OK if there are Roman letters in the file (i.e. no accents).

Changes in Version 1.8

Updated coded based on suggestions from static analysis.

Changes in Version 1.7

Option to write Unicode BOM
Option to disable decoding Thai HTML numeric entities
Read hexadecimal HTML numeric entities
Allow codepoints of other languages to pass through when reading UTF-8
Improved detection of coding types

Changes in Version 1.6

Improved guessing for the text coding.

Changes in Version 1.5

Support for Unicode BOM. In previous versions this caused problems with the scan function. From 1.5 the BOM can be read but will not be output (since BOM is irrelevant to UTF-8)
Write to a specified file
Bug fixes

Download

File	Platform	md5
thaiconv-1_8-ARM.tar.bz2	Linux ARM (Zaurus)	ef7abcd9e879ab2f20f18cd36991a692
thaiconv-1_8-mac.tar.bz2	MacOS X	269873120dad1d505237e23238805c5d
thaiconv-1_8-linux.tar.bz2	Linux x86	c165d31faede5f0bbaff6d4a8c71b6d3
thaiconv-1_8-win.zip	Windows	1ebb17f41d59646381fac66dc928f26a

Linux and Windows binaries are statically linked. If the idea of typing a command into a shell sounds too technical for you, then use Sontana instead.

Instructions

Input Parameter Explanation

Usage:

thaiconv [-h] [-s] [-sq] [-in X] [-out Y] [-noent] [-bom] -r input-filename -w output-filename

-h: Display useful help information.
-s: Scan input file and report on type.
-sq: Scan quick: same as above but only output a number according to the coding.
-in informat: Define input format. Optional, default = use scan to determine. Input and output formats are represented as numbers on the command line. See the help information for details.
-out outformat: Define output format. Optional, default = 0 (TIS-620).
-r input-filename: Use input-filename as input file. Required.
-w output-filename: Use output-filename as output file. If the output filename is not specified then output will go to the console (stdout).
-noent: Do not convert HTML numeric entities in TIS-620 or UTF-8 modes
-bom: Write a BOM at the start of Unicode files

Examples

Using thaiconv is straightforward, use -h to get comprehensive help information:

Work:Dev/ThaiConv:> thaiconv -h
thaiconv: Thai text transcoding tool. Version 1.7, Build 15122013.
Usage: thaiconv [-h] [-s] [-sq] [-in X] [-out Y] [-noent] [-bom] -r infilename [-w outfilename]

Convert plain text file encoding for Thai to another encoding.
---
 -s      scan file to determine type
 -sq     scan quiet - as above but only output the input file mode number
 -in     input format, see list below
 -out    output format, see list below
 -r      filename to read (required)
 -w      filename to write (default = stdout)
 -noent  don't convert HTML entities when reading
 -bom    write BOM when writing Unicode
 -h      this help
---
Input/Output Formats:
0 = TIS-620
1 = UTF-8 Thai
2 = HTML
3 = UTF-8 Latin 1 (cross coded Thai)
---
Notes:
If the input format is not specified then it will be determined
automatically. If the result is not obvious TIS-620 will be assumed.
Use scan mode to find automatic result.
Output format defaults to TIS-620 unless specified.

For extended information please see
<http://www.lyndonhill.com/Projects/thaiconv.html>

To convert a file from UTF-8 to TIS-620

Work:Dev/ThaiConv:> thaiconv -r utf8file.txt -out 0 > tis620file.txt

To convert a file from TIS-620 to UTF-8 :
Work:Dev/ThaiConv:> thaiconv -r tis620file.txt -out 1 > utf8file.txt

To convert a file from HTML Unicode to TIS-620:
Work:Dev/ThaiConv:> thaiconv -r htmlfile.txt -out 0 > tis620file.txt

To get thaiconv to tell you about the text file's coding:

Work:Dev/ThaiConv:> thaiconv -s -r testfile.txt
thaiconv Scan Report
--------------------

12      plain ASCII characters.
0       extended ASCII characters.
17      HTML Unicode entities in Thai range.
29      Total characters.

File is probably Thai HTML Unicode

If you want to use thaiconv in a script and just want to know what coding the file is without parsing a lot of output:

Work:Dev/ThaiConv:> thaiconv -sq -r testfile.txt
2

Codings

The following table lists formats understood by thaiconv.

Standard 7 bit ASCII	All alphabetical, numeric and punctuation characters used in standard ASCII. No accents, umlauts, ulls, fancy punctuation or graphics.
TIS-620	Thai characters are stored in the upper half of ASCII, i.e. using characters represented using 8 bit ASCII; thus allowing ASCII and Thai to co-exist.
UTF-8 (Thai range)	The Unicode standard, specifically the section on Thai characters (0xE00 - 0xE7F).
HTML Unicode (Thai range)	Unicode as represented in HTML: An entity of the form `&#NNNN;` or `&#xHHHH;` where NNNN is a decimal number and HHHH is a hexadecimal number.
Cross coded UTF-8	TIS-620 that has been converted to UTF-8 Latin1 (0xA0-0xF0). For example, the Thai character that has the value 160 in TIS-620 may have the Latin representation é, this character gets converted to the Unicode for é. This mode is likely to be converted correctly only if the cross coding and decoding occur in the same locality.