thaiconv
thaiconv is a tool for converting text coding for Thai.
If you have read my Thai email guide then you will know
that sending and receiving email in Thai can be beset with problems. To help you,
thaiconv will determine the coding of a text file and convert it to a coding
that you can read.
People on Un*x systems will be using something called iconv.
I didn't know about iconv until after several versions of thaiconv, however, thaiconv
has features that iconv doesn't have. I have created thaiconv to painlessly convert
between Thai codings whereas iconv is a general tool.
Features
thaiconv standard features include:
- Small command line executable.
- Conversion of TIS-620, UTF-8, HTML Unicode and cross coded UTF-8.
See Codings for an explanation.
- Analysis mode to try to guess the coding of an input file.
- Used in Sontana, my Thai text editor.
thaiconv assumes that the text file you want to process is in Thai of some form. It's
not suitable for processing other languages although it is OK if there are Roman letters
in the file (i.e. no accents).
Changes in Version 1.6
Improved guessing for the text coding.
Changes in Version 1.5
- Support for Unicode BOM. In previous versions this caused problems with the scan function. From 1.5 the BOM can be read but will not be output (since BOM is irrelevant to UTF-8).
- Write to a specified file.
- Bug fixes.
Download
Choose your platform:
Linux and Windows binaries are statically linked. The Mac binary is univeral (PPC and x86).
If the idea of typing a command into a shell sounds too technical for you, then
use Sontana instead.
Instructions
Input Parameter Explanation
Usage: thaiconv [-h] [-s] [-sq] [-in X] [-out Y] -r input-filename -w output-filename
- -h
- Display useful help information.
- -s
- Scan input file and report on type.
- -sq
- Scan quick: same as above but only output a number according to the coding.
- -in informat
- Define input format. Optional, default = use scan to determine. Input and output
formats are represented as numbers on the command line. See the help information for details.
- -out outformat
- Define output format. Optional, default = 0 (TIS-620).
- -r input-filename
- Use input-filename as input file. Required.
- -w output-filename
- Use output-filename as output file. If the output filename is not specified then
output will go to the console (stdout).
Examples
Using thaiconv is straightforward, use -h to get comprehensive help information:
Work:Dev/ThaiConv:> thaiconv -h
thaiconv: Thai text transcoding tool. Version 1.6, Build 04062009.
Usage: thaiconv [-h] [-s] [-sq] [-in X] [-out Y] -r infilename -w outfilename
Convert plain text file encoding for Thai to another encoding.
---
-s scan file to determine type
-sq scan quiet - as above but only output the input file mode number
-in input format, see list below
-out output format, see list below
-r filename to read
-w filename to write
-h this help
The only required parameter is -r.
If -w is not specified the output will go to stdout.
---
Input/Output Formats:
0 = TIS-620
1 = UTF-8 Thai
2 = HTML
3 = UTF-8 Latin 1 (cross coded Thai)
---
Notes:
If the input format is not specified then it will be determined
automatically. If the result is not obvious TIS-620 will be assumed.
Use scan mode to find automatic result.
Output format defaults to TIS-620 unless specified.
For extended information please see
<http://www.lyndonhill.com/Projects/thaiconv.html>
|
To convert a file from UTF-8 to TIS-620 :
Work:Dev/ThaiConv:> thaiconv -r utf8file.txt -out 0 > tis620file.txt
|
To convert a file from TIS-620 to UTF-8 :
Work:Dev/ThaiConv:> thaiconv -r tis620file.txt -out 1 > utf8file.txt
|
To convert a file from HTML Unicode to TIS-620:
Work:Dev/ThaiConv:> thaiconv -r htmlfile.txt -out 0 > tis620file.txt
|
To get thaiconv to tell you about the text file's coding:
Work:Dev/ThaiConv:> thaiconv -s -r testfile.txt
thaiconv Scan Report
--------------------
12 plain ASCII characters.
0 extended ASCII characters.
17 HTML Unicode entities in Thai range.
29 Total characters.
File is probably Thai HTML Unicode
|
If you want to use thaiconv in a script and just want to know what coding the file is without
parsing a lot of output:
Work:Dev/ThaiConv:> thaiconv -sq -r testfile.txt
2
|
The following table lists formats understood by thaiconv.
| Standard 7 bit ASCII |
All alphabetical, numeric and punctuation characters used in standard ASCII.
No accents, umlauts, ulls, fancy punctuation or graphics. |
| TIS-620 |
Thai characters are stored in the "Latin1 area" of ASCII, i.e. using characters
beyond 7 bit ASCII; thus allowing ASCII and Thai to co-exist. |
| UTF-8 (Thai range) |
The Unicode standard, specifically the
section on Thai characters (0xE00 - 0xE7F). |
| HTML Unicode (Thai range) |
Unicode as represented in HTML: An entity of the form &#NNNN;
where NNNN is a decimal number. |
| Cross coded UTF-8 |
TIS-620 that has been converted to UTF-8 Latin1 (0xA0-0xF0). For
example, the Thai character that has the value 160 in TIS-620 may have the Latin
representation é, this character gets converted to the Unicode for
é. This mode is likely to be converted correctly only if the cross coding
and decoding occur in the same locality. |
|