thaiconv
thaiconv is a tool for converting text coding for Thai.If you have read my Thai email guide then you will know that sending and receiving email in Thai can be beset with problems. To help you, thaiconv will determine the coding of a text file and convert it to a coding that you can read.
People on Un*x systems will be using something called iconv. I didn't know about iconv until after several versions of thaiconv, however, thaiconv has features that iconv doesn't have. I have created thaiconv to painlessly convert between Thai codings whereas iconv is a general tool.
Features
thaiconv standard features include:- Small command line executable.
- Conversion of TIS-620, UTF-8, HTML Unicode and cross coded UTF-8. See Codings for an explanation.
- Analysis mode to try to guess the coding of an input file.
- Used in Sontana, my Thai text editor.
Changes in Version 1.6
Improved guessing for the text coding.Changes in Version 1.5
- Support for Unicode BOM. In previous versions this caused problems with the scan function. From 1.5 the BOM can be read but will not be output (since BOM is irrelevant to UTF-8).
- Write to a specified file.
- Bug fixes.
Download
| File | Platform | Version |
|---|---|---|
| thaiconv-1_6-amiga.lha | AmigaOS 3, AmigaOS 4 and MorphOS | 1.6 |
| thaiconv-1_6-ARM.tar.bz2 | Linux ARM (Zaurus) | 1.6 |
| thaiconv-1_6-PPC.tar.bz2 | Linux PPC | 1.6 |
| thaiconv-1_6-mac.tar.bz2 | MacOS X | 1.6 |
| thaiconv-1_5-x86.tar.bz2 | Linux x86 | 1.5 |
| thaiconv-1_5-win.zip | Windows | 1.5 |
Linux and Windows binaries are statically linked. The Mac binary is univeral (PPC and x86). If the idea of typing a command into a shell sounds too technical for you, then use Sontana instead.
Instructions
Input Parameter Explanation
Usage: thaiconv [-h] [-s] [-sq] [-in X] [-out Y] -r input-filename -w output-filename- -h
- Display useful help information.
- -s
- Scan input file and report on type.
- -sq
- Scan quick: same as above but only output a number according to the coding.
- -in informat
- Define input format. Optional, default = use scan to determine. Input and output formats are represented as numbers on the command line. See the help information for details.
- -out outformat
- Define output format. Optional, default = 0 (TIS-620).
- -r input-filename
- Use input-filename as input file. Required.
- -w output-filename
- Use output-filename as output file. If the output filename is not specified then output will go to the console (stdout).
Examples
Using thaiconv is straightforward, use -h to get comprehensive help information:Work:Dev/ThaiConv:> thaiconv -h thaiconv: Thai text transcoding tool. Version 1.6, Build 04062009. Usage: thaiconv [-h] [-s] [-sq] [-in X] [-out Y] -r infilename -w outfilename Convert plain text file encoding for Thai to another encoding. --- -s scan file to determine type -sq scan quiet - as above but only output the input file mode number -in input format, see list below -out output format, see list below -r filename to read -w filename to write -h this help The only required parameter is -r. If -w is not specified the output will go to stdout. --- Input/Output Formats: 0 = TIS-620 1 = UTF-8 Thai 2 = HTML 3 = UTF-8 Latin 1 (cross coded Thai) --- Notes: If the input format is not specified then it will be determined automatically. If the result is not obvious TIS-620 will be assumed. Use scan mode to find automatic result. Output format defaults to TIS-620 unless specified. For extended information please see <http://www.lyndonhill.com/Projects/thaiconv.html>To convert a file from UTF-8 to TIS-620 :
Work:Dev/ThaiConv:> thaiconv -r utf8file.txt -out 0 > tis620file.txtTo convert a file from TIS-620 to UTF-8 :
Work:Dev/ThaiConv:> thaiconv -r tis620file.txt -out 1 > utf8file.txtTo convert a file from HTML Unicode to TIS-620:
Work:Dev/ThaiConv:> thaiconv -r htmlfile.txt -out 0 > tis620file.txtTo get thaiconv to tell you about the text file's coding:
Work:Dev/ThaiConv:> thaiconv -s -r testfile.txt thaiconv Scan Report -------------------- 12 plain ASCII characters. 0 extended ASCII characters. 17 HTML Unicode entities in Thai range. 29 Total characters. File is probably Thai HTML UnicodeIf you want to use thaiconv in a script and just want to know what coding the file is without parsing a lot of output:
Work:Dev/ThaiConv:> thaiconv -sq -r testfile.txt 2
Codings
The following table lists formats understood by thaiconv.| Standard 7 bit ASCII | All alphabetical, numeric and punctuation characters used in standard ASCII. No accents, umlauts, ulls, fancy punctuation or graphics. |
|---|---|
| TIS-620 | Thai characters are stored in the "Latin1 area" of ASCII, i.e. using characters beyond 7 bit ASCII; thus allowing ASCII and Thai to co-exist. |
| UTF-8 (Thai range) | The Unicode standard, specifically the section on Thai characters (0xE00 - 0xE7F). |
| HTML Unicode (Thai range) | Unicode as represented in HTML: An entity of the form &#NNNN; where NNNN is a decimal number. |
| Cross coded UTF-8 | TIS-620 that has been converted to UTF-8 Latin1 (0xA0-0xF0). For example, the Thai character that has the value 160 in TIS-620 may have the Latin representation é, this character gets converted to the Unicode for é. This mode is likely to be converted correctly only if the cross coding and decoding occur in the same locality. |