Decoding Thai Text in JapaneseIf you are in Japan and you send a Thai email the receiving person sometimes receives something that looks like Japanese but isn't. This also happens when Thai email is sent to a Japanese mobile phone. The original message may be recovered, but not perfectly. ถ้าคุณส่งหรือรับอีเมลเป็นภาษาไทยกับเพื่อนที่ญี่ปุ่นหรือคุณอยู่ที่ญี่ปุ่น คุณคงเจอปัญหานี้มาแล้วครับ ตอนอ่านอีเมล ดูเหมือนเป็นภาษายี่ปุ่นแต่ ส่งเป็นภาษาไทยตัวอย่างข้างล่าง If the received text is decoded as Japanese it looks like Katakana with some Kanji characters. If not decoded the text is a sequence of characters, each line starting with a B and with a lot of % signs. What happened is that an email server assumed that the original message was Japanese and converted it to a "safer" form of Japanese text coding. I have determined that the message is converted from Shift JIS to ISO-2022-JP. The original Thai email was coded without specifying the character set in the MIME type or the character set was ignored. เสอเวอร์คิดว่าเป็นภาษาญี่ปุ่นและเปลี่ยนแปลงจากรหัสญี่ปุ่น (Shift JIS) ถึงรหัสญี่ปุ่นอื่น (ISO-2022-JP). The Technical BitYou may safely skip this boring section and go straight to "How to Decode the Text". คุณไม่ต้องอ่านบทนี้ครับ The original email was probably written in Thai TIS-620, an 8 bit character coding which represents ASCII and all Thai characters. Alternatively, it could have been Unicode UTF-8 which is a multibyte coding where Thai characters are represented in three 8 bit bytes. Shift JIS represents 7 bit ASCII plus Katakana in 8 bits; additionally an optional 2nd byte may be used to represent other characters including Hiragana and Kanji. ISO-2022 is 7 bit but can use multiple bytes (7 bit each) by "escaping" to other modes. Due to using 7 bit bytes throughout, ISO-2022 is considered safer for email than 8 bit schemes. The task is to recover the original message by reversing the coding process. We can convert from ISO-2022-JP coding to Shift JIS, however, we cannot always recover the original message exactly. If a sentence ends in certain characters they will have been discarded during the original Shift JIS to ISO-2022-JP conversion as incomplete halves of a double byte sequence. In practice the sentence can usually be read. If the original coding was UTF-8 then the last character in a sentence may be corrupted. If the original coding was TIS-620 then the characters that may be lost if they are at the end of a sentence are: lakkhangyao, maiyamok, all tone markers, thanthakhat, nikhahit, fongman, all Thai numbers, angkhankhu and khomut. Leading vowels would also be lost but it would be grammatically impossible to end a sentence with a leading vowel. These characters can be summarised as Unicode range 0x0E40-0x0E5B. โปรแกรมนี้สามารทซ่อมแซมอีเมลส่วนใหญ่แต่อักษรสุดท้ายของทุกประโยดอาจจะซ่อมไม่ได้ How to Decode the TextI have written a simple tool called jax2th to recover the original Thai text.
ดาวนโลดโปรแกรม jax2th และใช้ให้เปลี่ยนรหัสอักษรเหมือนตัวอย่างข้างบนครับ This tool is free. โปรแกรมนี้ฟรี DownloadChoose your platform: ดาวนโลดที่นี้ครับ Linux and Windows binaries are statically linked (this means you don't have to worry about installing any other software). ไม่ต้องดาวนโลดอะไรอีกครับ |
