Decoding Thai Text in Japanese

If you are in Japan and you send a Thai email the receiving person sometimes receives something that looks like Japanese but isn't. This also happens when Thai email is sent to a Japanese mobile phone. The original message may be recovered, but not perfectly.

ถ้าคุณส่งหรือรับอีเมลเป็นภาษาไทยกับเพื่อนที่ญี่ปุ่นหรือคุณอยู่ที่ญี่ปุ่น คุณคงเจอปัญหานี้มาแล้วครับ ตอนอ่านอีเมล ดูเหมือนเป็นภาษายี่ปุ่นแต่ ส่งเป็นภาษาไทยตัวอย่างข้างล่าง

Hotmail cross coded message  Thunderbird showing cross coded message  YAM showing cross coded message
Email cross coded from Thai to Japanese

If the received text is decoded as Japanese it looks like Katakana with some Kanji characters. If not decoded the text is a sequence of characters, each line starting with a B and with a lot of % signs. What happened is that an email server assumed that the original message was Japanese and converted it to a "safer" form of Japanese text coding. I have determined that the message is converted from Shift JIS to ISO-2022-JP. The original Thai email was coded without specifying the character set in the MIME type or the character set was ignored.

เสอเวอร์คิดว่าเป็นภาษาญี่ปุ่นและเปลี่ยนแปลงจากรหัสญี่ปุ่น (Shift JIS) ถึงรหัสญี่ปุ่นอื่น (ISO-2022-JP).

The Technical Bit

You may safely skip this boring section and go straight to "How to Decode the Text".

คุณไม่ต้องอ่านบทนี้ครับ

The original email was probably written in Thai TIS-620, an 8 bit character coding which represents ASCII and all Thai characters. Alternatively, it could have been Unicode UTF-8 which is a multibyte coding where Thai characters are represented in three 8 bit bytes.

Shift JIS represents 7 bit ASCII plus Katakana in 8 bits; additionally an optional 2nd byte may be used to represent other characters including Hiragana and Kanji. ISO-2022 is 7 bit but can use multiple bytes (7 bit each) by "escaping" to other modes. Due to using 7 bit bytes throughout, ISO-2022 is considered safer for email than 8 bit schemes.

The task is to recover the original message by reversing the coding process. We can convert from ISO-2022-JP coding to Shift JIS, however, we cannot always recover the original message exactly. If a sentence ends in certain characters they will have been discarded during the original Shift JIS to ISO-2022-JP conversion as incomplete halves of a double byte sequence.

In practice the sentence can usually be read. If the original coding was UTF-8 then the last character in a sentence may be corrupted. If the original coding was TIS-620 then the characters that may be lost if they are at the end of a sentence are: lakkhangyao, maiyamok, all tone markers, thanthakhat, nikhahit, fongman, all Thai numbers, angkhankhu and khomut. Leading vowels would also be lost but it would be grammatically impossible to end a sentence with a leading vowel. These characters can be summarised as Unicode range 0x0E40-0x0E5B.

โปรแกรมนี้สามารทซ่อมแซมอีเมลส่วนใหญ่แต่อักษรสุดท้ายของทุกประโยดอาจจะซ่อมไม่ได้

How to Decode the Text

I have written a simple tool called jax2th to recover the original Thai text.

  1. Download the tool below, unarchive it (I assume you know how to do that)
  2. Save the message in raw form as file myemail.txt in the same directory as the tool
  3. Open a command shell and change directory to the location of the tool and email
  4. Run jax2th myemail.txt > newemail.txt
  5. Load the file newemail.txt in a text editor that can handle Thai

ดาวนโลดโปรแกรม jax2th และใช้ให้เปลี่ยนรหัสอักษรเหมือนตัวอย่างข้างบนครับ

This tool is free.

โปรแกรมนี้ฟรี

Download

Choose your platform:

ดาวนโลดที่นี้ครับ

Linux and Windows binaries are statically linked (this means you don't have to worry about installing any other software).

ไม่ต้องดาวนโลดอะไรอีกครับ



Home