This package contains five programs that convert Tamil between several encodings and the ITRANS romanization. Two programs convert between ISCII and ITRANS, two between ISCII and Unicode. The fifth program, tscii2uni, converts the TSCII encoding of Tamil to UTF-8 Unicode.
The pipeline:
u82iTamil | iscii2itransTamil | itrans2isciiTamil | i2u8Tamil
yields output identical to its input. You can test this by typing:
make roundtriptest
in the source directory.
All five programs provide fairly extensive checking of their input for errors and untranslatable codes.
The ITRANS accepted as input and generated by these programs is, by default, extended to include codes for the Tamil digits. The ITRANS accepted as input and generated by default contains HZ escapes (as defined in RFC 1843 http://www.ietf.org/rfc/rfc1843.txt) that delimit the Tamil portion. This allows processing of mixed Tamil and ASCII text.
INSTALLATION
On a system with GNU autoconf, do:
./configure
make
make install
TESTING
The file tsciitest.tsc contains a little test data in TSCII. The file tsciitest.u contains the equivalent in UTF-8 Unicode. The file tsciitest.asc contains the TSCII data as hexadecimal ASCII strings.
Running tscii2uni on tsciitest.tsc should produce the following messages on stderr. The last message will not appear if the -z option is used. You can run this test by typing: make tsciitest in the source directory.
Invalid TSCII code 0xA0 at byte 19 skipped
TSCII 0x82 Grantha character Sri was encountered. There is no such Unicode character so it has been translated to the native Tamil equivalent: 0x0BB8 0x0BB0 0x0BBF
TSCII 0x87 Grantha character Ksha was encountered. There is no such Unicode character so it has been translated to the native Tamil equivalent: 0x0B95 0x0BCD 0x0BB7
TSCII 0x8c Grantha character Ksha with pulli was encountered. There is no such Unicode character so it has been translated to the native Tamil equivalent: 0x0B95 0x0BCD 0x0BB7 0x0B82
TSCII 0x80 Tamil digit zero was encountered. There is no such character in version 4.0 of the Unicode standardn so it has been translated to ASCII zero: 0x0030. The codepoint 0x0BE6 has been approved but is not yet available in fonts.
This software was written by Bill Poser (wjposer@ldc.upenn.edu/billposer@alum.mit.edu) for the Linguistic Data Consortium with funding provided by the US Department of Defense.
