IDX-PKI i18n
This code was developped by IDEALX (http://IDEALX.org/).
Copyright © 2000-2003 IDEALX
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
$Id: README.txt,v 1.8 2002/02/13 15:48:15 sbi Exp $ $Source: /opt/cvs/idxpki/sources/src/php/i18n/doc/README.txt,v $
So you want to correct / complete a language, or translate IDX-PKI to a new language? Here is how to proceed.
Requirements
. Know enough PHP and Perl so to be able to copy cat examples. It is important to understand string concatenation and arguments since there are such things in the messages (mixing text and URLs for example).
. Know the encoding issues of the particular language you wish to work on: how to declare them in the HTML header, how to type them on the terminal, and how to encode them in PHP if needed. There are two ready-to-use examples given: Russian (KOI-8R) and Mandarin (Guo Bao). An URL is supplied with a number of other such examples.
. The default language will always be installed, and probably always be English. The default language must work in all other encodings, it should only use plain 7-bit ASCII.
Architecture
. Everything is located under the idxpki/sources/src/php/i18n directory.
All the data, all the messages in all languages. Of course the .php and .inc codes have been modified to take this into account. Some messages may have been forgotten and remain hardcoded in the code. Then the code will have to be modified according to the API.
. ``i18n.inc'' is the definition of localization functions.
. ``lang.inc'' is the list of supported languages. Every language must have a unique code (usually the ISO code), with only alphanumerical characters, such as [a-z_]
The $KNOWN_LANGUAGES array maps that code to the name of the language, in the language itself, and in a way that will work in all encodings (either 7-bit with HTML entities, or Unicode -- there are examples for many languages on www.debian.org for example).
The $DEFAULT_LANGUAGE variable should remain "en" for "English"
The $ENCODING_LANGUAGES array gives, when needed, the META-tags for HTML documents served in the preferred character set for that language (character set that will be used to translate the messages in that language).
. There is one directory for every known language, named with the language code.
Every such directory contains all of parts of the messages translated to that language, in the right character set.
The Check_translations.pl script, run from the i18n, checks all is coherent. It needs a rigid syntax (``"key" =>'' on the same line; ``"value",'' can be on the next line). Just copy the English and French examples.
Messages are grouped by families of messages, and each file is only loaded on the fly, when it is needed.
The messages family FooBar.inc of the language qux must define an associative array named ``$MSG["Foobar"]["qux"]''. This array will map message identifying keys (usually English names or phrases), the same for all languages, to their translation in that language. Leave the last comma: it is acceptable in the PHP syntax and makes it easier to move lines around.
Messages not translated are typeset in italics (``<em>'')
in the default language (unless msg() was called with the "n" or
"naked" typeset option -- read on).
Messages not found even in the default language display big red error
messages.
Installing / not installing languages
. Reminder: always install the default language; always have that default language be English!
. Comment out in lang.inc the languages you do not want to install (this is not necessary, but this will avoid attempting to open non-existing directories all the time)
. Just install the language directories you need: they are open dynamically with each PHP document to check for available languages.
API
If you need to localize other code or generally speaking understand better what is going on, the important functions of i18n.inc to know are:
msg(Class, Message_key, Typeset, Extra): Returns the message message_key of class Class translated in the current language (or in the default language if that translation is not available).
Typeset can be "p" (or "para"), "li" (or "listitem"), or "h2": it will be the bordering HTML tags. "none" or "n" means no bordering tag, not even italics if the translation is not found. "i" (or "inline") means just italics if the translation is not found. Typeset "error" or "e" returns the string as an error message. The default value is "para".
Extra is the extra piece of string to add inside the HTML tags. Can be another msg() or a constant in all languages (like a number). The default value is the empty string "".
do_msg(): same as msg(), but prints it out instead of returning it. Same as ``print msg(<arguments>)''.
html_link(Url, Text, Lang, In_para, Attr)
Returns a localized html_link.
Url is the URL to use in the href
Text is the anchor text to use
Lang is a flag: if non empty, the current language will be kept in
that URL (except if it is the default language). Use an empty
string if the URL does not depend on the language. The default
value is the empty string "".
In_para is a flag: is non empty returns the URL inside <p>-HTML tags.
The default value is the empty string "".
Attr are extra attributes to add in the ``<a>'' anchor, after the href.
The default value is the empty string "".
Note: some strings appear several times in different messages: this is sometimes said in FIXME or TODO comment notes.
Using languages with other fonts
- References
- * Code for the Representation of the Names of Languages. From ISO 639, revised 1989. http://www.oasis-open.org/cover/iso639a.html ==> list of codes for the languages * ISO 8859 Alphabet Soup http://czyborra.com/charsets/iso8859.html ==> what character set to use for a give language (if avaiable) * Fonts in XFree86 http://www.xfree86.org/4.2.0/fonts.html ==> how to compile and install fonts in X (for typing/reading translations) * Sample Pages for Various Character Sets http://vancouver-webpages.com/multilingual/index.shtml ==> how to write HTML with some more or less exotic languages * Debian GNU/Linux -- The Universal Operating System http://www.debian.org/ ==> how to code language names in Unicode / HTML entities so they are properly displayed in good browsers in any character set * A Unicode Test Page http://www.eleves.ens.fr:8080/home/madore/misc/unitest/ ==> testing the support of exotic fonts in your browser
Web browsers: do not have full support yet. Internet Explorer works more or less... try it on
- What is Unicode? in Arabic http://www.unicode.org/unicode/standard/translations/arabic.html
English only uses ASCII. No 8-bit character. If accents are needed, like in foreign words or names, use HTML entities (é for é).
For other languages the principle is to use a special font for each language. So we can keep using 7-bit and English. Only the 8-bit characters will be in the foreign language, and typed as is in the translation files, with the appropriate terminal and font.
To type an 8-bit character, fire up xfd on the font and activate the X Meta modifier (Alt on PCs running GNU/Linux in vim in an xterm, with default X settings and a window manager not intercepting Meta-h like wmaker does...).
French will use iso-8859-1
X font: 12x24
Note: this does not include \"Y \oe \OE and the EURO sign. They
are available in iso-8859-15
Russian will use koi8-r
X font: -cronyx-courier-bold-r-normal--20-140-100-100-m-120-koi8-r
(read fonts.dir in fonts dirs to map font filenames to X full name)
Debian package: xfonts-cyrillic
File: /usr/X11R6/lib/X11/fonts/cyrillic/crox3cb.pcf.gz
Interesting references:
http://www.geocities.com/Athens/Ithaca/1029/Slovar/Download/Sl_eng-E.zip (freeware dictionary with a data file that I filtered to a text file with English words and KOI8-R words)
Chinese (Mainland, simplified) will use GuoBao
X font: hanzigb24st
Debian package: ???
File: ??? (available on my machine but cannot find the file)
Note: a special program like cxterm is needed to type Chinese
phonetically, since each Chinese character is in fact 2 binary characters
cxterm is a Debian package, but I patched it and recompiled it and
customized it to use bigger fonts than default. I made macros to
run easily cxterm in GuoBao and Big5 (traditional) modes.
I made a little Perl script to translate from Chinese to English on
a word by work basis starting from
* CEDICT: Chinese-English Dictionary
http://www.mandarintools.com/cedict.html
Have also a look at
* CXTERM's UnOfficial Homepage on SourceForge.net
http://cxterm.sourceforge.net/
I downloaded it and installed it from some obscure Taiwan ftp site
I cannot find the address of now.
Reference: http://www.ibiblio.org/mdw/HOWTO/Chinese-HOWTO.html
Arabic will use iso-8859-6
X font: -lbi-naskhi-medium-r-normal--18-180-75-75-m-110-iso8859-6.8x
-lbi-naskhi-bold-r-normal--18-1-75-75-m-120-iso8859-6.8x Files: naskhi8XRf18.pcf.Z naskhi8XBf18.pcf.Z Download: http://www.langbox.com/bidimozilla/fontXFE/ Note: this is a French company that looks specialized in Arab fonts maybe they could help for the translation? Installation: read this URL for details
xset +fp <dir where .pcf.Z files are stored>
xlsfonts | grep 8859.6
to check
Also available: in the font directory of AraMosaic
http://www.langbox.com/arabic/download/AraMosaic/AraMosaic_linux_ArabicMotifStatic.tar.gz
Interesting references:
http://www.google.com/search?q=Arabic+English+dictionary
http://dictionary.ajeeb.com/en.htm
Note: this says it produces windows-1256 but we cannot seem to be able to
translate it to iso-8859-6... An interesting page is
http://members.aol.com/ArabicLexicons/codepages/test_page_cp-1256.htm
so we can guess/understand how to code windows-1256 to iso-8859-6
http://faculty.washington.edu/heer/
Problems with Arabic (tested browsers: Mozilla and IE)
. the Arabic alphabet can be found on the web. There are 28 letters,
some having as many as 4 different writings depending on context
(isolated, initial, medial, final)
. we will investigate the left-to-right/right-to-left (bidirectionality)
issue when we can figure out one encoding working with one browser (IE
is the best candidate)
. there seems to be special browsers to deal with Arabic (AraZilla and
others)
. IE asks to install a special Arabic pack for some web pages (TODO: find
one such example) but without the Windows XP CD we cannot proceed
. Mozilla fails to display Arabic letters in all tested pages
. ajeeb.com seems to use XML source documents (which would explain how
the right-to-left issue is achieved, and how the browser knows to use
initial/medial/final letters?) but are the XSL stylesheets available?
Can this work be re-used?
. the AraZilla home page at http://www.langbox.com/AraZilla/index_ar.html
displays as Arabic in IE but as mumbo-jumbo (iso-latin-1) in IE if
copied to another machine! The only different is the content-type,
"Unknown" in IE for the langbox.com server (as opposed to "Text/HTML"
on the other server)
. Unicode works. The list of Arabic characters can be found in a PDF file
(encrypted!) at http://www.unicode.org/charts/PDF/U0600.pdf
Then typing the number in HTML entities like so: ے (in decimal)
will work in IE (but not in Mozilla). But how to address the
right-to-left issue, the initial/medial/final issue, and the encoding
issue to display Arabic conveniently on the screen of the translator?
(Partial) Solutions
. the "align='rtl'" attribute in HTML 4.0 (for <p> and <html> tags)
as explained in http://www.weizmann.ac.il/IU/create/danon_excerpt.html
Note: this may be confusing if used with a non adapted editor.
. http://dictionary.ajeeb.com/en.asp
uses XML/XSLT to display the results (search the word "Arabic" for example)
The XML and corresponding XSLT can be found in the extra/ directory,
under the names Ajeeb_eDic.xsl and Ajeeb_EnAr_Arabic.xml
Problem: the encoding "windows-1256" is not supported by my XSLT
engine, XT.
Note: Did not understand well how to choose, when "View Source", whether
to display the XML source or the HTML produced.
If we transform it into "iso-8859-1", clean up the incorrect leading blanks
in the XML and recompile, we get a resulting pseudo-HTML file all in UTF-8,
with lots of JavaScript included.
If we try to "recode" the original windows-1256 to UTF-8 we get, for both files, we get the error messages:
recode: <filename> failed: Untranslatable input in CP1256..ISO-10646-UCS-2
- TODO
- . test IE can choose the right letters (isolated, initial, medial, final) according to context . compile a vim with support for :righleft to type Arabic in an xterm with iso-8859-6 font . view the results in IE after conversion to UTF-8 HTML entities . test different Arab words, with letters next to one another or distant
