Chapter 7. Languages support

Table of Contents
7.1. Character sets
7.2. Making multi-language search pages
7.3. Segmenters for Chinese, Japanese, Korean and Thai languages
7.4. Multilingual servers support

7.1. Character sets

7.1.1. Supported character sets

DataparkSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, euc-jp and iso-2022-jp, as well as UTF-8. Some multi-byte character sets are not supported by default, because the conversion tables for them are rather large that leads to increase of the executable files size. See configure parameters to enable support for these charsets.

DataparkSearch also supports the following Macintosh character sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati.

Table 7-1. Language groups

Language groupCharacter sets
Arabic cp864, ISO-8859-6, MacArabic, windows-1256
Armenian armscii-8
Baltic ISO-8859-13, ISO-8859-4, windows-1257
Celtic ISO-8859-14
Central European cp852, ISO-8859-16, ISO-8859-2, MacCE, MacCroatian, MacRomania, windows-1250
Chinese Simplified GB2312, GBK
Chinese Traditional Big5, Big5-HKSCS, cp950
Cyrillic cp855, cp866, cp866u, ISO-8859-5, KOI-7, KOI8-R, KOI8-U, MacCyrillic, windows-1251
Georgian geostd8
Greek cp869, cp875, ISO-8859-7, MacGreek, windows-1253
Hebrew cp862, ISO-8859-8, MacHebrew, windows-1255
Icelandic cp861, MacIceland
Indian MacGujarati, tscii
Iranian ISIRI3342
Japanese EUC-JP, ISO-2022-JP, Shift_JIS
Korean EUC-KR
Lao cp1133
Nordic cp865, ISO-8859-10
South Eur ISO-8859-3
Thai cp874, ISO-8859-11, MacThai
Turkish cp1026, cp857, ISO-8859-9, MacTurkish, windows-1254
Unicode sys-int, UTF-16BE, UTF-16LE, UTF-8
Vietnamese VISCII, windows-1258
Western cp437, cp500, cp850, cp860, cp863, IBM037, ISO-8859-1, ISO-8859-15, MacRoman, US-ASCII, windows-1252

7.1.2. Character sets aliases

Each charset is recognized by a number of its aliases. Web servers can return the same charset in different notation. For example, iso-8859-2, iso8859-2, latin2 are the same charsets. There is support for charsets names aliases which search engine can understand:

Table 7-2. Charsets aliases

armscii-8 armscii-8
Big5 big-5, big-five, big5, bigfive, cn-big5, csbig5
Big5-HKSCS big5-hkscs, big5_hkscs, big5hk, hkscs
cp1026 1026, cp-1026, cp1026, ibm1026
cp1133 1133, cp-1133, cp1133, ibm1133
cp437 437, cp437, ibm437
cp500 500, cp500, ibm500
cp850 850, cp850, cspc850multilingual, ibm850
cp852 852, cp852, ibm852
cp855 855, cp855, ibm855
cp857 857, cp857, ibm857
cp860 860, cp860, ibm860
cp861 861, cp861, ibm861
cp862 862, cp862, ibm862
cp863 863, cp863, ibm863
cp864 864, cp864, ibm864
cp865 865, cp865, ibm865
cp866 866, cp866, csibm866, ibm866
cp866u 866u, cp866u
cp869 869, cp869, csibm869, ibm869
cp874 874, cp874, cs874, ibm874, windows-874
cp875 875, cp875, ibm875, windows-875
cp950 950, cp950, windows-950
EUC-JP cseucjp, euc-jp, euc_jp, eucjp, x-euc-jp
EUC-KR cseuckr, euc-kr, euc_kr, euckr
GB2312 chinese, cn-gb, csgb2312, csiso58gb231280, euc-cn, euc_cn, euccn, gb2312, gb_2312-80, iso-ir-58
GBK cp936, gbk, windows-936
geostd8 geo8-gov, geostd8
IBM037 037, cp037, csibm037, ibm037
ISIRI3342 isiri-3342, isiri3342
ISO-2022-JP csiso2022jp, iso-2022-jp
ISO-8859-1 cp819, csisolatin1, ibm819, iso-8859-1, iso-ir-100, iso8859-1, iso_8859-1, iso_8859-1:1987, l1, latin1
ISO-8859-10 csisolatin6, iso-8859-10, iso-ir-157, iso8859-10, iso_8859-10, iso_8859-10:1992, l6, latin6
ISO-8859-11 iso-8859-11, iso8859-11, iso_8859-11, iso_8859-11:1992, tactis, tis-620, tis620
ISO-8859-13 iso-8859-13, iso-ir-179, iso8859-13, iso_8859-13, l7, latin7
ISO-8859-14 iso-8859-14, iso-ir-199, iso8859-14, iso_8859-14, iso_8859-14:1998, l8, latin8
ISO-8859-15 iso-8859-15, iso-ir-203, iso8859-15, iso_8859-15, iso_8859-15:1998
ISO-8859-16 iso-8859-16, iso-ir-226, iso8859-16, iso_8859-16, iso_8859-16:2000
ISO-8859-2 csisolatin2, iso-8859-2, iso-ir-101, iso8859-2, iso_8859-2, iso_8859-2:1987, l2, latin2
ISO-8859-3 csisolatin3, iso-8859-3, iso-ir-109, iso8859-3, iso_8859-3, iso_8859-3:1988, l3, latin3
ISO-8859-4 csisolatin4, iso-8859-4, iso-ir-110, iso8859-4, iso_8859-4, iso_8859-4:1988, l4, latin4
ISO-8859-5 csisolatincyrillic, cyrillic, iso-8859-5, iso-ir-144, iso8859-5, iso_8859-5, iso_8859-5:1988
ISO-8859-6 arabic, asmo-708, csisolatinarabic, ecma-114, iso-8859-6, iso-ir-127, iso8859-6, iso_8859-6, iso_8859-6:1987
ISO-8859-7 csisolatingreek, ecma-118, elot_928, greek, greek8, iso-8859-7, iso-ir-126, iso8859-7, iso_8859-7, iso_8859-7:1987
ISO-8859-8 csisolatinhebrew, hebrew, iso-8859-8, iso-ir-138, iso8859-8, iso_8859-8, iso_8859-8:1988
ISO-8859-9 csisolatin5, iso-8859-9, iso-ir-148, iso8859-9, iso_8859-9, iso_8859-9:1989, l5, latin5
KOI-7 iso-ir-37, koi-7
KOI8-R cskoi8r, koi8-r
KOI8-U koi8-u
MacArabic macarabic
MacCE cmac, macce, maccentraleurope, x-mac-ce
MacCroatian maccroation
MacCyrillic maccyrillic, x-mac-cyrillic
MacGreek macgreek
MacGujarati macgujarati
MacHebrew machebrew
MacIceland macisland
MacRoman csmacintosh, mac, macintosh, macroman
MacRomania macromania
MacThai macthai
MacTurkish macturkish
Shift_JIS csshiftjis, ms_kanji, s-jis, shift-jis, shift_jis, sjis, x-sjis
sys-int sys-int
tscii tscii
US-ASCII ansi_x3.4-1968, ascii, cp367, csascii, ibm367, iso-ir-6, iso646-us, iso_646.irv:1991, us, us-ascii
UTF-16BE utf-16, utf-16be, utf16, utf16be
UTF-16LE utf-16le, utf16le
UTF-8 utf-8, utf8
VISCII csviscii, viscii, viscii1.1-1
windows-1250 cp-1250, cp1250, ms-ee, windows-1250
windows-1251 cp-1251, cp1251, ms-cyr, win-1251, win1251, windows-1251
windows-1252 cp-1252, cp1252, ms-ansi, windows-1252
windows-1253 cp-1253, cp1253, ms-greek, windows-1253
windows-1254 cp-1254, cp1254, ms-turk, windows-1254
windows-1255 cp-1255, cp1255, ms-hebr, windows-1255
windows-1256 cp-1256, cp1256, ms-arab, windows-1256
windows-1257 cp-1257, cp1257, winbaltrim, windows-1257
windows-1258 cp-1258, cp1258, windows-1258

7.1.3. Recoding

indexer recodes all documents to the character set specified in the LocalCharset command in your indexer.conf file. Internally recoding is implemented using Unicode. Please note that if some recoding can't convert a character directly from one charset to another, DataparkSearch will use HTML numeric character references to escape this character (i.e. in form &#NNN; where NNN - a character code in Unicode). Thus, for any LocalCharset you do not lost any information about indexed documents, but on LocalCharset selection depend the database volume you will get after indexing.

7.1.4. Recoding at search time

You may use BrowserCharset command to choose a charset which will be used to display search results. BrowserCharset may differ from LocalCharset, DataparkSearch will recode all data automaticaly.

7.1.5. Document charset detection

indexer detects document character set in this order:

  1. "Content-type: text/html; charset=xxx"

  2. <META NAME="Content-Type" CONTENT="text/html; charset=xxx">

    Selection of this variant may be switch off by command: GuesserUseMeta no in your indexer.conf.

  3. Defaults from "Charset" field in Common Parameters

7.1.6. Automatic charset guesser

DataparkSearch has an automatic charset and language guesser. It currently recognizes more than 100 various charsets and languages. Charset and language detection is implemented using "N-Gram-Based Text Categorization" technique. There is a number of so called "language map" files, one for each language-charset pair. They are installed under /usr/local/dpsearch/etc/langmap/ directory by default. Take a look there to check the list of currently provided charset-language pairs. Guesser works fine for texts bigger than 500 characters. Shorter texts may not be guessed well.

7.1.6.1. Build your own language maps

To build your own language map use dpguesser utility. In addition, your need to collect file with language samples in charset desired. For new language map creation, use the following command:


        dpguesser -p -c charset -l language < FILENAME > language.charset.lm

You can also use dpguesser utility for guessing document's language and charset by existing language maps. To do this, use following command:


        dpguesser [-n maxhits] < FILENAME

For some languages, it may be used few different charset. To convert from one charset supported by DataparkSearch to another, use dpconv utility.


        dpconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfile

By default, both dpguesser and dpconv utilities is installed into /usr/local/dpsearch/sbin/ directory.

DataparkSearch has an ability to update language and charset maps automatically while indexing, if remote server supply with pages exactly specified language and charset. To enable this function, specify command


LangMapUpdate yes
in your indexer.conf file.

7.1.7. Default charset

Use RemoteCharset command in indexer.conf to choose the default charset of indexed servers.

7.1.8. Default Language

You can set default language for Servers by using DefaultLang indexer.conf variable. This is useful while restricting search by URL language.

7.1.9. Recoding during search

You may display search results in any charset supported by DataparkSearch. Use BrowserCharset command in search.htm to select charset for search results. This charset may be different from LocalCharset specified. All recodings will done automatically.