Research Report

extraction of trigrams

the extraction of trigrams was done with python using map-reduce and the speed was optimum when using mpi to parallelize this process. the speed was about 8 times faster compared with a serial extraction. this was achieved using a map-reduce driver that generates all trigrams simultaneously using n mpi processes. The results are loaded in a database or in a hash map that is serialized in order to be used by the classifiers.

the original algorithm

this was the first approach to solve the problem, that is, use two fixed thresholds and for every trigram present in a sample compute its relative frequency for every language registered. if the result is greater than a fixed threshold for a given language and the result for every other language is below a fixed threshold, it returns the language that has the only result above the others, otherwise return "unable to detect".

The shift-codon-matching process for detecting an LSE is as simple as the following:

Step 1. First, all the shift-codons from the target text u1, u2, u3, ... , un are listed.
Step 2. Tis (or T-values) are calculated for every ith LSE using formula (1), where Ti is the rate at which the shift-codons of ith shift-codon-signature (or Si) appear in the target text.
formula

Step 3. If just one Ti’ is greater than a predetermined value UB, meaning “closer to the value 1,” while others are less than a predetermined value LB, meaning “not close to the value 1,” the program gives the answer i' th LSE. Otherwise, the program returns “unable to detect,” which includes “other than registered.” UB and LB are predetermined and their standard values are 0.95 and 0.92, respectively.


 

results from original shift-codon algorithm

languages like portuguese and spanish were not recognized precisely and it was only very accurate for asian languages.

extraction of 5-grams

the extraction of 5-grams was done with the same method of the extraction of trigrams showing scalability of this parallel solution.

alternative to the original algorithm

the alternative approach to solve the problem was to construct a 5-gram model that could be used more successfully than the trigram approach. One fixed threshold of 0.6 for all languages and two moving thresholds separated by 0.005 for each language were used and the accuracy of recognition is greater than 90%. If there is only one language above the fixed threshold and above the highest moving threshold and all other languages are below the lower moving threshold, that language is recognized, otherwise the classifier returns "unable to detect".

 

 

97% Accuracy
English
German
French
Dutch
Italian
Russian
Spanish
Polish
Japanese
Portuguese
Chinese
Swedish
Vietnamese
Ukrainian
Catalan
Norwegian (Bokmål)
Finnish
Czech
Persian
Hungarian
Korean
Romanian
Indonesian
Arabic
Turkish
Kazakh
Slovak
Esperanto
Serbian
Danish
Lithuanian
Malay
Basque
Bulgarian
Hebrew
Slovenian
Volapük
Croatian
Waray-Waray
Hindi
Estonian
Galician
Norwegian (Nynorsk)
Azerbaijani
Simple English
Latin
Greek
Thai
Serbo-Croatian
Occitan
Newar / Nepal Bhasa
Macedonian
Uzbek
Georgian
Tagalog
Piedmontese
Belarusian
Haitian
Telugu
Tamil
Belarusian (Taraškievica)
Latvian
Breton
Albanian
Cebuano
Javanese
Malagasy
Welsh
Marathi
Luxembourgish
Icelandic
Armenian
Bosnian
Burmese
Yoruba
Aragonese
Lombard
Malayalam
Western Panjabi
West Frisian
Afrikaans
Aromanian
Bashkir
Bengali
Bishnupriya Manipuri
Swahili
Ido
Sicilian
Urdu
Nepali
Gujarati
Cantonese
Kirghiz
Low Saxon
Kurdish
Asturian
Irish
Quechua
Sundanese
Tatar
Chuvash
Interlingua
Neapolitan
Samogitian
Alemannic
Banyumasan
Walloon
Scots
Kannada
Amharic
Sorani
Scottish Gaelic
Buginese
Fiji Hindi
Tajik
Mazandarani
Min Nan
95% Accuracy
Yiddish
Venetian
Egyptian Arabic
Tarantino
Nahuatl
Ossetian
Sanskrit
Mongolian
Sakha
Zazaki
Kapampangan
Upper Sorbian
Limburgish
Sinhalese
Maori
Northern Sami
Corsican
Gan
Tibetan
Bavarian
Gilaki
Ilokano
Faroese
Central_Bicolano
Hill Mari
Võro
Dutch Low Saxon
Punjabi
Turkmen
Pashto
West Flemish
Manx
Rusyn
Mingrelian
Pangasinan
Divehi
Komi
Zeelandic
Norman
Komi-Permyak
Khmer
Romansh
Oriya
Kashubian
Ladino
Udmurt
Ligurian
Friulian
Meadow Mari
Maltese
Wu
North Frisian
Uyghur
Classical Chinese
Sardinian
Pali
Vepsian
Bihari
Somali
Novial
Ripuarian
Saterland Frisian
Anglo-Saxon
Cornish
Navajo
Aymara
Hakka
Picard
Franco-Provençal/Arpitan
Extremaduran
Guarani
Silesian
Gagauz
Interlingue
Lingala
Emilian-Romagnol
Kalmyk
Hawaiian
Palatinate German
Assamese
Pennsylvania German
Kinyarwanda
Karachay-Balkar
Crimean Tatar
Acehnese
Tongan
Chechen
Greenlandic
Aramaic
Lower Sorbian
Erzya
Banjar
Papiamentu
Shona
Lezgian
Kabyle
Tok Pisin
Lak
Lojban
Moksha
Wolof
Buryat (Russia)
Avar
Sranan
90% Accuracy
Zamboanga Chavacano
Mirandese
Tahitian
Lao
Kabardian Circassian
Abkhazian
Tetum
Latgalian
Nauruan
Kongo
Igbo
Northern Sotho
Zhuang
Karakalpak
Zulu
Cheyenne
Romani
Old Church Slavonic
Tswana
Cherokee
Gothic
Min Dong
Samoan
Bislama
Moldovan
Bambara
Inuktitut
Norfolk
Swati
Pontic
Sindhi
Kikuyu
Ewe
Hausa
Oromo
Fijian
Tigrinya
Tsonga
Kashmiri
Venda
Sango
Kirundi
Cree
Akan
Tumbuka
Dzongkha
Luganda
Chichewa
Inupiak
Chamorro
Fula
Sesotho
Twi
Xhosa