Innovations

this solution was inspired by the article of suzuki et al from nagaoka university of japan published in 2003. in that paper the classification method uses only trigrams to create a model that classifies text with two fixed thresholds. the problem with that algorithm is that western languages require more information to disambiguate their samples. The modified algorithm extracts all 5-grams from every sample and uses 3 thresholds instead of just 2. One threshold is fixed for all languages and the other two are moving but separated by a fixed distance for each language. if the language is portuguese, for example, the trigrams model is not sufficient to distinguish its samples from italian or spanish because of language structure. on the other hand, if each character may represent a whole word as in chinese, the trigram model fits better. Nevertheless, using the three thresholds approach with only 5-grams, all languages can be successfully recognized.