Introduction

this project tries to solve the problem of statistical language recognition with parallel processing. the rarest languages are precisely those that need more accuracy in classification. To solve that, 473mb of samples from 275 languages were taken from web. map-reduce was used to extract trigrams and 5-grams from every sample and mpi was used to parallelize the extraction. the classification is done using the indexed model generated by training modules.

Equipment

computer used in this project

a mac pro intel xeon 8 core with hyperthreading, 20 gb of ram and 2tb hd.

it has also a quadro fx 4800 card with 192 cuda cores that were not used in this project.

Results
Innovations

the main innovation in this project is the use of 5-grams, 1 fixed threshold for all languages and two moving thresholds separated by a fixed distance for each language in order to achieve the best classification.