Sampark: Indian Language MT System

Indian Language to Indian Language Machine Translation System

To access the system Click Here

Sampark System: Automated Translation among Indian Languages

Sampark is a multipart machine translation system developed with the combined efforts of 11 institutions in India under the umbrella of consortium project “ Indian language to India Language Machine translation” (ILMT) funded by TDIL program of Ministry of Electronics and Information Technology (MeitY), Govt. of India.

ILMT project has developed language technology for 9 Indian languages resulting in MT for 18 language pairs. These are: 14 bi-directional pairs between Hindi and Urdu / Punjabi / Telugu / Bengali / Tamil / Marathi / Kannada and 4 bidirectional between Tamil and Malayalam / Telugu.

India has more than hundred languages and dialects of which 22 are designated as official languages in the constitution. More than 850 million people world wide speak the following Indian languages: Hindi, Bengali, Telugu, Marathi, Tamil and Urdu. With the availability of e-content and development of language technology, it has become possible to overcome the language barrier. The complexity and diversity of Indian languages present many interesting computational challenges in building automatic translation system.

Approach:

First, Sampark uses Computational Paninian Grammar (CPG) approach for analyzing language and combines it with machine learning. Thus it uses both traditional rules-based and dictionary-based algorithms with statistical machine learning. At present twelve systems are being released :

-Punjabi to Hindi
-Hindi to Punjabi
-Telugu to Tamil
-Urdu to Hindi
-Tamil to Hindi
-Hindi to Telugu
-Tamil To Telugu
-Marathi To Hindi
-Telugu To Hindi
-Hindi To Urdu
-Hindi To Bengali
-Malayalam To Tamil
The Sampark system is based on analyze- transfer-generate paradigm. First, analysis of the source language is done, then a transfer of vocabulary and structure to target language is carried out and finally the target language is generated. Each phase consists of multiple "modules" with 13 major ones. An advantage of this approach is that a particular language analyzer, can be developed once, independent of other languages and then paired with generators in other languages. for example Punjabi analyzer can be joined with generator for Hindi to yield Punjabi to Hindi MT System. Because Indian languages are similar and share grammatical structures, only shallow parsing is done. Transfer grammar component has been kept simple. Domain specific aspects have been handled by building suitable domain dictionaries.

The 13 major modules together form a hybrid system that combines rule-based approaches with statistical methods in which the software in essence discovers rules through "training" on text tagged by human language experts.

The second attribute of this work is the system's software architecture. Due to the complexity of NLP system, and the heterogeneity of the available modules, it was decided that ILMT system should be developed using Blackboard Architecture to provide inter-operability between heterogeneous modules. Hence all the modules operate on a common data representation called Shakti Standard Format (SSF) either in memory or in text stream.

This approach helps to control the complexity of the overall system and also helps to achieve unprecedented transparency for input and output for every module. The textual SSF output of a module is not only for human consumption but it is also used by the subsequent module in the data stream as its input. Readability of SSF helps in development and debugging because the input and output of any module can be easily seen. Even in case a module fails to analyze its input, the SSF format helps to run the modules without any effect on normal operation of system. In such case the output SSF would have unfilled value of an attribute and downstream modules continue to operate on the data stream.