

First the source script is converted to WX and then WX is converted to the target script. Using this WX as bridge we convert a string from source script to target script. Check WX conversions of various Indic scripts from Indic-WX-Converter. There is a single ISCII to ASCII table for WX conversion and a seperate table for Unicode to ISCII for each Indic script. Internally WX maps the letters of Indic scripts to a common representation in ISCII (refer to page 15-17 of iscii91.pdf) and then maps this ISCII to ASCII which we call WX. But this is not the case with all Indic scripts, for example, Hyderabad in Tamil (ஹைதெராபாத்) maps to hEweVrApAw in WX. For example, Hyderabad in Telugu (హైదరాబాద్), Malayalam (ഹൈദരാബാദ്) and Kannada (ಹೈದರಾಬಾದ್) all map to a common representation hExarAbAx in WX. Common representation WX tries to give a common representation to all Indic scripts.This helps in analysis of the developed system. Readability WX allows one to read a Tamil or Telugu string even if he has no idea about the original script.Memory and computational efficiency since we are working with ASCII rather than Unicode (utf-8, utf-16 etc.).This scheme originated at IIT Kanpur for computational processing of Indian languages, and is widely used among the natural language processing (NLP) community in India. WX notation itself is a transliteration scheme for representing Indian languages in ASCII (Roman). I have used WX notation for rule-based transliteration among Indic scripts.

For example Bengali have no character for Va, so it is represented by Ba (ব), thus causes ambiguity for Ba character of Bengali when we transliterate it to some other Indic script as it can be Va or Ba in that script. These ambiguous mappings actually arise due to missing letters for some phonemes in these scripts. Secondly, there are ambiguous caharacter mappings ( see example table) in some Indic scripts like Tamil, Bengali etc. The first issue is to handle the missing phonemes in Indic scripts ( already discussed here). There are various issues with this approach. It seems that the transliteration between any two Indic scripts can be achieved by merely using their unicode tables. Indic scripts have a special property that their phonemes are one-to-one aligned between their Unicode tables. But for Indic scripts I have developed both rule-based and machine learning (ML) systems. The transliteration systems I have developed so far does not use any kind of rules to transliterate a word from one script to another rather uses the ML models trained using the training transliteration pairs.

Hindi-Urdu is another important language pair and I have already discussed about this pair in my previous blog.Īfter Hindi-Urdu, Gurmukhi-Shahmukhi script pair tops the chart. Thus the reason why Hindi-English and Urdu-English transliteration systems have been extensively researched. Hindi and Urdu are the two most widely used languages in the Indian subcontinent and English is a global language. Among Indian languages including English, transliteration has mainly been explored for the following language pairs:
