fertpodcast.blogg.se - Urdu to english transliteration

First the source script is converted to WX and then WX is converted to the target script. Using this WX as bridge we convert a string from source script to target script. Check WX conversions of various Indic scripts from Indic-WX-Converter. There is a single ISCII to ASCII table for WX conversion and a seperate table for Unicode to ISCII for each Indic script. Internally WX maps the letters of Indic scripts to a common representation in ISCII (refer to page 15-17 of iscii91.pdf) and then maps this ISCII to ASCII which we call WX. But this is not the case with all Indic scripts, for example, Hyderabad in Tamil (ஹைதெராபாத்) maps to hEweVrApAw in WX. For example, Hyderabad in Telugu (హైదరాబాద్), Malayalam (ഹൈദരാബാദ്) and Kannada (ಹೈದರಾಬಾದ್) all map to a common representation hExarAbAx in WX. Common representation WX tries to give a common representation to all Indic scripts.This helps in analysis of the developed system. Readability WX allows one to read a Tamil or Telugu string even if he has no idea about the original script.Memory and computational efficiency since we are working with ASCII rather than Unicode (utf-8, utf-16 etc.).This scheme originated at IIT Kanpur for computational processing of Indian languages, and is widely used among the natural language processing (NLP) community in India. WX notation itself is a transliteration scheme for representing Indian languages in ASCII (Roman). I have used WX notation for rule-based transliteration among Indic scripts.

For example Bengali have no character for Va, so it is represented by Ba (ব), thus causes ambiguity for Ba character of Bengali when we transliterate it to some other Indic script as it can be Va or Ba in that script. These ambiguous mappings actually arise due to missing letters for some phonemes in these scripts. Secondly, there are ambiguous caharacter mappings ( see example table) in some Indic scripts like Tamil, Bengali etc. The first issue is to handle the missing phonemes in Indic scripts ( already discussed here). There are various issues with this approach. It seems that the transliteration between any two Indic scripts can be achieved by merely using their unicode tables. Indic scripts have a special property that their phonemes are one-to-one aligned between their Unicode tables. But for Indic scripts I have developed both rule-based and machine learning (ML) systems. The transliteration systems I have developed so far does not use any kind of rules to transliterate a word from one script to another rather uses the ML models trained using the training transliteration pairs.

Devanagari (Hindi, Marathi, Konkani, Nepali and Bodo).

In this blog I will discuss about transliteration among the following languages and will refer them as Indic scripts from here on: Transliteration within rest of the Indic scripts has not been explored much. Other than these language pairs, transliteration systems have been developed for some Indic-Roman scripts like Malayalam-English, Tamil-English, Telugu-English etc. Gurmukhi is mainly used by Punjabi speakers in India while Shahmukhi is used by Punjabi speakers in Pakistan. Perso-Arabic (Shahmukhi) is one of two scripts used for Punjabi, the other being Gurumukhi. Gurmukhi and Shahmukhi are the two scripts for Punjabi and thus makes this pair an important candidate for exploring transliteration.

Hindi-Urdu is another important language pair and I have already discussed about this pair in my previous blog.Īfter Hindi-Urdu, Gurmukhi-Shahmukhi script pair tops the chart. Thus the reason why Hindi-English and Urdu-English transliteration systems have been extensively researched. Hindi and Urdu are the two most widely used languages in the Indian subcontinent and English is a global language. Among Indian languages including English, transliteration has mainly been explored for the following language pairs: