Friday, April 28, 2006

As part of my work on Natural Language Processing, I was required to learn how to work comfortably with Indic languages. This meant being able to computationally process Indic scripts, either in standard Unicode or in proprietary encoding, and parse Indian sentences syntactically and semantically. I found the work terribly refreshing, and I've found that the language features that most OSes provide have been underutilized for way too long. This post is part of what I've discovered.

If you want to be able to seamlessly work with your own languages like हिंदी, বাংলা, ગુજરાતી, ਪੰਜਾਬੀ, ಕನ್ನಡ, తెలుగు or தமிழ் on your computer the way you naturally do in your life, or are more fluent in your own language and have always wondered why you were stuck with working exlusively with English, you need to read this.

In 1991, the Unicode standard attempted to standardize and bring some regularity to the chaos of innumerable independent language scripts that were popping up all over the world. These scripts offered some compatibility with the roman script, but rarely worked with one another. Unicode supports almost all scripts in use today, from Arabic (العربية) to Zhuyin (中文). Every script has its own place in Unicode space, so that means that you can seamlessly integrate several scripts into one document, like I've just done.

Getting support for Indic and Arabic scripts in Windows XP is rather straightforward, and I'll explain it in brief here. In fact, unfortunately the rest of this post will deal with Windows XP exclusively. Linux and *nix users are requested to click here instead - getting Indic scripts to work in Linux is perhaps a bit more involved. For the rest, the "Regional and Language Options" icon under the Windows XP Control Panel is where you would want to go. Once it opens up, click on the "Languages" tab, and under the "Supplemental language support" group, tick the checkbox that says "Install files for complex scripts and right-to-left languages (including Thai)". Click "OK", wait for the installation to complete, and you're done with the preliminary support!

Well, this probably deserves some explanation. In Unicode, a complete phoneme like हिं is made up of a sequence of its compositional units, like ह+ि+ं (not really suprising at all, eh!). However, in roman script a sequence remains a sequence orthographically (c+a+t=cat), whereas in many languages like our own, a sequence could be mapped to a completely different character (think the previous example or, say, त्+र=त्र). So, Unicode fonts need to accommodate for this, and characters like त्र are stored in the font as well (even if they are still stored internally as a sequence of the Unicode representations of their compositional units).

Secondly, to start typing, you need to install the languages you would like to work with in Windows XP. To do this, go back to now familiar "Regional and Language Options" and under the "Languages" tab, click on the "Details..." button in the "Text services and input languages" group. Under "Installed services", click on "Add..". Add any language you wish, alongwith associated services like the corresponding keyboard. In case your language does not have a keyboard supported, choose "INSCRIPT". For हिंदी, Windows XP provides a "Hindi Traditional" keyboard. Now, under the "Preferences" group of the "Text services and input languages" window, click on "Language Bar...". Click on "Show the Language bar on the Desktop", and click on "Apply". You should now see a new bar floating around, and you can click on the icon that says "EN" to choose between languages installed. You are now ready to start typing in your own language!

In Windows XP, you simply need to choose a language in the Language bar to start typing. However, learning which characters are mapped to which keys on the keyboard isn't easy. There are in fact many ways to type in non-English languages. They are:

  1. Use the keyboard to type. Get a देवनागिरी keyboard or a keyboard for your language, or simply buy a keyboard skin. Learn the mapping of keys to characters youself. In general, for हिंदी, the consonants are towards the right, and the matras are towards the left.
  2. Install software or use online editors to type. This is much slower than actually using the keyboard since you have to click on each character. A fairly simple editor for हिंदी can be found here.
  3. Use transliteration. UPenn has a very handy webpage that lets you type romanized Hindi and get equivalent transliterated Unicode. So, you can type "bhagawaana", and the webpage gives you भगवान! Find it here.

I hope this really basic post will get you to do interesting things with Unicode. Being the samaritan that I am, I volunteer to give you pointers here as well. Some really fun things you can start off with in your own languages are:

  1. Start searching the web with keywords in your own scripts. This will introduce you to a part of the web you haven't seen before. A Google search for 'गरम मसाला' can be found here.
  2. Explore web pages and blogs that have long since adopted the Unicode standard. An example blog can be found here (I have no idea who the author is, it's just something I stumbled upon).
  3. Start contributing to the community. For starters, start blogging in your favourite language, or start adding pages, for example, to the বাংলা, ગુજરાતી, ಕನ್ನಡ or తెలుగు Wikipedias! It's about time we started making our presence felt on the World Wide Web, and asserted our identity!