How to tag documents with multiple languages and scripts.

A happy New Year to all my readers.

This holiday season was unusual in the fact that the Christian festival Christmas, the Jewish festival Chanukkah (חנכה), and the Islamic New Year Maal Hijra, all occurred at the same time.

The previous sentence raises the question as to how it should be tagged in HTML. It contains three different languages, the Hebrew in its native script and in transliteration, and the Arabic in transliteration only. To add to the complication the Hebrew script should be read from right to left whilst its transliteration should be read from left to right.

Before I try and answer this question I need to briefly explain why it is important to tag multilanguage documents correctly. The reasons include accessibility needs such as:

Screen readers need to know what language they are reading so that they can pronounce it properly, or announce that the text is in a language that they do not recognize.
Screen magnifiers may use the direction of text to decide on how they should move around the screen.

Besides the accessibility needs other systems may be able to benefit from knowing the language of the text:

Spelling checkers need to know the language of the text so that they can check against the correct dictionary, or ignore the text if they have no dictionary to check against.
Tools that allow you to ask the definition of a highlighted word obviously need to know which language the word is in so that they can give you either a dictionary definition or translation.
Search engines may also be able to use the language of the text to improve their categorisation and search results.

Having set myself this holiday question to investigate I went straight onto the web. I quickly discover that there are two attributes related to internationalisation (I18n):

‘Dir’ that specifies the direction of content, the values can be ‘ltr’ (left to right) or ‘rtl’ (right to left).
‘Xml:lang’ that specifies the language of the text and can have values such as: ‘en’ (English) or ‘fr’ (French).

My next discovery was that there is an international standard (ISO 639 -1) that specifies the two character abbreviations of languages; so I found out that Arabic is ” ar and Hebrew is he. Which left me with the problem of how to distinguish between Hebrew in native script and transliteration.

This led me into the world of Request for Comments (RFC) of the Internet Engineering Task Force (IETF). Being a world of standards it is by nature very detailed, precise and pedantic. This is as it has to be but it does make it difficult for a newcomer to comprehend and be able to navigate to the relevant area. I found out that a language attribute can be made up of more than one part and found a list of recognized combinations; this included ‘az-Latn’ for Azerbaijani transliterated in to Latin text. Thus it appeared to me that using ‘he-Latn’ would be a reasonable answer for my Hebrew transliteration. However, the document I was looking at said that I had to formally register it. My attempt to register it failed with a message that suggested that my formatting of the request was incorrect. Luckily I had found an e-mail address of someone who obviously understood the subject and I decided to use the personal touch rather than talk to a computer again. I am delighted to say that this approach resulted in a very quick response even though it was that their days between New Year and the restarting of work next week.

A few more e-mails from the RFC community explained everything to me. I had been looking at an out of date RFC and I should have been looking at here and they can be combined in any reasonable way, which includes ‘he-Latn’ and ‘ar-Latn’.

So I now have the answer to my question. If you look at the source of the relevant sentence you will see that it has been tagged correctly.

It is also relevant to point out that although this article has concentrated on HTML the language attribute can be used in other forms of documentation, for example tagged PDF.

I would like to thank all those who have helped me on this journey.

It has raised two new questions for me:

My journey was more complex because Google initially pointed me at the older documents on the subject. I assume that this was because there were more references to the older documents. Is there any way we can ensure that old and obsolete documents drop down the Google search list more quickly?

I also found the standards documents difficult to understand as a newcomer. Is there any way to make them easier to understand by relatively casual users like myself. I am hoping that writing this article may help other people who are trying to solve the same or a similar problem.

Wishing everyone an accessible and usable and well tagged New Year.