W3C india Office

W3C INDIA OFFICE

This version:
CSS India Draft Version 2.0
Latest version:
http://w3cindia.in/cssdocument.html
Previous version:
CSS India Draft Version 1.0

Editors:

Ms Swaran lata , W3C India, Country manager
Mr Somanth Chandra, W3C India, Dy. Country Manager
Mr Prashant verma, W3C India, Sr. software engineer
Mr Naitik Tyagi, W3C India, Software Engineer

Abstract

This document describes the requirements for Indian Languages layout to be realized with CSS technology. The document is mainly based on inputs for eight Indian Languages layout requirements for Hindi, Bengali, Kannada, Guajarati, Marathi, Tamil, Malayalam and Telugu. However we need to consider the requirements for other Indian Languages also.

Table of Contents

  1 Introduction
     1.1 Purpose of this document
     1.2 How this document was created
  2 Dependencies on other modules
  3 Basic Composition of Indian Languages characters
  4 Styling of first letter pseudo-element
     4.1 Issues in Indic script
     4.2 Issues in Bangali script
     4.3 Issues in Malayalam script
     4.4 Issues in Tamil script
  5 Drop Initial overview
  6 Bullets and Numbers
  7 Collation
  8 Vertical arrangements of characters
  9 Horizontal spacing
10 Styling issues of indentation of character
11 Underlining of the characters
12 Over lining of the Characters.
13 Unicode Line Breaking Algorithm (UAX#14)
14 Unicode Text Segmentation (UAX#29)
15 Formatting issues
     15.1 html.hi extension (HTML file with .hi extension)
     15.2 Horizontal justification using CSS (HTML file)
16 Some of the specific problems – Language Wise
17 Summary of the document
18 References
19 Annexure I. Candidate CSS 3 properties for Internationalization

1. Introduction

1.1 Purpose of This Document

This document presents various CSS styling issues in Indic script. It also includes the formatting issues and properties alignment features.CSS is the abbreviation for Cascading Style Sheet. A style sheet simply holds a collection of rules that we define to enable us to manipulate our web pages. CSS can be applied to our web pages in many ways; however the most powerful way to employ CSS rules is from an external cascading style sheet. When used in this manner , the full power of CSS can be used to control the design and appearance of our work from a single controlling location, which makes it easy to update our site on a global basis. Each cultural community has its own language, script and writing system. In that sense, the transfer of each writing system into cyberspace is a task with very high importance for information and communication technology. This document describes issues of text composition in six Indian Languages layout requirements for Hindi, Bengali, Kannada, Guajarati, Marathi, Malayalam Tamil and Telugu.

1.2 How This Document was created

This document was created by the W3C India. W3C India has discussed many issues and harmonized the requirements from user communities and solutions from technological experts. It includes the following participants:

a. India Language text composition experts (Formatting rules for India documents).

b. Internationalization and standardization experts in India.

c. Linguists


To support development, the task force held face-to-face meetings with participating Working Groups. The document itself was also developed bilingually, and is published bilingually. We carefully avoided using jargon for technical terms.

2 Dependencies on other modules

This CSS module depends on the following other CSS modules:

• Text

• Selectors

• Line layout

3. Basics Composition of Indic Languages

3.1 Hindi

Composition of Indic LanguagesComposition of Indic Languages

3.2 Kannnada

Composition of Indic LanguagesComposition of Indic Languages

3.3 Tamil

Composition of Indic LanguagesComposition of Indic Languages

3.5 Bangala

Banagali Characters

় ঁ ং ঃ ৺ অ আ ই ঈ উ ঊ ঋ ৠ ঌ ৡ এ ঐ ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড {ড়} ঢ {ঢ়} ণ ত থ দ ধ ন প ফ ব ভ ম য {য়} র ল শ ষ স হ ঽ া ি ী ু ূ ৃ ৄ ৢ ৣ ে ৈ ো ৌ ্ ৗ

Sorting Order

় ঁ ং ঃ ৺ অ আ ই ঈ উ ঊ ঋ ৠ ঌ ৡ এ ঐ ও ঔ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড {ড়} ঢ {ঢ়} ণ ত থ দ ধ ন প ফ ব ভ ম য {য়} র ল শ ষ স হ ঽ া ি ী ু ূ ৃ ৄ ৢ ৣ ে ৈ ো ৌ ্ ৗ

3.6 Gujarati

Gujarati Characters

ૐ અ આ ઇ ઈ ઉ ઊ ઋ ૠ ઍ એ ઐ ઑ ઓ ઔ ક ખ ગ ઘ ઙ ચ છ જ ઝ ઞ ટ ઠ ડ ઢ ણ ત થ દ ધ ન પ ફ બ ભ મ ય ર લ ળ વ શ ષ સ હ ઼ ઁ ં ઃ ઽ ્ ા િ ી ુ ૂ ૃ ૄ ૅ ે ૈ ૉ ો ૌ

Gujarati chracters Sorting Order

ૐ અ આ ઇ ઈ ઉ ઊ ઋ ૠ ઍ એ ઐ ઑ ઓ ઔ ક ખ ગ ઘ ઙ ચ છ જ ઝ ઞ ટ ઠ ડ ઢ ણ ત થ દ ધ ન પ ફ બ ભ મ ય ર લ ળ વ શ ષ સ હ ઼ ઁ ં ઃ ઽ ્ ા િ ી ુ ૂ ૃ ૄ ૅ ે ૈ ૉ ો ૌ

3.7 Malayalam

Malayalam Characters

അ ആ ഇ ഈ ഉ ഊ ഋ ൠ ഌ ൡ എ ഏ ഐ ഒ ഓ ഔ ക ഖ ഗ ഘ ങ ച ഛ ജ ഝ ഞ ട ഠ ഡ ഢ ണ ത ഥ ദ ധ ന പ ഫ ബ ഭ മ യ ര ല വ ശ ഷ സ ഹ ള ഴ റ

Malayalam chracters Sorting Order

\u200C \u200D ഃ അ ആ ഇ ഈ ഉ ഊ ഋ ൠ ഌ ൡ എ ഏ ഐ ഒ ഓ ഔ ക ഖ ഗ ഘ ങ ച ഛ ജ ഝ ഞ ട ഠ ഡ ഢ ണ ത ഥ ദ ധ ന പ ഫ ബ ഭ മ ം യ ര ല വ ശ ഷ സ ഹ ള ഴ റ ാ ി ീ \u0D41 \u0D42 \u0D43 െ േ ൈ ൊ ോ ൗ ൌ \u0D4D

3.8 Telugu

Telugu Characters

అ ఆ ఇ ఈ ఉ ఊ ఋ ౠ ఌ ౡ ఍ ఎ ఏ ఐ ఑ ఒ ఓ ఔ క ఖ గ ఘ ఙ చ ఛ జ ఝ ఞ ట ఠ డ ఢ ణ త థ ద ధ న ఩ ప ఫ బ భ మ య ర ఱ ల ళ ఴ వ ఁ ఁం ఁ ఁ ఁశ ఁష ఁస ఁహ ఁ఺ ఁ఻ ఁ఼ ఁఽ ఁా ఁఽై ఁ ఁ ఁ ఁ ఁై

Telugu chracters Sorting Order

\u200C \u200D ഃ അ ആ ഇ ഈ ഉ ഊ ഋ ൠ ഌ ൡ എ ഏ ഐ ഒ ഓ ഔ ക ഖ ഗ ഘ ങ ച ഛ ജ ഝ ഞ ട ഠ ഡ ഢ ണ ത ഥ ദ ധ ന പ ഫ ബ ഭ മ ം യ ര ല വ ശ ഷ സ ഹ ള ഴ റ ാ ി ീ \u0D41 \u0D42 \u0D43 െ േ ൈ ൊ ോ ൗ ൌ \u0D4D

4. Styling of first letter pseudo-element

The first-letter pseudo-element represents the first letter of the first line of a block, if it is not preceded by any other content (such as images or inline tables) on its line. It allows that first letter to be styled individually, without markup. It may be used for "initial caps" and "drop caps", which are common typographical effects in text in Latin script.

4.1 Issues in Hindi script

If some styling feature is to be applied to the starting character, then whether it will be applied to a single character, conjunct character, a syllable or a Grapheme cluster.




The first-letter pseudo-element represents the first letter of the first line of a block, if it is not preceded by any other content (such as images or inline tables) on its line. It allows that first letter to be styled individually, without markup. It may be used for "initial caps" and "drop caps" Indic script behavior relates to syllables, rather than individual letter forms. In the Hindi word स्थिति ('sthiti') the sequence of characters in the first syllable is as follows in memory:


0938: स DEVANAGARI LETTER SA
094D: ् DEVANAGARI SIGN VIRAMA
0925: थ DEVANAGARI LETTER THA
093F: ि DEVANAGARI VOWEL SIGN I



Sthithi


Note how the vowel sign appears to the left of the first character, not the third.

There are two default grapheme clusters here. The first includes the SA+VIRAMA+THA+I. (The second is the last two characters, T+II.)

From the feedback we have received it appears that first-letter styling will be needed for Indic scripts. We have examples in the mail archive for such styling in Devanagari, Bengali, and Malayalam, though we have reports that it is needed for other scripts, such as Telugu, Tamil and Kannada.

We see that the styling is done on the basis of the syllable, not the first character. A syllable includes a base consonant and any combination of the following characters in the text stream:

• Consonants preceded by virama (ie. conjuncts).

• vowel signs

• visarga, anusvara or candrabindu

4.2 Issues in Bangali script



Sthithi
We also had some examples of increased font size without the drop letter characteristics.

4.3 Issues in Malayalam script



Sthithi
While applying styling, in addition to the styling of first character some more combinations are to be taken into consideration.

4.3.1 Styling applied to consonant-cluster + vowel

4.3.2 Styling applied to consonant + vowel marker

4.3.3 Styling applied to consonant + anuswara

4.3.4 Styling applied to consonant + visarga

So, it seems that here also the style applies to the grapheme cluster for Malayalam, not the syllable or conjunct or first character only.

4.4 Issues in Tamil script


tamil
Example in Bengali Script without the drop letter characteristics is:

tamil

5. Drop Initial overview


Drop initial is a typographic effect emphasizing the initial letter(s) of a block element with a presentation similar to a 'floated' element.
The following figure shows first a simple case of a three line drop initial and seconds a case of a two line drop initial but with a three line size initial letter.
tamil
The examples show a predominance of styling similar to what would be called 'drop letter' in English. Where a character is enlarged in a script has a headstroke, the height of the headstroke in the large text and the regular text is typically approximately on the same level, but commonly does not join.

The bottom horizontal line in the first example shows the alignment between the baseline of the initial letter and the baseline of the third line of its block element. The primary connection point occurs at the intersection of these baselines between the two boxes (the third line box and the initial letter box). The top horizontal line shows a connection point between the top of the ink of the initial letter and the text-before-edge baseline of the initial line. Alternatively, instead of the text-before-edge baseline, a new computed alignment could be used which correspond loosely to the maximum caps-height of the initial line.0.

The drop initial effect may also be used for writing systems which use different alignment strategies. For example, in Devanagari the hanging baseline may be preferred. In that case the primary connection point connects the text-after-edge of the initial letter with the text-after-edge of the nth line, but the secondary connection point connects the hanging baselines of the initial letter and the initial line. This is shown in the following figure:

tamil

Example of a drop letter in Hindi

tamil

Example of a drop letter in Malyalam

Malyalam

Example of a drop letter in Bangali

Bangali Bangali

Example of a drop letter in Gujrati

Gujrati

Example of a drop letter in Telugu

Telugu

6. Bullets and Numbers

Number schemes/ bulleting needs to be supported in Indian languages as well. Some standards however need to be provided to those developing CSS so that by default user could have the facility to use bulleting in his own Indic languages. Once these standards are there the authoring tools developers can adapt them to provide Indic languages bulleting. The current bulleting order even in popular DTP applications such as Microsoft Word (Version 2003) is not suitable for user. The word processors are sometimes used by the user to develop pages for the web. Therefore standards are must. In most application devanagri order is followed for languages sharing the script, which unfortunately is not the correct thing while deciding on sorting/collation for Indic languages. Number schemes to be supported in Indian languages also.

6.1 Hindi

Telugu Hindi Bullets

6.2 Punjabi

Telugu Punjabi Bullets
In case of Bangla and Assamese language the bulleting should be like as in below In case of numeric bulleting the digit of the Bangla scripts like ১,২,৩,৪,৫,৬,৭,৮,৯,০ is to be used.

For bulleting by alpha numeric the consonant of Bangla scripts are used. The Bangla vowels are not used in bulleting purpose.

7. Collation

A means to search and order data in a way that makes sense in their particular culture. The myth is that one collation is good enough and if the data is Unicode enabled, sorting is already covered. In the example please see Hindi and Marathi which share Devanagari script.
Telugu Punjabi Bullets
The different is clear, so we can't use same bullets, sorting order etc. for languages sharing the script, the standards should be defined and provided for use by people.

8. Vertical arrangements of characters

Presentation / Styling issues:

Vertical arrangement of characters If some string is written in vertical mode, then writing each character on a new line may not be suitable, Styling like vertical arrangement of the character in Hindi
Telugu


Styling like vertical arrangement of the character in Bengali :



কা
তা


When this issue was first discussed, there were queries that whether Indic scripts (Devanagari etc.) are written in this fashion and will be of use anywhere.

Telugu

9. Horizontal spacing

Same thing applies to horizontal spacing as well for Indic languages

Styling issues like the Horizontal spacing between characters like C E R T I F I C A T E the space is given between the every character in case of English. But in case of Indian language like Bangla, Assamese etc the space may given not in every character but after some portion of the character sequence as in figure below:

Telugu

10. Styling issues of indentation of character

Sometimes some of the character of a word is indented as in figure-3 the হা is indented

Example in Bangla:

Telugu

What should be the solution or rule for such type of styling issue in case of Indian language Some time people said that styling is done on the basis of the syllable, but what is the definition of syllable. The definition of syllable depends on the pronunciation of the word. In the example পা􀂵টাহাওয়া the syllables are পাল, টা, হা, ও, য়া but styling is done as পা 􀂵টা হা ও য়া which is not as per the syllable. So we should define the rule instead of defining it by syllable basis.

The rule for the Indian language styling:

Braking of the word for styling may be done on the point where:

a) After the consonant or consonant cluster or consonant with halant where there is no vowel ligature is present after the consonant or consonacluster. i.e.
ক ল ম, ধ 􀄱 বা দ, উ দ্ যা প ন,

b) After the consonant or consonant cluster with vowel ligature
েকা িক ল , 􀄹া ণ, ভা র ত

c) After the independent vowel
অ স িম য়া

d) After the consonant or consonant cluster with or without the vowel ligature along with chadrabindu and bisarga.
চ􀆲 দ, গঁ দ, খ􀆲􀂧 দা, 􀆮ঃ খ

11. Underlining of the characters

There is some examples of Indian languages in which Matra’s are not readable due to underlining of characters

Hindi - अन्य भाषाओं में भी अनुवाद
Punjabi ਗੁਰੂ Matra’s are not readable
Bengali: তাই পুরোনো আর্কাইভ একটু ওলট পালট।

Guajarati - સરદાર ગુર્જરી
Marathi- मराठी मुला मुलींची नावे
Tamil- நீரிற்குமிழி யிளமை நிறைசெல்வம்
நீரிற் சுருட்டு நெடுந்திரைகள் - நீரில்
எழுத்தாகும் யாக்கை நமரங்கா ளென்னே
வழுத்தாத தெம்பிரான் மன்று
Telugu - శ్రీశైలం ప్రాజెక్ట్‌పై TV9 ప్రోగ్రాం " డ్యామ్ ఇన్ డేంజర్ " పార్ట్ - 2

When we see these pages on internet, the information is not clearly readable because if we hyperlink the text in Indian languages some modifiers (matras) are cut and in Punjabi the underline matches few matras (Small u). It can create problem in reading the information correctly. Therefore some changes may be required to be implemented in CSS standards developed by W3C with respect to Indian languages.

Under Linning of the characters ( test and snapshot )

underline_test

Internet Explorer (version 8.0.6001.18702)

underline_IE

Google Chrome (version 22.0.1229.94m)

underline_google_chrome

Mozilla Firefox (version 16.0.1)

underline_Mozilla_firefox

Opera (version 12.02)

underline_Opera

12. Over lining of the characters

Whenever we use special characters in Hindi and applying text decoration as over line we are finding some issues in almost every India language where we are using any special characters. Issues finds in language like Hindi, bangle, Punjabi, Malayalam, Tamil, Oriya, Guajarati, Marathi.

overline_test

Internet Explorer (version 8.0.6001.18702)

overline_IE

Google Chrome (version 22.0.1229.94m)

overline_google_chrome

Mozilla Firefox (version 16.0.1)

overline_Mozilla_firefox

Opera (version 12.02)

overline_Opera

13. Unicode Line Breaking Algorithm (UAX#14)

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. The Unicode Line Breaking Algorithm performs part of this process. Given an input text, it produces a set of positions called "break opportunities" that are appropriate points to begin a new line. The selection of actual line break positions from the set of break opportunities is not covered by the Unicode Line Breaking Algorithm, but is in the domain of higher level software with knowledge of the available width and the display size of the text.

'word-wrap' ::This CSS 3 property specifies whether the current rendered line should break if the content exceeds the boundary of the specified rendering box for an element

Internet Explorer

wrdwrp_ie

Google Chrome

wrdwrp_crom

Mozilla Firefox

wrdwrp_moz

------------------------------------------------------------------------------------------------------------------------------------- 'word-break' ::This CSS 3 property specifies whether the current rendered line should break if the content exceeds the boundary of the specified rendering box for an element

Internet Explorer

wrdbrk_ie

Google Chrome

wrdbrk_crom

Mozilla Firefox

wrdbrk_moz

14. Unicode Text Segmentation (UAX#29)

Some special sentence boundaries like the double poorna virama, possibly with numbers (as in Sanskrit text, shlokas etc.) A string of Unicode-encoded text often needs to be broken up into text elements programmatically. Common examples of text elements include what users think of as characters, words, lines (more precisely, where line breaks are allowed), and sentences. The precise determination of text elements may vary according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E FULL STOP) is used ambiguously, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundaries can match user perceptions quite closely, although sometimes the best that can be done is not to surprise the user.

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection, or “move to next word” control-arrow keys), and “Whole Word Search” for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another.

Example of double click mouse selection :

Internet Explorer

wrdbrk_ie

Google Chrome

wrdbrk_crom

Mozilla Firefox

wrdbrk_moz

15. Formatting issues

15.1 html.hi extension (HTML file with .hi extension)

The extension with html provides primary information about the document of the language (It becomes obvious from the extension). It works on most of the browsers for non-Indic languages such as .ja Japanese, .fr for French etc. For Indian languages also it works on few browsers but not all. This should be uniform and all browsers should be able to display file with language extension.

15.2 Horizontal justification using CSS (HTML file)

When we try to use for horizontal justification in HTML page in Indic languages, it does not work on few browsers. It works well on IE6.0 but not on Firefox 2.0.0.3 Telugu

Telugu

16. Some of the specific problems – Language Wise

Styling issues for Indian Languages have been tested in the following browsers Some of the test results of the above issues are given below:

16.1 Hindi

16.1.1 Matras are not properly displaying.

Hindi

Hindi

Hindi

Hindi

Hindi

16.1.2 Over lining over special character is not properly displayed.

Hindi

Hindi

Hindi

Hindi

Hindi

16.1.3 First letter is showing not properly throughout all browsers.

Hindi

Hindi

Hindi

Hindi

16.1.4 Line through is not coming at its position either upper part or lower part of character.

Hindi

16.2 Bangla

16.2.1 Some special characters are not visible.

Bangla

16.2.2 Matras are not properly displaying

Bangla

16.2.3 First letter is showing not properly throughout all browsers.

Bangla

16.2.4 Some characters are displaying above all other characters

Bangla

16.3 Punjabi

16.3.1 Over lining over special characters is breaking

Punjabi

16.3.2 Matras are not properly displaying

Punjabi

16.3.3 First letter is showing not properly throughout all browsers

Punjabi

16.4 Kannada

16.4.1 Over lining over special character is not properly displayed

Kannada

16.4.2 Under lining over special character is not properly displayed

Kannada

16.5 Tamil

16.5.1 Matras are not properly displaying.

Tamil

16.5.2 Over lining of characters is not displaying Properly.

Tamil

17. Consolidated Summary

The brief summary of this document is mention below

summary summary summary summary

18. References




Annexure I. Candidate CSS 3 properties for Internationalization

NameValues
':first-letter (CSS selector)'

The following properties apply to the "first-letter" pseudo- element: 

  • font properties
  • text-decoration
  • margin properties
  • vertical-align (only if "float" is "none")
  • text-transform
  • line-height
'list-style-type' disc | circle | square | decimal | decimal-leading-zero | lower-roman | upper-roman | lower-greek | lower-latin | upper-latin | armenian | georgian | lower-alpha | upper-alpha | none | inherit
'text-decoration' none | [ underline || overline || line-through || blink ] | inherit
'text-indent' <length> | <percentage> | inherit
'text-align' left | right | center | justify | inherit
'word-break' normal | keep-all | break-all
'word-wrap' normal | break-word