Indian languages and language technology

PRIYANKOO SARMAH

back to issue

THE way Indians communicate and access information is changing, and both the rural and the urban population in the country are not left untouched by it. Currently, the number of smartphone users in India is about 700 million, and it is estimated to reach about a billion in 2025. As the use of smartphones increases in the country, so do the digital services that are dished out as part of the smartphone ecology.

Evidence of the changing ecology was seen in the teaching and learning environment that arose due to the 2019-20 Coronavirus pandemic. During the pandemic, the learners had to adapt themselves to virtual learning, and the teachers had to make themselves and their materials smartphone ready. There was a need for both the population to make themselves digital or smartphone worthy. This caused many to wonder whether this is a glimpse of the future of education and society we are witnessing. There was also a concern that these ‘smart’ devices somehow make us lose our individuality and make us behave in a predictable, uniform way to suit the technologies we are using. Naturally, a larger question emerges that if we are going to lose out to homogeneity and lose our individual culture and languages in the digital world.

India is an incredibly complex country with numerous languages and cultures. The 2011 Census of India, participants reported speaking 19,569 mother tongues. Of course, after scrutiny, this number was reduced to 1369 mother tongues and 1474 ‘unclassified’ mother tongues. While some of these languages and cultures are thriving and are relatively well known, many others, in the nooks and corners of the country, struggle to survive for another generation. Nevertheless, it is noticed that there are hardly any cases where people have abandoned their language at their own will. In all cases, people are seen to be emotionally attached to their own languages and cultures. Therefore, when a piece of information is communicated to the people in their own language or mother tongue, it goes beyond serving the intended purpose of the message, and becomes a vehicle for emotional connection.

The business world has already recognized the ‘trust’ a person has in information provided in her mother tongue. Several studies on consumer behaviour have shown that people are more likely to trust the information if provided in their mother tongue. About 68% of the internet users in India consider digital content to be more credible when presented in the local language. Hence, in trade and commerce the importance of localization and multilingualism is widely acknowledged. Nonetheless, the question at the end of the day is whether all languages will be given equal importance and patronage by the business world.

 

In the last decade, the way people search for information has changed. Technology used in searching and accessing information on the internet has taken an unexpected but more convenient turn. Worldwide 7 out of every 10 search engine users prefers searching for information though voice rather than by typing text. It is assumed that very soon voice search will completely replace search by typing text. Moreover, the interactions themselves will be voice interactions in both ways, limiting the need to write or read text, even in the major languages. In India, 28% of the search engine users are using voice search, and every year the number of Indians using voice search is increasing exponentially. At the same time many of these voice searches are not in English, but in Indian languages, which has made Hindi the second most popular language worldwide in terms of voice searching.

The multilingual voice search raises several concerns such as voice based privacy issues, delegation of written forms of languages as secondary medium of interaction on web enabled platforms etc. However, it does seem like a great idea for languages that do not have a script or for people who lack the ability to read and write the language they speak. It seems like a favourable proposition for the Indian users, many of whom are proficient speakers of their mother tongue but are unable to write or read it because their formal education was in a language other than their mother tongue or simply because their mother tongues are not written. This seems true for both under resourced and well-resourced languages in the Indian context.

Building technologies for a tool such as voice search, is time and cost intensive which makes the technology companies prioritise the work, not surprisingly, in languages that have more revenue potential. Apart from the revenue generation aspect, there are other aspects related to building technologies. The speech and text resources required to build such speech technologies is voluminous. To put things into perspective, an automatic speech recognition (ASR) system built for English by the Chinese giant Baidu, used about 11940 hours of English speech data along with text transcriptions. The computational power required to process such an amount of data is exorbitantly costly.

 

The resulting ASR system of Baidu, when tested on an evaluation speech database, popularly known as the Wall Street Journal Evaluation Set of 1992, yielded an accuracy of 96.9%. In the test involving the evaluation speech database, accuracy is calculated based on how many words were correctly recognized by the ASR system. Interestingly, when the human subjects were asked to listen to the same speech, they made more mistakes in recognizing the words than the deep neural network (DNN) based speech recognition system. Human accuracy in recognizing the words correctly was 94.97% about 2% worse than the DNN based system. Hence, in several tasks involving speech or image classification, the deep neural network based systems perform better than humans, provided the systems are trained on large amounts of data.

For most of the Indian languages such an amount of speech and text resources are unavailable. Hence, even though languages, such as Bangla or Kannada, have voice search abilities on several popular internet search engines, languages such as Assamese or Bodo, in spite of being Schedule 22 languages with sizable populations, are not likely to find themselves in the league of voice search enabled languages very soon. With the inevitable growth in voice based technologies in the next few years, the speakers of Assamese or Bodo will possibly perform voice searches in one of the few languages these technologies are available in. In other words, there will probably be a new cyber lingua franca for many of these smaller, low-resourced languages, pushing them to the brink of marginalization in the digital world. Without relevant language technology for these minor or low-resource languages, speakers may begin switching to the major languages and start considering their own languages less useful.

 

I strongly believe that these divides created by technology may also be bridged by technology itself. Over the last few years, all around the world the problem of digital divide in language technology development is taken up with utmost seriousness. There are several research paradigms that have developed which aid language technology development in the absence of large speech and text databases in languages. Methodologies such as, transfer learning, low resource language technology development, no-resource language technology development and zero-shot learning etc. all aim at developing language technologies for low or zero resource languages.

The top 10 languages used on the internet accounts for 77% of the users whereas all the other languages in the world account for only the remaining 23%. This inequality in the digital world was addressed in a conference in Paris organized by UNESCO, in December 2019. Language Technology for All (LT4All), organized as part of UNESCO’s celebration of 2019 as the International Year of Indigenous Languages, was aimed at encouraging the truly multilingual internet and language technologies, with special focus on indigenous languages.

Use of language technology built with local languages has been very effective in achieving last mile connectivity in several domains, such as, public health and agriculture. One such example of effective multilingual language technology implementation is from Africa. Viamo, formerly known as VOTO Mobile, started in 2012 and implemented Interactive Voice Response systems (IVRS) in various African languages to disseminate public health information. It is to be noted that such IVRS systems needed a basic phone connection, but did not require any internet connection. However, the language technology working in the background in recognizing the voice inputs of the users was sophisticated as the ones mentioned earlier for the voice user interfaces.

 

Similarly, in India, the Technology Development for Indian Languages initiative by the Ministry of Information Technology, has funded the Mandi project. A farmer can dial a phone number and receive information about the prices of agricultural commodities in the local markets and weather information in the user’s district. The farmer can speak into the system in any of the nine scheduled Indian languages (namely, Assamese, Bengali, Hindi, Gujarati, Marathi, Tamil, Telugu, Kannada and Oriya) and receive the automatically spoken information in the same language.

Viamo and the Mandi project are two examples where language technology has made information accessible to the end users with the use of a simple feature phone and without the need for an internet connection. The technology built in the Mandi project is extensively used in the Sufal Bangla Agri Price Information Service, a joint initiative of the Government of West Bengal and CDAC Kolkata. I am told that this system receives an average of 450 queries every day.

 

In the Indian scenario, the language technology development is made further complicated by the fact that India has a considerable linguistic diversity and at the same time too little is known about these languages. The official language reports in India, which divides languages into several categories, such as scheduled, non-scheduled and mother tongues, themselves a hindrance in uniform development in language technology across all Indian languages. It is noticed that funds for language technology development are usually allocated for the scheduled languages and the non-scheduled languages and mother tongues are often left unfunded.

However, the recent National Education Policy (NEP) of India seems to treat the linguistic diversity of India, coming out of the previously established hierarchies. The policy insists on imparting children’s education in their home or mother languages. The policy also emphasizes on the need to create quality text resources for teaching in the home or mother languages. NEP’s emphasis on education in mother tongues will also lead to resource creation in the minority languages of India, which finally may be the stepping stone for the creation of digital resources for these languages. Here, the state offices of the State Council of Educational Research and Training (SCERT) can play a pivotal role.

 

The SCERTs in the states are responsible for developing textbooks for various languages spoken in each state. A more streamlined and systematic method to identify linguistic communities in need of the development of educational resources in each state and facilitating community initiatives in developing educational resources should be done in a mission mode.

One striking feature among almost all the linguistic communities in India is the existence of literary organizations. At least from my experience in North East India, I have observed that a linguistic community usually has a literary society that takes a call on how language will be promoted and used in the community. While doing so these organizations ensure community participation and chalk out plans for promotion of literacy in their own languages with their own writing systems. I strongly believe that these organizations have a key role to play in informing the policymakers about their desire to get support for creation of necessary resources in their language. If these organizations can facilitate community driven text and speech resource creation, it can be shared for further technology development. The outcomes of such initiatives will be beneficial for both the potential users and the stakeholders.

The digital divide is very much prevalent on the internet and most languages in the world are not well represented in cyberspace. However, this may not be the only reason why a language could become extinct. There are more potent reasons why languages could become endangered and even extinct, such as political reasons, unfavourable language policies and pressure from a dominant lingua franca. The adoption of lingua franca, in the Indian scenario, determines whether the users of minority languages will keep using their original language or not.

For example, though English has been popular in India, its role is restricted as the language of knowledge. Hence, English may not threaten local languages and cultures. On the contrary, the minority languages in India may actually be threatened by the major Indian languages spoken in their vicinity, which are used as lingua franca and may replace minority languages from the daily use of the users. Some of these languages may serve only as cyber lingua franca with their use restricted only to information retrieval from the internet.

The future digital world is likely to be dominated by visual and auditory media. While the population with internet and smartphones will access information using voice user interfaces, the ones without will possibly access information through alternative arrangements such as connecting to the same information server though a simple phone. However, whether such technology will be developed uniformly for all languages will depend on several factors. At government policy level, changes need to be brought in to do away with hierarchies that prioritize technology development in only a handful of languages. A national level policy on language technology development with exact milestones will also help achieve the goal of resource generation mentioned in the NEP.

 

At the local level, community language organizations need to upgrade their way of functioning and prepare exhaustive language documentation. If needed, such organizations can approach experts who can advise them on resource creation and technology development. These organizations can also approach local governmental bodies to make them aware of the need to incorporate language technology development programmes for their communities. Such community based spoken and textual resource generation in digital form will be helpful in creating textbooks and other educational materials to be used by the next generation of speakers. Moreover, many linguists will vouch that resource creation efforts become unsustainable after a period if there is no continuous and active participation from the linguistic community.

 

History can testify that any technology without thoughtful planning and implementation may be disadvantageous for a section of the society. Language technology may not be able to make a minority language go extinct, nevertheless, it may push minority languages to the brink of marginalization. Such marginalization will prevent members of these communities from accessing crucial information on health, education and civil and legal rights in their own language. In such situations, community level initiatives, empowered by policy level decisions, may be the best way for these languages to survive in the digital world without being digitally marginalized.

 

References:

D. Amodei, et al, ‘Deep Speech 2: End-to-end Speech Recognition in English and Mandarin’, Proceedings of Machine Learning Research 48, June 2016, pp. 173-182.

Gilles Adda, et al, Proceedings of the 1st International Conference on Language Technologies for All. European Language Resources Association (ELRA), Paris, 2019.

India Cellular and Electronics Association (ICEA), Study Report on Contribution of Smart Phones to Digital Governance in India. ICEA, New Delhi, 2020.

S.L. Chelliah, Why Language Documentation Matters. Springer Nature, Switzerland, 2021.

Office of the Registrar General, Paper 1 of 2018: Language. Census of India, 2018. Available online at: https://censusindia. gov.in/2011Census/Language_MTs.html, last accessed on 20 May 2021.

Lara D. Vogel, et al. ‘A Mobile-based Healthcare Utilization Assessment in Rural Ghana’, Procedia Engineering 159, September 2016, pp. 366-368.

top