Preserving linguistic diversity


back to issue

THE linguistic diversity of India is one of its crowning glories. We face a challenge in preserving this diversity because the languages in the four core language families, Indo-Aryan, Dravidian, Tibeto-Burman, and Austro-Asiatic, are not equally resourced. Most widely resourced are English, Sanskrit, and the 22 languages on the 8th Schedule of the Indian Constitution.

But, from here, there is a steep downward drop in resources for the dialects of scheduled languages, non-scheduled languages, their dialects, and the hundreds of languages that have less than 10,000 speakers. For these languages, there may be little in print, no newspapers or journals, little to nothing taught at schools or studied at universities in or about the language, and only the lightest digital footprint if any. And indeed, many of these languages lack a writing system. Languages that are low-resourced are in danger of lowered vitality and eventual loss.

Only a small percentage of the approximately 6000 languages of the world have entered the digital world and are used in multiple domains of interaction: in the family, in school, at work, and so on. Hundreds of other languages are being lost as transmission from elders to children is disrupted through the break of traditional societal structures. This disruption comes in many guises, including migration for economic reasons. Think of speakers of Sora from Odisha moving to Delhi to find work – children of the second and third generation diaspora community may speak some Sora, but most likely will have lost the language associated with rituals and feasts, since traditional rituals and feasts are most likely eschewed in Delhi as children assimilate to Delhi culture, Delhi language, and Delhi ways of being.

Languages can stop being spoken in situ as well, as elders encourage their children to use a majority language because of its higher prestige and as a pathway to economic security. We need only to look at the thousands of English language shops in Delhi to see the power of language prestige.

When a low-resource language is no longer spoken, because there is little recorded about the language in audio, video, or text, that language can be silenced forever. A question we may well ask is, so what? Why is it of interest that less than 5% of India’s languages are well resourced? The answer to this question goes to the heart of a new subdiscipline of linguistics called Documentary Linguistics. Documentary linguists take up the challenge of preserving linguistic diversity, working side by side with language communities to stem the tide of language loss. These language documenters recognize the urgency in preserving language information because it holds irreplaceable knowledge including information on cultural practices, community history, weather patterns, flora and fauna, and linguistic complexity.


When a language is lost, the connection of communities with their past histories and culture is broken. It appears that such connections are needed for the mental health and societal well-being of the community, especially for the young. In fact, it has been observed that preserving and celebrating mother tongues can bring healing and renewed confidence. Linguists find that language information from smaller languages is often needed to build an accurate picture of language pre-history and development. As well, the structures of smaller languages can be novel and complex, and can add to our theories of the possibilities and limits of human language. Finally, the information embodied in languages spoken by communities with close connections to the land, have a wealth of information about the land and environmentalists and agriculturalists are now seeing the value of partnering with indigenous communities to better understand the environment, weather, and land.

Documentary linguists and, increasingly, as I discuss below, community language documenters, use technologies and traditional linguistic analytics to create preservable audio and video recordings, transcriptions, and translations of language samples. In doing so, they create a record of languages that can be easily accessed and used for language learning, culture and language dissemination, linguistics, and other sciences. Access is made possible through archiving in open-access language archives with high-quality metadata. The documentary linguist does not produce materials for consumption by just linguists. Rather, the products of language documentation are meant to have broad and meaningful societal impact.


I now turn to the specifics of language documentation in India. The first thing to note is that the urgency and desire to document under resourced languages come from communities of language users themselves. As I detail in Why Language Documentation Matters (Springer 2021), many community documenters are linguists, but many are not. Let me highlight a few cases here to provide the reader with knowledge that language preservation and revitalization in India is a movement with momentum and not an occasional activity.

Mosyel Syelsaangthyel Khaling from the Uipo (Khoibu) tribe of Manipur has worked since the age of 17 to collect the remembrances of Uipo elders – to record Uipo traditional stories and other speech events. Khaling is currently a student of linguistics and supports younger Uipo scholars, who are studying linguistics so they can create documentation of Uipo. Consider also Chikari Tisso, who is a speaker of Karbi which is spoken in Assam, Meghalaya, and Arunachal Pradesh. Tisso has published books and articles in and about Karbi and created audio and video documentation recordings of speech events for Karbi, many of which have been transcribed and archived.1 Tisso has worked with linguists to create a grammar of Karbi and is now compiling Karbi songs and lullabies. The items collected by Khaling and Tisso are of great value in unpacking the linguistic structure of Uipo and Karbi. Each additional speech sample adds to the picture of the sound system (phonology) word structure (morphology), sentence structure (syntax) and word meaning (semantics). Each time we learn more about these languages, we can confirm our theories of how languages work or of how we must revise these theories. There would be no progress in language science without the additional language samples from languages like Uipo and Karbi. No language is so small that it cannot have an impact on language science.


Sometimes, in our pride of status, we ignore the complexity that resides in languages spoken by smaller groups. For example, the way that verbs are conjugated in Lamkang, a language of the South Central subgroup of the Tibeto-Burman language family spoken by somewhere between 5 and 10 thousand speakers in the Chandel district of Manipur state. It took me and my team several years to unpack the Lamkang verb to discover how speakers indicate who did what to whom. It is done using prefixes and suffixes which pattern differently depending on if the situation being talked about is in the past or future. In fact, the very shape of the verb stem must be changed if the event is in the past or future (think of English ‘be’ versus ‘is’). These and other facts about Lamkang can be found in the journal Himalayan Linguistics.

In addition to community language documenters, we have several high profile linguists and interested academics engaged in language documentation. Ganesh Devi, for example, has been the conceptual and practical lead for the People’s Linguistic Survey of India which is organized under the nonprofit Bhasha Research and Publication Centre. Devi worked with 3500 volunteers (linguists, historians, native speakers, speakers of related languages) to collect information on languages state by state. Kavita Rastogi founded the Society for Endangered and Lesser-known Languages to revitalize endangered languages by encouraging language documentation through the creation of grammars and dictionaries and material for teaching the language in formal and informal classroom settings. The society also works to build capacity for language documentation through training workshops.

Finally, we point to the work of field linguistics, such as Anvita Abbi who has, along with her students, created substantial descriptions of language samples, dictionaries, and teaching materials for many languages of India.


There also exists support from the Indian government for the documentation of less-resourced languages. For example, a Centre for Oral and Tribal Literature was recently added to the Sahitya Akademi organization. The Indira Gandhi National Centre for the Arts (IGNCA) has sponsored Language Documentation and Archiving workshops. The Office of the Registrar General and Census Commissioner under their Language Division oversees the Linguistic Survey of India (LSI) to create databases of demographic and linguistic information on languages. The Ministry of Human Resource Development funded the scheme for Protection and Preservation of Endangered Languages (SPPEL). The University Grants Commission funded centres for endangered languages at major universities such as Tezpur University in Assam, and University of Hyderabad in Telangana.


So, there is plenty of goodwill, interest, and even money supporting the preservation of linguistic diversity in India. To truly move the dial from interest to action and sustainable results, language documenters are supplementing traditional pen and paper records of how a language sounds, what words are in the language, and how people talk to each other with audio and video recordings. These audio and video recordings provide primary data for further analysis such as instrumental acoustic analysis. Audio and video also allow for an exponential increase in the quality and quantity of language samples used for language analysis, and this has made language documentation of interest to computational linguists interested in machine learning with smaller datasets. Digital language collections are also of interest to information professionals as they explore best practices for organizing digital files in digital language archives.

Digital language data is needed for maximum social impact of language documentation. Why? Because languages with a digital footprint have a higher chance of survival than languages with no digital presence. When only a few speakers remain and memories of words and languages are low, speakers can reclaim and revitalize languages from digital sources. Digital language data is needed for improved linguistic analysis. Why? Because digital data allows for a deep dive into structure including instrumental analysis and creating testable and falsifiable analyses.

So, this is the aspirational goal for language documentation in India: to make the curation of digital language data a habit, and open access archives a reality. We already see some success in this area with the Sikkim-Darjeeling Himalayas Endangered Languages Archive (SiDELHA), the first open access language archive in India using international archiving standards.2 UNESCO has affirmed 2019 the start of an international decade of Indigenous languages and, by doing so, has renewed a long-standing clarion call that language rights are a core aspect of human rights. As we continue to respect and affirm the human rights of all populations in India, we create pathways for language preservation that respect links to tradition, family, land, and ways of being. It is a curious juxtaposition, but so accurate, that to access tradition, we must rely on technology.


Let me end this essay with an observation. It seems to me that the road to preserving linguistic diversity in India cannot be from a top-down perspective. There are simply too many languages and too many disparate ecologies of language use for one methodology and one set of scientists to effectively gather language samples.

There is a groundswell of interest from young speakers of indigenous languages who are seeking training to document their languages. With appropriate capacity building, respecting individual students’ ambitions to be successful linguists, information technologists, and computational linguists, we can work with India’s upcoming students to document many of these languages. The approach would result in a win-win: language documentation created by speakers, for speakers, and a generation of linguistic experts creating digital records of the languages we so collectively treasure.



1. Karbi Language Resource on the Computational Resource of South Asian Languages Archive: