Build, Curate and Explore a massive dataset of public content in indigenous languages. The objective is to identify and enumerate data sources for retrieving content in a indigenous language, creating an open archive that can be leveraged in a variety of activities, including for training translation models to promote national languages, or for building vocal synthesizers to help distribute news content to illiterate citizens.
Initiate a research roadmap on translation and voice synthesis to promote indigenous languages through content sharing Preserving indigenous languages is a challenging endeavor which require first closing the information gap that may exist between official (mainly colonial) languages and indigenous languages. For example, news content are abundant in official languages, while rural areas are provided with brittle summaries in indigenous languages. Artificial intelligence can help in closing the gap through automatic translation of texts and voice synthesis (to account for illiteracy). The project will initiate a state-of-the-art survey of available and missing components in the context towards realizing this endeavor.
The long-term vision of the preservation project is to ensure that indigenous languages, hence the indigenous cultures, are sustained. To that end, the project investigates:
- the means to systematize the collection and archiving the contents. This will ensure that all data are openly made available in readily processible formats and in a unique repository endpoint
- the opportunity to perform automatic translation to ensure a back-and-forth exchange of viewpoints in official and indigenous languages
- the democratization of information by the elite to the rural citizens who only speak indigenous languages.
This last point is the ultimate goal towards preserving indigenous languages by ensuring that the information gap is closed, thus realizing one objective of open data, which is to increase democratic participation via information.