Bengali.AI: Democratizing Bengali Language Technology, 71 years after 1952 

  • January 5, 2023

What is Bengali.AI? 

Bengali.AI is a non-profit research initiative established for the democratization of Bengali language technology research. Formed by a group of graduates from BUET, KUET, and BRACU back in December 2017, the aim of Bengali.AI is to accelerate Bengali language research by creating large standardized open-sourced datasets – the absence of which has been a major bottleneck for Bangla research. Bengali.AI operates with a two-pronged approach. Firstly, they crowdsource language data through community-driven campaigns and curate the data with rigorous validation standards. Secondly, they crowdsource algorithms/solutions built on the datasets, through international competitions. For example, in 2019, the Bengali.AI team had crowdsourced+curated a dataset of over 500,000 handwritten Bengali graphemes. Following that in 2020, they launched an international Kaggle competition in collaboration with Google. Over 2,000 teams from all over the world partake in this competition regardless of their native tongue. These teams consisted of some of the biggest names in artificial intelligence like NVIDIA and H20.AI – all joining forces to build algorithms for Bengali optical character recognition.

Current projects: 

Bengali.AI is asking people to submit their voice samples for a public-domain research dataset. They are running a campaign called  ‘৫২ এর ৭১ বছর পূর্তি’. Since February 21st of this year, they have accumulated around 2000 hours of speech data from 23000 users through online campaigns – a major milestone that had taken English a significant amount of time to achieve. The objective of this campaign is to assist in the development of powerful and publicly available automatic speech recognition systems – paving the path for our own Bengali Siri or Alexa. 

The team is also working towards a domain study on Bangladeshi Sign Language (BdSL) diversity. They aim to target 20 major regions in Bangladesh and run a survey of sign language users from these regions. This will provide the policymakers with much-needed context for sign language dissemination in the country.

On a global stage, Bengali.AI will be co-hosting the first-ever Bangla NLP workshop at arguably the largest NLP conference in the world – Empirical Methods on Natural Language Processing (EMNLP) 2023. Bangla NLP practitioners and enthusiasts will be called upon to present their amazing research works in Singapore as well as participate in research competitions on Bangla language technology.

The importance of working with Bengali:

Bengali is one of the world’s richest languages, yet, existing language technology for this language is often quite lacking, earning Bengali the unfortunate tag of a low-resource language. Moreover, there exists a large demographic for whom, access to technology is limited due to their ineptitude in English. Bengali.AI dreams of bridging this gap and making technology accessible to everyone in Bangladesh via Bengali voice, text, and signs.