Skip to main content
Join this collection

Preserving the Estonian language with open source

Preserving Estonian with open source

Published on: 23/08/2023 News
Estonia open source data portal logo
Source: https://avaandmed.eesti.ee/

Estonia's newly-launched "Donate a speech" campaign—to improve language-processing technology for the Estonian language—is both hosted on free and open source software, and aims to advance the speech processing capabilities of its free and open source AI virtual assistant (Bürokratt, see OSOR's 2022 case study).

Organised as part of the Estonian Language Strategy 2021-2035, the first step of the project is to gather samples of speech from as many people as possible through Estonia's Open Data Portal. The speech materials are in a database which itself is free and open source, highlighting how FOSS enables social improvements by providing the software building blocks for a wide variety of purposes. Over 100 hours of speech have already been collected.

These samples will be used to improve speech recognition of the Bürokratt virtual assistant, making Estonia's e-government services accessible to more people. The project also aims to contribute to preserving the language, recording how it is spoken—now and as time goes by.

The produced speech materials will also be made available through the Open Data Portal, with the aim of helping both public and private sector software development.

This will be used for automatic subtitles (public and private broadcasters), meeting transcriptions, voice-controlled software, service-driven phone calls, as well as general services for people with hearing impairment. It is hoped that this will help both native and non-native speaker alike.

While speech production can focus on a single pronunciation, speech recognition must deal with the complexities of accents and dialects as well as unclear speech and speech impediments. The initial goal is to increase recognition of spontaneous speech from 85% to 91% but people working on this also note that language and pronunciation is constantly evolving, and staying accurate will require continuous work.