Linguists and Artificial Intelligence specialists are embarking on an ambitious programme designed to help Scottish Gaelic flourish in the digital age.
Experts at the Universities of Edinburgh and Glasgow have been awarded £225,000 by the Scottish Government to produce a Gaelic subtitling system suitable for the BBC.
Funding will also enable the team to start working towards production of a large language model – similar to ChatGPT – for Scottish Gaelic speakers.
Efforts to create the system are part of a wider initiative to counter the threat of digital extinction, faced by Scottish Gaelic and other minority languages.
The phenomenon occurs when speakers of a language are unable to participate in digital communication because of inadequate language technology.
The research team is also helping to develop a speech recognition system for Ojibwe, one of the indigenous languages of Canada.
Gaelic faces significant obstacles in developing and maintaining its language technology because of a scarcity of data.
AI programs typically are trained on large data sets, which are often scoured from the internet and then fine-tuned by human feedback.
Researchers will assemble a large body of Gaelic language data and use it to generate a high-quality automatic speech recognition (ASR) system for media, education and research.
The project will provide desperately needed Gaelic subtitling technology and jump start the development of state-of-the-art Gaelic language models.
Researchers say this will help to safeguard the language in digital domains and contribute substantially to national revitalisation efforts.
So-called deep learning approaches are ravenous for training data, and for minority languages like Gaelic, lack of data at scale is a significant obstacle.
The project aims to tackle this obstacle head-on, by generating a substantial body of colloquial Gaelic training data.
The team has already achieved a significant milestone in 2021 by developing the first publicly-available speech recognition system for Gaelic.
Although it requires additional development to improve its accuracy, the technology has already helped to create subtitles for teaching videos.
A number of existing digital projects will lay the foundation for the knowledge base needed to support the scaling-up effort proposed here.
Among them are 15,000 pages of transcribed Gaelic narrative sourced from the School of Scottish Studies Archives, based at the University of Edinburgh.
The team will also access material from the Digital Archive of Scottish Gaelic (DASG). This will include some 30 million words of text from the University of Glasgow’s Corpas na Gàidhlig and vernacular recordings from the DASG’s Cluas ri Claisneachd audio archive
Lead researcher Professor William Lamb, of the University of Edinburgh’s School of Literatures, Languages and Cultures, said: “This is about compiling large amounts of knowledge – gleaned from Gaelic speakers in the past – and returning it to Gaelic speakers, in various forms, in the present.”
Fellow researcher Professor Roibeard Ó Maolalaigh, of the University of Glasgow – who is also DASG Director – said: “This will add substantially to the development of language technology for Gaelic. It is gratifying that DASG’s resources are being deployed in this way and being further developed.”
Education Secretary Jenny Gilruth said: “The Scottish Government is proud to support this cutting-edge project, which will help Gaelic to thrive in the digital age and safeguard our country’s rich linguistic and cultural heritage.”
Also taking part are two other University of Edinburgh researchers – Dr Beatrice Alex, a senior lecturer in text mining, and Dr Peter Bell, a reader in speech technology.
The project is being carried out in tandem with BBC Alba. It also involves DASG, the historical dictionary Faclair na Gàidhlig, Gaelic media service MG ALBA and Tobar an Dualchais/Kist o Riches – a unique online record of Scotland’s rich oral heritage.