With the continued growth of the use of technology, Machine Learning and Artificial Intelligence are being intricately used to come up with problem-solving solutions in many areas including education, health and even communication.
Kathleen Siminyu, a machine learning fellow and an NLP (Natural Language Processing) reseacher at Mozilla Foundation is working on an interesting project at Mozilla called Common Voice – a data set platform that enables language communities to build language data sets.
A data set for speech recognition is essentially a piece of text accompanied by an audio of what is in the text. That is the data that you would feed into your machine learning algorithm for it to start learning how to transcribe to Kiswahili/or any other language text. From there, it is able to do a mapping of words to the respective sounds that are broken down into smaller parts of speech. A good example is the captions you see on TV in a different from the one speaking or even in teleconferencing tools like Zoom which have captions that are auto-generated.
More about Common Voice
Kathleen explains how this particular data set supports the development of a Kiswahili Common Voice that builds speech transcription models that can be used in areas like agricultural and financial domains. These speech transcription models make sure that other languages are incorporated in the above mentioned areas and include even various language dialects for all to understand.
She is currently working on ensuring that the diversity of Kiswahili speakers, in terms of age, gender, accent and language variant/dialect, is catered for in the dataset and models created.
Why is this so important?
Well, the main reason is how diverse we are, not just in Kenya but Africa as well; and even in a certain culture/tribe, you can find there are various dialects in their language meaning a word or term that is used in a language can have different meanings depending on the dialect. This is why these data sets are vital in relaying information so as to ensure the correct message is sent out to everyone.
She is particularly excited to be working on this project with Mozilla Foundation which is a non-profit because it benefits everyone who wants access to this resource to better their communities and it will also help more developers to use these data sets in various sectors that will benefit their communities. Furthermore, since this is a resource that will benefit everyone, it requires the collective effort of communities to help out with collecting the data, that is the different languages and dialects.