ML for Translating Dysarthria Speech (Pre-Part 1)
What is Dysarthria?
Per the Mayo Clinic, Dysarthria occurs when the muscles you use for speech are weak or you have difficulty controlling them. Dysarthria often causes slurred or slow speech that can be difficult to understand.
Typically, patients with dysarthria obtain this disorder from other reasons:
- Amyotrophic lateral sclerosis (ALS, or Lou Gehrig’s disease)
- Brain injury
- Brain tumor
- Cerebral palsy
- Guillain-Barre syndrome
- Head injury
- Huntington’s disease
- Lyme disease
- Multiple sclerosis
- Muscular dystrophy
- Myasthenia gravis
- Parkinson’s disease
- Wilson’s disease
This list is not meant to be exhaustive but to indicate the most common reasons why an individual develops dysarthria.
I am hoping to create a workflow and machine learning model that can listen to an individual’s voice and be able to produce a model that is unique to them. This model would be able to receive voice data and provide text output. Google has already done something very similar with Former NFL Player Tim Shaw. Tim Shaw announced he had ALS back in 2014. Google recently worked to help develop an application that would listen and translate Tim’s voice and produce text for others to read. Along with this they used audio records to reproduce his original voice.
I hope to take that research one step further by introducing speech translation from one language to another prior to displaying the translated speech text.
Why is this important, in general, to me?
My grandmother is an immigrant from Quebec, Canada. She came to the United States with my grandfather many years ago. When I was very young she suffered from two strokes. Every interaction I can remember with my grandmother it was always a struggle to understand her. Phone calls have always been difficult. The combination of poor audio quality and my grandmothers condition made it almost impossible to communicate. On top of all that my grandmother’s English is not up to par. This isn’t her fault since she is actually fluent in Canadian French. Since my grandmother lives in the United States most individuals don’t speak Canadian French. This would make communicating with my grandmother in any sense more difficult. The combination of this information and my inability to speak canadian french led me to want to use machine learning to fix the problem.
Due to my grandmother’s age and lack of interaction with technology, recordings of her voice prior to her strokes are pretty much non-existent. That only limits me in being able to recreate my grandmother’s voice. I should still be able to create something that can listen to her voice and produce text translation. If audio is required an alternative voice can be used with the text that is produced.
Technology Stack: GCP vs. AWS vs. Azure
I plan to create my own model using data that I collect. I unfortunately haven’t collected any data from prior conversations, but plan to collect going forward. I hope to make the code open-source on my github. That way others can collect their data and build a similar model. I don’t plan to release my model or data due to it being sensitive data. I haven’t decided which cloud platform I will use. The platform will need to provide fast API inference and at a reasonable price. There shouldn’t have to be too much data processing if any at all between training and inference. My initial thought is to use GCP but as time goes forward this may change. I will be following a traditional way of framing the ML problem, but a slightly different weighing scale on the tasks.
There are 5 major areas where effort should be allocated.
- Defining KPI’s. – 5%
- Collecting Data – 50%
- Building Infrastructure – 25%
- Optimizing ML Algorithm – 10%
- Integration – 10%
My KPI is the level of accuracy of the translation. I hope to achieve as accurate of translation as possible. This is a relatively straight forward KPI. Not much time spent here.
Collection of Data
This is literally half of the work. Since I have no data it makes difficult to generate a model. I am searching for another model to potentially perform transfer learning on but I haven’t been able to find one. The initial collection will be from conversations I have with my grandmother. The more conversations I have the more data I can train the model on. Once I have a model deployed in production, I hope to have a separate pipeline that collect audio submitted to the platform. That pipeline will save the audio to help train future models. As time goes on the models should hopefully become better to the point where serious analysis must be performed to determine which model is better.
This section will vary as time goes on. This mostly depends on what platform has the cheapest storage for the audio recordings, models, and various other data. On top of that which platform has the faster inference for the cheapest cost. Another determining factor is the available API’s and their ability to perform. Both GCP and AWS have amazing ML API’s. I plan to test both sets of NLP APIs for my specific needs.
Optimizing ML Algorithm
This section is more of a future conversation. Since we have no data we can’t really optimize an ML algorithm. Initial building of the model will most likely use something similar to GRU, LSTM, or other similar models. I will most likely use some sort of transfer learning or existing language translation API. It all depends whether or not the model for language translation for Canadian French exists. Initially, I may train an additional model to translate her english audio as well.
Since there isn’t another app or system that will be interfacing with this API/Model there aren’t really considerations here. I hope to create an app that integrates this API so my immediate relatives who interact with my grandmother are able to use it as well. I will also have to figure out an easy way for my grandmother to be able to use this with other individuals. My grandmother is very smart but still struggles with newer technology. This may be a hurdle that I will need to face. Another possibility is using Tensorflow Lite and creating an application that uses the model for speech prediction. Since I typically communicate with my Grandmother via facetime, I hope to make the app integrate with video chat applications for ease of use.
Future goal is to have an app that can listen to any voice and make this tool available to anyone. If there is someone in your life that suffers from similar conditions, I hope that this work will eventually help make life easier. As the project nears production ready, I hope to create an easy way to convert the model to work with other individuals. This can hopefully be done using transfer learning.
This is the first part of a multi-part series that will hopefully be fruitful, and produce an open-source method for translating speech from individuals who suffer from Dysarthria. This pre-part 1 was to discuss the problem, the approach, and end goal. Thank you for reading and please be on the lookout for the official release of part 1.