Enlarge /. Several generations of Neuralink neural implants.
For people with limited limb use, speech recognition can be critical to their ability to use a computer. But for many, the same problems that restrict limb movement affect the muscles that enable speech. This had made every form of communication a challenge, as physicist Stephen Hawking famously demonstrated. Ideally we would like to find a way to be one step ahead of any physical activity and to find ways to convert nerve impulses into speech.
Brain-computer interfaces made impressive progress even before Elon Musk decided to get involved, but the brain-to-text problem wasn't one of his successes. We have been able to recognize speech in the brain for a decade, but the accuracy and speed of this process are quite low. Some researchers at the University of California at San Francisco suggest that the problem may be that we haven't considered the challenge of the overall speaking process. And they have a brain-to-speech system to support them.
Lost in translation
Speaking is a complicated process and it is not necessarily obvious where the process starts best. At some point, your brain decides the meaning to be conveyed, although this is often revised as the process progresses. Then word decisions have to be made, although speaking after mastering does not require conscious thinking – even some word decisions, e.g. B. when articles should be used and which ones should be used can sometimes be done automatically. After selection, the brain has to organize a collection of muscles in order to actually make the appropriate sounds.
There is also the question of what exactly can be seen. Individual sound units are built into words and words into sentences. Both are subject to problems such as accents, incorrect pronunciations and other audible problems. How do you decide what your system should focus on?
The researchers behind the new work were inspired by the ever-improving capabilities of automated translation systems. They usually work at the sentence level, which is likely to help them determine the identity of ambiguous words based on the context and the derived meaning of the sentence.
Typically, these systems process written text into an intermediate form and then extract the meaning from it to identify the words. The researchers realized that the intermediate form does not necessarily have to be the result of word processing. Instead, they decided to derive it by processing neuronal activity.
In this case, they had access to four people who had electrodes implanted to monitor seizures that happened to be in parts of the brain that were involved in speech. Participants were asked to read a set of 50 sentences that contained a total of 250 unique words while the neural activity was being recorded by the implants. Some of the participants read from additional sentences, but this first sentence provided the primary experimental data.
The recordings, along with audio recordings of the actual speech, were then fed into a recurring neural network, which they processed into an intermediate representation that captured their key features after training. This representation was then sent to a second neural network, which then tried to identify the full text of the spoken sentence.
How did it work?
The main limitation here is the extremely limited number of sentences available for training – even the participant with the most spoken sentences had less than 40 minutes of talk time. It was so limited that researchers feared that the system could only find out what was said by tracking how long it took the system to speak. This caused some problems because some of the mistakes the system made were to replace a spoken sentence with the words of another sentence in the training sentence.
Apart from these mistakes, the system has done fairly well given the limited training. The authors used a measure of performance called the "word error rate" based on the minimum number of changes required to convert the translated sentence into the sentence actually spoken. In two of the participants, the word error rate after completing the complete training set was less than eight percent, which is comparable to the error rate of human translators.
To learn more about what was going on, the researchers systematically deactivated parts of the system. This confirmed that the neural representation was critical to the success of the system. You could disable the audio processing area of the system and the error rates would increase, but still fall into an area that is considered usable. This is quite important for potential uses, including people who cannot speak.
Disabling various parts of the electrode input confirmed that the key areas that the system was paying attention to were involved in speech production and processing. An important contribution came from an area of the brain that paid attention to the sound of one's own voice in order to provide feedback as to whether the spoken words corresponded to the speaker's intention.
Transfer tech
Finally, the researchers tested various forms of transfer learning. For example, one of the subjects spoke an additional set of sentences that were not used in the testing. Training the system also reduced the error rate by 30 percent. Similarly, training the system with data from two users improved performance for both. These studies showed that the system really managed to extract features of the set.
Transfer learning has two important implications. On the one hand, this indicates that the modularity of the system could make it possible to train it on intermediate representations that are derived from text, instead of requiring neuronal recordings at all times. This would of course open it up to be more generally useful, although it could initially increase the error rate.
The second thing is that it is likely that a significant portion of the training could take place with people other than the person for whom a particular system is ultimately used. This would be critical for those who have lost vocalizing ability and would significantly reduce the training time that a single user needs in the system.
Obviously none of this will work until such implants are safe and routine. But there is a small chicken and egg problem, as there is no justification for giving people implants without demonstrating the potential benefits. Even if decades could pass before such a system would be useful, simply demonstrating that it could be useful can help advance the field.
Nature Neuroscience, 2020. DOI: 10.1038 / s41593-020-0608-8 (About DOIs).