Wearable Architecture: Silent Speech Recognition
- Lyu Theresa
- Nov 2, 2019
- 3 min read
Internal vocalization is described as the characteristic inner voice in humans that is usually noticeable while reading and can be voluntarily triggered while speaking to oneself, excluding deliberate lip or discernible muscle movement. This is characterized by subtle movements of internal speech articulators.

Speech Synthesis and Electrophysiology
The production of a speech involves the most complex actions that human performs.
An expression once comes up in the brain, it will be encoded into a linguistic instance by a certain area in the brain, namely the Broca's area, and the motor area map it into muscular movements for vocal articulation subsequently. This control for voluntary articulation is enabled through the sensorimotor cortex which controls the activation and activation rate of a motor unit, via projections from the corticobulbar tract to the face, laryngeal cavity, pharynx and the oral cavity. Motor neurons receive nerve impulses from anterior horn cells, subsequently trigger an action potential propagates in the muscle fibre.
The Ion motion caused by muscle fibre resistance time-varying potential difference patterns that appear in the facial and neck muscles when intent to speak, resulting in a muscle connection that detects the corresponding EMG speech sound from the skin surface without vocalization and facial sounds by the system.
In various muscle sounders involving speech generation, research is focused on the laryngeal and hyoid bone areas as well as the buccal, mental, oral and infraorbital areas to detect signal characteristics in a non-invasive manner.
In order to determine the spatial position of the detection point, the researchers selected seven target areas on the skin for detection from the initial 30-point grid that spatially covers the above-mentioned local area.
In the current iteration of the device, the signals are sourced as 7 channels from the following areas - the laryngeal region, hyoid region, levator anguli oris, orbicularis oris, platysma, anterior belly of the digastric, mentum. The finer positions of the electrodes on the skin,
within the selected regions, were then adjusted empirically.
And the researchers ranked the 7 areas with potential target locations according to the filter-based feature ranking, evaluating how signals sourced from each target were able to better differentiate between word labels in the dataset.
Ranking Region
1 Mental
2 Inner laryngeal
3 Outer laryngeal
4 Hyoid
5 Inner infra-orbital
6 Outer infra-orbital
7 Buccal
Signal Capture, Processing and Hardware
Signals are captured using electrodes from the abovementioned target areas.
The team used bias based signal cancellation for canceling ∼60 Hz line interferences and to achieve higher SNR. The signals are sampled at 250 Hz and differentially amplified at 24× gain. They created an optoisolated external trigger for marking starting and ending events of a silent phrase.
The signal streams are wirelessly sent to an external computing device for further processing
and go through multiple preprocessing stages. The signals are digitally rectified, normalized to a range of 0 to 1 and concatenated as integer streams, and subsequently being send to the server hosting the recognition model to classify silent words.
Data Collection and Corpus
The researchers generate the datasets of varied vocabulary size. Data was collected within two steps. First step, they invited three participants to investigate the signal detection and to determine electrode position. The dataset recorded with the participants was binary, with the word labelled with yes and no. The vocabulary set was gradually augmented to accommodating more words. The team collected 5 hours of internal vocalized text in total.
In the second step, they create another dataset to be used to train a classifier. This dataset has 31 hours of silently spoken text recorded in different sessions to regulaize the recognition model for session independence. The dataset labels words and numbers (and operations) in categories.
They expand the catagories later in experiments and use the external trigger signal to slice the data into word instances. In each recording session, signals were recorded for randomly chosen words from a specific vocabulary set.
Silent Speech Recognition Model
The signal after a silhouette before being input to the recognition model.
Window average is used to recognize and get rid of the single spikes in the stream, which is greater than average values of the closest four points. They used mel-frequency cepstral coefficient to closely characterize human speech. MFCCs are commonly used as features in speech recognition systems, such as the systems which can automatically recognize numbers spoken into a telephone. The signal stream is framed into 0.025s
Windows, with a 0.01s step between successive windows followed by a periodogram estimate computation of the power spectrum for each frame. They also applied DCT to the log of the mel filterbank Applied to the power spectra. in order to effectively learn directly from the processed signal without hand-pick any features. This feature Representation is passed through a 1-dimensional convolutional neural network to classify into word labels
with the architecture will be detailed described in the future post.
Comments