top of page
Search

Internal vocalization is described as the characteristic inner voice in humans that is usually noticeable while reading and can be voluntarily triggered while speaking to oneself, excluding deliberate lip or discernible muscle movement. This is characterized by subtle movements of internal speech articulators.


Speech Synthesis and Electrophysiology

The production of a speech involves the most complex actions that human performs.

An expression once comes up in the brain, it will be encoded into a linguistic instance by a certain area in the brain, namely the Broca's area, and the motor area map it into muscular movements for vocal articulation subsequently. This control for voluntary articulation is enabled through the sensorimotor cortex which controls the activation and activation rate of a motor unit, via projections from the corticobulbar tract to the face, laryngeal cavity, pharynx and the oral cavity. Motor neurons receive nerve impulses from anterior horn cells, subsequently trigger an action potential propagates in the muscle fibre.

The Ion motion caused by muscle fibre resistance time-varying potential difference patterns that appear in the facial and neck muscles when intent to speak, resulting in a muscle connection that detects the corresponding EMG speech sound from the skin surface without vocalization and facial sounds by the system.

In various muscle sounders involving speech generation, research is focused on the laryngeal and hyoid bone areas as well as the buccal, mental, oral and infraorbital areas to detect signal characteristics in a non-invasive manner.

In order to determine the spatial position of the detection point, the researchers selected seven target areas on the skin for detection from the initial 30-point grid that spatially covers the above-mentioned local area.

In the current iteration of the device, the signals are sourced as 7 channels from the following areas - the laryngeal region, hyoid region, levator anguli oris, orbicularis oris, platysma, anterior belly of the digastric, mentum. The finer positions of the electrodes on the skin,

within the selected regions, were then adjusted empirically.


And the researchers ranked the 7 areas with potential target locations according to the filter-based feature ranking, evaluating how signals sourced from each target were able to better differentiate between word labels in the dataset.


Ranking Region

1 Mental

2 Inner laryngeal

3 Outer laryngeal

4 Hyoid

5 Inner infra-orbital

6 Outer infra-orbital

7 Buccal


Signal Capture, Processing and Hardware

Signals are captured using electrodes from the abovementioned target areas.

The team used bias based signal cancellation for canceling ∼60 Hz line interferences and to achieve higher SNR. The signals are sampled at 250 Hz and differentially amplified at 24× gain. They created an optoisolated external trigger for marking starting and ending events of a silent phrase.

The signal streams are wirelessly sent to an external computing device for further processing

and go through multiple preprocessing stages. The signals are digitally rectified, normalized to a range of 0 to 1 and concatenated as integer streams, and subsequently being send to the server hosting the recognition model to classify silent words.


Data Collection and Corpus

The researchers generate the datasets of varied vocabulary size. Data was collected within two steps. First step, they invited three participants to investigate the signal detection and to determine electrode position. The dataset recorded with the participants was binary, with the word labelled with yes and no. The vocabulary set was gradually augmented to accommodating more words. The team collected 5 hours of internal vocalized text in total.

In the second step, they create another dataset to be used to train a classifier. This dataset has 31 hours of silently spoken text recorded in different sessions to regulaize the recognition model for session independence. The dataset labels words and numbers (and operations) in categories.

They expand the catagories later in experiments and use the external trigger signal to slice the data into word instances. In each recording session, signals were recorded for randomly chosen words from a specific vocabulary set.


Silent Speech Recognition Model

The signal after a silhouette before being input to the recognition model.

Window average is used to recognize and get rid of the single spikes in the stream, which is greater than average values of the closest four points. They used mel-frequency cepstral coefficient to closely characterize human speech. MFCCs are commonly used as features in speech recognition systems, such as the systems which can automatically recognize numbers spoken into a telephone. The signal stream is framed into 0.025s

Windows, with a 0.01s step between successive windows followed by a periodogram estimate computation of the power spectrum for each frame. They also applied DCT to the log of the mel filterbank Applied to the power spectra. in order to effectively learn directly from the processed signal without hand-pick any features. This feature Representation is passed through a 1-dimensional convolutional neural network to classify into word labels

with the architecture will be detailed described in the future post.


 
 
 

AlterEgo is a wearable silent speech interface that enables a discreet, seamless and bi-directional communication with a computing device in natural language without discernible movements or voice input.


There are five overviews to distinguish Alterego system based on the following points:


1. The system captures neuromuscular signals from the surface of the user's skin via a wearable mask.


2. Alterego's system performs robustly when the user does not open their mouth, make any sound and without the need for any deliberate and coded muscle articulation that is often used when using surface EMG to detect silent speech. Since the existing non-invasive real-time methods with robust accuracies require the user to explicitly mouth their speech with pronounced, apparent facial movements. It is the key difference between Alterego and the existing systems. The modality of natural language communication without any discernible movement is key since it allows for a seamless and discreet interface.


3. Alterego's system achieves a median accuracy of 92%, outperforming amongst others.

And it's based on the silent input without required any facial muscle movement.

4. The device is portable and ambulatory wearable, which a user just needs to wear for it to function and the device connects wirelessly over Bluetooth to any external computing device.


5. The platform does not have access to private information or thoughts. Inputs are voluntary on the user's part, which is unlike traditional brain-computer interfaces. The system has a

peripheral nerve interface by taking measurements from the facial and neck area, which allows for silent speech signals to be distilled without being accompanied by electrical

noise from the frontal lobe of the cerebral cortex.

 
 
 

Silent Speech Interface:

Silent speech interface is a device that allows communication without using the sound made when people vocalize their speech sounds. The main goal of a silent speech interface is to accurately capture speech without in need of vocalization. The end result is similar to “reading someone’s mind”. It’s an exciting and growing technology as it is very suitable for human-machine interactions.

Not all silent speech interfaces are created for one general purpose. The goal of a silent speech interface can be generating actual sound (e.g. for larynx cancer patients) or generating text, or to be used as a mean of an interface between humans and computer systems… As the goal differs; methods and even tools differ as well.

Such devices are created as aids to those unable to create the sound phonation needed for audible speech such as after laryngectomies. Another use is for communication when speech is masked by background noise or distorted by self-contained breathing apparatus. A further practical use is where a need exists for silent communication, such as when privacy is required in a public place, or hands-free data silent transmission is needed during a military or security operation.

There have been several previous attempts at achieving silent speech communication. These systems can be

categorized under two primary approaches: invasive and non-invasive systems.


Invasive Systems:


Brumberg et al. 2010 used direct brain implants in the speech motor cortex to achieve silent speech recognition, demonstrating reasonable accuracies on limited vocabulary datasets. There have been explorations surrounding measurement of the movement of internal speech articulators by placing sensors inside these articulators. Hueber et al.

2008 used sensors placed on the tongue to measure tongue movements. Hofe et al. 2013 and Fagan et al. 2008 [9] used permanent magnet (PMA) sensors to capture the movement of specific points on muscles used in speech articulation. The approach requires permanent fixing of magnetic beads invasively which does not scale well in a real-world setting. Florescu et al. 2010 propose characterization of the vocal tract using ultrasound to achieve silent speech. The system only achieves good results when combined with a video camera looking

directly at the user’s mouth. The invasiveness and obtrusiveness of the immobility of the apparatus impede the scalability of these solutions in real-world settings, beyond clinical scenarios.


Non-Invasive Systems:

There have been multiple approaches proposed to detect and recognize silent speech in a non-invasive manner. Porbadnik et al. 2009 used EEG sensors for silent

speech recognition but suffered from low signal-to-noise ratio to robustly detect speech formation and thereby encountered poor performance. Wand et al. 2016 used

deep learning on video without acoustic vocalization but requires externally placed cameras to decode language from the movement of the lips. Hirahara et al. use Non-Audible

Murmur microphone to digitally transform signals. There have been instances of decoding speech from facial muscles movements using surface electromyography.

Wand and Schultz 2011 have demonstrated surface EMG silent speech using a phoneme based acoustic model, but the user has to explicitly mouth the words and has to use pronounced facial movements. Jorgensen et al. use surface EMG to detect subvocal words with accuracy fluctuating down to 33% with the system also unable to recognize alveolar consonants with high accuracy, which is a significant obstruction to actual usage as a speech interface.

 
 
 
bottom of page