Systems that can ignore silence and non-speech sound can save a considerable amount of power compared to always-on wake word detection.
The ability to control machines using speech alone has become a popular feature in many commercial and consumer systems. But the problem with speech control is that the device must always be listening, which means it must always be powered. New options are coming forward, though, that can help designers curtail the power usage of their voice-activated designs.
Getting a machine to respond appropriately to spoken commands is a substantial processing challenge. It requires a system to first have a microphone to pick up sounds, a digitizer to convert the sound to something with which a processor can work, then a lot of digital signal processing to extract speech information from the sound. The amount of processing involved will depend on the number of command words that require recognition. Systems with limited vocabulary can use a structure like that in Figure 1 for local processing to do word spotting, while systems requiring natural speech understanding can use cloud computing resources for further processing.
Figure 1 A typical voice control system must continuously be processing sound to look for command words. Source: Aspinity
Unfortunately, most of the time there is no speech occurring and the processing and the power it consumes is wasted effort. The waste can be avoided by requiring a user to first press a button or the like to activate the speech processing. But if the system is to be activated by speech alone, it must always be capturing and processing sound to avoid missing a command. This poses a particular concern for battery-powered applications because the “always on” nature of the speech processing can be a significant battery drain.
To reduce the wasted effort and conserve power, voice processing systems typically will make use of a “wake” word for activation. This approach requires less power because most of the time the voice processing only needs to be able to identify a single, specific word rather than its full functional vocabulary. The system can thus run a much simpler, less power-hungry processing algorithm while listening for the wake word, suspending the full voice processing effort until after the wake word is detected.
In pursuit of this approach, the industry has invested substantial effort into developing wake word engines requiring minimal power. Often these engines can recognize only a few words to give the user a choice of possible wake options. Some engines, though, can recognize enough words to provide a limited form of voice control offering multiple commands. For more complex voice-control, though, the wake word engine’s purpose is simply to activate more powerful, and power-consuming, processing in time to receive and interpret the voice commands that will follow the wake word.
These wake word engines are continually evolving. One recent introduction is the pairing of Retune’s VoiceSpot word-spotting algorithm with CEVA’s family of low-power DSPs. The combination can perform beamforming and acoustic echo cancellation to improve the reliability of word identification in the presence of noise, as well as wake-word identification. The total memory footprint of the algorithm is under 80 Kbytes, targeting smaller, battery-powered applications such as earbuds, smartwatches, and action cameras.
Another recent introduction pairs Cyberon’s CSpotter algorithm with RA6-series microcontrollers from Renesas. The algorithm uses phoneme-based modelling that supports more than 30 languages. It can serve as a wake-word engine or provide local voice control using several different command sets. The processor offers an I2S (inter-IC sound) interface to a digital microphone, eliminating the need for an ADC.
Both approaches, although they have minimized the voice identification task, still rely upon digital signal processing for their wake-word identification. This sets a lower limit for the always-on power demand, which may still be burdensome in a battery-powered application. There is another technology available, though, that can save even more power for always-on wake-word identification.
Analog machine learning technology is the key. Aspinity has developed the RAMP (reconfigurable analog modular processor) chip to first identify a sound as voice before then trying to determine if the voice is speaking the wake word. What the RAMP chip allows a system to do is identify that the sounds being detected are actually voice before doing any voice processing. This predetermination allows even the wake word engine to remain dormant when no one is speaking, as shown in Figure 2.
Figure 2 By first determining whether a sound is speech, the RAMP chip allows voice processing to safely ignore other types of sound. Source: Aspinity
The chip achieves this end using an analog neural network trained to distinguish human voice from other sounds, then sending an activation signal to the voice processing system to determine if the voice is speaking the wake word. In order to ensure that the voice processing has the full speech pattern to work with, the chip buffers 500 msec of the captured sound in a pre-roll cache. When the chip identifies the sound as voice, it directs the incoming sound – starting with the pre-roll data – to the voice processing system for interpretation.
This approach allows the voice-controlled system to keep only the RAMP chip powered continuously. The voice processing hardware – including the wake word engine – can remain dormant whenever no one is speaking. In most cases, the periods with no speech present represent the bulk of the time the system is operating. The RAMP chip and host microcontroller only require about 25 μA of current, compared to the typical tens of milliamps needed for wake-word detection. Thus, the ability to ignore both silence and non-speech sound can save a considerable amount of power compared to always-on wake word detection.
Such power-reduction innovations in voice control will likely continue to occur, expanding the potential for voice-activated operation to applications from line- to battery-powered designs. Whether controlling a given device by voice is a good idea or not, it is becoming a practical option regardless of its power source.
This article was originally published on EDN.
Rich Quinnell is a retired engineer and writer, and former Editor-in-Chief at EDN.