Key design considerations for voice command systems

Article By : Raj Senguttuvan

This article provides design considerations for low-power, always-on voice command systems using voice activity detection (VAD).

Voice assistants and integration are being implemented into most products, appliances, and technology introduced to the market. That being said, it is no secret that these useful voice assistants are always on to listen for activation/wake words (such as “okay Google” or “Alexa”), which often uses a high amount of power. In a world where tech is rapidly advancing, it is imperative to consider the impact this has on energy-consumption.

This article provides design considerations for low-power, always-on voice command systems using voice activity detection (VAD). It explores trade-offs and considerations when choosing the components required for creating an easy-to-us, energy-efficient voice user interface (VUI).

Th VAD function detects human voice in the environment before listening for a wake word, meaning that when nobody is home, your voice assistant won’t be wasting unnecessary energy. It is estimated that there are 4.2 billion digital voice assistants being used around the world, and this number is expected to double by 2024. Implementing this technology into voice assistant software and other products that rely on voice integration would drastically lower the energy consumption of those who use voice assistants.

There are several hardware architectures for implementing a VUI system. In general, a typical voice user interface implementation consists of microphones, either a single microphone or a microphone array connected to an audio processor for capturing and processing voice.

The incoming audio stream can be processed on an edge audio edge processor, a smart microphone with built-in audio edge processor, or on a standard applications processor (AP). Edge audio processors are optimized for low-power and low-latency processing of audio signals. In addition to providing specialized processing of the input audio, an edge audio processor can also be used for post-processing audio output signals. If the VUI system is cloud connected, the audio edge processor can also communicate with the cloud VUI interface through the main system-on-a-chip (SoC) with wireless connectivity. Two different implementations for VUI systems are presented in this paper along with their respective trade-offs.

Ultra-low-power VAD (voice activity detection)

The architecture shown in figure 1 supports ultra-low-power VUI using an analog signal path, including analog microphone and an analog comparator to provide a wake trigger. When an acoustic activity is detected, the analog signal chain generates an interrupt to wake up the audio processor for voice capture. The device could also include a “push-to-talk” feature, whereby the user pushes a button to wake up the audio processor.

Knowles figure 1 - voice button trigger wake
Figure 1. Ultra-low power, always-on VUI hardware signal chain for remote control without pre-roll buffering.

The analog wake microphone must always be listening to the environment, and therefore this microphone, along with the comparator, must consume very little power. An example of an efficient audio processor with power consumption of less than 1mW in its simplest wakeup trigger mode, and 1MB of memory for advanced audio processing is the Knowles IA8201. While the approach illustrated in figure 1 provides a simple low-power AAD (acoustic activity detection) approach for always-on VUI in devices such as remote controls and wearables, it has limitations. This implementation wakes up the audio processor for any acoustic signal and can lead to high-overall system power consumption in noisy situations. Also, voice user interface systems that are cloud connected require audio data for a period just prior to the wake word to be captured for increased accuracy of wake-word detection. This is commonly referred to as pre-roll and is a must-have requirement for Alexa-enabled devices and other smart speaker devices.

Knowles figure 2 - always-on keyword trigger pre-roll buffering
Figure 2. Architecture supporting pre-roll buffering for devices such as smart speakers.

Figure 2 shows an architecture that supports pre-roll buffering for devices such as smart speakers. These devices typically have bigger batteries and/or may not have the requirement of multiple months of battery life on a single charge. The VUI system is always on, listening to the environment and recording pre-roll in a circular buffer. The length of the pre-roll is typically of the order of 500ms of audio data and is used to calibrate the ambient noise level.

There are a few different approaches to design the always-on, front-end architecture. The choice of the audio processor depends on the number of microphones used, and whether they are analog or digital.

The architecture shown above uses a Knowles IA611 for voice activity detection, SPH0655LM4H-1 Cornell II digital microphones for beamforming, and Knowles IA8201 for audio processing. The Knowles IA611 is a smart microphone offering benefits to a system designer as discussed in the following section.

Microphone selection

For the architecture shown in figure 1, a single analog microphone and comparator is used as a trigger input to wake up the audio processor when an acoustic activity is detected. The wake-mic should be a low-power analog microphone with signal-to-noise ratio (SNR) preferably higher than 62 dB. The Knowles SiSonic MEMS microphone portfolio offers several choices for the wake microphone. For example, SPV1840LR5H-B Kaskade analog microphone is a good choice consuming only 45µA when ON. The always-on analog path, including a microphone, amplifier, and the comparator, consumes less than 67µA. There are piezoelectric microphones available in the market with very low, always-on power (10µA), but they typically have low SNR, which can affect system performance.

For the pre-roll buffering capable architecture shown in figure 2, microphones with an embedded audio processor and sufficient memory to continuously capture voice data in a circular buffer of 2 seconds, such as the Knowles IA611, are viable options for always-on voice activity detection. It also comes with an ecosystem of ported voice triggers and commands, such as Amazon’s Alexa. When keyword is detected, both the pre-roll buffer and uttered voice audio is sent to the cloud automatic speech recognition (ASR) engine. IA611’s always-on, voice-wake power is 0.39 mA @ battery 1.8V and 90 percent efficiency, making it a good choice for voice user interface in battery operated devices such as Bluetooth speakers. The device also accepts PDM input from a digital microphone, and can be used to support beamforming on the host BT-SoC processor by passing through audio once the system wakes up.

While this always-on power is acceptable for pre-roll applications, it is also worth considering for a non pre-roll architecture as illustrated in figure 1. As described earlier, an analog wake microphone will trigger for any incoming sound and turns on the audio processor. This can be problematic in a noisy environment, such as when the TV is ON, where there will be many spurious wakes leading to significant wastage of power. If voice activity detection is used instead of the low-power analog wake microphone, the system would turn on only when a key word is detected. It is logical to see why using a voice activity detection microphone might be more efficient than a simple analog wake microphone in noisy environment.

Figure 3 shows simulation data that compares the number of days of battery life for a typical TV remote control using VAD on IA611 vs. a competing piezoelectric low power AAD microphone and an audio processor for varying duration of acoustic activity ON time. Acoustic activity can be present when the TV or other household appliances are ON, or in other situations when there is babble etc. As seen in figure 3, there is a crossover point at about 3 hours, whereby the power advantage of using the analog AAD on a competitor’s microphone vs voice activity detection on IA611 disappears.

At five hours of acoustic activity ON time, the voice activity detection solution offers eight extra days of battery life over the competing AAD-based solution. To put this advantage in context, U.S. adults watched nearly eight hours of TV per day, according to a Nielsen study published in 2017. With the increasing demand for internet-connected devices, such as smart TVs, game consoles and other multimedia devices, the hours of acoustic activity in a typical U.S. household will likely continue to rise as well. Using an intelligent VAD based wake-up will help systems designers develop more power efficient VUI systems.

Knowles figure 3 - always on VAD
Figure 3. Remote control battery life with VAD vs AAD.

Conclusion

From smart home, hospitality, digital workplaces, voice payments, intelligent energy management, voice at the edge and healthcare, all the way to industrial IoT applications changing the plant floor, voice adds flexibility, efficiency, sustainability, and adoption acceptance to new technologies.

The various hardware architectures for design of a voice user interface, along with microphone section, each serve a slightly different need depending on the end-device’s applications and designer preferences; For example, Alexa-enabled devices and smart speakers require a pre-roll buffering capable architecture.

It is important that electronics engineers and designers carefully evaluate how the end-device will leverage voice, capabilities they wish to access, and from there, determine the correct architecture and microphone components accordingly.

This article was originally published on Embedded.


Raj Senguttuvan - Knowles

Raj Senguttuvan has over 15 years of experience in new technology development for consumer and industrial applications, early stage business development, and project management for companies including Analog Devices and Texas Instruments. In his role as director, strategic marketing for Knowles, he directs system-level development, drives venture investments and partnerships, and marketing strategy for IoT and consumer technologies including audio processors, algorithms, microphones, sensors, and receivers. Raj holds an MBA from Cornell University and a PhD in electrical engineering from Georgia Institute of Technology.

Leave a comment