Audio of Things encompasses audio technologies like voice control, communications, playback, and sensing serving smart devices and machines.
The Internet of Things (IoT) is not a single-track horse and embodies a diverse array of use cases. One offshoot, Audio of Things or AoT, encompasses audio technologies like voice control, communications, playback and sensing, and their evolving relationships with smart devices and machines.
The term has been coined by DSP Concepts, a design house that provides chipmakers and OEMs with real-time workflows to embed sound and voice features into audio-enabled designs. The company claims that its AoT building blocks are processor agnostic and fully customizable.
According to Simon Forrest, principal technology analyst at Futuresource Consulting, Audio of Things is now an ever-expanding premise due to a global appetite for audio-centric products. “The addressable market for audio products has been standing at just under 3.5 billion devices in 2021, of which 2.1 billion integrate voice processing.”
Figure 1 Audio of Things employs a wide span of technologies, including voice control, communications, playback and sensing. Source: DSP Concepts
While humans interface with machines by inserting a complex series of button pushes that the machine can understand, smarter devices are now adopting speech-based human-machine interfaces (HMIs) by leveraging audio processing power available on edge devices. Below is an outline of the key design trends that help developers overcome the unique challenges associated with sound-based interface and intent designs.
Smart devices or machines with voice interfaces use a two-pronged approach to discern the intent. An automatic speech recognition (ASR) module is used to convert speech to text, and then intent is determined by analyzing that text with a natural language understanding (NLU) engine. Typically, these processes employ cloud computing platforms such as Amazon Voice Services and Google Assistant, which use artificial intelligence (AI) and machine learning to handle an extremely broad set of queries and commands and generate an equally broad set of replies and actions.
However, while there has been a continual improvement in cloud-based technologies like ASR and NLU, it’s largely been “edge-based” designs that have lowered the barriers to the spread of voice-based HMIs. At the edge, a new breed of embedded speech-to-intent engines run completely on the device itself, though with a limited vocabulary and set of actionable intents.
Audio design engineers have traditionally been adding a stand-alone DSP to the board as an audio coprocessor, mainly because MCUs didn’t have the required horsepower. Now, embedded processors employ instruction set enhancements like Arm’s Neon for Cortex-A and Helium for Cortex-M to support the floating-point and SIMD operations needed for efficient audio processing.
That brings the compute power necessary to integrate voice control in a design. Moreover, unlike a 200-MHz DSP consuming 7,000 uW/MHz, a power-optimized MCU with audio-processing capabilities can operate at around 20 uW/MHz.
Figure 2 Audio Weaver, a graphical design environment optimized for embedded audio software, offers software components to bolster audio processing in chips serving the Audio of Things. Source: DSP Concepts
An array of multiple microphones, instead of just one microphone, benefits from more compute cycles for the audio design. However, designing an array comprising two to seven microphones mandates acoustical, electrical, and mechanical expertise. Design engineers must choose the appropriate mics, decide on the optimum number and array geometry, and ensure they are properly mounted and gasketed. Furthermore, engineers must design the overall acoustic and product chassis without mechanical coupling between the microphones and loudspeakers.
Another roadblock in the way of widespread adoption of voice-based user interfaces has been poor speech-recognition performance, and that’s where design innovations in audio front-end (AFE) come into play. AFE—the functional block that resides between a device’s microphones and the rest of the voice-processing design—takes the raw audio from microphones and attempts to create a single audio output stream from the user’s voice commands.
Here, the fixed-function, hardware-based AFEs are hard to integrate into small form factors, and their performance has mostly been underwhelming. But now the availability of software-based AFEs like TalkTo enables machines to match human’s ability to understand speech in a noisy environment.
Figure 3 Qualcomm’s system-on-chip (SoC) has incorporated TalkTo for supporting an always-listening design. Source: Qualcomm
In TalkTo, multichannel acoustic echo canceller (AEC) cancels the “known” sounds made by the device’s own speakers. Next, adaptive interference canceller (AIC) technology uses machine learning and advanced microphone-processing techniques to continually map and characterize the ambient sound field.
This article was originally published on EDN.
Majeed Ahmad, editor-in-chief of EDN and Planet Analog, has covered the electronics design industry for more than two decades.