XMOS just introduced its XVF3510 next-generation voice processor, which can pluck an individual voice out of a crowded audio landscape using just two microphones.
Those clever folks at XMOS have just brought us one step closer to embedding “ears” for voice control in just about every device with which we interact.
As a reminder, XMOS is a fabless semiconductor company that develops voice solutions, audio products, and multicore microcontrollers capable of concurrently executing real-time tasks, extreme digital signal processing (DSP), and control flow. XMOS microcontrollers are distinguished by their deterministic (predictable) behavior.
Let’s start with the underlying xCORE multicore microcontroller technology, which comprises multiple “processor tiles” connected by a high-speed switch. Each processor tile is a conventional RISC processor that can execute up to eight tasks concurrently. Tasks can communicate with each other over channels (that can connect to tasks on the local tile or to tasks on remote tiles) or by using memory (for tasks running in the same tile only).
The xCORE architecture delivers, in hardware, many of the elements that are usually seen in a real-time operating system (RTOS). This includes the task scheduler, timers, I/O operations, and channel communication. By eliminating sources of timing uncertainty (interrupts, caches, buses, and other shared resources), xCORE devices can provide deterministic and predictable performance for many applications. A task can typically respond in nanoseconds to events such as external I/O or timers. This makes it possible to program xCORE devices to perform hard real-time tasks that would otherwise require dedicated hardware.
In 2017, XMOS acquired Setem Technologies. As I wrote in my column “XMOS + Setem Could Be a Game-Changer for Embedded Speech”: “The chaps and chapesses at Setem are the pioneers of Advanced Blind Source Signal Separation technology. Their patented algorithms enable consumer devices to focus on a specific voice or conversation within a crowed audio environment to achieve optimized input into speech-recognition systems.”
I have two Amazon Echo/Dot devices at home and one in my office (I asked my wife, Gina the Gorgeous, why she was whispering. “I heard that the folks at Amazon might be listening to us,” she said. I laughed, Gina laughed, Alexa laughed…). I think that these devices are awesome, but they do require an array of seven microphones, which increases both the cost and the physical footprint of the overall solution.
Having multiple microphones allows the system to better detect and remove noise, perform things like echo cancellation, and determine the location of sound sources such as a person speaking. Of course, when you think about it, we manage to do all of this stuff with just two ears (I don’t know about you, but I don’t think I have enough room on my head to accommodate seven ears without at least one of them getting in the way).
Not surprisingly, the folks at XMOS also spotted this, which is why they just introduced their new XVF3510 next-generation voice processor that can pluck an individual voice out of a crowded audio landscape using just two microphones.
The algorithms running on the XVF3510 include interference cancellation (which nulls point noise sources to cancel out unwanted background noise), stereo acoustic echo cancellation (which suppresses unwanted speaker echo and enables barge-in), and adaptive delay estimation (which dynamically adjusts audio reference signal latency, thereby ensuring that the echo-cancellation algorithms deliver a smooth, real-time experience).
These and other algorithms enable the XVF3510 to work intelligently to analyze the acoustic environment and identify and isolate voice commands from every other sound in the room (including any media streaming through the device itself), thereby enabling far-field voice capture with close-range precision.
As part of this announcement, the guys at XMOS have also announced a new XVF3510-based VocalFusion development kit for Alexa Voice Service (AVS), which refers to Amazon’s suite of services built around its voice-controlled AI assistant, Alexa (check out this video to see the development kit in action).
On the right, we see a small, two-microphone array connected to a board carrying the XVF3510. Meanwhile, the shield seen on the left (top) is plugged into a Raspberry Pi (bottom). (Note that the Raspberry Pi is not included with the VocalFusion dev kit for AVS.)
Costing only $0.99 a chip for orders over 1 million units a year (prices start at $1.39 for smaller quantities), the XVF3510 voice processor will enable manufacturers to economically embed a voice interface into mass-market consumer products like smart TVs and set-top boxes.
I don’t know about you, but I’m both thrilled and filled with trepidation with regard to an AI-enabled voice-controlled future. On the one hand, I can easily visualize myself ambling around, telling devices and systems what I want them to do to make my life easier and more enjoyable. On the other hand, I can also imagine a world in which I’m surrounded by appliances, contraptions, gadgets, and gizmos that are all clamoring for my attention and I cannot get them to stop talking to me. How about you? Are you excited or scared?