Voice control and voice interfaces have begun their inexorable infiltration of pretty much all consumer edge device categories.
Voice control and voice interfaces have begun their inexorable infiltration of pretty much all consumer edge device categories. Advances in both voice recognition algorithms and AI accelerator hardware mean the technology is accessible even to power- and cost-constrained applications such as smart home devices (and even some dumb ones).
The drivers behind voice control in smart home devices from the user side are clear.
“Ease of use and convenience are the main drivers at this time,” Alireza Kenarsari-Anhari, CEO of PicoVoice told EE Times. It’s easy to imagine shouting across to a coffee maker in your home office from your desk when you want a coffee, or dictating orders to a tumble dryer while holding a basket of wet laundry.
We assume that smart devices like these, which are not portable, have permanent access to the home’s WiFi connection — so why not do this voice processing in the cloud?
The trend towards edge AI in this situation is primarily driven by privacy, which Kenarsari-Anhari says is a concern for consumers but a must-have for some enterprises. Reliability is another driver: “Does it make sense for your laundry machine to stop working if your WiFi is not working?” he said.
Latency is also important in certain situations; some applications do need real-time guarantees for voice workload processing, such as gaming.
Cost is another big driver for edge processing of voice, since it costs money to process this voice data in the cloud. The business model of paying every time you use a cloud API doesn’t work for use cases such as home appliances and consumer electronics, which have a low cost-point and may be used many times each day.
PicoVoice, whose AI speech-to-text inference engine is designed to run independent of the cloud on sub-$1 microcontrollers, aims to enable voice control in applications where it otherwise wouldn’t be feasible. This could include consumer wearables and hearables, which are at the cross point of needing the power efficiency and cost efficiency that could be enabled by a microcontroller-based voice solution. A power- and cost-optimized solution could also unlock opportunities in industrial, security and medical applications, Kenarsari-Anhari says.
The company recently launched Shepherd, a no-code platform for building voice applications on microcontrollers, which works with the company’s model creation software, PicoVoice Console. Shepherd supports popular Arm Cortex-M microcontrollers from ST and NXP with support for other devices on the way.
“I think of voice as an interface — if you can build your GUI or website without coding, maybe using WordPress, building voice interfaces in a similar way is the next logical step,” Kenarsari-Anhari said. “Shepherd is empowering product managers and UX designers to build prototypes and iterate fast but we do aim to widen its target user base. What if everyone could build their own assistant? Name it what they want — not Alexa! — and give it the personality they want.”
While it is perfectly possible to develop natural language processing models and implement them without specialist software, this route is not for everyone.
“One certainly can — Apple, Amazon, Google and Microsoft did it,” he said. “It is really about whether an enterprise has the resources, is committed to building an organization around it, and can afford to wait for a few years.”
Future trends
Voice is becoming the preferred interface for the next generation of technology users, Kurt Busch, CEO of Syntiant told EE Times in an interview last summer.
Busch described how his youngest child, who could read but was still a bit too young for writing and spelling, could text message with his friends using the voice interface on a smartphone.
“His older siblings text, but his generation got phones a few years earlier than they did,” Busch said. “As time has gone by, for his generation and younger, their default interface is to talk to it.”
Busch’s view is that voice will become “the touch screen of the future,” with in-device processing providing quick, responsive interfaces at first in devices that have a keyboard or mouse, and then in white goods.
Syntiant’s chips are specialist AI accelerators designed to handle voice AI workloads in consumer electronics devices with low to extremely low power budgets. The startup has shipped more than 10 million of its chips globally to date, most of which have gone into mobile phones to enable always-on keyword detection. The latest Syntiant chip, NDP120, can recognize hot words such as “OK Google” to activate Google assistant in under 280 µW.
In the future, Busch also sees voice control enabling connectivity and access to technology for everyone.
“We see voice as the great democratizer for technology,” Busch said. “There are 3 billion people in the world that live on $2 a day. My assumption is those people do not have internet access and may not have been through the education system. The natural interface here is [speech]. This is how you get technology into the third of the world that is not interacting with technology today. We’ve seen a lot of interest in developing countries about voice first applications, to get those segments of society that maybe did not have access before, not only from an expense point of view but also from a comfort point of view.”
Market fragmentation
The danger with a market that is growing as quickly as voice is that it can quickly become extremely fragmented, Vikram Shirastava, senior director of IoT at Knowles told EE Times – and not just along hardware lines.
“The market gets fragmented based on, say, what speech recognition engine is being used?” Shirastava said. “The market gets fragmented depending upon whether you’re integrating with a TV SoC or whether it’s a simple MCU inside say, a microwave. You get fragmentation based on operating systems, or based on the acoustical environment – is it just the home? Is it a doorbell outside? There cannot be one-size-fits-all solution. You have to kind of find what the common denominators are in each of these verticals, and try to address integration of voice accordingly.”
Knowles has a DSP-based voice control solution which it intends to introduce versions of for different verticals. Its approach is to group fragments of the market into those with a common denominator – home controls, TV soundbars and remote controls might fall into the same group, for example – and then develop a solution that is optimized for that group of applications. Shirastava calls this approach “one level below turnkey,” which offers turnkey’s scalability but with some added flexibility.
“We have to have a few different releases that address a certain aspect of that fragmentation to allow us to cover the verticals we want to go after,” he said.
Knowles’ recent release, the AISonic Bluetooth Standard Solution, is a development kit for voice recognition in Bluetooth-connected devices such as smart speakers, smart home devices, wearables and in-vehicle voice assistants. The kit is based on Knowles’ IA8201 dual-core DSP silicon, which is designed specifically for neural network processing at far lower power than an application processor. For example, the chip can handle separate AI models for keyword spotting, source classification, beam forming, acoustic echo cancellation (AEC) and source direction estimation concurrently, in under 50 mW. This is enabled by an instruction set extension of almost 400 custom instructions for audio and AI processing on the Tensilica DSP cores, which in turn allows clock frequency to be reduced in order to save power.
Will voice eventually become the default user interface for most classes of consumer electronics? It certainly looks that way. A combination of advanced, efficient AI voice control algorithms, development environments that enable developers to easily integrate voice, and a growing ecosystem of energy- and cost-efficient hardware solutions has emerged to make it all possible.
This article was originally published on EE Times.