Achieve reliable hands-free voice control (Part 2)11 Feb 2013 | Bernie Brafman
Figure 1: The role of voice activation.
However, this important voice-activation step requires a few critical characteristics.
Extremely fast response time. Since it basically competes with a button press, it has to have a similar or faster response time. Because the hands free system uses a probabilistic approach, it can respond without having to wait for the recogniser to determine if the word is even finished. Slow response times lead users to speak before the Step 2 recogniser is ready to listen, which is a major cause of failure.
Low power consumption. This technology can deliver "always listening" wake-up triggers with as few as 7 MIPS, and current draw requirements in 1-10 mA range on today's devices.
Highly accurate even in low SNR environments.�This means several things:
Works in high noise –Truly Handsfree Voice Control performs virtually flawlessly in extremely loud environments, including�music playing in the background, in a car,�or even outdoors
Works without a microphone in close proximity -it is responsive even at distances of 20 feet (in a relatively quiet environment) and at arm's length in noise. This is critical because many VUI based applications of the future will become commonplace in a wide variety of consumer electronics devices, and users won't want to get up and walk over to their devices to control them.
Such companies as Nuance, Google and Microsoft are prominent in the second step, which is a powerful (often cloud-based) recognition system.
The third step "Understanding Meaning" is what the original Siri was all about. This was an AI component developed under DARPA funding at SRI and later spun off and acquired by Apple. Nuance's Vlingo does a really nice job of implementing Steps 1-3 It's very likely that Google, Microsoft, Apple and Nuance all have efforts underway in the area of AI and natural language understanding.
The SEARCH in Step 4 is done via typical search engines (Google, Microsoft, Apple) and likely the independent players have developed partnerships in these areas.
Step 5 represents a good quality Text-to-Speech (TTS) engine. Providers like Nuance, Ivona, ATT, NeoSpeech, and Acapella all have quality TTS engines, and no doubt Apple, Microsoft and Google all have in-house solutions as well.
Mobile applications for smartphones, tablets and ultrabooks benefit from hands-free voice control in safety and convenience. Applications can wake up and be controlled without touching the handset in the car or across the room. As a component of a medium vocabulary size recogniser with SDKs for iOS and Android, voice triggers and extensive command menus can be combined with cloud based recognisers creating a hybrid rich user experience when connected and extensive control capabilities when not connected. Response time is so fast that no pause between the trigger and command in necessary; for example "hello computer what time is it in Tokyo?"
Triggers can be made contextual; for example if a phone number is included in a text message or email, a trigger such as "Dial the number" can be activated. Uniquely, these SDKs also support using voice triggers as Speaker Verification or Speaker Identification phrases. In these scenarios, a single user or multiple users enroll themselves by speaking the phrase a few times. When enrolled, the trigger can be used as a voice password in the case of Speaker Verification, rejecting any other speaker, or as identification from a group of enrolled users in the case of Speaker Identification, so that users preferences may be retrieved. Both predefined fixed "hard coded" triggers and User Defined Triggering systems can be implemented on the device for further personalisation (and combined with Speaker Verification/Identification.
Share this page with your friends
5,000 Panasonic workers to lose jobs
Jobs from Panasonic's auto and industrial division, will be cut based on new business strategy...