The dream of the digital assistant, that perfect robotic helper, has long been with us. The dumb-ish systems that currently guide us in paying bills and getting directions can understand speech but not the nuances of language. The next-level tool must be able to “listen.”
That day is almost here, the perfect convergence of technology progressing to meet need at just the right moment, asserts David Pierce in a new Wired piece. He believes that “your assistant will know every corner of every app on your phone and will glide between them at your spoken command,” reducing or eliminating the reliance on pointing and clicking (a blessing for many users, especially the visually impaired).
The downside to this innovation (which goes unmentioned in the article) is that a “servant” that improves as it gets to know more about you will know more about you. Your privacy won’t be just yours, and a warm, friendly voice may get you to reveal more than the cool hum of a search engine ever could.
In surveying what he believes to be the near-future landscape, Pierce relates his awe of a new SoundHound app prototype that dazzled him with its sophistication. An excerpt:
The prototype is called Hound, and it’s pretty incredible. Holding a black Nexus 5 smartphone, [SoundHound CEO Keyvan] Mohajer taps a blue and white microphone icon and begins asking questions. He starts simply, asking for the time in Berlin and the population of Japan. Basic search-result stuff—followed by a twist: “What is the distance between them?” The app understands the context and fires back, “About 5,536 miles.”
Then Mohajer gets rolling, smiling as he rattles off a barrage of questions that keep escalating in complexity. He asks Hound to calculate the monthly mortgage payments on a million-dollar home, and the app immediately asks him for the interest rate and the term of the loan before dishing out its answer: $4,270.84.
“What is the population of the capital of the country in which the Space Needle is located?” he asks. Hound figures out that Mohajer is fishing for the population of Washington, DC, faster than I do and spits out the correct answer in its rapid-fire robotic voice. “What is the population and capital for Japan and China, and their areas in square miles and square kilometers? And also tell me how many people live in India, and what is the area code for Germany, France, and Italy?” Mohajer would keep on adding questions, but he runs out of breath. I’ll spare you the minute-long response, but Hound answers every question. Correctly.
Hound, which is now in beta, is probably the fastest and most versatile voice recognition system unveiled thus far. It has an edge for now because it can do speech recognition and natural language processing simultaneously. But really, it’s only a matter of time before other systems catch up.
After all, the underlying ingredients—what Kaplan calls the “gating technologies” necessary for a strong conversational interface—are all pretty much available now to whoever’s buying. It’s a classic story of technological convergence: Advances in processing power, speech recognition, mobile connectivity, cloud computing, and neural networks have all surged to a critical mass at roughly the same time. These tools are finally good enough, cheap enough, and accessible enough to make the conversational interface real—and ubiquitous.•