Is voice the new battlefield as companies (Apple, Google, Microsoft, Nuance) jockey for position to control the voice interface and our personal search and assistance experiences? Roberto Pieraccini, industry veteran and author of new book The Voice in the Machine, looks and the evolution of voice in computing and communications and maps out the future of mobile voice and beyond.
* * *
If you watch Kubrick’s 1968 classic movie “2001 A Space Odyssey” today, you will certainly notice that many of technologies we enjoyed in 2001, not to mention those we enjoy today, are absent from the movie. Everything looks so dated, so sixties: green alphanumeric terminals, no display windows, primitive graphics, no mouses, no portable tablets or wireless devices, no Web. People take notes with pencil and paper; they don’t consult Wikipedia for advice.
On the other hand, the intelligent, murderous, omnipresent computer that—I’m tempted to say “who”—controls the spaceship, HAL-9000, speaks and converses like a human. Its voice is soothing, it appears kind but determined, and it seems to have emotion.
While we have advanced way beyond most of the technologies portrayed in the movie, with gadgets that even visionaries such as Arthur C. Clarke and Stanley Kubrick could not dream up, the conversational and general intelligence of today’s computers seems lame at best when compared with HAL’s. And this is even more striking because we have been trying to build “computers that understand speech” for 60 years.
Yes, 60 years. Why is that? Is speech technology so far behind our expectations of a HAL-like machine that we should give up, or is it advancing along a different path of useful, but not human-like, applications that make our life easier?
This question is at the core of my book The Voice in the Machine, the first book to make the science and technology of speech understanding computers accessible to a general audience and track the 60-year journey that brought us to its applications today.
The mechanics of speech
Speech and language are complex phenomena that are not completely understood. Their complexity, as I explain in the book, is one of the reasons why machines cannot yet engage in spoken conversation as well as humans. We understand, to a certain extent, the mechanics of speech production and perception. But as we move closer to the inner workings of the brain, our knowledge moves from the realm of facts to that of hypotheses. We can hypothesize different levels of abstract cognition in humans — from the lower acoustic and phonetic, to the lexical, morphological, syntactic, and semantic, up to human knowledge of the world — and we can try to mimic them in computers.
But we don’t know exactly how the human brain processes the complexity and infinite variety of language. That lack of precise knowledge is one of the reasons we can’t quite build a perfect replica, in a machine, of the human ability to talk and understand. However, even with limited knowledge of how humans work, we have built useful machines that interact with speech.
From voice dialing in your car to voice-recognition-based automated telephone customer care, people can issue commands and get a response. In the case of customer care centers, we have access to the right answers right away, without having to wait in a queue for the ‘next available agent.’
Siri changes the rules
The 60-year journey that culminated in current speech technology is marked by alternate phases of success, failure, hype, and disillusion. The tipping point that enabled modern speech technology was the development, in the late 1970s, of a mathematical model called the Hidden Markov Model, or HMM, and its application to the problem of speech recognition.
While the methods experimented with before required laborious accumulation of knowledge by expert linguists, HMMs allowed systems to learn the characteristics of speech, and as a consequence allowed its recognition both directly, from large numbers of transcribed utterances, and automatically. All modern speech recognition engines, both in research and in the commercial world, are based on the incremental evolution of the HMM paradigm.
The large corporate research centers, the giant ivory towers of IBM, AT&T Bell Labs, and Microsoft, were among the first to apply the HMM models to bring us huge advances in speech recognition technology. But it was the adventurous entrepreneurial spirit of young and agile startups like Dragon, SpeechWorks and Nuance that moved speech technology to a new and exciting level.
Why were these nimble newcomers able to compete head-on with the R&D giants? They had the agility, focus, and low costs that allowed them to get into small but market-shaping deals, and pursue a strategy that the giants—like AT&T, IBM, and Microsoft — lacked. It was the ability to develop solutions that drove the adoption of speech recognition while simultaneously working to improve its core technology, rather than relying on third parties to pick it up and use it.
Siri stakes its turf
The rest is history. By exploiting an aggressive M&A strategy and relentlessly incorporating all the smaller players, Nuance became the 800-pound gorilla we have today in the speech technology arena.
However, the game is changing. The use of speech recognition in smartphone applications has created a new consumer environment and has brought new players to the speech arena. For several years, Google has been investing in mobile voice search, which allows smartphone users to search the Web by voice. Apple, which has been dormant in speech technology for decades, has licensed technology from Nuance and introduced Siri, one of the first commercial and mass-marketed realizations of the vision of an omniscient personal assistant.
These new paradigms of speech recognition have created a wave of excitement and interest in the field of speech recognition. There is a new energy not only in the commercial space, but also in research. We know that speech understanding is not a solved problem. While sophisticated design and engineering can make it work well in some limited, although remarkable, applications, its capabilities are still far from those of humans.
Whether we will ever be able to converse with a machine like HAL-9000, or witness the evolution of incrementally better “please tell me the reason you are calling about” applications is an open question that will find, hopefully, some answers during our lifetime. In the meantime, Siri sets new expectations that the industry can’t ignore. More consumers will expect — even demand — to use speech with a wide variety of devices, not just mobile, but also home appliances, kiosks, TVs and everything in between (!).
As a result, speech will soon become the universal and de facto interface bridging our digital and physical worlds, and opening up a world of opportunity for existing and new companies. The pace of change will be phenomenal, matched only by the speed at which research works to identify and solve the fundamental problems of speech and language in order to build better “computers that understand speech.”
Roberto Pieraccini has always been interested in computers that understand speech. He started work at CSELT, a telecommunications research center in Italy, in 1981, when he built his first speech recognition machine. In 1988 Roberto was invited to visit Bell Laboratories in Murray Hill, NJ. His one-year visit became a permanent job and a life-long passion. He enjoyed the fervent scientific atmosphere of Bell Labs, and then AT&T Shannon Labs until 1999. But Roberto wanted to experience first hand how a small startup can put technology into practice at a faster pace than large corporate research centers, and so joined SpeechWorks, which contributed hugely to the modern speech industry and is now known as Nuance. In 2003 he joined IVM research, and the in 2005 SpeechCycle as its CTO. In 2012 Roberto had a unique opportunity to expand his horizons beyond speech technology, and accepted the position of director of the International Computer Science Institute (ICSI) in Berkeley, CA. ICSI is an independent research organization, close to the University of California at Berkeley, that pursues advanced research in many areas of computer science, such as networking, security, speech, vision, artificial intelligence, linguistics, computational biology, computer architectures, and theoretical computer science. In addition to his work, Roberto is a photographer, storyteller and writer. Follow Roberto on Twitter (@RobertoPieracc).