Reading Time: 13 minutes
A Time Capsule Predicting The Voice First Revolution
Over a year ago I answered this Quora Knowledge Prize question:
This answer caused quite a stir, it was quite ahead of it’s time and most of the deep thinkers in tech thought Voice First was at best a novelty. I think the posting is quite informative today. It is an interesting read in light of all the advancements of the Voice First revolution.
March 5th, 2016: Is Amazon Echo (and/or Siri and other voice assistants) actually useful, or is it just a novelty? Are usage and retention of these products growing?
Speech Is the Ultimate Invisible Computer Interface
In the next 10 years more than 50% of computer interactions will be via voice. The computer, the device and the legacy interface will disappear, all that will persist is the volition, intention, interaction and results.
In the summer of 1952 Bell Laboratories actively tested Audrey (Automatic Digit Recognizer)  the first speaker independent voice recognition system that decoded the phone number digits spoken over a telephone for automated operator assisted calls.
Schematic of Audery the first speaker independent voice recognition system.
In 1962 IBM demonstrated at the World’s Fair its “Shoebox“ machine , which could understand 16 words spoken in English and was designed to be a voice calculator.
Demonstration of IBM’s “Shoebox“ at 1962 World’s Fair.
Moving forward in time there were hundreds of advancements. Most of the history of speech recognition was mired in speaker dependent systems that required the user to read a very long story or grouping of words. Even with this training accuracy was quite poor. There were many reasons for this, much of it was based on the power of the software algorithms and processor power. Additionally continuous speech recognition, where you just talk naturally has only been refined to a great extent in the last 5 years.
In the last 10 years there has been more advancement then the last 50 years. The line back to 1952 on to 2016 moved speech recognition to be one of the most important technology advancements in computer history.
Speech Requires Less Mechanical Load And Cognitive Load
The most powerful and efficient interface for communication is the human voice. It sounds obvious in this context and it has had a few million years of evolutionary development. Yet we take speech quite for granted as we only recently took to a mechanical system (typing, clicking, pointing) to interact with computers.
Human speech is a far more refined tool that can convey densely packed instructions and requests in-situ more effectively. The mechanical load and cognitive load on the human is far lower when we can utter a phrase like “Alexa, what does my commute look like?” as compared to the 30+ cognitive and mechanical steps using the best smartphone and best apps. The alternative to speech requires the cognitive load on the brain and mechanical load to type with the cognitive load on the brain to interpret what a map may be relating. Simply asking a question is far more superior.
Speech based interactions fundamentally have three advantages over current systems:
- Speech is an ambient medium rather than an intentional one (typing, clicking, etc). Visual activity requires singular focused attention (a cognitive load) while speech allows us to do something else.
- Speech is descriptive rather than referential. When we speak we describe objects in terms of their roles and attributes. Most of our interactions with computers are referential.
- Speech requires more modest physical resources. Speech-based interaction can be scaled down to much smaller and much cheaper form-factors than visual or manual modalities.
The power of speech based systems have grown profoundly powerful with the addition of always on systems combined with machine learning (Artificial Intelligence), cloud based computing power and highly optimized algorithms. The speech recognition is combined with almost pristine Text To Speech voices that are so closely resemble human speech that many trained dogs will take commands from the best systems. Siri, Google Voice and Amazon Echo Alexa are the best consumer amiable examples of the combination of Speech Recognition and Text To Speech products today.
We take for granted the mechanical processes we all have adapted to use computers we will be able to eliminate many if not all of these steps with just a simple question. This process can be broken out to 3 basic conceptual modes of voice interface operations:
- Does Things For You– Task completion:
– Multiple Criteria Vertical and Horizontal searches
– On the fly combining of multiple information sources
– Real time editing of information based on dynamic criteria
– Integrated endpoints, like ticket purchases, etc.
- Gets What You Say– Conversational intent:
– Location context
– Time context
– Task context
– Dialog context
- Gets To Know You– Learns and acts on personal information:
– Who are your friends
– Where do you live
– What is your age
– What do you like
In the cloud there is quite a bit of heavy lifting working at producing an acceptable result. This encompasses:
- Location Awareness
- Time Awareness
- Task Awareness
- Semantic Data
- Out Bound Cloud API Connections
- Task And Domain Models
- Conversational Interface
- Text To Intent
- Speech To Text
- Text To Speech
- Dialog Flow
- Access To Personal Information And Demographics
- Social Graph
- Social Data
Voice based computers seem to have limits on what can be accomplished. However when one truly analyzes the exact results we are looking for, a vast majority of times can simply be answered by a “Yes” or “No”. When the back end systems correctly analyze your volition and intent countless steps of mechanical and cognitive load is eliminated. We have just recently entered into an epoch, at this moment, where all the right technologies have converged to make the full promise of an advanced voice interface truly arrive.
The Secret “Project Doppler”, Yap, Evi And IVONA
Amazon surprised just about everyone in technology when the secret ‘Project Doppler’ or ‘Project D’ from Lab126 offices in Silicon Valley and Cambridge, Mass. was announced on November 6, 2014. This was an outgrowth of a Kindle e-book reader project that began in 2010 and the acquisition of voice platforms it acquired from Yap, Evi, and IVONA.
The original premise of Echo was to be a portable book reader built around a very well designed and powerful omni directional microphone and surprisingly good WiFi/Bluetooth speaker. This humble mission soon morphed into a far more robust solution that is just now taking form for most people.
Beyond the the power of the Echo hardware is the power of Amazon Web Services (AWS). AWS is one of the largest virtual computer platforms in the world. Echo simply would not work without this platform for the local electronics in Echo are not powerful enough to parse and respond to voice commands with-out the millions of processors AWS has at its disposal.
Since the 2014 limited release of Echo, Amazon recently added Echo Dot, which is hockey puck sized version of the Echo designed to connect to existing speakers since it only has a small speaker and the Amazon Tap, a portable and smaller version of the Echo with dual stereo speakers. They all are basically work the same.
The evolution of Echo has been constrained and slowly new features are added. Today (Feb 4, 2016 version 3077 software update) Echo can:
- Order items from Amazon from both previous orders and the creation of new orders.
- Create shopping lists for use at other stores, not just Amazon.
- Read books from your Kindle library using Text To Speech.
- Play Audio books from your Audible library.
- Sports Update with details such as scores and upcoming schedules for NFL, NBA, MLS, MLB, NHL, WNBA, NCAA, and other American sports.
- Present weather and news from a variety of sources, including local radio stations, NPR, ESPN, TuneIn.
- Play music from owner’s Amazon Music accounts and built in support for the Pandora, and Spotify streaming music service and streaming services such as Apple Music, and Google Play Music from a phone or tablet.
- Support for IFTTT (If This, Then That) voice-controlled alarms, timers, shopping and to-do lists.
- Personal workout trainer using Skills settings.
- Access Wikipedia articles.
- Respond to your questions about items in your Google calendar.
- Integrates with Philips Hue, Belkin Wemo, SmartThings, Insteon, and Wink with anticipated support of Countertop by Orange Chef, Scout Alarm, Garageio, Toymail, MARA, and Mojio.
- Give traffic reports.
- Hail an Uber car.
- Tune a guitar.
- A growing set of ASK (Alexia Speech Kit) developer “Skills” qusi-API.
For many this is more then enough to justify a ~$150 purchase of the original Echo. These features were enough for me to have an Echo in the Kitchen, Master Bathroom (not in the toilet area) and in a car. I was fortunate to have one in early December 2014 and found it useful in ways I could not have predicted. In each setting there are unique and sometime unexpected use cases.
Echo In The Kitchen
In the kitchen Echo has become indespenible in creating family shopping lists. There is no way I would go back to the haphazard way that was replaced. We all simply call out to Alexa to add X to the shopping list throughout the week with a sort of frenzy of scouring the refrigerator, freezer and cabinets in a group effort just before we go shopping. Echo forms an unexpected connection with non-Amazon physical grocery stores. It is one of the largest oversights from Amazon not to have an easy way to convert all or part of a shopping list into an Amazon order. I am certain this deficit will be addressed soon.
Echo also is very useful in cooking situations. Timers, timers and timers, I never used so many timers and frankly should have. Measurement conversions and recipe adjustments and recommendations have also been very useful.
Echo in the kitchen also is a centerpiece for the family with my two boys asking just as many questions to Alexa as to me, “Alexa, why is the sky blue?”. We have sort of a game where we see who can answer a question faster then Alexa, I win quite a bit but my children have caught up. I see Echo as important as any encyclopedia or school text book for education. This extends to the books that I have narrated during Breakfast and some other meals that seems to captivate all of us and promote questions and ideas from the minds of curious boys.
Echo In The Bathroom
Lets face it, even the most care free person spends quite sometime in the bathroom getting “ready” for the day. The majority of us (56%) take any time between 11 and 30 minutes getting ready . That means 30% of American spend over a week in getting ready in the bathroom each year. I and my wife use this time to set todo lists, shopping lists listen to books and music and most indispensable to me, take notes for ideas and send an occasional Tweet. Since December 2014, Alexa and Audible has read me ~45 books while I was getting ready. It is a powerful learning tool, these are 45 books that I would have likely had to read at other times and perhaps conflicting with other things I wanted to do.
I have been able to hack together a non-elegant way to use Echo to read Quora postings and I can say this has really extended my consumption of the work of the most amazing minds in the world on Quora. My method is an ugly hack that I hope will be made pretty in the future.
I also set the Nest thermosstat via the IF (ifttt) app to a pleasant temperature in the morning. As well as set the final over night temperature. Although I do not yet have an Echo in the master bedroom, I do set light music in the evening that fills the room with more then adequate sound.
Echo In The Car
I am a researcher and this does compel me to try the unexpected and extreme. Thus I wanted to test just how effective and useful Echo would be in the car. This was in January 2015 and Echo was still not in wide distribution and I am rather certain I was one of the pioneers here.
Echo in the car became absolutely indispensable, perhaps even more so then in the Kitchen. For obvious reasons driving requires the minimal amount of distractions. I use Echo for the same things I do at home but in many ways it is more effective. I use Echo to read a lot of Quora, news and books while on the road. The hack I use to post to Twitter is useful when a stream of ideas prompt me.
The few hours a week I spend on the road has allowed me to access 1000s of Quora postings, daily headlines, a few hundred Tweets and about 31 books since December 2014.
The experiment became permanent the moment I was able to get another unit to install in the car. The car has an AT&T hotspot built in and actually only added about $15 per month with all of my use. I also have a built in 120 VAC plug and found a sort of ok location for Echo in the car.
Alexa, Google Or Siri? I Say Yes To All
Living and working with Echo in three primary locations for over a year I am fully convinced that Echo and the many products I see that will come in the future will dominate our homes and vehicles. I think it is important to add that I am also a very heavy user of Apple’s Siri and wrote quite a bit about it here on Quora . I see Echo and Siri as similar but quite different on a few fundamental levels. Siri to me is very useful and quite indispensable for me answering text messages and composing short to medium amounts of dictated text. In fact, about 40% of this posting is being composed using Siri. To me it will never be an either/or situation but a rich mixture of uses that each system does better. I also use Google Voice to some degree mostly for searches, in fact all of the searches I used for this posting was conducted using Google Voice.
As I mentioned above, even Amazon is in early days with Echo with the inability to convert a shopping list to a live Amazon order. When Siri was released I wrote quite a bit about the commerce aspect of Siri and speech in general . I wrote of the prospect of Siri (or any voice based system) to become a transaction completion system. I wrote about a future in 2011 that I was certain Apple would adopt far faster then they have. I was writing about Apple getting into payments also at the time and knew that the product that became Apple Pay had to be released first. This finally took place on October 2014. Apple has made many improvements and updates to Siri but thus far has feel behind in some ways behind Amazon Echo. I am rather certain with Apple Pay 4.0 and really large changes to Siri that I am predicting, Apple will perhaps surpass Amazon. We already see a hint of this with the newest version of Apple TV.
APIs Are The Future Of Voice Interfaces
The speech technology as it stand in March, 2016 is quite rich and useful and if the evolution were to stop today, it would have already made a permanent place in my life and the life of my family. But of course the innovation will not stop here. There is a huge future ahead with the possibility of open APIs from Amazon, Apple and Google that will extend the usability of Speech to far more extend use cases. I wrote about the prospects of how APIs can be the most defining element of a voice interface in 2011 with the release of Siri . The Ontology of information accessed by a voice interface will continue to expand with higher momentum in 2016. Additionally access to controlling everything from lights to coffee makers via a voice interface will also gain higher momentum in 2016. Thus far none of the three large voice interfaces have open and useful APIs-yet but this will change. Although Amazon is well on the way with Alexa Skill Kit .
Specimen of the Alexa Skills flow pattern.
Education, Commerce And Advertising are the “Killer Apps” For Amazon Echo
Education: It is very clear to me after a year with Echo that Education is a fundamental but yet to be discovered aspect of this technology. So much so that I think in the next five years many students will find this type of voice interface as a study guide to be almost common. Amazon and Google are in a very unique position to leverage their huge inventory of indexed information into a speech based expert system powered by advanced machine learning technology.
Commerce: Commerce seems logical to the inception of Echo, but as I mention, this was not the fundamental driving force during the development of “Project Doppler”. It was a Kindle extension taken to an extreme. Thus it is not surprising that Amazon is just now catching up to the commerce element of Echo. You can order items of course today, but the expense needs to evolve. The external connection to Domino’s Pizza presents just how deeply APIs can go.
Advertising: The Domino’s Pizza relationship presents a new advertising model that could very well change the entire industry. Very much like the Pay Per Click model Google refined in the 1990s the Pay Per Voice Order model can become a dominate platform. Amazon, Google and Apple can control this future where APIs and integrated payment systems from these companies complete transactions with just about any merchant for just about any product. Over 75% of sales on Amazon.com is from marketplace sellers and not Amazon. Amazon has great experience with being an advertising and payments platform and I am certain Echo will define this new advertising model.
“Alexa order a large cheese Pizza”
These are the base foundations of why voice interfaces will thrive moving forward. If even one dominates short term it will be a revolution. Clearly the commerce and payments aspect is perhaps the largest element short term. Today you can sit in your kitchen and say “Alexa, order a large pizza from Domino’s”. It will be delivered in about 30 minutes and it will have already been paid for. Imagine how many mechanical and cognitive steps this six word command replaces. I have studied it, there are over 200 “Demming” steps.
It may seem that Amazon will dominate this space. I firmly assert (and have for the last 3 years to clients) that after Apple Pay, voice commerce is the largest payments opportunity in this epoch. Simply put it may very well be someone in a garage right now building the foundation for this new payments and commerce recolution. The “winners” today have no guarantee to be “winners” in this new voice commerce world. This is not retail payments nor is it in-app or web based payments for many fundamental reasons.
Voice commerce, history will show, is an entirely new and unique paradigm. I have studied every aspect of voice interfaces with attention to commerce and payments for over 20 years and we are at the pricipase of something revolutionary. I have identified a road map of over 200 points on how new payment paradigms, new non Amazon hardware and new businesses will develop in this eco system. Thus far not a single payments startup or legacy company is positioned to not only truly understand this opportunity, they may be edging away from it.
Echo, A Novelty?
I am certain in the era of punch cards being the primary user interface to a computer, a keyboard seemed like a novelty. I can fully attest to the fact that in the era where a keyboard as a primary user interface the mouse was considered a novelty. Finally the touch screen was certainly seen as a novelty in the era of the Blackberry micro keyboard.
Demonstration of IBM’Punch Card interface at a time when the keyboard and display were considered a novelty.
The Keyboard trumped the punch card. The mouse coexisted with the keyboard. The touchscreen made redundant the mechanical keyboard. In the the future just voice interfaces will make redundant the need for all of these things for a growing number of tasks. We will extract ourselves from the mechanical and cognitive processes of tasks we perform today and use our voices to command these very powerful systems in the cloud. These systems will do all of the work completing the tasks and report to us when they are done.
Self Driving cars will require a voice interface for control and interaction. There is little doubt this technology will be critically important. Very much like how I use Echo in the car today, I am certain this too will become a popular way to consume information in this setting.
Speech based interfaces allow you to multitask and do other things. Unlike the paradigm using a device and reading a screen, using your voice is liberating and increase productivity to an extend simply not possible with mechanical interfaces alone.
The computer as we know it has been shrinking and in many ways will disappear and become a nexus connecting us via speech. There will still be touch screens and perhaps VR headsets, even perhaps holographic ephemeral displays in the next 10 years. However voice interfaces will continue to grow and supplement these experiences.
One year of Echo use has informed me that the unabashed confidence of being a stand alone voice interface, always is its greatest strength. Unlike an appendage to a TV, Smart Phone or web browser, Echo truly defines the physical space where it lives. It is remarkable how quickly I have become used to entering a room and instantly addressing Echo with a request. I can see clearly where this will lead us all.
The ultimate destination for the voice interface will be an anotnomous humanoid robotic system that very much like a science fiction movie, will fundamentally interact with us via voice. This feature is not as an afterthought but the centerpiece of the humanoid robots we will certainly have in the future.
What we call a computer will fundamentally and profoundly change, our great granchildred will marvel at the Keyboard and Mouse and perhaps even a touch screen. They will see these user interfaces as a historic novelty.
We have traveled so very far since Audrey in 1952. Echo, a novelty? No, Alexa is us just getting started.