Voice Recognition

From New Media Business Blog

Jump to: navigation, search

Speech recognition is the process of capturing spoken words using a microphone or telephone and converting them into a digitally stored set of words. Speech recognition is gradually becoming more widely used for a range of activities such as driving control, sports and health. It provides efficiency when dealing with direct tasks. As the years go by, the accuracy of this technology is continuously improving for both individual and business-use.

This is often used interchangeably with voice recognition, which works by analyzing the features of speech that differ between individuals and is aimed at identifying the person who is speaking. Although this article/report focuses mainly on speech recognition technology, we also cover voice recognition technology since it is a growing technology with many functions. This article contains more information regarding the differences between speech and voice recognition technology.


Origins and Evolution of Voice Recognition Technology

Speech Recognition in the 1950s

Can you guess when speech/voice recognition technology was first introduced to the world? Back in the 1950s, Bell Laboratories designed the "Audrey" system, which recognized digits spoken by a single voice.

Since it was first introduced, the greatest barriers to the speed and accuracy of speech recognition technology were the lack of computer speed and power. With the average CPU now at and above a Pentium III with RAM levels of over 500 MB, accuracy levels have reached 95% and transcription speeds have increased to over 160 words per minutes.

Evolution of Speech Recognition Technology

This image shows the evolution of speech/voice recognition technology:

1960s - Introduction of IBM's "Shoebox" machine, which understand 16 words spoken in English.

1970s - Introduction of Carnegie Mellon's "Harpy" speech-understanding system, which could understand 1011 words. This is the approximate vocabulary of an average three-year-old.

1980s - Speech recognition vocabulary jumped from to several thousand words, and the technology had the potential to recognize an unlimited number of words. One major reason was a new statistical method known as the "Hidden Markov model" (discussed below). Rather than using templates for words and looking for sound patterns, the HMM considers the probability of unknown sounds being words. However, the devices were still slow. Speech recognition started being used in areas such as medical care and in the home (ex. dolls that would speak and seem to "understand" the child).

1990s - Dragon entered the scene and introduced its first product, Dragon Dictate, which was followed by Dragon Naturally Speaking seven years later. However, both of these still required the speaker to pause after every single word, so it wasn't ideal from a speed and usability standpoint.

2000s - Speech recognition capabilities were built into Windows Vista and Mac OSX

2010s - Apple launched Siri, which marked a revolution in the world of speech recognition, especially due to the increasing trend towards the use of smartphones.

To read more about the evolution of speech recognition technology, check out this article: Speech Recognition Through the Decades: How We Ended Up With Siri.

The Technical Aspect

Types of Programs

Small Vocabulary / Many Users These systems are ideal for automated telephone answering. Even if users speak with a great deal of variation in accent and speech patterns, the system will understand them most of the time. However, usage is limited to a small number of predetermined commands and inputs, such as basic menu options or numbers.

Large Vocabulary / Limited Users These systems work best in a business environment where a small number of users will work with the program. While these systems work with a good degree of accuracy (85% or higher with an expert user) and have vocabularies in the tens of thousands, you must train them to work best with a small number of primary users. The accuracy rate will fall drastically with any other user.

Large Vocabulary Continuous Speech Recognition / Many Users With all the development and advancement in speech recognition, technology is moving toward systems that can maintain a large vocabulary database and work with large number of users. The systems are also trained to identify accents with a relatively high level of accuracy. Some examples of such application would be Google Voice Search, Siri and name dialing. Now, the systems are also being trained and tested for dynamic translation from one language to another.

How It Works

Hidden Markov Model

When we speak, we create vibrations (analog waves) in the air. The analog-to-digital converter (ADC) translates this analog wave into digital data that that computers can understand. To do this, the system digitizes the sound by taking precise measurements of the wave at frequent intervals. The system filters the digitized sound to remove unwanted noise, and sometimes to separate it into different bands of frequency. Next, the signal is divided into small segments as short as a few hundredths of a second. The program then matches these segments to known phonemes in the appropriate language. A phoneme is the smallest element of a language, which represents the sounds we make and put them together to form meaningful expressions. The program examines phonemes in the context of the other phonemes around them. It runs the contextual phoneme plot through a complex statistical model and compares them to a large library of known words, phrases, and sentences. The program then determines what the user was probably saying and either outputs it as text or issues a computer command.

Today's speech recognition systems use powerful and complicated statistical modelling systems. These systems use probability and mathematical functions to determine the most likely outcome. The two models that dominate the field today are the Hidden Markov Model and Neural Networks.

Neural Network Model

Hidden Markov Model

  • Each phoneme is like a link in a chain, and the completed chain is a word.
  • The chain branches off in different directions as the program attempts to match the digital sound with the phoneme that's most likely to come next.
  • The program assigns a probability score to each phoneme, which is based on its built-in dictionary and user training.

Neural Network Model

  • Digitalize the speech that you want to recognize.
  • Compute features that represent the spectral-domain content of the speech.
  • A neural network (also called an ANN, multi-layer perceptron, or MLP) is used to classify a set of these features into phonetic-based categories at each frame.
  • Viterbi search is used to match the neural-network output scores to the target words. This determines the word that was most likely uttered.

Latest Trends

Nuance Communications

Nuance Communications[1]

Nuance Communications is an American multinational computer software technology corporation, headquartered in Burlington, Massachusetts, United States, that provides speech and imaging applications.

Nuance was founded in 1994 as a spin-off of SRI International's Speech Technology and Research (STAR) Laboratory to commercialize the speaker-independent speech recognition technology developed for the US government at SRI. Based in Menlo Park, California, Nuance deployed its first commercial large-scale speech application in 1996. Many years later, Nuance partnership with Apple Inc to create the Siri application.

In present day, Nuance is the company behind a wide range of products. The current business products focus on server and embedded speech recognition, telephone call steering systems, automated telephone directory services, medical transcription software and systems, optical character recognition software, and desktop imaging software.

Dragon Naturally Speaking

Dragon Naturally Speaking Software [2]

Dragon Naturally Speaking[3] is the world’s best-selling speech recognition software. It turns your talk into text and can make virtually any computer task easier and faster. From capturing ideas and creating documents, to email and searching the web, to using simple voice commands to control many of the popular programs you use every day at home, work – and beyond.

Dragon Naturally Speaking, created by Nuance, can impact many areas of life, from healthcare applications, to driving. The efficiency that it creates eliminates the use of manual inputs and uses quick notations of the voice. People with visual impairments are able to dictate ideas and create documents, surf the internet and provide commands to the computer solely through their speech. This software is the base of many devices that are already created, and will be the base for many more applications and softwares soon to be developed for social benefit and daily use.

Dragon Medical

Dragon Medical Demo [4]

Dragon Medical provides clinical documentation solutions for over 300,000 physicians. This product captures the physician narrative to document care in the EHR – anywhere, any time and on any device. It spans both front-end and back-end speech recognition solutions for clinical narrative, and enables documentation at the point of care. This stored voice file can then be accessed from the database for future reference or analysis by physicians, patients and other medical staff. Dragon Medical allows physicians to spend more time with patients, and allows them to accurately document their story at the point of care. using just their voice. It also allows doctors to speed up the consultation process, thus allowing them to see more patients.

This is a good example of an pre-packaged software that allows healthcare providers (medical clinics and hospitals) to implement an integrated system that improves efficiency of medical care. This pre-packaged software would like not need to be customized, and so many clinics and hospitals would not have to develop a documentation system such as this in-house or by outsourcing the development of a new application. One foreseeable challenge is the ability of the software to process complex medical terms.

Dragon Drive

Dragon Drive - by Nuance Mobile [1]

Nuance integrates its technology with Cloud and vehicle on-board capabilities to create distraction-free driving with Dragon Drive voice command[1] in action. There are over 90 cars are currently equipped with Nuance Dragon Drive. BMW has already chosen Dragon Drive as its voice recognition software.

Dragon Drive provides the convenience of smartphone connectivity, and ability to focus on driving by allowing users to keep their eyes on the road and hands on the wheel. This results in the decreased number of accidents involved in mobile incidents due to the mitigation of distracted driving (ex. texting while driving) incidents. Dragon Drive provides the ease of search and navigation. It proves to be more beneficial than most GPSs' since it allows for directions to be pre-added into the device from your smartphone, or for requests to be voiced instead of manually inputted. Dragon Drive also has the ability to manage multiple suppliers and devices; it is not tied to one service. Another benefit is that you are able to access other content, such as the weather, networking sites and Facebook, and send text messages, using just your voice.

Of course Dragon Drive comes at a high cost for the software, as well as the costs of installing the device in your vehicle. There are some brands of cars, such as BMW, that have it already installed in some of their cars, and soon more and more cars will adopt it soon. There is another thing to consider with Dragon Drive - although you use voice for controls, there is still a screen. It is without a doubt that individuals will still have their eyes drawn to the screen once in a while while driving; this can impact their safety while driving, to an extent.

VoicePod - Completely Hands Free Voice Control for Your Home [2]

Voice Pod

Voice Pod[1] allows you to use your smartphone or tablet devices to control your home systems, hands-free, using just your voice. This app can understand everyday words and interpret sentences. This is a great example of using speech recognition technology to make your life more convenient and comfortable. There is a tabletop version, which is in the form of a device the size of a baby monitor. There is also a mobile app version, available only for iOS (Android version coming soon).

Some of Voice Pod's features include being able to let you control lights, shades, temperatures, AV systems, security systems and more. Some of its benefits include: a) Simplicity: It does not take a very tech-savvy person to learn how to use this technology. b) Wireless integration: It currently uses Zigbee wireless communications (Wi-Fi version coming soon) to communicate with your home's systems. c) Hands-free: Although this product may not be a need for able-bodied individuals, it can be of great benefit for seniors or people with disabilities, who may not be able to or might find it easier to use their voice to control their home's systems.

Coding by Voice - Using Python

Coding by Voice Using Python - Python Conference 2013 [2]

Tavis Rudd is an example of an individual that discovered how to code by voice by using using the programming language, Python. It started off when Tavis Rudd developed Emacs Pinkie (RSI), a condition that causes your hands to go numb. For several months of vocab tweaking and duct-tape coding in Python and Emacs Lisp, he developed a program that would allow him to code using speech. The system enabled him to code faster and more efficiently using his voice than he ever had by hand.

Tavis attended the Python Conference 2013 and demonstrated the software he created. He uses obscure and odd sounds and words, such as "erp" and "blurp". Through this demonstration, he shows off a few of the commands he created , but in reality, there are over 2000 personal commands that Tavis made up. He used Dragon Naturally Speaking voice recognition software on Microsoft Windows. Since Dragon uses Python, he hacked it, together with a Python Speech recognition extension lib DragonFly by Christo Butcher. The code is released for download. You can follow Tavis Rudd on GitHub: Tavis Rudd GitHub, or Twitter: @tavisrudd.

Skype: Live Voice Translation

Skype: Live Voice Translation Demo [1]

Microsoft released the first beta of real-time Skype Translator to Windows 8[1] before the end of 2014. Redmond is close to implementing near real-time translation of multiple languages in a Skype call. Currently there is instant functional translation from English to German and Chinese. The English/German translation function was demonstrated by Redmond at the Code Conference in 2014. He interacts in English with a German speaker with minimal difficulty.

Skype [2]

There are numerous benefits with this real-time type of technology. Firstly this is a great advantage for business. When individuals participate in a conference call with a business contact that speaks a foreign language, Skype will be able to help decrease the language barrier and bridge geographic and language boundaries. This gives businesses global and cultural opportunities. Furthermore, live translation will support long-distance learning as more languages become more developed and implemented. Educating individuals over Skype will seamlessly act almost identical as interacting with someone face-to-face. By eliminating distance, location costs and traveling time, this option will become much more appealing.

However, one cannot ignore the possibly that there will be some errors concerning slang, language structure, and possibly voice recognition due to microphone voice quality and background noise. The software will continuously need to be upgraded to minimize these errors. As well, Skype Live Translator will most likely not be a free service, but a service you will need to invest in by paying a premium service fee.


Most smart devices users try their device's in-built speech recognition features. However, the majority of these users do not use speech recognition as the primary mode of operation of their devices. In this section, we will explore the reason why people are currently not using speech recognition technology, which is actually designed to make people's lives more convenient. Furthermore, we will discuss privacy issues brought by speech recognition technology.

Proportion of information workers using speech recognition [3]

This is one of the reasons why people are not using speech recognition: Burnistoun S1E1 - Voice Recognition Elevator - ELEVEN! (youtube)[4]

Accuracy & Performance

The unsatisfactory level of accuracy and performance of current speech recognition systems cause low effectiveness. Voice or speech recognition systems sometimes even complicate users' operation of their devices.


Performance of a speech recognition system is determined by its accuracy and speed. Accuracy is determined by the rate of successful responses to commands, or other similar measurements. Accuracy will be discussed under the next section. Speed of the speech recognition system depends on the speed of the processors (i.e. clock rate), size of its memory, and the algorithm adopted to interpret and execute commands.


iPhone 4S Siri does not understand Japanese English [5]

As mentioned above, the development of speech recognition technology was rapid, which raised the concern regarding the accuracy of its systems. As the vocabulary and range of operations that can be performed by voice recognition systems are significantly growing, technology to handle this growth is also progressing.

High accuracy of interpretation of the user's voice is one of the main attributes of effective speech recognition applications and software. Some factors associated with accuracy are internal, which are software-related factors and hardware-related factors:

  • program efficiency
  • choice of algorithm
  • noise reduction efficiency
  • size of vocabulary
  • range of responses
  • microphone quality
  • processing units
  • speaker-dependent or speaker-independent systems

Meanwhile, accuracy is also attributed to external factors that are uncontrollable by the voice recognition system itself:

  • speed of speech (isolated, discontinued or continuous speech)
  • type of speech (read or spontaneous speech)
  • syntax
  • punctuation
  • clarity of speech
  • volume of speech
  • language and ascent
  • background noise, distortion, overlapped speech, number of speakers, etc.
How to Increase Accuracy
iPhone4S Siri (Japanese English)

Do you notice the differences between the video above and the video in the previous section?

To increase or ensure higher accuracy, a user can try some or all of the following:

  • minimize background noise and number of speakers
  • speak slowly and clearly
  • speak sentences with less syntax, grammar, and punctuation error
  • use a package(or system) that can interpret your language or ascent
  • update the voice recognition software when available

Users will choose to use speech recognition system if the performance of it is more satisfying, i.e. faster and more accurate.

Please complete this survey to help us gain some insight into BUS 466 students' preferences and usage of speech recognition applications and software: Voice Recognition Software Usage Survey and view the results afterwards.


Speech recognition technology is designed to enable humans to communicate, in our natural language, with computers. Speech recognition technology itself will do no harm to our privacy, but it can be used in ways that might infringe upon our privacy anonymously. There were a few cases revealing that speech recognition technology helped businesses keep track of their customers or employees.

Google Chrome

Google Chrome [1]

Google has included its speech recognition function on its websites and apps since 2008, which has eased the work required by our fingers on phone virtual keypads and keyboards. Google Voice Search works in real-time by receiving the voice of the user, and processing it in the server before sending response back to the user.

However, a speech recognition specialist, who also is a Google Chrome user in Israel, revealed that Google Chrome may turn on your microphone in the background. If this is the case, sounds would be recorded to Google's highly efficient voice recognition server and allow them to process the recorded sounds, including your conversations, music, TV shows, and any other activities you are doing on your computer. Once, you have permitted Google Chrome to use your microphone, sounds around you may be picked up by Google's speech recognition server. No one except Google knows how the sound data were used. After Google investigated this bug, it announced that they have fixed it and claimed that users will be obviously notified whenever their microphone is connected to Google. Google Chrome's "bug" may be threatening to businesses, which require its information to remain confidential in order to be competitive. [2]

Facebook Music and TV Recognition

An update of the mobile Facebook App that was announced around June and July has a new feature that allows the app to listen to your surroundings and determine what music you are listening to or TV show you are watching. This feature can be enabled by users if they want after getting this update. Although Facebook said that this feature will be activated only if user enables it and only while the user is updating their status, it is still possible that there will be "bugs" like the one discovered in Google Chrome.

Wearable Tech in Workplace and Outside

Google Glass being used while golfing

Wearable Technology is becoming more popular in the workplace, especially in the manufacturing industry. These wearable devices are designed for more efficient and effective communication between supervisors/managers and production line workers by allowing easy broadcasting and quick response from workers. It is not designed for monitoring employees' behaviour or invading their privacy. Click here to read more: Wearable Technology In The Manufacturing Workplace.

By incorporating speech recognition technology, these wearable devices can be used to track workers' performance and, virtually, every single action they do while working. However, employers should not use wearable technology to monitor employees or intrude on their privacy, and use it solely for enhancing communication.

This wearable technology can also be used for outside of the office, during field assignments for example. Wearable devices such as wristbands, or Google Glass can improve people's ability to do physical activities or performance when playing sports. Google Glass, for example, can aid with range and swing calculations when playing golf. It can also keep track of progress when running or riding a bike.[3] The voice recognition functionality in these devices will become more widely used as more functions are discovered.

The Future of Speech Recognition Technology

Ed Grabianowski in his article How Speech Recognition Works sums it up quite nicely: "At some point in the future, speech recognition may become speech understanding. The statistical models that allow computers to decide what a person just said may someday allow them to grasp the meaning behind the words. Although it is a huge leap in terms of computational power and software sophistication, some researchers argue that speech recognition development offers the most direct line from the computers of today to true artificial intelligence. We can talk to our computers today. In 25 years, they may very well talk back."

Although the Skype translator function has made strides in areas such as universal translation (see "Latest Trends" section above), the technology still has a long ways to go. Seamless translation between different languages is difficult because of the existence of dialects, accents and slang.

When it comes to cyber security, speech recognition, specifically voice recognition technology, is growing rapidly. The MIT professor who invented typed passwords now thinks they are a nightmare, and recognizes the need for a move towards "pass-voice". This would mean that access to information or restricted areas, or your personal accounts (ex. email) would be through voice recognition technology that would detect your unique voice. In face, Nuance (no surprise there) has a software for this called Nuance VoicePassword.

Furthermore, Biometrics Research Group Inc. predicts that the voice recognition market will reach $2.5 billion by 2015, and this will mainly be in the banking sector. The use of biometrics to enhance banking security will fuel this growth. Read more in this article by Biometric Update. They also predict that speech recognition technology will gross revenues of USD $20.1 billion. In conclusion, there is a lot of potential in speech recognition (and voice recognition) technology, and the performance of software, applications and devices that use this technology is improving rapidly!


Estes, C.E. (2014, May 3). The Guy Who Invented Computer Passwords Thinks They're A Nightmare. Retrieved from http://www.gizmodo.com.au/2014/05/the-guy-who-invented-computer-passwords-thinks-theyre-a-nightmare/.

Grabianowski, E. (n.d). How Speech Recognition Works. Retrieved from http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-recognition.htm.

Pinola, M. (2011, Nov 2). Speech Recognition Through the Decades. Retrieved from http://www.techhive.com/article/243060/speech_recognition_through_the_decades_how_we_ended_up_with_siri.html.

Tobias, M.W. (2014, Jan 26). Here's How Easy It Is For Google Chrome To Eavesdrop On Your PC Microphone. Retrieved from http://www.forbes.com/sites/marcwebertobias/2014/01/26/heres-how-easy-it-is-for-google-chrome-to-eavesdrop-on-your-pc-microphone/.

  1. Google Chrome
  2. Here's How Easy It Is For Google Chrome To Eavesdrop On Your PC Microphone
  3. Google Glass
Personal tools