Speech Recognition

From New Media Business Blog

Jump to: navigation, search

Speech Recognition is the conversion of spoken words into text. It is based on statistical models of spoken language which store the actual words and ways which they can be strung together, thereby increasing the accuracy of a speech recognition system. These models have improved over the years to improve accuracy for both speaker independent and dependent systems, especially with proponents such as Google and Nuance Communications.

Speech recognition systems are widely used for personal and business use despite recurring weaknesses which prevent the technology from reaching critical mass. But with computing power and data becoming less costly and more widespread, the plateau of speech recognition is expected to be reached in 2 - 5 years.



Advancements over speech recognition have dramatically increased over the past decade. Beginning with systems which could only understand digits and leading to a world where computers can now understand billions and billions of words from various languages.

1950s – 1960s: Recognition of Digits

The first speech recognition systems can be traced back to the early 1950s. These systems could only understand digits. Bell Laboratories created the Audrey System in 1952 which recognized digits spoken by a single voice. In 1962, IBM created the Shoebox machine which could understand 16 English spoken words.

1970s – 1980s: Early Advances of Speech Recognition

Apple's Knowledge Navigator Introduction Video

Major funding from the United States Department of Defense allowed for the implementation of the DARPA. Speech Understanding Research (SUR) program from 1971 to 1976. One of its major projects was the Carnegie Mellon's Harpy speech-understanding system. [1] Harpy could understand 1,011 words, approximately the vocabulary of an average three-year-old. Harpy was significant because it introduced a more efficient search approach called beam search. It was responsible for processing each part of the language, determining different interpretations and keeping the best interpretations, evaluating them according to certain programmed criteria, and finally, choosing the closest interpretation.

In the 1980s, the Hidden Markov Model (HMM) was applied to speech recognition. There were also a growing number of commercial applications that could learn 1,000 – 5,000 words. For one, Dragon Dictate was released by Dragon Systems in the 1980s as the first speech recognition program for the PC for DOS computers. It could recognize only individual words, spoken one at a time.[2]

In 1987, Apple CEO, John Sculley, introduced a device considered futuristic for its time; referred to as the Knowledge Navigator, the device focused on communication between a person (user) and a personal intelligence agent. The Knowledge Navigator was able to recognize speech perfectly.

1990s – Early 2000s: Increased Accessibility

During this time period, there was increased accessibility to speech recognition systems for the general public. Due to faster computer processors and cheaper costs of computing power for the masses, speech recognition software became viable for ordinary consumers.

In 1990, Dragon Systems launched the first consumer speech recognition product, Dragon Dictate, for $9,000. Seven years later it evolved into Dragon NaturallySpeaking which could recognize continuous speech naturally, at about 100 words per minute.[1] The program had to be trained for 45 minutes and the cost of this new system was $695.

In 1996, VAL from BellSouth was created, which was a dial-in interactive voice recognition system that provided information to the user based on commands given over the phone.[1]

Mid 2000s – Present

Apple's Siri

During this time period, the main proponent of speech recognition was brought forth by Google. The Google Voice Search App built on past issues and relied on Google’s extensive data infrastructure.[3] It also added data from search queries to predict what the user is saying. Google succeeded for two reasons. Firstly, cellphones and mobile devices had very small keyboards to type on which provided incentive for better input methods such as speech recognition. Secondly, Google had the necessary cloud data centers to be able to process large-scale data analysis and store the large amount of human-speech examples.

In 2010, Google updated Voice Search on Android phones so that the software could record users' voice searches and produce a more accurate speech model. Today, Google systems can understand more than 230 billion words.[1] Apple’s Siri is an intelligent personal assistant which uses natural language processing and relies on cloud-based processing to give a contextual reply. It was believed that the Knowledge Navigator Video, which debuted in 1987, predicted what technology would be like in September 2011. Siri launched in October 2011, only a month from the predicted date.

Speech Recognition Systems

The following types of Speech Recognition Systems are based on the Speech Recognition Models:

  • Speaker dependent systems are designed around a specific speaker.[4] They are generally more accurate for the principal speaker, but much less accurate for other speakers due to the utilization training techniques.
  • Speaker independent systems are designed to accommodate a multitude of speakers.[4] Without training, adaptive systems usually start as speaker independent systems. However, training techniques may eventually be utilized to adapt to a specific speaker and increase overall recognition accuracy.
  • Isolated word recognizers require single utterances to be made. This type of system was very prominent in the early development of speech recognition, primarily associated with the limited computer processing power available then.
  • Connect word systems minimize the long pauses between each utterance and even allow separate utterances to be processed together.
  • Continuous recognition systems recognize more difficult and complex speech. Continuous speech recognizers allow users to speak almost naturally, while the computer determines the content.
  • Spontaneous speech systems handle true natural sounding speech. It can detect a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters.

Speech Recognition Models

Beam Search

Beam Search was first seen in the Harpy Speech Recognition System in 1976. It uses a best-first search algorithm to maintain optimization in large systems that have insufficient memory to store all the results. Beam Search is responsible for processing language part by part, determining different interpretations and keeping the best interpretations, then evaluating them according to certain pre-programmed criteria, and finally, choosing the closest interpretation. Only a predetermined number of the best-matched results are kept.

Hidden Markov Model

The Hidden Markov Model (HMM) is a statistical model in which the state is not directly visible but its output is visible. This simply refers to the ability to view the entire sequence of speech after a system has recognized it. Rather than simply using templates for words and looking for sound patterns like previous speech recognition systems, HMM considered the probability of unknown sounds being words, which has largely carried on today. It estimates the joint probability of the entire sequence of speech that the user generates.

Three-Tiered Integration Model

Speech recognition operates by obtaining data from a statistical model. There are statistical models for each spoken language. The input speech from a user is first matched to statistical models of the spoken language, and then transferred back to the user.

Based on this premise, a statistical model requires the following three levels to function:

  1. Acoustic Model: An acoustic model forms by taking audio recordings of speech and the transcriptions of what was said. By combining the two sources, a statistical representation of the sound is created.
  2. Language Model: The language model tries to improve recognition further by compiling data about the probable sequence of words. It poses a similar resemblance to Google Search’s model of estimating the probable sequence of words when typing.
  3. Lexicon: These contain the mapping between the written representations and the pronunciations of words and phrases. Each speech recognition system has an internal lexicon that specifies which words in a language can be recognized or spoken. The lexicon has specific expectations for how words are pronounced using characters from a single phonetic alphabet.

Google’s speech recognition systems have widely adopted the Three-Tiered Integration Model.[2]

Underlying Technologies

Voice Biometric Authentication

Voice biometric authentication, or voice recognition, recognizes a specific voice and responds only to that user. It aims to determine who is speaking rather than interpreting what a user is saying. The most common use of voice biometrics is in security, for verification and identification purposes. Verification is used to match, or authenticate a particular voice to a voice that was previously programmed into the system. Identification, on the other hand, is determining the identity of an unknown speaker. Voice recognition is actually a subset of speech recognition, but many people make the mistake of using the two interchangeably.

Speech Sampling

Similar to data sampling, speech sampling is the conversion of speech and speech signals into data. Speech signals, which carry only human speech, have recently been used for various purposes, such as company analytics or research. A MIT researcher, Deb Roy, used speech sampling to analyze how his infant son learned to speak the English language, exhibited through his TED talk.

Speech-to-Speech Translation

Speech translation is a process in which natural speech is recognized, processed and translated. This technology is currently present in many smartphone applications. Scientists are working on perfecting a universal speech recognition translator–any language can be spoken into the translator and the words can be translated into any other language in both speech and text formats. Though far from perfection, it may be possible for computers to not only recognize voice commands, but also understand what you are saying and communicate back with you.

Speech Synthesis

Speech-to-Text is a major aspect of speech recognition, most commonly referred to in the mobile phone market. Speech-to-text processing, also known as transcription, is a very common application of the technology. It is largely built upon the Acoustic and Language models as well as the HMM. Language-Based Functions: Functions such as “dial” and “search” are most common among mobile phones. Language models have been built around recognizing speech and executing an action such as “call” instead of using very precise voice commands to match specific functions. For example, a user can use the phrase “Call Mom”, rather than “Dial 604-333-8999”. By doing this, it provides larger storage for language models since voice commands can be less strict and also adds the element of convenience since users can use more natural requests.

Hands-Free Voice Control

Dragon Drive Interface
Kinect Voice Commands

Voice control is the use of specific commands to trigger a system to perform programmed actions. As such, speech recognition technology is a mandatory component used to recognize and confirm a user’s selections. Voice control is often used to improve multi-tasking abilities. Examples that employ voice control are:

  • Nuance Communications: Nuance’s Dragon Drive! allow users to use their voices to manage in-car options effortlessly. Connecting with the main server, the device allows for location-based services, entertainment and music search, speech-to-text, as well as weather and traffic applications.
  • Xbox 360 Kinect: The game console allows users to connect with the device through key words, such as “Xbox”, and recognizes which area of the interface you would like to go to. For certain games, it also allows for in-game control of various functions using specific voice commands.

Audio Mining

Audio Mining is software that converts unstructured conversations into structured output, or metadata. Output files can be categorized, analyzed and used by the company. Audio mining is most commonly used at call centres or customer service centres. Recorded conversations used to be the most prominently used but due to many technological advances, real-time analysis is now growing in popularity. The six most common purposes of audio mining are:[5]

  1. Root Cause Analysis: Aims to understand the reasons why people call.
  2. Trend Analysis: Identifies large trends of both anticipated and unexpected reasons, positive and negative reactions or responses for customer calls.
  3. Emotion Detection: Understands a caller’s emotional state.
  4. Talk Analysis: Understands basic statistical aspects of a contact centre including caller and agent talk periods, amount of silent time or hold and transfer times and talking over.
  5. Script Adherence: Monitors whether company agents are following company scripts, communicating the required information and uttering inappropriate statements to customers.
  6. Quality Assurance: Identifies whether agents are following the Company Code and conversations that require management’s attention.

Current Uses

Industrial and Business Applications


Many computer software and applications with speech recognition functionalities have been developed. Most of these are catered for people living with physical, visual, hearing, or learning disabilities. A well-known software package in this industry is Nuance Communications’ Dragon NaturallySpeaking for PC (or Dragon Dictate for Mac).[6] The package installs a voice interface to the computer so that users are able to do hands-free computing, controlling the computer through simple voice commands. This is particularly useful for people who are physically challenged or have learning disabilities that make it difficult for them to form words and text. Another software is J-Say Pro which reads words from PC applications and reiterates them through speech.[6]


Voxtech Phraselator

The military has made extensive use of speech recognition technology. Soldiers use a mobile device called the Phraselator to communicate with tourists, local citizens, and prisoners of war. Phraselator is a PDA-sized, one-way communicator that translates full English phrases (not individual words) to another language.[1] Phrases are pre-recorded by native speakers and stored as MP3 files on the device. The device is triggered to play a voice file when you speak an English phrase into its microphone. Unlike most speech recognition applications, the Phraselator is not speaker dependent, meaning it does not have to be trained to a specific person’s voice. The Phraselator is also being adopted by police departments as a tool for crowd control situations.

Simulation training for pilots and air traffic controllers of the military has also been integrated with speech recognition. Originally, the training would require another person to role-play in order to complete realistic simulations. This limitation on training times and costs is eliminated through use of speech recognition.[2] The training software is able to detect voice commands given by the trainees and accordingly generate simulations and performance assessments.

Helicopters and fighter jets are now flown with voice-controlled aircraft systems.[3] The speech recognition systems are speaker independent, meaning they do not require pilots to pre-train their voices. Through voice commands, pilots can use communication and navigation functions, such as setting radio frequencies and showing flight display. On fighter jets, selecting weapons and setting their release parameters can also be done through voice commands.[4] These systems were implemented to simplify the piloting process, allowing them more time to observe the external environment and less time on the panels or instruments. As a result, safety and efficiency of pilots were increased.


In the healthcare industry, speech recognition is most widely used for medical documentation.[5] Physicians record medical data through front-end or back-end speech-to-text technologies. With front-end applications, words appear on screen immediately and would require the physician to edit the report as they speak. Back-end speech recognition records the physician’s dictation in an audio file, while at the same time, generates a text file of the words. The files are sent to an editor or transcriptionist to edit, before being sent back to the physician for a final review. GeekoSystems International recently developed voice-powered wheelchairs.[6] The company synthesized its SafePath wheelchair with its GeekoChat verbal interaction software. This product empowers people who are not physically capable to control regular wheelchairs but can still speak fine, allowing them the freedom to move around on their own.


Ford SYNC's Voice Control System

Some automobile manufacturers have developed in-vehicle systems that integrate multiple functionalities and access through voice control. One such example is the Ford SYNC which runs on an operating system designed by Microsoft. Using voice commands, the driver can: make hands-free phone calls, control volumes, select music, control temperature, use GPS, ask for the weather, ask for the vehicle health report, and even control select mobile applications.[1]

Television Broadcast

Closed captioning technology is constantly being developed to improve its accuracy. It is most extensively used for television broadcasting. Many pre-recorded shows now provide closed captioning, however, broadcasters struggle with doing the same for live shows.[2] Speech recognition comes into play through software that provides speech-to-text functionality. The software would be trained to a specific person’s voice and would relay his or her words in text as they are spoken. The speaker dependent system recognizes only one voice, making it difficult to apply across all television programs, where often, more than one voice is heard at a time. Regarding this limitation, broadcasters may have someone in a quiet room reiterate a television show so that captions may be made; this is especially useful for live shows.[3] This same technology is used in the justice system for real-time court reporting.[4]

Call Centres

On the consumer level, speech recognition is most widely used in call centres. Interactive Voice Response (IVR) technology is used to interact with callers.[5] When a call reaches the call centre, the IVR application will ask the customer for his or her information. If the IVR application is able to retrieve all required information, the customer may not need to interact with a representative. Otherwise, the application will transfer the caller to an appropriate representative and provide the employee with the information previously gathered from the caller. In more advanced uses, such as the system employed by Ceridian,[6] all phone calls are automatically classified and categorized according to the subtopics discussed within the conversation. It also identifies the customer service skills that the employee used. Additionally, difficult customers or underachieving employees can be determined through patterns. With this information, Ceridian was able to generate analytics and improve its call centre processes. The system can also monitor how well employees adhered to the new processes.

Personal Applications and Entertainment

Personal Intelligence Agents

Vlingo's Statistics on Speech Recognition Usage

Speech recognition is heavily utilized for personal intelligence agents. Examples include: Apple's Siri, Google Now, and Samsung S-voice. A natural language user interface is used to understand what you say, including verbs, phrases, and clauses to delegate requests to a set of web services.[7] These software are also known as personal assistants due to their abilities in simplifying your day-to-day activities. Some of their functions include:

  • Instant searches
  • Looking up nearby activities
  • Posting to social media accounts
  • Launching an application
  • Sending texts and emails
  • Setting meetings, reminders, alarms
  • Finding directions

Getting an answer is as simple as saying a question like, “Is United Airlines flight 318 on time?” Your words will appear as you speak, and for short and easy requests, like the status and departure time of your flight or what the weather is like, the answer will be received almost immediately.[8] Siri is the most popular among smartphones and tablets; as of the iOS 6 release in 2012, it can understand and speak nine different languages.

According to a research conducted by Vlingo, speech recognition technology on the two major smartphone operating systems is used most often for voice dialing on Android devices and text messaging on iPhones. Other popular uses include web searching and navigation.

Smart Homes

With the advent of smart devices and even home appliances, it makes sense that homes are becoming “smarter” as well. Smart home is a complex environment where heterogeneous smart devices and appliances are connected to each other to provide various “smart” services.[9]

With the ability to recognize vocal commands, motions, gestures, and living preferences, smart homes can directly or indirectly interact with people to provide a more personalized living space. Intelligent home appliances will be automated to better fit the consumer’s needs by identifying their personality, emotion, and language through speech and voice recognition technology.[10] Also, individual profiles and preferences will be stored and applied automatically upon voice or touch commands. Voice recognition can identify and authenticate users, so home security can be enhanced by granting access and control to only certain specified users. Moreover, voice control can be used to trigger services such as setting the temperature of the air conditioner, initiating a grocery order online or locking the doors. Also, smart home devices can self-diagnose malfunctions and notify users when maintenance is required. Imagine a home with no more over cooked steaks, no more dirty dishes, and no more wet clothes out of the dryer. Life with smart devices will revolutionize the way a home operates.

Smart home technology is still in its early stages but will soon be a tremendous impact on our daily lives. The need for an effective and secure managing mechanism is increasing as smart homes are getting closer to reality. With the ability to predict user actions, services may be triggered without user consent, but through unintentional gestures or verbal expressions. These may be the greatest obstacle for user acceptance of the smart home technology.

Smart TV

Samsung's Smart TV with Voice Control feature

Smart TV is a multifunction television created by Samsung. ssentially, it is like any tablet computer, with access to social media, Skype, internet, music apps and more. With speech recognition technology integrated in the Smart TV, users can control the TV using voice commands.[11] This allows users to activate the TV and access certain menus and functions with simply the sound of their voice. Internet-based services and applications, such as Facebook or Twitter can also be activated by voice commands. Voice control works best when only one person is speaking to the TV with a loud clear voice. To address privacy issues, Samsung employs industry-standard security safeguards and practices (including data encryption) to secure customers’ personal information and prevent its unauthorized collection or use. Should the TV owner choose not to use these features, the microphone can be disabled.[12]

Examples of Smart TV Commands:

Voice Request
“Hi TV”
Activates the voice control feature
“Hi TV Turn On”
Turns on the television set
“Hi TV Change to Channel ___”
Changes the channel to the requested channel
“Hi TV Web Browser”
Opens browsing software

Computer Gaming

In the computer gaming industry, speech recognition technology is integrated with voice control in industry-leading computer games. Gamers can command and control your character or reload weapons.[13] Another use of the technology is to interact with other game players by utilizing the speech-to-text technology. Gamers’ speech will be dictated and sent through messages to other players. This provides a more realistic experience and allows gamers to be truly immersed in the game. Examples of computer games that enable this functionality include Call of Duty, World of Warcraft, and Mass Effect.


Weaknesses are the focal point of many programmers, who try to utilize advanced computing and processing power to counteract common issues. Weaknesses include: noise and accuracy, overlapping speech, convenience and time, as well as capacity and space.

Noise & Accuracy

Trouble with Understanding Accents

One of the greatest challenges in speech recognition today, is its technological ability to filter out noise while correctly and accurately recording intended messages.[1] As the signals transferred through the microphone are recorded, all background noises must be obsolete in order for the recorded data to be accurate. This means all sounds captured by the microphone would be transmitted. This is problematic as the use of speech recognition is limited to areas that are free of noise. Homonyms are another example of the issues in speech recognition accuracy. Where two words sound similar, speech recognition software do not have the full ability to decipher the intended word of the user, resulting in accuracy issues of misrecognition. Accents are another cause for error; thick accents may be difficult to recognize under the system.

An example of Flawed Interpretation of Speech Recognition:

User: I really admire your analysis.
Speech Recognition: I really admire Urinalysis.

Overlapping Speech

In many real-life situations and environments, multiple conversations happen simultaneously. However, speech recognition technology is not yet capable of recognizing different voices and users at one time. Thus, when speech recognition is used in the presence of multiple users, interruptions in each others’ speech result in poor quality outcomes.

Convenience & Time

Even when speech recognition accurately recognizes user inputs, proper punctuation must also be ensured. A considerable amount of time is required to proofread the data input and add proper pronunciation. Existing software, like Dragon Dictation Systems, require users to voice their punctuations at the end of their speech such as “period” or “comma”.[1] In addition, the setup of a voice recognition system can be time consuming. Before use, speaker dependent systems require a one hour enrollment process. Human training to use the system is also a requirement before adoption.

Capacity and Space

Speech recognition takes up a significant amount of hardware space.[1] The software requires a large storage of vocabulary, which would require fast computer processing. Error rates naturally increase as vocabulary base grows. Running statistical models, backtracking and generating word searches are just some of the necessary functions of speech recognition. These require quick processing speeds and may slow down the computers. In addition to operation space, speech recognition systems may also take up physical space in the office.


As the use of speech recognition requires interaction between users and computers, the technology's role in deciphering user inputs opens up greater levels of uncertainty. There are a number of factors that attribute to the risks of speech recognition and they contribute to the current challenges in the widespread adoption of speech recognition today. There are a number of risks that prevent the technology from reaching critical mass.

Privacy Risk

Vocal input is a mandatory requirement of speech recognition. This gives rise to privacy issues as voices can be easily overheard, whereas with typing systems, this problem did not exist. Additionally, large organizations that maintain a vast database of customer “voiceprints” put these consumers in a vulnerable situation, especially with the increased adoption of biometric authentication. Siri was found to store users’ “voiceprints” at its offsite speech-to-text processing where Dragon Dictation scans and uploads files to Nuance’s servers. [2] Google also has within its system an electronic key that tracks and links user utterances to their Google account.[3] Since laws regarding the Internet are hardly stringent, the police and other government departments may gain easy access to stored “voiceprints” in the near future.

Bill C-30

Also termed the Protecting Children from Internet Predators Act, Bill C-30 would allow telecom providers in Canada to monitor user data and allow authorities to access personal information without warrants or court documents. Since telecom companies would be able to gather information and transfer it to other departments, companies or people, personal information would pass through many different systems, thereby increasing vulnerability for potential information breaches. Due to high opposition, the government has pulled it off the House of Commons agenda for further review.

Adoption over Old Technology

Improvement in productivity is dependent on both the user group and the group’s ability to use new technology effectively. However, speech recognition technology may not always provide increased productivity, even if the user has become familiar with it. For example, in a study of native English speakers with good typing skills, the fastest users spoke an average of 107 uncorrected words per minute which resulted in approximately 25 corrected words per minute.[4] A different group with similar skills, when using a keyboard and mouse, completed almost three times more words per minute than the voice-only group accomplished. Participants observed that they were usually aware when a typing error occurred, but were much less confident about being aware when a speech recognition error occurred. This causes a higher degree of diligence over suspecting errors and accuracy. It is a risk for organizations to adopt unfamiliar technology based solely on expected improvement.

Cost of Investment

Speech recognition technology is costly, especially in its initial investment. It is a risk for companies to invest so heavily on a system which may still have prominent accuracy issues. Evaluations must be made on maintenance costs as well as future software and hardware upgrades since the technology is still being fine-tuned. The constant need for improved networking to work with speech recognition systems is increasing, and the possibility of the technology working wirelessly would incur higher expenses. The cost of enhancements to meet the needs of the end user will be pricey. Indirect costs include costs of matching the technology to current company infrastructure and human capital. If the technology does not align with the company’s current infrastructure and strategic goals, it will be a failure. A company must consult with professionals before implementing speech recognition, rather than installing it due to the hype and trend. A commitment from all stakeholders must be obtained to put in time and effort to use the system effectively.

Future of Speech Recognition

The mass availability of voice recognition apps alone is an indicator of the potential reach of the technology in the future. The future will not only entail controlling systems by voice commands; ultimately, programmers would like to transcribe the underlying request of a user’s verbal language, a goal which has only been sparsely met with the advent of Siri and similar technologies. With the constant improvements in data and computing power, it is expected that accuracy will increase as statistical models increase in size, and that sound systems will better adapt to background noise levels by discarding irrelevant material.

Gartner's Hype Cycle

2012 Gartner Hype Cycle

The Gartner Hype Cycle represents the emergence, maturity and expected stability of various technologies.[5] It predicts the reach or hype that certain technologies will reach in the near or distant future. The Hype Cycle can be a helpful tool for determining whether a company should begin implementing a technology based on future popularity.

As per the 2012 Gartner Hype Cycle, speech recognition is placed within the Slope of Enlightenment but nearing the Plateau of Productivity. Nuance Communications has been one of the main drivers of speech recognition in the past ten years, becoming largely successful in the application of the technology for various business and personal uses, without any mass marketing of their speech recognition systems. With its recent work towards widely-available personal assistant technologies such as Apple’s Siri, it has moved closer to reaching its plateau. In 2 to 5 years, speech recognition is expected to be widely used and accepted; it may possibly counteract most, if not all, of the issues plaguing the technology today. Benefits of speech recognition are beginning to crystallize and become more widely understood. Better methodologies and practices are developing, though there are still companies who are conservative and remain cautious about the technology.

Natural Language Processing

Natural language processing deals with the interactions between computers and human languages. It involves natural input speech from a speaker and the computer’s ability to infer the meaning and words of the user. Since natural language has minimal pauses between words, speech recognition technology has to first segment the words, then process the speech and finally generate understanding of the language.

Recognition of Meaning

Currently, many speech recognition systems focus on the actual words that are voiced. The future, though, poses the opportunity to build on misspeak and make use of intelligent systems such that a user’s speech can be accurately predicted before actually being spoken. This Semantic Web Semantic Web will understand the meaning behind a request rather than just the words of a request. Conversational interaction models are used to understand user speech without any constraints. However, even the most advanced speech recognition systems have trouble with this, since the amount of data within statistical and language models do not account for underlying meanings and interpretations perfectly. Fortunately, since technologies can house thousands of applications, many systems have been built to retract resources from other services or applications. Therefore, through a combination of other services, the recognition of meaning has been greatly enhanced. Natural language processing can benefit by taking data from calendars or maps to plot important events or set reminders. Using the intelligence gained through other companies can also create more relevant results for a user.

Personalized Recognition

Through more natural language processing., companies such as Nuance tried to develop systems that remember and recognize past conversations. Tailoring recognition to specific speakers can allow for the creation of speaker profiles that the systems can refer back to. A link is made between a system and user utterances based on personalized recognition models. Google and Nuance Communications have experimented with and even implemented the technology in some of their voice services. With increasingly efficient data speeds and processing power, personalized recognition could lead to less stringent more natural commands used in everyday fashion. By recalling past conversations, users will be able to use commands such as “Let’s take a look at the research article I found yesterday” instead of having to request the article by its specific name and author.

Emotional Speech Recognition

Emotion is a normal and fundamental component of verbal communication which makes understanding language a large part of natural language processing. Systems today can usually recognize 3 to 5 emotions ranging from dissatisfaction and frustration to stress and anxiety.[6] However, the future brings an opportunity for natural conversation interpretation which could interpret much more than five emotions.

An emotional speech recognition system aims to extract three features:

  • Pitch – Males and females present different voice pitches when reacting with different emotions
  • Speech Energy – The output level of sound in every speaker utterance
  • Format Frequencies – The frequency of audio signals which can be modified by changing the relative position of the tongue and lips.

These three features comprise the computer makeup of understanding emotion.[7] As pitch, speech energy and frequency may differ between males and females, the speech recognition system must also identify the gender of the speaker. By adding emotion to the mix, speech recognition will become a viable part of Web 3.0 and allow for more personalized recommendations based on emotion and gender.

Recognition of Emotion[8]

Web 3.0

Web 3.0 is one vision of the future that demonstrates the true personalization of responses based on user-spoken requests. The future imposes a conceptual reality where users can sit back, simply dictate their requests to a technological device and be given an accurate response momentarily. Completely hands-free control over computerized systems will allow the ongoing efforts to increase efficiency and effectiveness of speech recognition to extend even to the mass consumers.

Of course speech recognition is only one aspect of Web 3.0, since it will integrate different resources to create a holistic experience for users. To create a true Web 3.0 experience, speech recognition will have to be used in conjunction with other technologies or concepts, such as:

  • Sentiment Analysis – The extraction of information from source material to determine the user’s attitude. This process also ties in with emotional speech recognition as it tries to infer whether the user speech is positive, negative or simply neutral. Requesting actions or information by speech will require some form of sentiment analysis to provide a holistic personal experience.
  • Artificial Intelligence – AI software such as a neural network–which is responsible for many processes, like facial detection for example–can absorb data and become better at processing it. This procedure can also allow the software to recognize relationships between statistical models and language models of speech.
  • Location-Based Services – Map and location services will operate alongside speech recognition systems to offer users geographically localized responses. The service, already widely used today, will integrate GPS signals and track favourite locations (similar to how Google Now does it today) to complement voice requests.

Speech recognition will be at the forefront of Web 3.0. The ability to dictate requests and add voice to the element of the Web will make speech recognition a necessity for future generations. Although the technologies listed above will all be contributors to the entire Web 3.0 experience, speech recognition will play a part in all of them.


  1. 1.0 1.1 1.2 Speech Recognition
  2. Dragon dictation app embroiled by privacy confusion
  3. Google Account
  4. Native English speakers study
  5. Gartner’s 2012 Hype Cycle
  6. Emotional Speech Recognition: Resources, Features and Methods
  7. Emotional Speech Recognition
  8. Recognition of Emotion
Personal tools