Gesture Recognition

From New Media Business Blog

Jump to: navigation, search

The central topic of this paper involves a specific type of human-computer interaction (HCI) called gesture recognition. Before we can get into the details, we must first define the term “human-computer interaction” and revisit the history that lead to the technology. According to the Webster’s Dictionary, human-computer interaction is the study of how humans interact with computers, and how to design computer systems that are easy, quick and productive for humans to use[1].

Gesture Recognition Interface

John Underkoffler: Pointing to the future of UI


History of Human-Computer Interaction

Basic and Intermediate HCI

Starting from the basics, there is the keyboard and mouse. This combination is undeniably the most common HCI in use today in the computing world. The computer keyboard was derived from a typewriter invented by Christopher Latham Sholes and was patented in 1868 [1]. While this was not the very first keyboard-like device invented, it was the first commercially-available keyboard that utilized the QWERTY layout. As for the mouse, it was invented by Douglas Engelbart in 1964, and he patented the design in 1970. The specific patent name was for a X-Y Position Indicator For A Display System[2]. At around the same time the mouse was invented, the development of touch-screen devices had already begun. The first known touch screen device was invented by E.A. Johnson in 1967 at the Royal Establishment in Malvern, UK[3]. This device was designed for use with screens in air traffic control. The 1990’s saw more consumer electronics utilizing the touch interface. Apple released the Newton PDA in 1993, equipped with features such as handwriting recognition and contact storage[4]. Later in 1996, IBM released the first smartphone called Simon. This phone had similar functions to today’s smartphones such as calendar, notepad and a touch screen interface for dialing phone numbers. Other notable devices include Palm’s Pilot lineup in the PDA market.

Typewriter invented by Scholes utilizing the QWERTY layout
Douglas Engelbart and his mouse prototype
Apple's Newton PDA

Advanced HCI: Peripheral and Vision-based

With Moore’s law in effect, computer processors and their computing power continued to rise while their prices dropped. This gave computer developers the ability to create more complex programs and applications that weren’t financially plausible before. What some developers[5] had noticed was the constraints caused by the mouse and keyboard, where complex human computer interactions were virtually impossible. First attempts to solve this problem were in the form of mechanical devices that measured hand movements. One example was VPL’s Data Glove in 1989; this device is attached to a user’s hand with sensors and fibre-optic cables to measure movements[6]. These movements are then computed to deliver commands through an application. While this was a breakthrough in HCI technology, developers and researchers still saw room for improvement. Sturman and Zeltzer[7] stated glove-based gestural interfaces require the user to wear a cumbersome device, and generally carry a load of cables that connect the device to a computer. This hinders the ease and naturalness with which the user can interact with the computer controlled environment. This is why vision-based gesture recognition is advantageous because it does not require any devices to be attached to a human body. All motions are tracked by a camera or camera-like device that is able to create a virtual representation of human gestures. These gestures are then programmed to compute specific commands to the user’s needs.

According to a group of researchers[8], Factors that may have contributed to this increased interest (of gesture recognition) include the availability of fast computing that made real-time vision processing feasible, and recent advances in computer vision techniques. Another reason is the growth of the gaming industry. When the Nintendo Wii was released in 2006, it took the world by surprise by utilizing gesture recognition devices in its game play. This was many consumers’ first contact with this technology. Later on, Microsoft released its Kinect device that was even more advanced than Nintendo’s offerings. This device does not require any peripherals to be held by the user and it simply tracks the user’s movements through multiple sensors and cameras[9].

What Is Gesture Recognition?

According to a definition found from Webopedia[10], gesture recognition is the interface with computers using gestures of the human body where a camera reads the movement and communicates the data to a computer that uses the gestures as input to control devices or applications. There are no physical devices required for the user to interact with the computer. The camera is an integral piece of the technology for converting gestures into commands. How does Gesture Recognition work? The mechanism of early-stage gesture-recognition systems is very similar to 3D recognition through human eyes in nature. For instance, a light source such as the sun bathes an object in a full spectrum of light. The eye senses reflected light, but only a limited portion of the spectrum. The brain compares a series of these reflections and computes movement and relative location.

How Does Gesture Recognition Work?

The mechanism of early-stage gesture-recognition systems is very similar to 3D recognition through human eyes in nature. For instance, a light source such as the sun bathes an object in a full spectrum of light. The eye senses reflected light, but only a limited portion of the spectrum. The brain compares a series of these reflections and computes movement and relative location[11].

Basic Gesture-Recognition System Components

Gesture-Recognition System Components

There are plenty of different technologies supporting gesture-recognition systems nowadays; however, all these technologies share a fundamental component list as follows[12]:

  1. Light Source: An LED or laser diode typically generates infrared or near-infrared light. This light isn’t normally noticeable to users and is often optically modulated to improve the resolution performance of the system.
  2. Controlling Optics: Optical lenses help optimally illuminate the environment and focus reflected light onto the detector surface. A bandpass filter lets only reflected light that matches the illuminating light frequency reach the light sensor, eliminating ambient and other stray light that would degrade performance.
  3. Image Sensor: A high performance optical receiver detects the reflected, filtered light and turns it into an electrical signal for processing by the firmware.
  4. Firmware: Very-high-speed ASIC or DSP chips process the received information and turn it into a format which can be understood by the end-user application such as video game software.

Evolution of Gesture Recognition Technology

In recent years, gesture recognition technologies are increasingly gaining attention as they enable humans to interactive with machines through natural human movements without any mechanical devices in situations such that a user is physically separated from a device, the user's hands are wet or dirty, or the user does not want to directly touch a publically accessible device due to hygiene concerns[13]. In attempting to use hand gestures as a substitute for mouse operations, it has been possible to detect horizontal and vertical hand motions that are similar to the way a mouse cursor moves. However, there was no mouse operation equivalent for forward and backward hand motions, which has been a longstanding problem for researchers to detect depth information[14].

Limitations of (x, y) Coordinate-based 2D Vision

Computers lack the ability to analyze spatial information including segmentation, recognition and object representation due to their 2D representation of scenes in nature[15]. As a result, a traditional gesture-recognition system has to rely on changes in the area of the moving object. However, this approach suffers a significant decline in the precision of detection when the background colour is similar to the moving object’s colour because it becomes difficult to distinguish between the background and the object. That makes it impossible to accurately extract the area of the object from the background[16].

Virtual Keyboard

Z-coordinate (depth) Innovation

The addition of the z axis not only raises the precision of detecting motions along with the exact spatial coordinates but also allows displays and images to look more natural and familiar, reflecting what people see with their own eyes. In addition, this third coordinate changes the types of applications available to users, including products that provide immersive digital experience. As of now, there are three common technologies able to acquire 3D images: stereoscopic vision, structured light pattern and time of flight (TOF). Each has its own unique strengths and appropriate use cases[17].

  • Stereoscopic Vision system uses two cameras to obtain a left and right stereo image. These images are slightly offset on the same order as the human eyes are. As the computer compares the two images, it develops a disparity image that relates the displacement of objects in the images. This technology is commonly used in 3D movies and mobile devices including smartphones and tablets.
  • Structured Light Pattern technology basically exploits the same triangulation as a stereoscopic system does to acquire the 3D coordinates of the object by replacing one of sensors of a stereoscopic vision system with a light source. Single 2D camera systems with an RGB-based sensor can be used to measure the displacement of any single stripe of visible light. The coordinates can be obtained through software analysis and then be used to create a digital 3D image of the shape.
  • Time of Flight (TOF) is relatively new among depth information systems. TOF systems are not scanners, as they do not measure point to point. Instead, TOF systems perceive the entire scene simultaneously to determine the 3D range image. With the measured coordinates of an object, a 3D image can be generated and used in systems such as device control in areas like manufacturing, robotics, medical technologies and digital photography.

WiSee: Wi-Fi signals enable gesture recognition throughout entire home

Pattie Maes: Unveiling game-changing wearable tech

Emerging Technology

Other forms of gesture recognition are taking place beyond the video camera. WiSee, a product developed by computer scientist at the University of Washington utilizes WiFi signals and its subtle changes to detect human movements. This technology enables gesture recognition throughout entire home by using the Doppler frequency shift, recognizing and responding to any sort of gesture or hand motion. With Wi-See, an employee is able to turn on a printer by pointing at it with his forefinger, or turn off his computer by connecting his thumb and his forefinger together[1]. While this is an interesting technology, it is beyond the scope of this document as we are targeted vision-based gesture recognition. It is possible however, for the two to combine in the near future for even more complex gesture recognition capabilities.

Current Applications of Gesture Recognition

As the authors of Gesture recognition: Enabling natural interactions with electronics claim[2], the ability to process and display the "z" coordinate is enabling new applications far beyond entertainment and gaming, including manufacturing control, security, interactive digital signage, remote medical care, automotive safety and robotic vision.” The gesture recognition market is estimated to grow from $322.30 million in 2012 to $7.15 billion in 2018[3]. The following are areas in which the technology is currently being utilized.

Human Gesture Recognition for Consumer Applications

Human gesture recognition has already been a popular input method in gaming and mobile devices including smartphones and tablets. It allows users to interact with the device in a much more natural and intuitive way, leading to greater acceptance and approval of the products. These products contain various resolutions of 3D data, which is analyzed by software modules such as raw-to-depth conversion, two-hand tracking and full-body tracking to deliver gaming and tracking in real time[4].

Interactive Digital Signage as a Pinpoint Marketing Tool

Interactive digital signage takes public advertising to a whole new level. Companies will be increasingly able to use pinpoint marketing tools to deliver the most applicable content to each consumer. For example, as someone walks past a digital sign, an extra message may appear on the sign to acknowledge the customer. If the customer stops to read the message, the sign can interpret that as interest in their product and deliver a more targeted message. Microphones allow the billboard to recognize significant phrases to further strategically pinpoint the delivered message[5].

Fault-free Virtual or Remote Medical Care

Gesture recognition technology also brings significant benefits to the medical field. It will ensure that the best medical care is available to everyone no matter where in the world they are located[6]. For instance, doctors can virtually treat patients across geographical boundaries by utilizing medical robotic vision, or even perform operations through remote control-operated robotic arm.

Gesture Recognition Market
Gesture Recognition Applications


The Nintendo Wii was released in 2006 and won that generation’s console war, due to the high sales volume of the console. It was able to accomplish this due to the console being able to open a new market for gaming. Typically, the gamer market is tailored to younger males and young adults, but with the Wii, it became popular with elder and younger gamers due to its innovative gaming design and ease of use. Users control the game through 2 controllers, called the Wii Mote and the Nunchuck controller, which track user movement through infrared sensors and accelerometers in the controllers, and the console tracks these movements through an infrared bar that can be mounted above or below the display[7]. The controller can then track user movements in 3D as well as a pointing device. This innovative and simple premise allowed for many different applications and games for gamers of all ages to easily pick up and play.

Xbox Kinect

The Xbox Kinect was Microsoft’s response to the Wii and was the first commercially successful gesture recognition system for gaming, without the attachment of peripherals to the human body. It was released in Nov 2010 and sold 8 million units in the first 60 days, breaking the record for the fastest selling commercial electronic device. The Kinect detects movement through its camera, which is designed by PrimeSense, and processed by software created by RARE, a subsidiary of Microsoft. The actual camera is comprised of an infrared sensor, a camera, and the data is then processed by a special microchip which relays information to the main console. Through this combination of hardware and software, the Kinect is able to simultaneously track 6 people, 2 active players, and each player has 20 separate joints each that are actively tracked[8]. While the Kinect gained critical success in most markets, it was unable to achieve the same level in Japan, mainly due to the wide open space required for the Kinect, which is not readily available for use in most Japanese homes where most living units are quite small due to population density.

Kinect Retail

There have been many innovative uses for the Kinect. One example is of a Japanese company which was able to utilize the Kinect to monitor customer’s purchase decision process in a clothing store. The Kinect is able to collect data such as which items you will compare when you are making purchase decisions, and accomplishes this by analyzing body language and time spent on each item[9]. This also shows the scalability and flexibility of the technology, with possible RFID and other sensor integrations for future uses.


Gesture recognition is also becoming applicable in athletics and kinesiology studies. It is applicable in postural and gait analysis, in which a camera can be used to track movements without the need for sensors[10]. This is especially important because the use of attached sensors may impede or change the natural movement of a subject when analyzing the movements. It is now used to identify movements that cause strain or injury, and point out different areas in which movement can be optimized to improve athletic performance.

My Automated Conversation coacH

Skill Spector

An example of gesture recognition in athletic analysis is a freeware called Skill Spector. Skill Spector is a video based motion analysis of sports movements, and is able to do a 2D or 3D analysis of movements. It is able to compute inertia and angular or linear kinematic data in an easy to use interface. Athletes can then use these representations to figure out where to change movement, and can see these data relayed in a 3D representation of movement[11].

My Automated Conversation coacH(MACH)

A new use for gesture recognition is an innovative system called My Automated Conversation Coach, or MACH for short. MACH is a system designed by MIT student M. Ehsan Hoque. It utilizes a camera and its software to generate a computer avatar that talks with you. It then analyzes your facial expressions, your eye and head movements, and your speech patterns and intonations to generate feedback on your speaking skills. It relays the data in graphical form and can show you this data during a recording of your speaking, allowing for detailed feedback and instant analysis. Through pilots and testing of this software, it was shown to be able to improve performance in job interviews and public speaking[12].

Harman Car System

Gesture recognition can also be used in cars, as is the case in the Harman Car System. Hand gestures are used to control different functions in the car, and are detected through the use of infrared sensors which watch for predetermined motions. You can change music volume by tilting your head, switch to next track on your music player by tapping the steering wheel, and even control climate by lifting or lowering your hand above the signal light switch. The system claims to be able to distinguish between accidental and intentional gestures[13].

Leap Motion Device

Leap Motion

Perhaps one of the most innovative products so far is the Leap Motion controller. Leap Motion as a company aims to replace the keyboard and mouse through the use of a specialized sensor. Leap Motion believes that molding virtual clay should be just as easy as molding real clay, and with this in mind created the Leap Motion [14]. This specialized sensor is composed of 2 cameras and 3 infrared sensors to create a hemispherical area of 1 meter in which you can freely manipulate your computer. It tracks movement up to a precision of 0.01mm[15]. This tool is extremely useful for 3D modeling, in which it is cumbersome for traditional mouse and keyboard to manipulate. It can also be used to navigate OS and different screens, as well as high precision drawing in addition to the efficient 3D manipulation.

Business Applications

Inition's interactive advertising with Ford's CMAX

As mentioned earlier in this document, the healthcare industry is already experimenting with gesture recognition to improve its various processes. Washington Hospital Center for example, had implemented[1] this technology to allow doctors to manipulate pictures and diagrams during their surgeries without the use of any physical input devices. The technology allows the surgeons to improve efficiencies by eliminating the need to travel between the operation table and the input device. The shortened operation time could potentially allow more surgeries to be booked while maintaining the same level of human resources. In addition, patient risks during surgery are greatly reduced because the surgeon no longer needs to be in contact with another source of potential contamination. To a patient, this could be a key deciding factor in selecting where to perform the operation, thus allowing the hospital with the technology to have competitive advantage over other establishments.

For business, it is increasingly important to create innovative modeling and prototyping techniques to increase efficiency. Gesture recognition is an important step in that direction, as it can allow users to manipulate virtual prototypes in a natural fashion, instead of with the traditional mouse and keyboard, which is cumbersome to navigate. Allowing users to navigate in 3D would enable users to feel and perceive an object with detail that cannot be attained in other environments, and allowing them to notice specific details that would not be possible in a 2D environment. This can improve not only the efficiency, but also the quality of work produced in the business, especially in the engineering and architecture fields. Another way for businesses to capitalize on gesture recognition is through interactive advertising. Interactive advertising engages the viewers through gesture controlled ads that can be displayed through a screen setup at a promotional event or other public location, or directly through their smart TVs in the comfort of their own home. This not only improves viewer interest and value, but also allows businesses to collect the data on which part of the ad are most popular, allowing them to target specific points for future ads. Viewer reactions can also be recorded to gauge effectiveness of the ad. This new idea will allow business to fully utilize the screen that most consumers view for the longest, and allows them to directly interact with and find the information they want from the ad, increasing brand recognition for the business. Brainient has created the first interactive ad for the movie The Hobbit with the Kinect[2], where viewers can interact with the ad and view character biographies and photo galleries by swipe and wave gestures.

Pranav Mistry: The thrilling potential of SixthSense technology

Business Applications beyond the Technology Itself

The following is a list of potential applications when gesture recognition is used in conjunction with other available technologies.

Augmented Reality

Inition, a UK based tech company, was able to incorporate both augmented reality and gesture recognition to create an interactive advertisement for Ford’s CMAX vehicle[1]. The interactive ad can detect a user’s hand and position a virtual car on top on a display. The user is then prompted with virtual buttons on-screen that allows the user to manipulate the virtual vehicle. This combination creates an entirely different way for customers to interact with various products virtually, and provides viewers the opportunity to investigate further to engage with the product. We believe it very likely that these types of advertising will be available online, especially in the garment industry where customers will be able to “try-on” clothing as if they are in a physical store.

Another example of pairing this technology with augmented reality is from a product called Sixth Sense. It was created by a group of MIT students in 2009 and was defined as a wearable gestural interface that augments the physical world around us with digital information and lets us use natural hand gestures to interact with that information[2]. It is also capable of communicating via the Internet to gather resources and information in real-time to display customized content for its augmented displays. For example, it was able to recognize products on a store shelf and reveal product ratings from multiple online sources. While the prototype device is rather large and cumbersome, we believe it will be commercially viable in the near future. Companies will be able to capitalize on this device on many levels. For example, a company like Consumer Reports would be able to sell its subscription to device owners where they will be able to access its various ratings and product test results. Businesses can also use this opportunity to further educate the consumer in real-time when they are viewing their products, allowing a higher purchase rate. Consumers will also benefit by predefining their product preferences for added efficiencies in their shopping process. Lastly, valuable data can be collected in terms of product preferences which can reveal changing consumer tastes and requirements; in turn, companies can use this data to shape future product developments or marketing efforts.

Combining with Haptic Technologies

Sometimes, certain complex behaviors and movements may not be tracked accurately by vision-based gesture recognition itself. Furthermore, it is often hard to gain feedback of your movements when you are not in contact with any physical surfaces. To combat this problem, researchers from the University of Calgary[3] experimented with a prototype that incorporates both gesture recognition and haptic devices. In this context, haptic devices are ones that allow users to touch, feel and manipulate three-dimensional objects in virtual environments and tele-operated systems. This leads to better designed user interfaces where movement feedback is crucial in the line of work. 3D modeling is an example of which, and the feedback component can help users pinpoint the area they wish to manipulate in the virtual environment. This will lead to shortened design phases for various prototypes and will reduce costs in the design phases for a range of different projects.

Drawbacks of Gesture Recognition

Users reviews on Twitters about the “X-box listening to command feature

Privacy Component

One of the key concerns with vision-based gesture recognition technology is the camera component in these devices. Users are sensitive about the data captured and which users will gain access to their data. Using Xbox’s Kinect as an example, we will explore the extent of these concerns.

Xbox Kinect

In addition to physical images, this device is capable of recording all kinds of personal information in regards to the user’s reaction rates, learning state, and emotional state while operating the device. This data can potentially be passed onto an unauthorized party, or posted to users’ Facebook or other social media without sending users’ notification or obtaining users’ permission[4]. Also, the Kinect is in a constant “Always listening mode”, which is a standby mode that listens for user input to turn on the device even when the system is off. There was even a dispute between Germany’s federal data protection commissioner where he referred the video game console as a monitoring device[5].

Constant Network Connection

Microsoft stated the new console will require a re-occurring connection to the Internet for it to function properly. This aspect is quite common with a lot of devices utilizing gesture recognition technology. Sixth Sense is an example; recall that this device requires a network connection to deliver real-time contents to its user, meaning the content generating businesses have the potential to view the gestures created by the device owner. There are other concerns such as a user’s GPS location but it is excluded in this discussion as it’s outside the scope of gesture recognition technology.

User Controls and Fatigue

Another drawback to gesture recognition is the general physicality required in its operation. While users are able to see the results of their manipulations with different gestures, they are unable to receive any physical feedback from their actions, which is an extension of the sense of touch. Since gesture recognition does not enable a user to physically manipulate a real object, it lacks the natural feel of manipulating objects with your hands. In addition to no feedback, the extended use of gesture recognition for long periods of time will lead to what is called Gorilla arms[6]. This term describes the fatigue, cramped, and soreness from using it for long periods of time, making the user feel like a “gorilla”. Initially used to describe fatigue from early touchscreen devices, it has now been extended to describe the fatigue felt after extended use of gesture recognition. Since traditional mouse and keyboard operation required little effort from a resting position, the effects of fatigue are less apparent.


  1. Inition. (n.d.). Ford C-MAX Campaign: AR with Gestural Interface from Inition. Inition | Everything in 3D. Retrieved July 29, 2013, from
  2. Mistry, P. (n.d.). SixthSense - a wearable gestural interface (MIT Media Lab). Pranav Mistry. Retrieved July 29, 2013, from
  3. Menelas, B., Hu, Y., Lahamy, H., & Lichti, D., (2011). Haptic and gesture-based interactions for manipulating geological datasets. Systems, Man, and Cybernetics (SMC). p. 2051-2055
  4. Weber, R. (May 2013). German commissioner highlights Xbox One privacy concerns. Game Industry International. Retrieve from
  5. Sapieha, C. (May 2013). Xbox One: Five reasons to fear for Microsoft’s new console. Financial Post. Retrieve from
  6. Pogue, D. (n.d.). Why Touch Screens Will Not Take Over: Scientific American.Science News, Articles and Information | Scientific American. Retrieved July 29, 2013, from
Personal tools