Time invariant hand gesture recognition for human-computer interaction

. Hand motion driven human-computer interface based on novel time-invariant gesture description is proposed. Description is represented as a sequence of overthreshold motion distribution histograms. Such description utilizes information about gesture spatial configuration and motion dynamics. K-nearest-neighbour classifier was trained on six gesture types. Application for remote slideshow control was developed based on the proposed algorithm.


Introduction
Popularity of natural interfaces for desktop and mobile computer control has been rapidly growing within last decade.Nowadays common human-computer interfaces (like keyboard) are gradually replaced by natural control interfaces based on gesture-driven, voice-driven, finger or full body motion driven control.These new methods are widely used in entertainment applications or in such fields, where a contact between human and input device is impossible or unwanted because of sterility requirement or in case, when a device have to be controlled by a group of people simultaneously.Hand motion recognition task is concerned with several fundamental computer vision problems, in particular with the problems of dynamic patterns detection and recognition.The standard pipeline for single image analysis is represented as the sequence of procedures: preprocessing -segmentation -classification.This pipeline is admissible for video analysis only if the task is to detect and classify the objects, which movements are not of value.Otherwise, additional information about object movement or transformation has to be considered.In contrast to single image, video contains such additional information, that has to be utilized for object detection and recognition.
The main goal of our project is to develop the robust descriptor for the dynamic objects, invariant to object deformations and perspective transformations during a movement.In order to develop such descriptor the modifications of the standard single image analysis pipeline are proposed.The new dynamic gesture recognition method is described below.It utilizes information about duration, direction and amplitude of a motion along with spatial and intensity-based feature descriptions of images in video sequence.This method was used to develop the human-computer interface and application for presentation remote control.

Research background
Gesture-based human-computer interfaces can utilize hand stationary configuration (configuration is relevant), like ''open palm'' or ''thumb up'', hand motion (dynamics is relevant), like ''from palm to fist'' motion, ''hands up'' motion, etc.A variety of gesture recognition algorithms was developed, that can be divided into two groups according to configuration or motion relevance:  single image analysis algorithms, that detect and recognize hand configuration in each frame of video;  image sequences analysis algorithms, that detects hand configuration changing pattern for whole gesture video sequence.Single image analysis pipeline for gesture recognition is similar to commonly used analysis procedures: preprocessing, segmentation, classification.Following steps for gesture analysis were proposed [1]: hand contour extraction, tracking and recognition based on selected features.The image sequence analysis pipeline differs from single image approach and contains following steps: background subtraction, description of a gesture and classification.In spite of the fact that every gesture detection and recognition method is unique, combining different approaches and algorithms, there are some common steps used in most methods, such as background subtraction, features extraction and classification of a gesture.Gesture detection and recognition methods use different background subtraction algorithms from very simple like frame difference [2 -4] to more complex methods such as Adaptive Mixture of Gaussians [5,6] and frame difference enhanced with Gaussian filter [7].Various feature extraction methods for gesture recognition have been described in [3,7,8,9].Methods based on calculation of histograms of oriented gradients (HOG) are used in [9] and [3].In [3] it is used for features extraction from a motion history image which is created from several frames in sequence.Fourier analysis based methods are used for different kind of movements in [7].Hand shapes within single image analysis approach can be described by shape context descriptor [8].Gesture classification is performed by different algorithms based on fuzzy logic [8], SVM for large feature vector (3780 elements) [9], Euclidian distances [3], and spectrum analysis [7].
Each above-mentioned gesture analysis method has its own advantages and disadvantages.For instance, single image based analysis provides precise estimation of hand location in image but it is limited with fixed hand configuration because of non-rigid nature of a human hand.On the other hand image sequence based methods are invariant to hand configuration in common but are dependent on gesture duration and completeness.Thus, gesture analysis methods have its limitations, for example, most of the methods based on skin color segmentation [10 -11] or background subtraction methods [2-4, 7, 9] are highly dependent on scene light conditions, quality of camera sensor, etc.Along with recognition quality the method processing speed is one of the key factors in a sense of end user experience, so it has to be taken into account when comparing gesture recognition methods.

Gesture detection and recognition
The new method of a hand gestures detection and recognition is presented.Following requirements were set during problem formulation.According to them the method must be invariant to gesture duration, hand initial position and be able to detect and recognize transformable hand configurations.According to the requirements and research background this method should describe a gesture in terms of integral motion characteristics.Algorithm workflow scheme is presented in fig. 1.According to this scheme two main components can be highlighted in the workflow of this algorithm: background subtraction and features extraction.

Background subtraction
Background subtraction is used for gesture duration estimation and as a preprocessing step for gesture description in our approach.Following requirements for background subtraction method were formulated:  object contours estimation;  no contour traces;  real time performance.Several popular background subtraction methods were reviewed within the research.Each of them was evaluated according to the following parameters: performance, contours continuity, length of contour traces.Background subtraction quality was evaluated qualitatively and algorithm performance was estimated according to the video processing frame rate.Algorithm performance is considered ''realtime'' with  ⩾ 30.All algorithms (except ViBe) were evaluated according to the current implementation in BGSLibrary [12] on 3th generation Intel Core i5 processor powered PC.Results of performance and quality estimations are presented in the table 1.
Presented background subtraction algorithm demonstrates realtime performance, sufficient quality of contour separation without long traces.However, this method is depended on light conditions.Parameter  regulates background model update speed.Binarized variance maps of two adjacent frames with different  is presented in the fig.2.
Binarized variance map  is prepared by median filtering for noise reduction and nearby contours merging.
Current frame motion rate is estimated by counting all non-zero elements in the binarized variance map.Motion is considered to be a gesture candidate when the following condition is met: where ,  -pixel coordinates in binarized variance map , , ℎwidth and height of variance map accordingly, ℎℎ -threshold for estimating a gesture candidate in the frame.

Fig. 2. Binarized variance maps of two adjacent frames with different N.
If the current frame satisfies the above-mentioned condition it is being added to the current gesture sequence.Current gesture sequence is considered complete when number of frames in sequence ≥   and the condition (eq. 3) was not met in the   + 1 frame.As soon as the gesture sequence is considered complete it is being described and classified.

Gesture description and classification
Any gesture sequence, containing more than   frames, can be described by a feature vector.Motion Distribution Histogram is proposed to be the integral description of the frame in gesture sequence.It describes a distribution of overthreshold variance (eq. 3) over frame.Each motion distribution histogram is calculated according to the mean center of masses for all variance maps in the sequence, thus, we can achieve initial hand position invariance.Variance maps are being divided into 16 sectors in which all non-zero elements are counted and stored in corresponding element of motion distribution histogram.Sectors are numbered clockwise from horizontal axis orientation.Each element of motion distribution histogram is normalized according to the area of binarized variance map of the frame.Examples of variance maps and corresponding motion distribution histograms are shown in the fig.3. Variance map and motion distribution histogram calculated for all frames in the sequence.
Motion distribution histograms sequence containing  histograms are being divided into 4 subsequences according to the following intervals : where is the number of histograms in the sequence.

Experimental results
Proposed method has advantages and disadvantages comparing to the methods that are based on object detection and recognition in each frame of video.The main advantages of this approach are invariance to hand configuration transformations during the movement, invariance to the speed of the gesture and hand initial position in the camera field of view.However, this method is highly depended on background movement (which can be eliminated with depth map), distance between human and camera and lighting conditions.
Six gesture types were selected for detection and recognition: hand movement from left to right, right to left, hand up, hand down, both hands from the center of the screen and both hands to the center of the screen.Training set containing samples of each gesture type was collected.The set containing seven examples of 2 classes (hands down and hands up) are displayed in the fig.4, where ''hand down'' movement is visualized with black colour, and ''hands up'' movement is visualized with gray colour.

Application
The new human-computer interface was implemented based on the gesture recognition method described above.The application for hand driven slideshow control was developed as an example of this interface.The clientserver application architecture was proposed to achieve remote slideshow control.The architecture of the demonstration application is presented in the fig.6.
Fig. 6.The architecture of the application.

Conclusion
Hand motion guided human-computer interface based on the new dynamic patterns descriptor is presented.The distinctiveness of proposed gesture description was demonstrated by cross-class Euclidian distance measurement of training samples.
Hand motion is described by the sequence of motion distribution histograms.This method demonstrates sufficient processing speed in terms of end user experience and classification accuracy for gesture sequences to be used for remote slideshow control.Further research within proposed approach aims to support different gestures types and non-relevant objects motion filtering using skin color map, depth map and motion map.

Fig. 3 .
Fig. 3. Binarized variance maps for different frames of ''from left to right'' gesture sequence and corresponding motion distribution histograms.

Fig. 4 .
Fig. 4. Plot of 14 histograms containing samples from two classes: ''hand up'' and ''hand down''.Black colour corresponds to ''hands down'' movement, gray colour corresponds to''hans up'' movement.Feature vector element numbers displayed on horizontal axis and values on vertical axis.

Fig. 5 .
Fig. 5. Mean Euclidian distances and standard deviation (error bars) between feature vectors of each class compared to another classes: a -"left to right" gesture, b -"right to left", c -"hand up", d -"hand down", e -"hands away", f -"hands together".

Table . 1
Overview of background subtraction methods.