paper Wilkowski

A Template Matching Method for Recognition of Surveyor’s Gestures in Images A TEMPLATE MATCHING METHOD FOR RECOGNITION ...

0 downloads 7 Views 1MB Size
A Template Matching Method for Recognition of Surveyor’s Gestures in Images

A TEMPLATE MATCHING METHOD FOR RECOGNITION OF SURVEYOR’S GESTURES IN IMAGES Artur Wilkowski Faculty of Geodesy and Cartography & Institute of Control and Computation Engineering, Poland

Abstract – During their work in the field surveyors, in addition to the electronic means of communication such as radio, use also a special sign language. A possibility of automatic understanding of this language by machines (computer or robots) can be a foundation for automatic assistants of a surveyor as well as for a possible replacement for expensive target tracking optical systems used in high-end surveying instruments. The paper presents a template matching algorithm based on Generalized Hough Transform for positioning of the surveyor’s silhouette in the image and recognition of the gestures performed by the surveyor. In both pre-processing and post-processing phase color analysis is used to improve results. Key words - recognition of surveyor’s gestures in images Introduction Performing field works is one of the most common activities of the surveyors. One of the typical task is the determination of points on the ground. In their work, the surveyors are supported by a range of measuring instruments. For the task of determining points on the ground, depending on the measurement conditions and the precision required, the surveyors may use e.g. GPS measurements as well as more traditional means such as theodolite, telemeters (or teacheometer) and prism. In the second case, in order to perform measurements efficiently, a surveying team consisting of at least two persons is required. In the team, one person operates the measuring instrument, the other moves around with a prism. A key factor for efficient cooperation in the team is robust communication between its members. Most of the communication takes place with the aid of radio, however also a sign language is widely used. Choice of the best communication system depends on work conditions such as visibility of other team members, environment noise (agricultural field, motorway or construction site), distance between team members as well as

convenience and personal preferences. In the article there is proposed an image processing algorithm that can be used for automatic positioning of the person performing gestures as well as for recognition of the gestures themselves. A possible application of the algorithm would be a creation of the automatic surveyor’s assistant. Another application is the use of person positioning as a preliminary step for prism positioning in an automatic measuring instrument. Both applications enable to reduce the number of persons required to operate in the geodetic team. Previous work Our previous works and experiments on the topic of automatic surveyor’s gestures recognition are presented in . In the article the gesture recognition task is based on the segmentation technique founded on color analysis. After memorizing the color parameters of the surveyor’s coat, they are used to split the image into separate objects. Subsequently, feature vectors are created for the distinguished object in the image and Support Vector Machines classifier is used to perform classification. Since most objects detected in the scene do not contain the object of interest, a pattern rejection scheme for hierarchical SVM’s described in is utilized in order to get rid of spurious objects. The most negative impact on the effectiveness of the solution proposed in the article is bad segmentation. Whereas the pattern rejection scheme applied is very effective when dealing with well segmented spurious objects, it could not deal with the situations where the object of interest is “merged” into a clutter of other objects. In the scenes used for testing that contained complicated real-world urban environments this situation is quite common. This negative effect can be only partially overcome by substraction of thresholded edge image, that was proposed in as a mean to improve results of a pure color-based segmentation. The alphabet of gestures recognized According to the author’s investigations there

The Romanian Review Precision Mechanics, Optics & Mecatronics, 2008 (18), No. 33

29

A Template Matching Method for Recognition of Surveyor’s Gestures in Images does not exist a single, precisely defined code for communication between surveyors in the field. However it is possible to specify a common baseline for frequently used, basic gestures. This dictionary of gestures encompasses several simple commands such as “left”, “right”, “forward”, “backward”, “stop” that help to guide a partner in the field. For the sake of the experiments it was adopted that gestures are static (which is not always the case in real work). Examples of the gestures processed are given in Fig. 1. Methodology Hough Transform Transform

and

Generalized

Hough

The surveyor’s silhouette positioning and gesture recognition algorithm is based on the Genaralized Hough Transform method that aim at retrieving a specific object position having as an input an edge image of the scene. The original Hough transform was proposed in as a method for

finding straight lines in the image. In the basic method there is assumed that each line can be parametrized in the form y = mx + b. The search process takes place with an aid of two spaces – the original image space (x, y), and the parameter space (m, b). It is easy to show that each point (x, y) in the original space corresponds to a line in the parameter space: b = -xm + y. What is more, this line defines a pencil of lines crossing the point (x, y) in the original space. Therefore a simple algorithm for finding a line in the original image is to find an intersection of all lines in the parameter space corresponding to points on the line searched for. In order to implement the search effectively, the parameter space is first discretized. Resulting data structure is called the Hough Accumulator. In the search phase each point in the original image votes (increases accumulator value) for every point on the corresponding line in the parameter space. After all points has been processed the position of peaks in the accumulator helps to discover the lines in the original image.

Figure 1: Examples of gestures performed in communication between members of the surveying team. Starting from the top-left picture the gestures have the following meaning “left”, “right”, “backward”, “stop” , “forward” Hough transform can be also used for detection of other parametric shapes such as circles and ellipses. In Ballard proposed a generalization to the Hough Transform Method enabling to encode and detect arbitrary shapes. This method is known as Generalized Hough Transform. In the method the edges or points belonging to the shape are stored in a structure called R-table. In order to encode a shape, some reference point is chosen first. Then for each point of the shape, a vector connecting the point 30

processed and the reference point is stored in the Rtable. For convenience this vector r can be encoded in polar form as a pair (r, β), where β denotes azimuth and r denotes vector length. The latter representation is especially useful when allowing rotations of the shape. In order to increase efficiency and reduce noise it is also a adopted that entries in the R-table are divided according to the gradient direction (φ) at the specific point of the shape. Example of the R-table is given in the table 1.

The Romanian Review Precision Mechanics, Optics & Mecatronics, 2008 (18), No. 33

A Template Matching Method for Recognition of Surveyor’s Gestures in Images Table 1. Example of R-table Point r φ 0 [0;∆φ) (r, β)00; (r, β)01; (r, β)02... 1 [∆φ;2∆φ) (r, β)10; (r, β)11; (r, β)12... 2 [2∆φ;3∆φ) (r, β)20; (r, β)21; (r, β)22... ... ... ... Reconstruction of every shape point (xs, ys) at the same time eliminate the effect of broken lines. from the R-table having given a reference point (xr, In this approach two thresholds – high and low are yr) can be simply performed using the following utilized. Surpassing the high threshold is required to equation: initiate an edge chain, however when tracking the edge the low threshold is used instead. The original algorithm and the one used in the paper works on  x s = x r − r cos β grayscale image, however also other means of (0)  obtaining image gradients from color images may be  y s = y r − r sin β utilized. The edge image of a natural outdoor scene During the shape search phase there is obtained in this way may still contain many created a 2D Hough accumulator of possible unwanted edges that may check both efficiency and locations of the shape reference point. In the next accuracy of locating model and gesture recognition. step each edge point in the image votes for different However, in this work as well as in it is assumed the locations of the reference point according to the object of interest used for recognition is the entries of the R-table. The number of votes produced surveyor’s coat. It is assumed also that the by a single image point can be limited by choosing surveyor’s coat is uniformly colored. Therefore, as is only a single row of the R-table specific to the edge the case in , also in the approach presented here the orientation. As in the previous case – positions in the color analysis can be used in image pre-processing. accumulator with maximum number of votes become preferred locations of the shape reference The traditional RGB color space is not point. suitable for this task, as all its 3-values depend on the lighting conditions. Much more convenient are The GHT can be used for detecting of scaled color-spaces using a separated chrominacity and and rotated version of the model by modification of luminance values. The color of surveyors coat is the equation (1) by the scale factor and rotation detected based primarily on chrominacity values, angle and compiling different accumulators for each and change in light intensity (e.g. shades) does not discretized rotation and scale value. In the task of affect as much the results of segmentation. gesture recognition the rotation coefficient can be neglected and only change in scale should be taken The color space used for segmentation in this into consideration. In the algorithm used in this work is YCbCr. The matrix that can be used for work, R-table entries are not encoded in polar form transformation between RGB and YCbCr is given in but in (x, y) coordinates. In order to avoid angle the equation (2) (all color values are assumed to be quantization problems (we deal with non-rigid in range 0..255). objects), only one gradient direction interval is used in the experiments. 0.587 0.114   R   0   Y   0.299 Preparation of Edge Image Before the GHT can be applied for the image there should be available a reliable set of edges well representing shapes of objects present in the image. The method used in the approach described here is Canny Edge Detection proposed by Canny in . In the first stage of detection image filtering with a gaussian kernel is used to eliminate noise. Then image gradients in two different directions are obtained using Sobel operators. In result both edge strength and edge direction can be computed for every pixel of the input image. Then edge thinning is applied by suppressing weaker pixels not aligned with the local edge direction. Finally a hysteresis thresholding is used to finally define the edges and

 Cb  =  − 0.1687 − 0.3313 0.5  x  G  +  128 (2)    − 0.4187 − 0.0813  B   128  Cr   0.5

In the proposed solution static thresholds on Y, Cb, Cr values have been experimentally established. In the recognition phase a transformation into YCbCr space is performed for every image. Then the static thresholds are utilized to develop a binary image containing regions of interest in the image. In order to remove small objects that are mainly result of noise, the area of each continuous object in the binary image is measured and the objects having area smaller than a pre-specified threshold are removed from the binary image. In this step a flood-fill algorithm is utilized. The binary image then becomes a subject to

The Romanian Review Precision Mechanics, Optics & Mecatronics, 2008 (18), No. 33

31

A Template Matching Method for Recognition of Surveyor’s Gestures in Images repeated several times morphological dilation operation in order to make up for the inaccuracies in color detection and thresholding and to avoid a chance that important parts of the picture would not be incorporated into the final image. The binary

image thus obtained is used as a mask for filtering edge image which is the result of Canny edge detection. A sample result of edge image filtering using the calculated color of surveyor’s coat is given in Fig. 2.

Figure 2: Edge image filtering results using the analysis of expected color of the object of interest (surveyor’s coat). Left – original image, Right – image after filtration. Templates for Model Objects In order to prepare model objects for creation of GHT R-tables for 5 distinct gestures the similar procedure to the presented in the two previous sections is adopted. However, this time a manual selection of area in the binary image (created in result of color analysis) is required to locate the object of interest. Afterwards, the Canny generated

edge image within the chose area is encoded into appropriate R-table. Together with R-table also the corresponding area of binary image is memorized forming binary templates (in this case morphological dilation is not applied). In this way, training data for the positioning and recognition module consist only of 5 selected pictures with different gestures. The generated GHT edge templates as well as binary templates are shown in Fig. 3.

Figure 3: GHT edge templates (top) as well as binary templates (bottom) generated for the training set Interpretation Results

of

Hough

Transform

In total 15 different GHT R-tables are used in the process of recognition. Each of the 5 different gestures corresponds to 3 R-tables, each one for 32

different scaling of the model object (scales used: 1.2, 1.0. 0.8). Each of the 15’s accumulators used for the recognition has a resolution of 5 pixels. In the discovery phase, after filling the accumulator using the methods described before, the cells values are normalized by the number of voting elements in the

The Romanian Review Precision Mechanics, Optics & Mecatronics, 2008 (18), No. 33

A Template Matching Method for Recognition of Surveyor’s Gestures in Images given R-table, so the results can be compared between different R-tables. In the solution described here the voting result is also multiplied by the value of scale used. Without the latter step the objects at smaller scale tend to be unjustifiably preferred over larger ones due to greater concentration of votes in smaller area. Our experiments reveal that some regions of images where large and dense clutters of edges are detected (for instance bushes, branches of tree, fences) can dominate the response in the accumulator even though the absence of the object of interest. One possible solution is to use very small and therefore precise accumulator cells (e.g. of size of one pixel) however at the expense of ability to

generalize. Another possible solution is to look for the local peaks in the accumulator disregarding absolute value of cells. In the algorithm presented, the second option is chosen and before any further processing a high-pass filter is applied to the accumulator. In this work this is approximated by convolving the image with the following kernel:  0 −1 0 K =  − 1 4 − 1  0 − 1 0 

(2)

This approach also helps in some way to overcome the effect of maxima blurring in the accumulator. Example of shapes detection with and without filtering is given in Fig. 4.

Figure 4: Detected shapes corresponding to 3 best hits from the Hough accumulator. Top – edge image processed (after color-based filtering); Middle-left – votes in non-filtered Hough accumulator for the presented gesture (scale 1.0); Middle-right – votes in high-pass filtered Hough accumulator for the presented gesture (scale 1.0); Bottom-left – detected shapes for 3 best hits for non-filtered Hough accumulator (located very close to one The Romanian Review Precision Mechanics, Optics & Mecatronics, 2008 (18), No. 33

33

A Template Matching Method for Recognition of Surveyor’s Gestures in Images another); Bottom-right - detected shapes for 3 best hits for high-pass filtered Hough accumulator Post-processing Results

of

Hough

Transform

Although in many situations finding the best hit provided by the GHT accumulator gives acceptable results, the correct answer may appear on the second, third position with regard to the Hough accumulator response. On many occasions wrong hits can be easily rejected by resorting again to color analysis. One of the methods is to position and scale binary template of some gesture according to the data obtained from GHT and compare it against suspected area in the binary image created in result of color analysis (esp. thresholding in YCbCr space) of the currently processed image. In result, one obtains the correspondence of the binary template with the area of pixels in the image that are likely to belong to the surveyor’s coat. The algorithm may be implemented as follows: Let I be the binary image after initial processing in YCbCr space Let T[] be the map of binary templates for different gestures best_coverage := -1 ; best_hit := ∅ ;

accumulators for each hit do begin translate and scale T[hit] according to the data specified in the accumulator coverage := number_of_pixels(I ∩ T[hit])/ number_of_pixels(T[hit])*100; if coverage > best_coverage then begin best_coverage := coverage ; best_hit := hit ; end if coverage > threshold then exit loop ; end best_hit specifies the best template type, scale and position found in the image As can be seen this simple algorithm looks for the best match in the Hough accumulators having an acceptable (larger than a specified threshold) correspondence between binary gesture template and the binary image. Only in case when no such object is found there is chosen the one with the highest correspondence. In the experiments performed, there were typically analyzed 10 best hits from Hough accumulators and pixel coverage of over 82%. Example of matching a binary template against image is given in Fig. 5.

Compute and order n best hits in the Hough

Figure 5: Results of matching of the binary template and the image; green color denotes well matched pixels of the template, red color – badly matched pixels of the template; left image – good template match, right image– bad template match Experimental results The experiments were performed images. In each image there was performing one of the 5 gestures. 5 of were used to generate Hough R-table 34

on 101 test a person the images and binary

template. They were not excluded from the test set however. 3 versions of the algorithm were tested. In the first, simplest version, the recognition result was based only on the best hit found in the original GHT accumulator. In second, before extracting the best result, the high-pass filter was applied to the

The Romanian Review Precision Mechanics, Optics & Mecatronics, 2008 (18), No. 33

A Template Matching Method for Recognition of Surveyor’s Gestures in Images accumulators. In the third version there was used the full algorithm described in the previous sections multiple best matches from GHT accumulators were verified by the binary template correspondence. Results of the experiments are given in the Table 2. During experiments the recognition as well as the person positioning accuracy were tested. A sample was treated as recognized if the correct gesture was observed on the output. A sample was treated as recognized and well positioned if the output silhouette corresponded to the expected gesture and was correctly positioned over the surveyors body (some variations in scale were accepted). Table 2. Results of recognition experiments Algorithm Best hit GHT match Best hit GHT match with high-pass filtering Incl. post-processing with binary template correspondence

60% 68%

Recognition + positioning accuracy 50% 64%

81%

79%

Recognition accuracy

It can be clearly seen that the best results were obtained using the full proposed algorithm. It was observed however that smaller scale values were more often preferred over the larger ones in this case. This behavior can be attributed to the last

processing step due to larger probability of good correspondence between smaller binary template and the image. It should be also noted that recognition result for basic GHT algorithm is a little overestimated since a significant number of positive matches were accidental. This fact is well reflected in the “Recognition + positioning accuracy” column of the table. In comparison with the solution proposed in the algorithm presented in this paper is quite slow. Processing of a single image of little below 1 million pixels takes several seconds. The most time consuming operation is the edge detection by the Canny operator and, before everything, template matching performed for 15 R-tables. What adds to the problem is the large number of intensive edges present in real-world test images. Although very fast real-time processing is not crucial for this type of application some improvement should be searched for. In terms of accuracy the approach to surveyor’s gesture recognition proposed in this paper can compete or even beat the results obtained in . A completely different philosophy applied here allows to correctly interpret some scenes that could not be interpreted properly using only color analysis (see Fig. 6). However, it must be admitted that the samples causing problems for the method described in are more likely to be problematic also for the algorithm described. These samples are usually characterized by poor color contrasts between object of interest and the background, which cause both difficulties in segmentation using color analysis and difficulties in obtaining high-quality edges for recognition using GHT.

Figure 6: Left image – example of bad segmentation using exclusively color analysis as described in , right image – gesture recognition using methods described in this paper. Conclusion and future works In the paper there was presented a template matching algorithm for recognition and positioning of gestures performed by surveyors. The method proposed gives an acceptable accuracy for recognizing gestures and locating a silhouette of the

surveyor. What is even more important the algorithms described in this article prove to be in some situations complementary to the methods proposed in which are based on interpretation of local color features and utilization of the SVM classifier. Therefore it would be advisable in the future to devise a scheme binding both methods

The Romanian Review Precision Mechanics, Optics & Mecatronics, 2008 (18), No. 33

35

A Template Matching Method for Recognition of Surveyor’s Gestures in Images together and utilizing intermittently both strategies: color features and SVM classification in case of good color contrasts in the image and, proposed here, edge-based template matching in other cases. Efficiency of the algorithm proposed should also be focused on in the future development. References

[1] Wilkowski, A., M. Luckner. “Automatyczne Rozpoznawanie Gestów Wykonywanych Przez Geodetów w Pracach Polowych.”, Mlodzi Naukowcy Wobec Wyzwan Wspolczesnej Techniki. Warsaw 2008. To be published. [2] Luckner, M. “Comparison of Hierarchical SVM structures in Letters Recognition Task”, IEEE CIS - Poland Chapter Edited Volume, ICAISC’2008. Zakopane. 2008 [3]

36

[4] Kukharev, G., A. Nowosielski. “Visitor Identification – Elaborating Real Time Face Recognition System.” WSCG'2004 Short Communications. 157 – 164, Plzen, Czech Republic. 2004. [5] Hough, P.V.C. “Methods and means for recognizing complex patterns.” US patent 3069654. 1962. [6] Ballard, D.H. “Generalizing the Hough transform to detect arbitrary shapes.” Pattern Recognition, 13(2): 111-122. 1981 [7] Canny, J. “A Computational Approach to Edge Detection.” IEEE Trans. Pattern Analysis and Machine Intelligence, 8(6): 679-698. 1986

The Romanian Review Precision Mechanics, Optics & Mecatronics, 2008 (18), No. 33