human hand event detection for ripley the...

26
1 Human Hand Event Detection for Ripley the Robot Advanced Undergraduate Project Report Marjorie Cheng, MIT, May 2006

Upload: others

Post on 11-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

1

Human Hand Event Detectionfor Ripley the Robot

Advanced Undergraduate Project Report

Marjorie Cheng, MIT, May 2006

Page 2: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

2

Table of Contents

Abstract 1

1. Introduction 1

2. Ripley and his Vision Pipeline 3

2.1 The “segmenter” module (color model-based segmentation) 6

2.2 The “trainmodel” module (training for color models) 6

2.3 The “stereobjecter” module (binocular match and depth calculation) 7

2.4 The “stm” module (short term memory and event detection) 7

3. Previous Work 9

3.1 Conversational Robotics 9

3.2 Hand Detection 10

4. Further Theory and Comments on the methods used 10

4.1 Hand Detection 10

4.2 Depth Detection 12

4.3 Hand Endpoint Detection 13

4.4 Inclination Detection 13

4.5 Creation of GSM events for “Touch/Pickup/Putdown” 13

5. Results 14

5.1 Hand Detection 14

5.2 Hand/Object depth estimation 15

5.3 Hand Inclination estimation 19

5.4 Mental Model Touch and Pickup and Putdown 19

6. Conclusions 21

7. Future Work 21

Acknowledgements 23

References 24

Page 3: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

3

Abstract

For a robot to cooperate with humans, it is not enough for it to be able to manipulate objects with itself – it should also be able to recognize what humans are doing with objects, respond to their actions and ultimately interact with humans and learn from them. Ripley is a conversational robot, which can see objects, detect human faces, answer questions about the present or the past, manipulate and imagine objects. Ripley communicates with the human user through speech and pointing gestures.

To achieve our vision of cooperative robots, the next step in Ripley is the identification of human hands and events surrounding them. The objective of this project was the development of human hand detection, and human hand-object interaction event recognition capabilities for Ripley. This was accomplished through a mixture of tools: training of stochastic color models, stereo vision, and object permanence techniques. The implemented system consists of several new modules that augmented Ripley’s existing modular architecture, plus modifications to existing Ripley modules.

1. Introduction

Technology has been advancing to a point where people can envision truly

cooperative conversational robots. Ripley is aspiring to be such a robot. Ripley is a

seven degree of freedom robotic arm supplemented with computer vision, speech

recognition, and speech synthesis. Its software architecture is based on the concept of

“Grounded Situation Models”, created by N. Mavridis and D. Roy [1]. A grounded

situation model is a representation of the situation the robot is experiencing, remembering,

or imagining, which is proposed in [1]. One important future landmark for the Ripley

project is to be able to have Ripley pass the “Token Test”. The “Token Test” is a test

given to children to evaluate if the child has difficulty acquiring speech skills. [2]

Currently, Ripley is able to perceive or imagine objects, measure their properties,

answer questions about the present or the past, and interact with the objects when told to

by a user. My research takes upon the challenge for Ripley to be able to better interact

with people by allowing it to recognize and interact with human hands, and recognize

Page 4: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

4

hand-related events such as: “your hand started moving”, “your hand touched the green

object”, and “your hand pointed towards the small blue ball”.

This is quite an important capability: for a robot to cooperate with humans, it is

not enough for the robot to be able to manipulate objects itself – the robot should also be

able to recognize what humans are doing to objects. For example, Ripley can currently

point towards objects, touch them, pick them up, put them down, and hand them to a

human. However, before this project, the robot could not recognize when humans

interacted with objects in the above way – and thus a fundamental asymmetry existed.

The new capabilities could support many extensions to practical functionality: with

appropriate behavioral / linguistic extensions, the robot can now be envisioned of being

capable of understanding indexicals with human pointing (“Give me this one!”), handling

conditional execution commands (“when I touch the blue one, you give me the red one”),

mark the timeline on the basis of human hand events (“remember when my hand started

moving”), learn procedures by imitating humans (the human first touched the green one

and then picked up the blue one – that’s what I will do too) etc.

Page 5: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

5

2 Ripley and his Vision Pipeline

Ripley’s manipulator arm has seven degrees of freedom. Ripley’s environment

consists of a table, a human user, and random objects which are placed on the table. His

software architecture consists of several modules, an overview of the architecture

previous to this project can be seen in Figure 1, which is taken from [3].

Figure 1. The modules in Ripley’s system before the current work was done.

Page 6: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

6

Previously, Ripley used to utilize monocular vision, which has now been

extended to object-level stereo functionality. This was accomplished through

augmentations to its vision system, which consists of modules in a five-stage pipeline as

shown in Figure 2. At the front end of the vision system, there is a capture module for

each camera (“capture”), supplying 720x480 images at a frame rate of more than 15 fps.

At the second stage of the pipeline, there are three species-specific detector/segmenter

modules: one for objects on the table, one for human faces, and one for human hands

(“segmentation” and “facedetect”). At the third stage, apparent 2D region permanence

modules reside (“objecter”), followed by stereo matching modules at the fourth stage

(“stereobjecter”). Finally, stereo and mono results are pushed forward to one of the four

modality-specific situation model update modules – namely the “visor” responsible for

suggesting changes to the current situation model on the basis of visual information [3].

In the remainder of this section, I briefly touch upon the modules that were

created or modified during the work, namely the “stereobjecter” (created on the basis of

the existing “objectors”) and the “segmenter”, “visor” and “stm” (slightly modified), as

well as “trainmodel”, the training module for the segmenter1.

1 The “visor” modifications as well as the language-related “stm” modifications were carried out by N. Mavridis.

Page 7: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

7

capture 1

segmenter (hands)

segmenter (objects)

regions (objects)

objecter (objects)

regions (hands)

objecter (hands)

facedetect

region

ellipses

region

ellipses

segmenter (face)

capture 2

segmenter (hands)

segmenter (objects)

regions (objects)

objecter (objects)

regions (hands)

objecter (hands)

facedetect

region

ellipses

region

ellipses

segmenter (face)

stereobjecter (objects)

stereobjecter (hands)

stereobjecter (faces)

visor

Figure 2. Diagram of the modules in Ripley’s current vision system.

Page 8: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

8

2.1 The “segmenter” module (color model-based segmentation)

To identify objects, the current segmenter uses stochastic color models to train

Ripley’s vision on various objects lying on a tablecloth background. Various processing

stages follow, including sub-sampling, contour detection, ellipsoid fitting etc. In more

detail: the images from the cameras are decomposed to blocks of five by five pixels, and

the block’s belonging to either object/hand or background is decided by determining

where the majority of pixels within the block belongs to. Then, contours are detected, and

effectively the image is broken into connected regions, after filtering out some spurious

regions. For each of these regions, an ellipse is created to best fit the region. Two output

streams emanate from the segmenter: the ellipses and the regions (pixel-by-pixel). The

ellipses are translated to spherical blobs in the internal situation model (in “visor”). The

ellipses contain a center, a big axis, a small axis, the degree of rotation of the ellipse, and

other information. The regions are used for 3D voxel-based shape model acquisition,

which voluntarily also takes place in “visor”. Thus, in the internal situation model, each

object is represented as a superposition of a spherical blob, with an optional detailed

voxel-based 3D model attached to it. It is important to note that the second output stream

of the segmenter, which are the regions, is also used for human hand endpoint detection,

as shown in section 4.2.

2.2 The “trainmodel” module (training for color models)

The stochastic model of the segmenter is trained (using “trainmodel”) in the

following manner: First, samples of objects or hands are consecutively placed at the

center of the field of view. Then, the middle of the image is sampled, and a color

Page 9: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

9

histogram is acquired. Then, a similar procedure is used to acquire color histograms of

the background (tablecloth, some surrounding environment, objects when training for

hands and vice-versa). Finally, the two histograms are weighted by priors, and a decision

table is made where Ripley is able to determine, on the basis of pixel color, if what he

sees is a pixel belonging to the object/hand or the background, by considering the

weighted probability of the pixel belonging to one or the other histogram.

2.3 The “stereobjecter” module (binocular match and depth calculation)

After having considered “segmenter” and “trainmodel” briefly, let’s move over to

“stereobjecter”. Previous to this project, Ripley only used one of the two cameras

attached to the sides of its head. In order to give Ripley depth perception, which is

invaluable for hand event detection as well as 3D modeling, we decided to use both of the

cameras. In stereobjecter, the ellipsoids corresponding to objects/hands from the left and

right eyes enter. Then, objects are matched across eyes, on the basis of a composite

positional/color/size-based distance metrics. Finally, for the resulting matched pairs, a

virtual “middle eye” (in between the left and right) apparent position is calculated, as

well as an apparent depth, given the inter-eye distance. The new “middle-eye” objects,

augmented by depth estimates, are the outputs to the “visor”. In the “visor”, these are

resolved from head-centered to absolute coordinates, and are placed in the situation

model after dealing with 3D object permanence, as described in [4]

2.4 The “stm” module (short term memory and event detection)

Page 10: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

10

Ripley’s internal model of the world is designed after the “Grounded Situation

Models” (GSM) proposal [1]. In the GSM, each temporal slice of a situation is called a

“moment”. Multiple moments build up detailed “histories”. In order to quantize the

quasi-continuum of “histories”, the histories are parsed to sequences of standardized

“events”, implemented by event-specific detector routines. Such events include changes

of kinetic state of objects, appearances/disappearances etc. (“when the green one

appeared”, “when my head started moving” etc.). New event types and recognizers were

created for “hand starting to / stopped / is touching object” as well as “pick up”, “put

down” and “move up/down”. In terms of implementation, the keeping of the history as

well as the event detection takes places in the “stm” module (termed “rememberer” in

[1]). Later, these events can be accessed through language, in order to facilitate

“remembering”, answering questions about the past etc.

Page 11: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

11

3 Previous Works

Related previous work can be found in the area of hand detection, but also more generally

in conversational robotics. A very short introduction follows.

3.1 Conversational Robotics

A number of systems and designs have appeared in the literature. Some such

examples can be found in [5]. Ripley the robot has three novel abilities as compared to

these systems: he can verbalize confidence and uncertainty, refer to past events, and most

importantly imagine situations through linguistic descriptions.

3.2 Hand Detection

Hand detection and tracking are welled studied in the literature. For hand

detection, color segmentation and shape matching are common methods. The hand color

can also be trained using different methods such as Bayesian networks and AdaBoost.

Hand detection is also tracked by the shape of the hand and usually is used for specific

gesture pointing. For example in [6], gesture recognition is based upon color

segmentation, a neural gas network, finger identification and so forth.

For my project, I used methods of color segmentation for hand detection and

adapted them to Ripley’s vision system. One difference between the previous works and

my work is that my goal was to integrate hand detection into Ripley’s modular vision

system rather than just studying hand detection itself. Also, I focused on the detection of

specific hand-related events in a uniform framework, which enables easy integration with

language and other behaviors.

Page 12: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

12

4 Further Theory and Comments on the methods used

4.1 Hand Detection

To detect human hands, we used stochastic color models, as discussed in section 2.

In practice, for the training of hand models, we used samples of the hands of two

individuals under artificially different lighting conditions. The training of the

background consisted mainly on the tablecloth surface but also included the edges and

sides of the table, as well as “opposites”, i.e. samples of objects when training for hands

and vice-versa. The output of “trainmodel” is a three dimensional decision table, which

when given a pixel color value, decides whether the pixel having this value it most likely

belongs to {object/hand} or {background}.

Because the training sessions were short, not all possible pixel values were

encountered in the training samples. As a result, some form of “smoothing” of the

histograms (which had “holes”) was required. Thus, a special diffusion code was used to

smooth the histograms. The amount of smoothing required was then tuned on the basis of

signal-detection theoretic measures (False/True Positives/Negatives), after a suitable

hand-segmented testing set was created. In other words, the histograms were

progressively smoothed, until we got satisfactory False Positive/False Negative ratios.

4.2 Depth Detection

Depth detection was the second feature to be added. To calculate depth, the there

were two options: using the input from 1) two cameras, or 2) one moving camera. The

first option of using both cameras is the most simplistic and stable method for Ripley.

Since Ripley’s motor controls are not the most accurate; thus, the use of one moving

Page 13: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

13

camera would lead to erroneous results from misjudged displacement in the camera, and

also such a method would require static objects, which is not true for hand interactions.

In object-level stereo, we initially do not know which items in one camera

correspond to which in the other. As mentioned above and discussed below, a matching

algorithm based on a composite position-color-size distance measure was used in the

stereobjecter. There are three possible scenarios in which items may appear to Ripley’s

vision system.

1) The item is seen by the left camera but not the right camera.

2) The item is seen by the right camera but not the left camera.

3) The item is seen by both cameras

We move on to the calculation of the depth of an object, assuming that we have

matched the two items corresponding to it across the cameras. We will be calculating the

depth of an object for case three as shown in Figure 3. The two cameras’ field of views

overlap and both cameras can see the object. Since the cameras are identical, their field

of view angle (2Θ) is the same. Thus, from the diagram, we get the following equations:

(1) z = l1cos(α)

(2) z = l1 cos (ß)

(3) x1= l1 sin (α)

(4) x2= l2 sin (ß)

(5) d = x1 * l1sin(α) – x2 * l2sin(ß)

We scale x1 and x2 to be x2′ and x1′, which are between {-1, 1}. With the x2′ and x1′

variables combined with the above equations, the equation for the depth is:

Page 14: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

14

(6) z = (-d/(2 sin (Θ/2))) * (1/ x2′- x1′)

Figure 3: Diagram of the variables for finding the depth of the two cameras.

4.3 Hand Endpoint Detection

Hand endpoints were detected in the following manner: the major axis of the

ellipse of the hand was used as the axis of a new coordinate system. The hand region’s

pixels were projected on this axis. The direction of this axis was decided on the basis of a

complicated criterion. The two endpoints of the major axis of the ellipse are classified as

either the “positive” or the “negative” endpoint. The “negative” endpoint is the one

which is closer to the boundary of the camera window. The “positive” endpoint is the

other endpoint. The direction of the axis chosen for the projection of the hand region

pixels is in the direction of the “positive” endpoint. After the projection of the region’s

pixels onto the new coordinate system, the pixel within the region with the maximal

projection coordinate was chosen as endpoint as seen in Figure 4. See the results section

for examples.

d LeftCamera

Right Camera

α ßΘ

Θ z

Object

l1

l2

y

x1 x2

Page 15: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

15

Figure 4. The ellipse of the hand is drawn over the original hand region. The dark blue arrow shows the projection of the coordinate furthest away in the object’s coordinate space. The dark blue arrow points to the cyan color dot, which is the endpoint found. The light blue arrow points to the normal x coordinate of the endpoint.

4.4 Inclination Detection

Combining the capability to detect hands and perceive the depth of hands, one

might also want to treat the hand as not just being parallel to the table, but also possibly

having a z-axis inclination. This might for example be useful in cases where a human is

pointing towards something with an inclined hand. Hand inclination can be calculated by

taking the depth of two different points on the hand and finding the angle of inclination

from there. Experiments towards this were made, as discussed in the results section of

this report.

4.5 Creation of GSM events for “Touch/Pickup/Putdown”

To integrate the vision system outputs to Ripley’s behavioral and linguistic

subsystems, the addition of event models of touch and pickup was added to the mental

model. The situation model contains a model of Ripley, a table, objects, and a human

head and body. Once hand detection and depth were working, N. Mavridis added a

simple human hand to the human model. This model represents what Ripley thinks of

Page 16: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

16

where the human hand exists given what he is seeing, as well as shows the human user

and audience a visualization of what occurs in Ripley’s situation model, which resides in

his mind.

For the touch event triggering, we get the x, y, and the depth coordinate of the

objects and to calculate the distance between objects to see which objects are close

enough to be touching. Our model will only accounts for human hands touching objects,

but can easily be extended to objects touching other objects, for the detection of generic

{{thing1} touches {thing2}} events. Picking up and putting down of objects is based

upon other events being triggered. If two objects are both moving up or moving down

and are touching, the pickup event will be triggered. Pickup is a secondary event that is

based on the primary events of moving and touching. This architecture allows for ease of

triggering events whose fundamentals are other triggered events.

5 Results

5.1 Hand Detection

The implementation of hand detection within Ripley’s vision pipeline was

described in section 2. Numerous unforeseen difficulties were encountered: for example,

common color models could not easily be shared across cameras due to the fact the two

cameras had different color tunings. One camera had a redder tinge than the second, thus,

the two cameras were trained separately. Another issue that occurred was due to the

original tablecloth being used as the background. The tablecloth had a tan color and

since the tan color and hand colors are quite similar, the system had difficulty separating

Page 17: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

17

the two objects. To solve this problem we changed the table cloth color to be deep navy

blue, as a quick and simple solution.

As discussed before, the segmenter module has two output streams: one for pixel-

by-pixel connected regions (region stream), and one for fitted ellipsoids (ellipse stream).

In Figure 5, we can see an example of both streams:

Figure 5. Output of the segmenter for a hand. The left is the pixel-by-pixel regions stream. The right is the corresponding ellipse in the ellipse stream. Later we will comment upon the cyan and orange points.

5.2 Hand/Object depth estimation

For depth perception, as mentioned before, there are a few issues that must be

dealt with. The first problem is reconciling the images we receive from the two cameras

– i.e. matching regions across cameras. For simplicity, the assumption is made that for

the object to be present, it must appear in both cameras. Objects are matched by their

color, size, and position, by using modifications of the “objecter” code described in [4].

The choice of a suitable distance metric for the matching of objects across

cameras was an important decision. My graduate student supervisor and I decided to put

more emphasis on color than size and position and to the vertical positional difference in

particular as compared to horizontal. We expect a horizontal apparent displacement due

to the displacement between the cameras. Also, for the above reason horizontal

displacement should only be positive and not negative in sign – thus, an asymmetrical

Page 18: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

18

term was introduced in the composite distance metric due to thus, punishing negative

horizontal displacements of the object pairs considered much more than the positive. I.e.

there would be heavy punishment for objects which are not in the correct relational

position to each other, given the geometry of the camera positions.

One central issue in our system is concerned with the calculation of the relative

depth of the hand, as seen from the two cameras. Here, we approached the problem

through three successive attempts that worked as increasingly better approximations to

what we were trying to determine.

Approximation 1: Using the centers of the ellipses for hand depth estimation

To reconcile two images and figure out the depth, we decided to initially work with the

ellipse output stream of the segmenter, and not the region stream. At first, we took the

depth by calculating the centers of the ellipses. For objects such as squares, cans and

balls, the depth can be calculated fairly easily since the center of the ellipses should be

approximately the center of the object due to the fact the objects usually fit completely

within the camera’s field of view. With this scenario, we found the depth as described in

section 4.2.

Hands are different from objects because of difficulty that the whole hand is not

visible in the camera. Furthermore, in most cases a smaller part of the hand will be visible

in one camera than in the other.

Let us consider three cases:

1) Hand region bigger in right camera than the left

2) Hand region bigger in the left camera than the right camera

3) Coming front top or bottom, same size but different placement

Page 19: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

19

(A) CASE 1- RIGHT (C) CASE 2- RIGHT (E) CASE 3- RIGHT

(B) CASE 1- LEFT (D) CASE 2- LEFT (F) CASE 3- LEFT

Figure 6: Combinations of differing hand lengths across the two cameras. Case one is shown in (A) the right camera and (B) the left camera. Case two is shown in (C) the right camera and (D) the left camera. Case three is shown in (E) the right camera, and (F) the left camera.

Thus, if we were to use the centers of the ellipses, it would turn out with a high

probability that the center of the ellipse would be of two different parts of the hand

instead of the same part. In Figure 6, the centers are marked with an orange dot. All the

cases except the third shows that the centers of the ellipses mark different parts of the

hand. We cannot detect precisely where the hand becomes the wrist (i.e. how the wrist

becomes smaller) because if a user is pointing at a slanted angle vertically, the wrist may

actually be larger than the hand.

Approximation 2: Using a selected endpoint of the ellipses for hand depth estimation

Thus, as a second attempt, instead of using the center, we decided to use the

endpoint of the ellipses, though there are a few issues involved with the idea. One

Page 20: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

20

problem is figuring out which of the endpoints of the ellipse are chosen and how to match

up the endpoints. We match up the ellipse by the bigger axis with the bigger axis and the

smaller axis with the smaller axis. The assumption is made that the hand if pointing is

usually longer in the pointing direction than in width. Though this scenario is

occasionally inaccurate, the scenario should be mostly accurate. Next, to determine

which endpoint not to use, we calculate which of the two endpoints is closer to the edge

of the screen. Since the hand is connected to the human body, the hand will still be

connected to the body, and thus will go off the screen. The requirement is for the user to

have his hand and arm show his skin color under the cameras’ views.

Thus, the endpoints of the big axis are taken, the x and y distance of each

endpoint on the image is calculated to see which is closer to the edge of the screen. We

take the opposite endpoint, which should be the end of the hand and find the depth.

There is the problem in which one hand goes completely through a camera’s field of view.

Then, the camera may calculate the wrong endpoint as the one connected to the arm.

However, if we assume that the hand goes through only one camera’s field of view

completely, we would be able to use the other camera’s information about the endpoint

and translate it onto the first camera’s image to find the correct endpoint.

To test and evaluate hand recognition and depth perception we reconfigured some

video channels to output specific and important points. For example, with depth

perception, there was difficulty figuring out the position of the endpoints and the center

of the ellipses with only numbers printed out since the images of the hands are constantly

moving. Thus, we output the center of the ellipse and the endpoint of the ellipse as seen

in Figure 5 and Figure 6.

Page 21: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

21

Approximation 3: Using a selected region endpoint for hand depth estimation

We realized that the end of the ellipse may not be the end of the actual hand

because calculation of ellipse’s big axis can be inaccurate as well as if the hand is not

exactly straight. To figure out the real endpoint of the hand, we filtered through each

region in the image and calculated the point farthest along the x axis in the object’s space

coordinates, as shown previously in section (4.3).

5.3 Hand Inclination estimation

Calculating the inclination of the hand proved more difficult than expected

because it was difficult to pinpoint the exact location of a landmark other than the hand

endpoint. There was the wrist or the round palm of the hand, but either is unreliable in

many cases because of a slanted vertical hand. This problem is left for future extension.

However, if we assume that the hand is parallel to the ground, or that it just points

towards objects on the table, what we have built already is sufficient in order to figure out

the general vicinity in which the finger is pointing at, and resolve indexicals.

5.4 Mental Model Touch and Pickup and Putdown

The mental model touch event detector routine was first implemented, later

followed by the pickup. The touch event consists of three parts: start touching, is

touching, and stop touching. The event allows for Ripley to easily remember when the

touch event happens and allow for various questions concerning time to be asked. The

touch event is triggered by calculating if the distance between the hand and the object is

within a certain threshold. The threshold consists of the object’s radius and an error

Page 22: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

22

threshold. The pick up event triggering depends on the human hand and the object

moving as well as the human hand touching the object.

The result of testing the touch event triggering was good as long as the hand was

between the two cameras. The picture of the hand moving in the mental model is a little

disconnected because the frame rate is slower than the human hand moving. If the user is

sufficiently slow and in view of the cameras, the system is able to track the hand and the

object quite well. An example of the overview of the system is show below.

(A) (B) (C)

(D) (E)Figure 7. (A) a display of the original input to one of the cameras. (B) The object segmentation. (C) The hand segmentation. (D) The mental model of the human, the human hand, Ripley, and the objects. (E) Ripley’s view of the hand.

Page 23: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

23

6 Conclusions

Overall, my project was a success in two ways: in terms of what I built, and in

terms of what I learned.

What I built seems useful because of the reliability and the functionality of the

hand detection, depth detection, and touch event detection. Ripley was able to

demonstrate hand movement and depth perception at demos in the Media Lab DL and

TTT Open Houses in May 2006.

The second aspect of the success of the project lies in the useful experience

acquired. Apart from learning to persist in a real-world long-term engineering endeavor, I

acquired useful skills in dealing with and hacking long pieces of not-optimally-written

real-world code, in working with multi-module distributed systems across a network of 8

machines, and in familiarizing myself with the basic components of a conversational

robot. Furthermore, I got practical experience in computer vision as well as pattern

recognition techniques, and in report writing / presentation of my work.

7 Future Work

Many further improvements can be envisioned for the vision system within

Ripley. One improvement is the ability for Ripley to adjust its position so that both

cameras can view an object if the object is in the view of only one camera, and not the

other. This feature would give the user the impression that Ripley is curious and knows

when something is not in its stereo vision. Another improvement is in the detection of

objects in general. Currently, objects in the situation model system are occasionally

instable and will disappear instantly sometimes. This could also be overcome

Page 24: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

24

synthetically, by better tuning the system. Furthermore, active vision techniques could be

used in order to counteract the current wasteful push-forward-everything current

architecture, and generic object recognizers could also be added. An area of improvement

with significant practical effect would also be the possible easy online training/tuning of

the vision system as well as auto-adaptation to new conditions. Another problem is when

an object is on the table and the hand blocks the object from the cameras. The system

currently believes the object disappeared when in actuality it is being occluded by the

hand. Improvement can be made in allowing objects to stay in the Ripley’s mental model

when a hand is above where the object was placed.

In terms of Ripley and conversational robots in general, many things are under

preparation. Last but not least, I hope that the hand detector and my contribution has

made Ripley and his descendants more capable, more fun to interact with, and ultimately

more useful for humans.

Page 25: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

25

Acknowledgements

I would like to thank Nikolaos Mavridis for the opportunity to work with him on Ripley and Deb Roy for supervising this project.

I would like to thank to my parents and my friends for supporting me through my time at MIT.

Page 26: Human Hand Event Detection for Ripley the Robotalumni.media.mit.edu/~nmav/thesis/RipleyHandDetection.pdf · Human Hand Event Detection for Ripley the Robot Advanced Undergraduate

26

References

1. N. Mavridis, "Grounded Situation Models for Embodied Conversational Assistants", Thesis Proposal, December 2005.

2. F. DiSimoni. The Token Test for Children. DLM Teaching Resources, USA, 1978.

3. N. Mavridis, D. Roy, "Grounded Situation Models for Robots: Where words and percepts meet ", draft in preparation for IROS2006.

4. D. Roy, K. Hsiao and N. Mavridis, “Mental Imagery for a Conversational Robot”, IEEE Systems, Man & Cybernetics Part B, June 2004.

5. C. Breazeal et al. Humanoid Robots as Cooperative Partners for People. In IJHR, 2004.

6. E. Stergiopoulou, N. Papamarkos, A. Atsalakis, “Hand Gesture Recognition Via a New Self-Organized Neural Network”.