human hand event detection for ripley the...
TRANSCRIPT
1
Human Hand Event Detectionfor Ripley the Robot
Advanced Undergraduate Project Report
Marjorie Cheng, MIT, May 2006
2
Table of Contents
Abstract 1
1. Introduction 1
2. Ripley and his Vision Pipeline 3
2.1 The “segmenter” module (color model-based segmentation) 6
2.2 The “trainmodel” module (training for color models) 6
2.3 The “stereobjecter” module (binocular match and depth calculation) 7
2.4 The “stm” module (short term memory and event detection) 7
3. Previous Work 9
3.1 Conversational Robotics 9
3.2 Hand Detection 10
4. Further Theory and Comments on the methods used 10
4.1 Hand Detection 10
4.2 Depth Detection 12
4.3 Hand Endpoint Detection 13
4.4 Inclination Detection 13
4.5 Creation of GSM events for “Touch/Pickup/Putdown” 13
5. Results 14
5.1 Hand Detection 14
5.2 Hand/Object depth estimation 15
5.3 Hand Inclination estimation 19
5.4 Mental Model Touch and Pickup and Putdown 19
6. Conclusions 21
7. Future Work 21
Acknowledgements 23
References 24
3
Abstract
For a robot to cooperate with humans, it is not enough for it to be able to manipulate objects with itself – it should also be able to recognize what humans are doing with objects, respond to their actions and ultimately interact with humans and learn from them. Ripley is a conversational robot, which can see objects, detect human faces, answer questions about the present or the past, manipulate and imagine objects. Ripley communicates with the human user through speech and pointing gestures.
To achieve our vision of cooperative robots, the next step in Ripley is the identification of human hands and events surrounding them. The objective of this project was the development of human hand detection, and human hand-object interaction event recognition capabilities for Ripley. This was accomplished through a mixture of tools: training of stochastic color models, stereo vision, and object permanence techniques. The implemented system consists of several new modules that augmented Ripley’s existing modular architecture, plus modifications to existing Ripley modules.
1. Introduction
Technology has been advancing to a point where people can envision truly
cooperative conversational robots. Ripley is aspiring to be such a robot. Ripley is a
seven degree of freedom robotic arm supplemented with computer vision, speech
recognition, and speech synthesis. Its software architecture is based on the concept of
“Grounded Situation Models”, created by N. Mavridis and D. Roy [1]. A grounded
situation model is a representation of the situation the robot is experiencing, remembering,
or imagining, which is proposed in [1]. One important future landmark for the Ripley
project is to be able to have Ripley pass the “Token Test”. The “Token Test” is a test
given to children to evaluate if the child has difficulty acquiring speech skills. [2]
Currently, Ripley is able to perceive or imagine objects, measure their properties,
answer questions about the present or the past, and interact with the objects when told to
by a user. My research takes upon the challenge for Ripley to be able to better interact
with people by allowing it to recognize and interact with human hands, and recognize
4
hand-related events such as: “your hand started moving”, “your hand touched the green
object”, and “your hand pointed towards the small blue ball”.
This is quite an important capability: for a robot to cooperate with humans, it is
not enough for the robot to be able to manipulate objects itself – the robot should also be
able to recognize what humans are doing to objects. For example, Ripley can currently
point towards objects, touch them, pick them up, put them down, and hand them to a
human. However, before this project, the robot could not recognize when humans
interacted with objects in the above way – and thus a fundamental asymmetry existed.
The new capabilities could support many extensions to practical functionality: with
appropriate behavioral / linguistic extensions, the robot can now be envisioned of being
capable of understanding indexicals with human pointing (“Give me this one!”), handling
conditional execution commands (“when I touch the blue one, you give me the red one”),
mark the timeline on the basis of human hand events (“remember when my hand started
moving”), learn procedures by imitating humans (the human first touched the green one
and then picked up the blue one – that’s what I will do too) etc.
5
2 Ripley and his Vision Pipeline
Ripley’s manipulator arm has seven degrees of freedom. Ripley’s environment
consists of a table, a human user, and random objects which are placed on the table. His
software architecture consists of several modules, an overview of the architecture
previous to this project can be seen in Figure 1, which is taken from [3].
Figure 1. The modules in Ripley’s system before the current work was done.
6
Previously, Ripley used to utilize monocular vision, which has now been
extended to object-level stereo functionality. This was accomplished through
augmentations to its vision system, which consists of modules in a five-stage pipeline as
shown in Figure 2. At the front end of the vision system, there is a capture module for
each camera (“capture”), supplying 720x480 images at a frame rate of more than 15 fps.
At the second stage of the pipeline, there are three species-specific detector/segmenter
modules: one for objects on the table, one for human faces, and one for human hands
(“segmentation” and “facedetect”). At the third stage, apparent 2D region permanence
modules reside (“objecter”), followed by stereo matching modules at the fourth stage
(“stereobjecter”). Finally, stereo and mono results are pushed forward to one of the four
modality-specific situation model update modules – namely the “visor” responsible for
suggesting changes to the current situation model on the basis of visual information [3].
In the remainder of this section, I briefly touch upon the modules that were
created or modified during the work, namely the “stereobjecter” (created on the basis of
the existing “objectors”) and the “segmenter”, “visor” and “stm” (slightly modified), as
well as “trainmodel”, the training module for the segmenter1.
1 The “visor” modifications as well as the language-related “stm” modifications were carried out by N. Mavridis.
7
capture 1
segmenter (hands)
segmenter (objects)
regions (objects)
objecter (objects)
regions (hands)
objecter (hands)
facedetect
region
ellipses
region
ellipses
segmenter (face)
capture 2
segmenter (hands)
segmenter (objects)
regions (objects)
objecter (objects)
regions (hands)
objecter (hands)
facedetect
region
ellipses
region
ellipses
segmenter (face)
stereobjecter (objects)
stereobjecter (hands)
stereobjecter (faces)
visor
Figure 2. Diagram of the modules in Ripley’s current vision system.
8
2.1 The “segmenter” module (color model-based segmentation)
To identify objects, the current segmenter uses stochastic color models to train
Ripley’s vision on various objects lying on a tablecloth background. Various processing
stages follow, including sub-sampling, contour detection, ellipsoid fitting etc. In more
detail: the images from the cameras are decomposed to blocks of five by five pixels, and
the block’s belonging to either object/hand or background is decided by determining
where the majority of pixels within the block belongs to. Then, contours are detected, and
effectively the image is broken into connected regions, after filtering out some spurious
regions. For each of these regions, an ellipse is created to best fit the region. Two output
streams emanate from the segmenter: the ellipses and the regions (pixel-by-pixel). The
ellipses are translated to spherical blobs in the internal situation model (in “visor”). The
ellipses contain a center, a big axis, a small axis, the degree of rotation of the ellipse, and
other information. The regions are used for 3D voxel-based shape model acquisition,
which voluntarily also takes place in “visor”. Thus, in the internal situation model, each
object is represented as a superposition of a spherical blob, with an optional detailed
voxel-based 3D model attached to it. It is important to note that the second output stream
of the segmenter, which are the regions, is also used for human hand endpoint detection,
as shown in section 4.2.
2.2 The “trainmodel” module (training for color models)
The stochastic model of the segmenter is trained (using “trainmodel”) in the
following manner: First, samples of objects or hands are consecutively placed at the
center of the field of view. Then, the middle of the image is sampled, and a color
9
histogram is acquired. Then, a similar procedure is used to acquire color histograms of
the background (tablecloth, some surrounding environment, objects when training for
hands and vice-versa). Finally, the two histograms are weighted by priors, and a decision
table is made where Ripley is able to determine, on the basis of pixel color, if what he
sees is a pixel belonging to the object/hand or the background, by considering the
weighted probability of the pixel belonging to one or the other histogram.
2.3 The “stereobjecter” module (binocular match and depth calculation)
After having considered “segmenter” and “trainmodel” briefly, let’s move over to
“stereobjecter”. Previous to this project, Ripley only used one of the two cameras
attached to the sides of its head. In order to give Ripley depth perception, which is
invaluable for hand event detection as well as 3D modeling, we decided to use both of the
cameras. In stereobjecter, the ellipsoids corresponding to objects/hands from the left and
right eyes enter. Then, objects are matched across eyes, on the basis of a composite
positional/color/size-based distance metrics. Finally, for the resulting matched pairs, a
virtual “middle eye” (in between the left and right) apparent position is calculated, as
well as an apparent depth, given the inter-eye distance. The new “middle-eye” objects,
augmented by depth estimates, are the outputs to the “visor”. In the “visor”, these are
resolved from head-centered to absolute coordinates, and are placed in the situation
model after dealing with 3D object permanence, as described in [4]
2.4 The “stm” module (short term memory and event detection)
10
Ripley’s internal model of the world is designed after the “Grounded Situation
Models” (GSM) proposal [1]. In the GSM, each temporal slice of a situation is called a
“moment”. Multiple moments build up detailed “histories”. In order to quantize the
quasi-continuum of “histories”, the histories are parsed to sequences of standardized
“events”, implemented by event-specific detector routines. Such events include changes
of kinetic state of objects, appearances/disappearances etc. (“when the green one
appeared”, “when my head started moving” etc.). New event types and recognizers were
created for “hand starting to / stopped / is touching object” as well as “pick up”, “put
down” and “move up/down”. In terms of implementation, the keeping of the history as
well as the event detection takes places in the “stm” module (termed “rememberer” in
[1]). Later, these events can be accessed through language, in order to facilitate
“remembering”, answering questions about the past etc.
11
3 Previous Works
Related previous work can be found in the area of hand detection, but also more generally
in conversational robotics. A very short introduction follows.
3.1 Conversational Robotics
A number of systems and designs have appeared in the literature. Some such
examples can be found in [5]. Ripley the robot has three novel abilities as compared to
these systems: he can verbalize confidence and uncertainty, refer to past events, and most
importantly imagine situations through linguistic descriptions.
3.2 Hand Detection
Hand detection and tracking are welled studied in the literature. For hand
detection, color segmentation and shape matching are common methods. The hand color
can also be trained using different methods such as Bayesian networks and AdaBoost.
Hand detection is also tracked by the shape of the hand and usually is used for specific
gesture pointing. For example in [6], gesture recognition is based upon color
segmentation, a neural gas network, finger identification and so forth.
For my project, I used methods of color segmentation for hand detection and
adapted them to Ripley’s vision system. One difference between the previous works and
my work is that my goal was to integrate hand detection into Ripley’s modular vision
system rather than just studying hand detection itself. Also, I focused on the detection of
specific hand-related events in a uniform framework, which enables easy integration with
language and other behaviors.
12
4 Further Theory and Comments on the methods used
4.1 Hand Detection
To detect human hands, we used stochastic color models, as discussed in section 2.
In practice, for the training of hand models, we used samples of the hands of two
individuals under artificially different lighting conditions. The training of the
background consisted mainly on the tablecloth surface but also included the edges and
sides of the table, as well as “opposites”, i.e. samples of objects when training for hands
and vice-versa. The output of “trainmodel” is a three dimensional decision table, which
when given a pixel color value, decides whether the pixel having this value it most likely
belongs to {object/hand} or {background}.
Because the training sessions were short, not all possible pixel values were
encountered in the training samples. As a result, some form of “smoothing” of the
histograms (which had “holes”) was required. Thus, a special diffusion code was used to
smooth the histograms. The amount of smoothing required was then tuned on the basis of
signal-detection theoretic measures (False/True Positives/Negatives), after a suitable
hand-segmented testing set was created. In other words, the histograms were
progressively smoothed, until we got satisfactory False Positive/False Negative ratios.
4.2 Depth Detection
Depth detection was the second feature to be added. To calculate depth, the there
were two options: using the input from 1) two cameras, or 2) one moving camera. The
first option of using both cameras is the most simplistic and stable method for Ripley.
Since Ripley’s motor controls are not the most accurate; thus, the use of one moving
13
camera would lead to erroneous results from misjudged displacement in the camera, and
also such a method would require static objects, which is not true for hand interactions.
In object-level stereo, we initially do not know which items in one camera
correspond to which in the other. As mentioned above and discussed below, a matching
algorithm based on a composite position-color-size distance measure was used in the
stereobjecter. There are three possible scenarios in which items may appear to Ripley’s
vision system.
1) The item is seen by the left camera but not the right camera.
2) The item is seen by the right camera but not the left camera.
3) The item is seen by both cameras
We move on to the calculation of the depth of an object, assuming that we have
matched the two items corresponding to it across the cameras. We will be calculating the
depth of an object for case three as shown in Figure 3. The two cameras’ field of views
overlap and both cameras can see the object. Since the cameras are identical, their field
of view angle (2Θ) is the same. Thus, from the diagram, we get the following equations:
(1) z = l1cos(α)
(2) z = l1 cos (ß)
(3) x1= l1 sin (α)
(4) x2= l2 sin (ß)
(5) d = x1 * l1sin(α) – x2 * l2sin(ß)
We scale x1 and x2 to be x2′ and x1′, which are between {-1, 1}. With the x2′ and x1′
variables combined with the above equations, the equation for the depth is:
14
(6) z = (-d/(2 sin (Θ/2))) * (1/ x2′- x1′)
Figure 3: Diagram of the variables for finding the depth of the two cameras.
4.3 Hand Endpoint Detection
Hand endpoints were detected in the following manner: the major axis of the
ellipse of the hand was used as the axis of a new coordinate system. The hand region’s
pixels were projected on this axis. The direction of this axis was decided on the basis of a
complicated criterion. The two endpoints of the major axis of the ellipse are classified as
either the “positive” or the “negative” endpoint. The “negative” endpoint is the one
which is closer to the boundary of the camera window. The “positive” endpoint is the
other endpoint. The direction of the axis chosen for the projection of the hand region
pixels is in the direction of the “positive” endpoint. After the projection of the region’s
pixels onto the new coordinate system, the pixel within the region with the maximal
projection coordinate was chosen as endpoint as seen in Figure 4. See the results section
for examples.
d LeftCamera
Right Camera
α ßΘ
Θ z
Object
l1
l2
y
x1 x2
15
Figure 4. The ellipse of the hand is drawn over the original hand region. The dark blue arrow shows the projection of the coordinate furthest away in the object’s coordinate space. The dark blue arrow points to the cyan color dot, which is the endpoint found. The light blue arrow points to the normal x coordinate of the endpoint.
4.4 Inclination Detection
Combining the capability to detect hands and perceive the depth of hands, one
might also want to treat the hand as not just being parallel to the table, but also possibly
having a z-axis inclination. This might for example be useful in cases where a human is
pointing towards something with an inclined hand. Hand inclination can be calculated by
taking the depth of two different points on the hand and finding the angle of inclination
from there. Experiments towards this were made, as discussed in the results section of
this report.
4.5 Creation of GSM events for “Touch/Pickup/Putdown”
To integrate the vision system outputs to Ripley’s behavioral and linguistic
subsystems, the addition of event models of touch and pickup was added to the mental
model. The situation model contains a model of Ripley, a table, objects, and a human
head and body. Once hand detection and depth were working, N. Mavridis added a
simple human hand to the human model. This model represents what Ripley thinks of
16
where the human hand exists given what he is seeing, as well as shows the human user
and audience a visualization of what occurs in Ripley’s situation model, which resides in
his mind.
For the touch event triggering, we get the x, y, and the depth coordinate of the
objects and to calculate the distance between objects to see which objects are close
enough to be touching. Our model will only accounts for human hands touching objects,
but can easily be extended to objects touching other objects, for the detection of generic
{{thing1} touches {thing2}} events. Picking up and putting down of objects is based
upon other events being triggered. If two objects are both moving up or moving down
and are touching, the pickup event will be triggered. Pickup is a secondary event that is
based on the primary events of moving and touching. This architecture allows for ease of
triggering events whose fundamentals are other triggered events.
5 Results
5.1 Hand Detection
The implementation of hand detection within Ripley’s vision pipeline was
described in section 2. Numerous unforeseen difficulties were encountered: for example,
common color models could not easily be shared across cameras due to the fact the two
cameras had different color tunings. One camera had a redder tinge than the second, thus,
the two cameras were trained separately. Another issue that occurred was due to the
original tablecloth being used as the background. The tablecloth had a tan color and
since the tan color and hand colors are quite similar, the system had difficulty separating
17
the two objects. To solve this problem we changed the table cloth color to be deep navy
blue, as a quick and simple solution.
As discussed before, the segmenter module has two output streams: one for pixel-
by-pixel connected regions (region stream), and one for fitted ellipsoids (ellipse stream).
In Figure 5, we can see an example of both streams:
Figure 5. Output of the segmenter for a hand. The left is the pixel-by-pixel regions stream. The right is the corresponding ellipse in the ellipse stream. Later we will comment upon the cyan and orange points.
5.2 Hand/Object depth estimation
For depth perception, as mentioned before, there are a few issues that must be
dealt with. The first problem is reconciling the images we receive from the two cameras
– i.e. matching regions across cameras. For simplicity, the assumption is made that for
the object to be present, it must appear in both cameras. Objects are matched by their
color, size, and position, by using modifications of the “objecter” code described in [4].
The choice of a suitable distance metric for the matching of objects across
cameras was an important decision. My graduate student supervisor and I decided to put
more emphasis on color than size and position and to the vertical positional difference in
particular as compared to horizontal. We expect a horizontal apparent displacement due
to the displacement between the cameras. Also, for the above reason horizontal
displacement should only be positive and not negative in sign – thus, an asymmetrical
18
term was introduced in the composite distance metric due to thus, punishing negative
horizontal displacements of the object pairs considered much more than the positive. I.e.
there would be heavy punishment for objects which are not in the correct relational
position to each other, given the geometry of the camera positions.
One central issue in our system is concerned with the calculation of the relative
depth of the hand, as seen from the two cameras. Here, we approached the problem
through three successive attempts that worked as increasingly better approximations to
what we were trying to determine.
Approximation 1: Using the centers of the ellipses for hand depth estimation
To reconcile two images and figure out the depth, we decided to initially work with the
ellipse output stream of the segmenter, and not the region stream. At first, we took the
depth by calculating the centers of the ellipses. For objects such as squares, cans and
balls, the depth can be calculated fairly easily since the center of the ellipses should be
approximately the center of the object due to the fact the objects usually fit completely
within the camera’s field of view. With this scenario, we found the depth as described in
section 4.2.
Hands are different from objects because of difficulty that the whole hand is not
visible in the camera. Furthermore, in most cases a smaller part of the hand will be visible
in one camera than in the other.
Let us consider three cases:
1) Hand region bigger in right camera than the left
2) Hand region bigger in the left camera than the right camera
3) Coming front top or bottom, same size but different placement
19
(A) CASE 1- RIGHT (C) CASE 2- RIGHT (E) CASE 3- RIGHT
(B) CASE 1- LEFT (D) CASE 2- LEFT (F) CASE 3- LEFT
Figure 6: Combinations of differing hand lengths across the two cameras. Case one is shown in (A) the right camera and (B) the left camera. Case two is shown in (C) the right camera and (D) the left camera. Case three is shown in (E) the right camera, and (F) the left camera.
Thus, if we were to use the centers of the ellipses, it would turn out with a high
probability that the center of the ellipse would be of two different parts of the hand
instead of the same part. In Figure 6, the centers are marked with an orange dot. All the
cases except the third shows that the centers of the ellipses mark different parts of the
hand. We cannot detect precisely where the hand becomes the wrist (i.e. how the wrist
becomes smaller) because if a user is pointing at a slanted angle vertically, the wrist may
actually be larger than the hand.
Approximation 2: Using a selected endpoint of the ellipses for hand depth estimation
Thus, as a second attempt, instead of using the center, we decided to use the
endpoint of the ellipses, though there are a few issues involved with the idea. One
20
problem is figuring out which of the endpoints of the ellipse are chosen and how to match
up the endpoints. We match up the ellipse by the bigger axis with the bigger axis and the
smaller axis with the smaller axis. The assumption is made that the hand if pointing is
usually longer in the pointing direction than in width. Though this scenario is
occasionally inaccurate, the scenario should be mostly accurate. Next, to determine
which endpoint not to use, we calculate which of the two endpoints is closer to the edge
of the screen. Since the hand is connected to the human body, the hand will still be
connected to the body, and thus will go off the screen. The requirement is for the user to
have his hand and arm show his skin color under the cameras’ views.
Thus, the endpoints of the big axis are taken, the x and y distance of each
endpoint on the image is calculated to see which is closer to the edge of the screen. We
take the opposite endpoint, which should be the end of the hand and find the depth.
There is the problem in which one hand goes completely through a camera’s field of view.
Then, the camera may calculate the wrong endpoint as the one connected to the arm.
However, if we assume that the hand goes through only one camera’s field of view
completely, we would be able to use the other camera’s information about the endpoint
and translate it onto the first camera’s image to find the correct endpoint.
To test and evaluate hand recognition and depth perception we reconfigured some
video channels to output specific and important points. For example, with depth
perception, there was difficulty figuring out the position of the endpoints and the center
of the ellipses with only numbers printed out since the images of the hands are constantly
moving. Thus, we output the center of the ellipse and the endpoint of the ellipse as seen
in Figure 5 and Figure 6.
21
Approximation 3: Using a selected region endpoint for hand depth estimation
We realized that the end of the ellipse may not be the end of the actual hand
because calculation of ellipse’s big axis can be inaccurate as well as if the hand is not
exactly straight. To figure out the real endpoint of the hand, we filtered through each
region in the image and calculated the point farthest along the x axis in the object’s space
coordinates, as shown previously in section (4.3).
5.3 Hand Inclination estimation
Calculating the inclination of the hand proved more difficult than expected
because it was difficult to pinpoint the exact location of a landmark other than the hand
endpoint. There was the wrist or the round palm of the hand, but either is unreliable in
many cases because of a slanted vertical hand. This problem is left for future extension.
However, if we assume that the hand is parallel to the ground, or that it just points
towards objects on the table, what we have built already is sufficient in order to figure out
the general vicinity in which the finger is pointing at, and resolve indexicals.
5.4 Mental Model Touch and Pickup and Putdown
The mental model touch event detector routine was first implemented, later
followed by the pickup. The touch event consists of three parts: start touching, is
touching, and stop touching. The event allows for Ripley to easily remember when the
touch event happens and allow for various questions concerning time to be asked. The
touch event is triggered by calculating if the distance between the hand and the object is
within a certain threshold. The threshold consists of the object’s radius and an error
22
threshold. The pick up event triggering depends on the human hand and the object
moving as well as the human hand touching the object.
The result of testing the touch event triggering was good as long as the hand was
between the two cameras. The picture of the hand moving in the mental model is a little
disconnected because the frame rate is slower than the human hand moving. If the user is
sufficiently slow and in view of the cameras, the system is able to track the hand and the
object quite well. An example of the overview of the system is show below.
(A) (B) (C)
(D) (E)Figure 7. (A) a display of the original input to one of the cameras. (B) The object segmentation. (C) The hand segmentation. (D) The mental model of the human, the human hand, Ripley, and the objects. (E) Ripley’s view of the hand.
23
6 Conclusions
Overall, my project was a success in two ways: in terms of what I built, and in
terms of what I learned.
What I built seems useful because of the reliability and the functionality of the
hand detection, depth detection, and touch event detection. Ripley was able to
demonstrate hand movement and depth perception at demos in the Media Lab DL and
TTT Open Houses in May 2006.
The second aspect of the success of the project lies in the useful experience
acquired. Apart from learning to persist in a real-world long-term engineering endeavor, I
acquired useful skills in dealing with and hacking long pieces of not-optimally-written
real-world code, in working with multi-module distributed systems across a network of 8
machines, and in familiarizing myself with the basic components of a conversational
robot. Furthermore, I got practical experience in computer vision as well as pattern
recognition techniques, and in report writing / presentation of my work.
7 Future Work
Many further improvements can be envisioned for the vision system within
Ripley. One improvement is the ability for Ripley to adjust its position so that both
cameras can view an object if the object is in the view of only one camera, and not the
other. This feature would give the user the impression that Ripley is curious and knows
when something is not in its stereo vision. Another improvement is in the detection of
objects in general. Currently, objects in the situation model system are occasionally
instable and will disappear instantly sometimes. This could also be overcome
24
synthetically, by better tuning the system. Furthermore, active vision techniques could be
used in order to counteract the current wasteful push-forward-everything current
architecture, and generic object recognizers could also be added. An area of improvement
with significant practical effect would also be the possible easy online training/tuning of
the vision system as well as auto-adaptation to new conditions. Another problem is when
an object is on the table and the hand blocks the object from the cameras. The system
currently believes the object disappeared when in actuality it is being occluded by the
hand. Improvement can be made in allowing objects to stay in the Ripley’s mental model
when a hand is above where the object was placed.
In terms of Ripley and conversational robots in general, many things are under
preparation. Last but not least, I hope that the hand detector and my contribution has
made Ripley and his descendants more capable, more fun to interact with, and ultimately
more useful for humans.
25
Acknowledgements
I would like to thank Nikolaos Mavridis for the opportunity to work with him on Ripley and Deb Roy for supervising this project.
I would like to thank to my parents and my friends for supporting me through my time at MIT.
26
References
1. N. Mavridis, "Grounded Situation Models for Embodied Conversational Assistants", Thesis Proposal, December 2005.
2. F. DiSimoni. The Token Test for Children. DLM Teaching Resources, USA, 1978.
3. N. Mavridis, D. Roy, "Grounded Situation Models for Robots: Where words and percepts meet ", draft in preparation for IROS2006.
4. D. Roy, K. Hsiao and N. Mavridis, “Mental Imagery for a Conversational Robot”, IEEE Systems, Man & Cybernetics Part B, June 2004.
5. C. Breazeal et al. Humanoid Robots as Cooperative Partners for People. In IJHR, 2004.
6. E. Stergiopoulou, N. Papamarkos, A. Atsalakis, “Hand Gesture Recognition Via a New Self-Organized Neural Network”.