on indexicality, direction of arrival of sound sources, and human-robot interaction

Research ArticleOn Indexicality Direction of Arrival of Sound Sourcesand Human-Robot Interaction

Ivan Meza Caleb Rascon Gibran Fuentes and Luis A Pineda

Instituto de Investigaciones en Matematicas Aplicadas y en Systemas Universidad Nacional Autonoma de MexicoCiudad Universitaria Coyocan 04510 Mexico City MEX Mexico

Correspondence should be addressed to Ivan Meza ivanvladimirturingiimasunammx

Received 30 November 2015 Revised 7 March 2016 Accepted 10 April 2016

Academic Editor Gordon R Pennock

Copyright copy 2016 Ivan Meza et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

We present the use of direction of arrival (DOA) of sound sources as an index during the interaction between humans and servicerobots These indices follow the notion defined by the theory of interpretation of signs by Peirce This notion establishes a strongphysical relation between signs (DOAs) and objects being signified in specific contexts With this in mind we have modeled thecall at a distance to a robot as indexical in nature These indices can be later interpreted as the position of the user and the userherselfhimself The relation between the call and the emitter is formalized in our framework of development of service robotsbased on the SitLog programming language In particular we create a set of behaviours based on direction of arrival information tobe used in the programming of tasks for service robots Based on these behaviours we have implemented four tasks which heavilyrely on them following a person taking attendance of a class playing Marco-Polo and acting as a waiter in a restaurant

1 Introduction

Knowing the origin of a sound source is an important skillfor a robot Usually this skill is associated with survivalbehaviours in which this knowledge is used to react to naturalpredators or eminent danger However this skill also plays akey role during interaction for instance in calling a waiterover At first knowing the direction of arrival (DOA) of asound source may seem too basic to be an element in theinteraction however as we will show this information canbe meaningful so that the robot can react in a proper mannerand enhance the interaction

We consider the direction of arrival information as anindex in the sense proposed by Peircersquos sign theory [1 2]In this theory there are three types of signs given theirrelation with the represented object These types are iconindex and symbol Icons reflect qualitative features of theobject for instance pictures or drawings of the objectIndices have an existential or physical connection with theobjects in a specific context for instance a dark cloud isa sign for rain On the other hand symbols have a stableinterpretation based on a convention that connects them tothe object for instance a logo of a productThis also holds for

spoken communication onomatopoeic words for instanceare iconic since they resemblewhat they represent (eg chop)pronouns are indices since they ldquopointrdquo to the object theyrepresent andnouns are symbolic since they are conventionaland detached from the object they represent In the context ofthis classification DOA information has an indexical qualityit is an index of the position of the object which emits thesound and also of the object itself since it holds a strongconnection between the emitter of a sound (object) and thedirection of arrival (sign) In this work we exploit the use ofthis index to support the interaction between a robot and theusers

The present work of indexical DOAs is formalized andimplemented in the SitLog programming language that wehave developed and used to program our service robot [3 4]SitLog defines a set of behaviours which range from simple(one skill) to composed (more than one skill) behavioursThese behaviours are the basic blocks used to program ourrobot at a higher level Examples of simple behaviours arewalk which takes the robot from its current position to adestination and ask which prompts the user with a questionand waits for the spoken answer An example of a composedbehaviour is searching for an object in different locations

Hindawi Publishing CorporationJournal of RoboticsVolume 2016 Article ID 3081048 13 pageshttpdxdoiorg10115520163081048

2 Journal of Robotics

since it relies on other basic behaviours such as walk and seeobject In essence a behaviour has to be (1) portable so it canbe used in different tasks (2) compositional so by couplingit with different behaviours we can create more complexbehaviours or program a task and (3) able to handle potentialfailures (eg not arriving to its destination not listening to ananswer from the user or not finding the object)

In order to model the indexicality of the DOAs andincorporate it into the interaction capabilities of our robotwe propose a simple behaviour to handle DOAs and interpretthem as indices This supports the interaction by allowingcalls at a distance In this case the user or users canbring the attention of the robot to a specific area in whichthe sound originates This behaviour can also be used incombination with the walk behaviour to allow interruptionsduring walking such as calling a waiter over Additionallythe behaviour also supports the verification of the number ofsources during a conversation In particular this can be usedin combination with the ask behaviour to ensure that whenthe robot is listening to an answer there is only one persontalking

This paper is organized as follows Section 2 presentsprevious work in which sound information is part of a taskfor a service robot Section 3 reviews indexicality as interpre-tation procedure Section 4 presents the framework we useto interpret DOAs signs as indices Section 5 describes themanner in which the DOA behaviour resolves a DOA indexSection 6 shows the four tasks on which the indices fromsound sources are used to direct the interaction Section 7presents a discussion about our proposal and findings

2 Previous Work

The use of sound source localization (SSL) in robotics isa blooming field Okuno and Nakadai and Argentieri etal present reviews of the main methods and their usewith different types of robots [6ndash8] Since its application torobotics SSL has been promoted as a main skill for robotsBrooks et al proposed it as a basic skill for interaction inthe Cog project [9] The robot SIG was proposed as anexperimental setting for the RoboCup competition [10] andthe robot Spartacus participated in the 2005 AAAI MobileRobot Challenge implementing SSL as a part of its skill set[11]

Although the interaction goal is a driving force in thefield there is a great effort on developing a robust SSLmoduleWith this goal in mind the preferred setting is the use ofan array of microphones on board of the robot A many-microphone solution provides good performance given itsredundancy In [12] Valin et al propose an 8-microphone 3Darray to enhance speech recognition and as a preamble forDOA estimation More microphones are possible Hara et aluse two arrays of 8 microphones (16 in total) [13] Howeverminimal systems with 2 microphones are also possible suchas the one presented in [14] A common setting which wefollow is to use 3microphones [15 16]The best approach andconfiguration are still an open question Badali et al presentan evaluation of the main approaches over an 8-microphonearray for mobile robots [17]

Source localization can be reactive as a modality of theinteraction as presented in works that deal with followinga conversation [18ndash21] or it can be an essential part ofthe interaction such as using sound localization to controlthe navigation of the robot In particular Teachasrisaksakulet al propose a system that follows a person through aroom by her or his voice [22] On the other hand Li et aldemonstrated a robot which plays hide and seek using visualand audio modalities in the interaction [23] These worksfollow a similar approach onhow to incorporate sound sourcelocation as a part of the interaction as the one we proposeHowever since only one user is present in their work itstreatment as an index is trivial The DOA sign is directlyinterpreted as the user there are no conflicts regarding whoor what the sign could represent

Given the progress in the field there has been an effortdirected at capturing a more complicated interaction Forinstance Fransen et al present a multimodal robot whichfollows the instructions among two users to reach a target[24]Nakadai et al present a robotwhich is able to judge rock-paper-scissors sound games [25]TheQuizmaster robot playsa game in which it asks a question to up to four participantsand uses SSL and source separation to decide who answeredit first [26] Do et al present a robot which together witha caregiver log and detect the origin of certain sounds [27]At this level of complexity of the interaction DOA signs arebeing indirectly used as indices the DOAs become indiceswhich signify directly the users and this information is usedto disambiguate who calls or wins In [27] the DOAs anda category associated with them signify types of events Atthis level the multiple sources make a method to assign theDOA to the right entity necessary As we will see in thiswork the benefit of considering a DOA as an index in theright context allows us to use indexical resolution to performthis disambiguation in the cases that more than one DOAcompetes In this work we formalize this resolution similarlyto other reference resolution mechanisms on robots such asdeictic references [28 29]This differs from other approachesin which DOAs and multimodal information complementeach other for disambiguation [30]

3 Indexical Expressions

An interactive agent such as a robot has to understand theintention of its users or have expectations about the eventsthat can occur in the environment in order to produce anadequate response Considering the theory of signs by Piercewe can formulate the goal of ldquounderstandingrdquo as providinga sign or a set of signs to identify the signified object orobjects [1 2 31] In order to denote the object the robot buildsa representation of it For example if the user commandsrobot go to the kitchen the sequence of symbols robot goto the and kitchen get translated into the representation119892119900(119903119900119887119900119905 119896119894119905119888ℎ119890119899) with which the robot can establish thesignified objects (itself the kitchen and the action) andact accordingly A similar mechanism can be used for theinterpretation of iconic signs for instance when the robotsees a table and identifies on it a bottle of juice it will

Journal of Robotics 3

Table 1 Examples of indexical expressions and their resolution

Sign Representation Constraint ResolvedThis is juice + pointing gesture 120582119909119899119886119898119890(119909 119901119900119904119890 119895119906119894119888119890) Pose of object 119899119886119898119890(119900119887119895119890119888119905 119895119906119894119888119890)

Take it 120582119909119905119886119896119890(119909 119900119887119895119890119888119905) Object 119905119886119896119890(119895119906119894119888119890)

generate the representation 119900119899(119895119906119894119888119890 119905119886119887119897119890) which denotesthe corresponding juice object fully

When the set of signs of the interaction includes indexicalsigns the representation is not fully specified by the signssince the index has to be linked in an extra resolutionstep to the denoted object For instance if the user statesthe command robot come here the representation for thisutterance is underspecified as 120582ℎ119892119900(119903119900119887119900119905 ℎ) In order tofully specify this expression it has to be analysed in relationto the context of the task in which the position of the user canbe inferred For example if the user is located in the kitchenthe expression should be resolved to 119892119900(119903119900119887119900119905 119896119894119905119888ℎ119890119899)

In this scheme indexical signs produce an underspecifiedrepresentation at first which should be later resolved in thepresence of contextual information Furthermore indicesprovide additional constraints on the type of object which canresolve the expression for instance in the previous examplethe destination of the robot has to be resolved into a validposition In order to account for this phenomenon inspiredby the treatment of indices in [32] we propose the followingformulation for an index in lambda calculus in which we usethe symbol ldquordquo to signal a constraint 120572 over a certain variable119909

120582120572119876119909 (119876119909 120572) (1)

Indices defined in this form receive as parameters the con-straint 120572 and the underspecified predicate 119876 the effect ofapplying such arguments is the binding of the constrainedvariable 119909 120572 into the expression 119876 In this case thesymbol signifies a functional application (ie 120573 reductionin lambda calculus) If we apply the index in our previousexample 119876 = 120582ℎ119892119900(119903119900119887119900119905 ℎ) and 120572 = 119906119904119890119903119901119900119904 the followingrepresentation is produced 120582ℎ119892119900(119903119900119887119900119905 ℎ 119906119904119890119903119901119900119904) It isstill underspecified but it has constrained the variable ℎ tobe the position of the user A second process of resolutionhas to identify from the context the fact that the user is in thekitchen

Table 1 shows two examples of indexical signs using thisformulation with a robot For these examples consider thefollowing context the user and the robot are in front of atable with an orange juice placed on it The first illustratedexample is a multimodal expression the user says somethingand points to an object The word this in conjunction withthe pointing gesture define an index which imposes a spatialconstraint on the object being signified by both In thiscase the constraint is the pose of the object (position andorientation extracted from the visual system)This constraintis used to identify the object in the scene which in theexample resolves into the bottle of juice The second exampleshows an indexical reference with the word it which also addsa constraint [33] In this case the underspecified variablehas to be resolved to an object in the real world the

bottle of juice again Notice that we have invoked a spatialand deictic resolution however we have not specified aparticularmechanism to perform this type of resolutionsThetopic of effective mechanisms for the resolution of indexicalexpressions continues to be studied in different fields [34ndash36] and which mechanism is applied depends on the typeof expression and the type of task being solved In this workwe define themechanism for the resolution of indexical DOAexpressions and the conditions in which it can be applied

4 Dialogue Models and Robot

In our robot the coordination among modules during theexecution of a task is done by the execution of dialoguemodels These dialogue models are written using the SitLogprogramming language for service robots Dialogue modelsare composed of a set of situations which have associated atuple of expectations actions and next situations In a typicalinteraction cycle the robot is in a given situation in whichit has certain expectations if the arriving interpretationmatches some expectation the robot proceeds to performthe associated actions and moves to another situation Inthis framework performing a task consists in traversing adialogue model while the modules provide the interpre-tations and perform the actions defined by the dialoguemodel Additionally in SitLog it is possible to define themain elements as functions that dynamically construct thestructure of the dialogue model It is also possible to callanother dialogue model from a particular situation Figure 1depicts these cases

(a) It shows a typical SitLog arc in which a situation 119878119894 hasone expectation 120572 if this is satisfied the set of actions120573 are performed and the arc arrives to the situation 119878119895

(b) It exemplifies the case in which the elements aredefined by functions and the expectation will bedefined by the evaluation of function 119891 actions by119892 and the next situation by ℎ In particular theseproperties make it possible to program dynamicdialoguemodels which change as the task is executed

(c) It shows the case for recursive situations in which asituation calls an entire dialoguemodel for executionand when finished it uses the name of the lastsituation as its own expectation Algorithm 1 presentsthe SitLog code for these cases

SitLog at its core is framework-agnostic one could imple-ment different frameworks such as subsumption [37] reactive[38] or simple state machine architectures At our currentlaboratory we have developed a full library of behavioursand tasks based on our interaction cycle which is a highlevel cognitive architecture (IOCA [4 39]) Associated with


[ id ==gt s i

arcs ==gt [ alpha betha =gt s j]

]

[ id ==gt s i

arcs ==gt [ f(x) g(y) =gt h(z) ]

]

[ id ==gt s i

embeded dm ==gt r(m)

arcs ==gt [ alpha betha =gt s j ]

]

Algorithm 1 Example of code for SitLog (a) typical SitLog arc (b)functional arc and (c) recursive situation

Si Sj120572 120573

(a)

Si

hf g

(b)

Si

SjRM

middot middot middot

120572 120573

120572M

(c)

Figure 1 Examples of situations expectations and actions (a)Typical SitLog arc (b) functional arc and (c) recursive situation

the architecture we have implemented several skills such asvision language [40] sound [5] navigation manipulationand movement Using this framework we have programmedthe Golem-II+ robot (depicted in Figure 2 [41]) Table 2summarizes the main capabilities modules and hardwarethat compose the robot Within this framework and thecurrent version of the hardware we have implemented severaltasks such as following a person introducing itself searchingfor objects acting as a waiter in a dinner party or a restaurant[42] guarding amuseum and playing amatchmemory game(for examples of the tasks visit httpgolemiimasunammx)

41 Multiple-DOA Estimation System The DOA skill isperformed by the multiple-DOA estimation system This isbased on previous work that focuses on a small lightweighthardware setup that is able to estimate more DOAs than theamount of microphones employed [5 43]

From the hardware standpoint (see Table 2 soundperception section) the system uses a 2-dimensional 3-microphone array (see Figure 4) going through a 4-channelUSB interface From the software standpoint the architecture

Figure 2 The Golem-II+ robot

of this can be seen in Figure 3The system is divided into threeparts

Audio Acquisition This is the underlying module that feedsaudio data to the next part of the system It is based on theJack Audio Connection Toolkit [44] which provides highresolution data (48 kHz in our case) at real-time speedswith a very high tolerance in the amount of microphonesit manages and a relative small resource requirement Thiscaptured audio passes through a VAD which activates thelocalization

Initial Single-DOA Estimation For each time window eachmicrophone pair is used to estimate a signal delay using across-correlation vector (CCV) method (see Figure 3 single-DOA estimation block) Each delay is used to calculate twopreliminary DOAs (to consider the back-to-front mirroringissue of 1-dimensional arrays pose) Using a basic searchmethod the most coherent set of 3 DOAs (1 from eachmicrophone pair) is found using a proposed coherencymetric (see Figure 5) If the coherency of that set is above acertain threshold the DOA of the most perpendicular pairtowards the source is proposed as its DOA this pair is chosento avoid nonlinearities in the relation between the delay andthe resulting DOA Figure 5 illustrates this stage

Multiple-DOA Tracking Because the previous part carriesDOA estimation near real-time speeds it is able to estimatethe DOA from one source in small time windows even ininstances where there are two or more simultaneous soundsources (see Figure 3 multi-DOA tracking block) Howeverthe occurrence of 1-sound source time windows in suchinstances is stochastic To this effect when aDOA is proposedfrom the previous part of the system a clustering method is


Table 2 Capabilities modules and hardware used on the Golem-II+ robot (IH in-house developed)

Module Hardware Software librariesDialogue management

Dialogue manager mdash SitLogKnowledge-base mdash Prolog

VisionObject recognition Flea 3 camera MOPEDPerson recognition WebCam OpenCVPerson tracking Kinect OpenNIGestures recognition Kinect OpenNI

LanguageSpeech recognition Directional Mic PocketSphinxSpeech synthesiser Speakers Festival TTSUnderstanding mdash GF grammar IH parser

Sound perceptionDOA system 3 omnidirectional mics mdashVolume monitor Directional mic Jack IH libraryVolume monitor M-Audio Fast Track interface Jack IH library

Navigation manipulation and neckNavigation Robotic base laser PlayerManipulation IH robotic arm Dynamixel RoboPlusNeck movement IH neck Dynamixel RoboPlus

Audio Single-DOAestimation

Multi-DOAtracking

VAD

ITDcalc

Redundancycheck

Single-DOAcalculation

Clusteringmethod

ClusterDOA

averaging

Multiple-DOAestimation

PairwiseDOA

estimation

Filter(1ndash4kHz)

IRL

ILF

IFR

DfrontRL

DbackRL

DfrontLF

DbackLF

DfrontFR

DbackFR

Figure 3 Architecture of the multiple-DOA estimation system

employed to group several nearby DOAs into one or moreestimated sound sources directions

Figure 6 shows the tracking of three speakers during 30seconds each of the speakers is separated by 120 degreesAs it can be seen this tracking is quite effective as it couldlocalize each of the found sources In this example the systemhad a 69 precision and 60 recall performance at the framelevel Table 3 presents an evaluation of the whole systemAlthough it could be thought that this performance wouldnot be enough for HRI interactions since it is at the frame

label (ie 100ms) in an interaction setting when turns arebeing taken this performance is more than enough to catchthe interaction of a user An extensive evaluation of thismodule can be consulted in [43]

5 DOA Behaviour

The goal of the DOA behaviour is to transform DOAmeasurements into a reference to an object in the contextof the task When the DOA system detects a source this


Mic 2 (right)

Mic 1 (left)

Mic 3 (back)

Figure 4Microphones of the audio localization system in the robotGolem-II+

Table 3 Multiple-DOA estimation system performance

Minimum distance of speaker 30 cmMaximum distance of speaker 5mMaximum number of simultaneous speakers 4Response time sRange (minimum angle between speakers) 10Coverage 3601198651-119904119888119900119903119890 1 speaker 100001198651-119904119888119900119903119890 2 speakers 78791198651-119904119888119900119903119890 3 speakers 72341198651-119904119888119900119903119890 4 speakers 6102

sign has a strong relation with the emitter of the sound TheDOA behaviour takes advantage of this strong relation toresolve the object being ldquopointed atrdquo by imposing a spatialconstraint on the possible sources The spatial constraintconsists of the relative position 119886 of the source 119909 given thecurrent robot position (ie 120582119886 119909119903119890119897 119901119900119904(119909 119886)) At this pointwe only have the information that something is at position119886 however since we consider the DOA an index we can useour formulation (1) for indices so that we obtain the followingreduction

120582119909119903119890119897 119901119900119904 (119909 119886 119886) (2)

Although we are able to constrain the object so far we havenot resolved the index to the right referent to achieve thiswe need to use the contextual information The angle of theDOA is used to identify potential objects given the contextAlgorithm 2 is in charge of this resolution stage by analysingwhich objects on the context satisfy the DOA constraint

The whole mechanism is encapsulated in a behaviourwhich is programmed as a simple dialogue model (seeFigure 7) This dialogue model can process multiple DOAsat once for each one it will try to resolve the object beingrefered If successful it creates an 119900119896 situation associatedwith a list of referred objects otherwise it reaches an errorsituationThe proposedDOA behaviour supports interactionin two modalities

(i) Unrestricted call(ii) Contextualized call

Require Pred DOAs ContextEnsure 119875119903119890119889(119897119894119904119905119874119887119895119890119888119905119904 119897119894119904119905119863119874119860119904)119897119894119904119905119874119887119895119890119888119905119904 larr []

119897119894119904119905119863119874119860119904 larr []

for 119889119900119886 larr 119863119874119860119904 dofor 119894119889 120572 larr 119862119900119899119905119890119909119905 doif 119889119900119886 120572 then119897119894119904119905119874119887119895119890119888119905119904119901119906119904ℎ(119894119889)

119897119894119904119905119863119874119860119904119901119906119904ℎ(119889119900119886)

end ifend for

end for

Algorithm 2 Implementation algorithm for the interpretation ofDOAS indices

The first case corresponds to the scenario in which the posi-tion of the user cannot be expectedpredicted beforehand inthis case the description of the context 119862 is empty meaningthat only one user is present (Figure 8(a) depicts the robot insuch situation) that is the indices are directly interpreted asthe user the robot is having the interactionwith In the secondcase the robot has an expectation of the direction fromwhichit will be called and it uses this expectation to identify thecaller For instance Figure 8(b) depicts the robot having aconversation with two users with the information of theirposition it can discard a third user which is not part of theconversation

So far we have assumed that the robot is not moving butthe user should be able to call the robot while it is movingIn order to tackle this situation we created a composedbehaviour which allows the robot to be walking and listeningfor possible calls at a distance Figure 9 shows the dialoguemodel of this behaviour The call for such behaviour is119908119886119897119896119889119900119886(119863 119862) 119863 represents the destination and 119862 thedescription of the spatial context First the behaviour startswith the action of walking to 119863 and while doing so therobot polls for possible DOAs sources in 119862 If there are noneit continues walking otherwise it ends the dialogue with asituation called 119894119899119905119890119903119903119906119901119905119890119889 which contains the informationrelevant to such interruptionThe spatial context gets updatedin each call with the current position of the robot Notice thatthis behaviour reuses the simple DOA behaviour (situation119877119889119900119886119904(119862))

For the alternative case when the user moves the DOAsignal imposes an extra constraint into the index Thisconstraint is that the DOA at a relative position comes fromthe same sound source as the previous DOA and this extrainformation is provided by the tracking stage of ourmultiple-DOA estimation system Thanks to this constraint we canuse the same mechanism we have outlined so far to handlethis case In combination with the walking and listeningbehaviour they can candle the scenario in which both userand robot move

Additionally we also propose to use the DOA behaviourto verify the number of sound sources in a given moment Inparticular we have created a composed behaviour between


Speakerq = 1

q = 0

r = 1

r = 0p = 1

p = 0 RL

F

(a)

Speaker

q = 1

q = 0

r = 1 r = 0

p = 1p = 0 RL

F

(b)

Figure 5 (a) Incoherent set of measures (b) Coherent set of measures (taken from [5])

Ang

les

150

100

50

0

minus50

minus100

minus150

Samples50 100 150 200 250 300 350 400

Figure 6 DOAs for three speakers at 90 minus30 and minus150 degrees

doa fildoas(X) 120576

ok(U Z)

err

doas(C)

interpretCX

120576 120576

Figure 7 Dialogue model for simple DOA behaviour

the ask and DOA behaviours which take advantage of thissituation Figure 10 shows the dialogue model for suchbehaviour After asking a question 119875 and listening to theanswer from the user the robot checks how many indiceswere recovered (ie the number of elements in119885) if it is onlyone it could proceed with the interaction if they are morethan one the robot infers that more than one user is talkingso it produces an error which has to be handled by the robotSimilarly to the walking behaviour this behaviour reuses thesimple DOA behaviour

6 Implemented Tasks

The DOA behaviours have been used to program the fol-lowing tasks with our robot Golem-II+ (videos of these

tasks can be watched at httpsyoutubeQ6prwIjoDnElist=PL4EiERt u4faJoxHF1M5EMwhEoNH4NSc)

Following a Person In this task the robot follows a personby using the visual tracking system (kinect based) The robottries to keep a distance of 1m while the user is moving therobot tries to catch it at a safe speed In case the robot loseshim it asks the user to call it so the robot could infer whichdirection to look for Once positioned on the right directionit uses vision to continue the tracking it uses When lost theuser can be in any direction and inside a ratio of 5m andthe robot would identify where it is If the user is not foundit will insist to be called and try again to locate him (for ademo of the full system see supplementary video followinga personmp4 in Supplementary Material available online athttpdxdoiorg10115520163081048)

Taking Attendance In this task the robot takes attendancein a class by calling the names of the members of the classand for each call it expects an auditory response When thedirection of the response is identified it faces towards thatdirection checks if a person is there by asking to waveand gets closer to verify the identity of the student Thenumber of students is limited by the vision system whichonly can see up to four different persons They have tobe located in a line at the moment the task does notconsider the case of multiple rows We take advantageof this setting and we specify a spatial context in whichstudents are only in front of the robot (for a demo of thefull system see supplementary video taking attendance of aclassmp4)

Playing Marco-Polo In this task the robot plays the role oflooking for the users by enunciatingMarco and expecting oneor more Polo responses When the direction of a response isidentified it moves towards that direction if possible Whenadvancing to this direction if the laser sensor recognizes thatthere is someone in front at close distance it assumes it haswon the game otherwise it continues callingMarco until theuser respondsThe robot has 119899 tries to catch the user if not itgives up and loses the game The laser hypotheses that thereis someone in front if it detects a discontinuity in the readingIn this setting the players can be located at any direction in aratio of 5m (for a demo of the full system see supplementaryvideo playing Marco-Polomp4)


(a) (b)

Figure 8 Examples of calls at a distance unrestricted one user not at predictable position and contextualized multiple users at predictableposition

s w120576 go(D)

walkdoas(D C)

arrived 120576ok

err 120576

ok(U Z) 120576interrupted(U Z)

no C998400= updateCRdoas(C998400)

Figure 9 Dialogue model for walk DOA behaviour

s l120576 said(P)

askdoas(P C)

said(X) 120576

nothing said([repeat P])

err said([repeat P])

ok(Z U) 120576ok(X)

err

checkZ

Rdoas(C)

Figure 10 Dialogue model for ask DOA behaviour

Waiter In the previous examples the DOA behaviour relieson the robotrsquos initiative This task pushes the behaviour intoa mixed-initiative strategy In this case the robot is a waiterand it waits for calls from the users located in the restauranttables Once a table calls at a distance the robot will walkto the table to take the orders of the clients However whilewalking it will listen for other calls from different tables ifthis happens it will let the users in the corresponding tableknow that it will be there as soon as it finishes with the tableit is currently walking to While taking the order the robotdirects the interaction if more than one client talks at thesame time Once it has the order it proceeds to bring it bytaking the drinksmeals from a designated pickup location inthe kitchen area or by asking the chef directly for them Thistask is limited by themaximum number of sources that couldbe simultaneously detected (4) In this case the maximumnumber of tables to attend would be 4 with a maximumnumber of 4 clients in each table this have to be in a ratioof 5m (for a demo of the whole system see supplementaryvideo waiter in a restaurantmp4)

We hypothesize that the success of the interaction ineach of the tasks is possible because of the use of theDOA information as an index which is fully interpreted inthe context of each task These contexts make the physicalrelation between the sign (the call) and the object (the user

or users) evident See Table 4 for a summary of the mainelements involved in the resolution of the DOA indices ineach of the tasks and detailed as follows

In the case of the following a person task it is until therobot has lost the user and addresses her or him that theresponse is interpreted as the user and the direction whereit could look for its user this is a case of unconstrained call inwhich the spatial context is emptyWithout context the robotis able to establish the relative position of the user

In the case of the taking attendance task the protocol forthe interaction is quite standard The robot calls the name ofthe student and she or he has to respond to such call Suchresponse is interpreted as the presence of the student and itsrelative position is established Here we can constrain thecalls to be in front of the robot

In the case of the playing Marco-Polo task the rulesdefine the context on which the call is interpreted Theserules specify that when the robot says Marco the users haveto respond with Polo the index of such response can beinterpreted as the relative position of a specific player and itcan direct the robot to look in that areaThis task also uses anunconstrained call In fact the user is not even required to sayPolo since the objective of the game is that the robot ldquocatchesrdquoone of the users via sound alone therefore the DOA of anysound may be considered as an index this DOA will refer tothe user

In the case of thewaiter task the spatial context is definedby the arrangement of the tables in the restaurant This taskuses this spatial context to interpret which table calls forthe attention of the robot The constraint triggered by theDOA information helps identify the calling table and it alsoestablishes its relative position The task uses both the simpleversion and the walking version of the DOA behaviour toresolve the indices Additionally it uses the ask version ofthe DOA behaviour when taking the order The informationprovided by the ask version is used by the robot to face theclients while taking their orders and to direct the interactionby asking the users to talk one at the time

61 Application of the DOA Behaviour in Tasks Figure 11shows excerpts from the dialogue model of the followinga person taking attendance and playing Marco-Polo tasksGiven the information provided by the DOA behaviourthe following a person and playing Marco-Polo tasks cannot


Table 4 Characteristics of the tasks using DOA behaviours

Following a person Taking attendanceAgents Person being followed StudentsRobot goal Track a person Verify students presentUsers goals Being followed Be in the classDOA behaviour Call at a distance Call at a distanceSituation User lost Student name calledSign Here PresentExpression 120582119909119903119890119897 119901119900119904(119909 119886 119886) 120582119909119903119890119897119901119900119904(119909 119886 119886)

Resolution 119903119890119897119901119900119904(119906119904119890119903 119886) 119903119890119897119901119900119904(119899119886119898119890 119886)

Interpretation Target and its position Presence of student and positionPlaying Marco-Polo Waiter

Agents Players ClientsRobot goal Catch a player Take and deliver ordersUsers goals Not being caught Receive ordersDOA behaviour Call at a distance Call at a distance VerificationSituation SayMarco Robot doing nothing or asking orderSign Polo WaiterExpression 120582119909119903119890119897 119901119900119904(119909 119886 119886) 120582119909119886119903119890119897119901119900119904(119909 119886 119886)

Resolution 119903119890119897 119901119900119904(119901119897119886119910119890119903 119886) 119903119890119897119901119900119904(119905119886119887119897119890 119886)

Interpretation Player and its position Table to order and client talking

lostε said(where are you)

Following

err ε

err ε

err ε

ok([U | _] [Z | _]) turn(Z)see(U)

studentε said(Name)

Checking assistance

ok([Name | _] [Z | _]) turn(Z)check(U)

Robotturn

ε said(marco)

Marco-Polo

ok([U | _] [Z | _]) turn(Z) walklaser(u)

Rdoas([])

Rdoas([])

Rdoas([])

(1m)

Figure 11 The DOA behaviour used on the following a person taking attendance and playing Marco-Polo

identify the user it is as if he or she would have only saidI am here The response of the robot to this interaction isnot verbal but an action in both cases it turns towards thedirection of the call In case of an error in the detection of theDOA the robot will try to recover In the following a persontask the person tracker will trigger an error when no one isfound and it will ask again for a call by the user In the playingMarco-Polo task the robot will repeat Marco and wait forthe answer continuing with the game Because of the natureof the take attendance task the interpretation of the index

signals a particular student even in an underspecified contextbecause the robot knows the name of the student being calledand the response is attributed to her or him It is as if he or shewould have respondedme X I am here In this task the robottries to prevent an error in the DOA location and it checks ifthere is an actual person in that direction but in case it is notit will keep calling the name of the student

Figure 12 shows the DOA behaviours for the waiter taskThis task uses the three behaviours based on the DOA skillFirst it uses the DOA behaviour but with a defined context


Wait ε ε

Waitererr ε

ok([T | _] [Z | _]) turn(Z)

ok ε

ok(Order) ε

err

Bring

Rdoas(Ts)

interrupted([U998400 | _] [Z998400 | _]) turn(Z998400) said(be there)

Rwalkdoas(TTs)

Raskdoas(order Ts)

Figure 12 The DOA behaviours used on the waiter task

The interpretation of the index will depend on the spatialcontext and it will define which table calls the robot In thiscase it is as if the client would have said here table X needssomething In the task an error has a larger effect on theinteraction if the robot mistakenly approaches a table thatdid not call for it it will offer to take the order to which theclients there could refuse or if the table is empty the robotwill leave However the table which really wanted to ask foran order could call the robot while walking so that whenthe robot is done with the erroneous table it will approachthe correct one During such movements the task uses thewalking DOA behaviour Finally when arriving to a table toask an order the task uses the ask DOA behaviour and ifmore than one index is returned during the response it willlet the clients know that only one user is supposed to talk at atime If an error on the detection of DOAs occurs during thispart of the task it will create a confusing situation on whichthe robot will face clients which are not there however if noclient answers the robot will assume the ldquoghostrdquo client doesnot want anything and will carry on with the rest of the ordertaking and delivery

62 History of Demonstration and Evaluation of the TasksThe following a person playing Marco-Polo and waiter taskshave been repetitively demonstrated in our lab to the generalpublic and in the RoboCupHome competition [45 46]The experience with these demonstrations during the com-petition has been positive In 2012 the following a persontask was demonstrated in the German Open competitionwhere the demonstration allowed the team to obtain thethird place and in the Mexican Tournament of Roboticsnational competition where the robot won the first placeIn 2013 the waiter task was presented in the first stage ofthe RoboCup competition and it was awarded the InnovationAward of the Home league that year Additionally thatsame year a demonstration of the robot playing Marco-Polowith another robot was attempted but technical problemsprevented the execution of the demo during competition butit was successfully carried out to the general public In allthese cases we have observed a positive appreciation of the

integration of sound localization in the interaction betweenhumans and robots In particular the waiter task has beenextensively evaluated [42] This evaluation was done withfinal users (30) which were asked to place an order to therobot 60 of the users did not have previous experiencewith a service robot We found that when asking for anorder clients repeatedly called at a distance (100) andfully completed the task (90) And even though it was notcommon that two users talked at the same time (40) whenthey did the robot was able to direct the interaction

In the case of the taking attendance task this wasprogrammed by students during a three-week summerinternship The students had no previous knowledge of ourdevelopment framework During the internship they learnedthe framework developed the idea and implemented it onthe robot Other four teams of students developed other fourdifferent tasksThe abstraction of the skill as an element of theSitLog language made it relatively easy to directly implementthe task and reach an appropriate performance in a short timespan

7 Discussion

In this work we proposed the use of direction of arrival ofsound sources as indices In many Human-Robot Interactiontasks the position of objects is essential for the executionof the task for instance losing someone while followinghim or her Under this perspective DOAs are second-classcitizens since they do not provide a position but a directionHowever treating DOAs as indices allows the robot to linksuch directions to objects in the spatial context that leadsto the correct interpretation and continuation of the taskguessing that the lost one is a direction This considerationpermits modeling calls at a distance to a robot which isa desirable way to interact with an interactive agent Inorder to demonstrate this concept we implemented a set ofbehaviours that depend on DOAs to be interpreted as indicesand tested them in four tasks following a person taking theattendance of a class playing Marco-Polo and acting as awaiter in a restaurant


The spatial context or a lack of it defines aspects of thetype of interaction on the task When there is no availablespatial context such as when playing Marco-Polo any indexis considered by the robot as a possible participant Howeverwhen a spatial context is defined for instance in attendingthe tables in a restaurant it is possible to keep track of morethan one user It is also possible to have a mixed-initiativeinteraction in which the robot waits for the call rather thanvisiting each table and asking if they want anything Theindexicality of the DOAs allows the robot to react to thedirection of the user and continue with the interaction notnecessarily in the same modality in playing Marco-Polo therobot moves in the direction of the index in taking theattendance and following a person tasks the robot confirmsthe presence of the user by visual methods in the direction ofthe index in the waiter task the robot faces the tables whichcorrespond to the indices These three possible aspects of theinteraction when using indexical DOAs (multiuser mixed-initiative and multimodal) are an example of the richness ofthe use of calls at a distance in an interaction

Additionally we have showed that the DOA behaviourcomplements the verbal interaction In particular knowingthat there is more than one person talking when the robotis listening is a good indication that the interaction maynot be good With the indexical information the robot cantake control of the interaction and direct it by asking eachuser to speak one at a time preventing an error in thecommunication

Wewere able to implement these tasks since themultiple-DOA estimation system is light and robust up to four soundsources However this system and other robotic systemsimpose some restrictions on the interaction In the caseof following a person it can only follow one person at aclose distance (1m) in the case of taking attendance thestudents have to be aligned In our test there were threestudents in the case of playing Marco-Polo and waitingtables the users player and tables have to be in a ratio of5m In all these cases the maximum number of speakerscould have been four However under this condition theperformance of the system was the poorest and had an effecton the interaction our evaluation shows that one user forplaying Marco-Polo and two users for table in the waitercase were the limit to achieve good performance At thispoint another major problem is possibility of hijacking theattention of the robot This is related to the fact that whenthe spatial context is not defined the robot automaticallyinterprets the DOA as its target This is exemplified with theMarco-Polo game At the moment our implementation ofthe task takes advantage of this situation and an index isinterpreted as a possible user which becomes the target toapproach However this situation can provoke the fact thatthe robot switches targets constantly rather than targetingone which might be a better strategy to win the game Inorder to account for this situation we have identified thatthe DOA behaviour could be complemented with a speakerverification stage in which the identity of who is respondingwith Polo would be verified This complement is persistentwith our proposal since it becomes an extra constraint in thereferred object Other tasks can also take advantage of this as

such verification can be part of the following a person takingattendance and waiter tasks In the future we will explorecomplementing the DOA behaviour with such verificationand its consequences on the interaction

Competing Interests

The authors declare that there are no competing interestsregarding the publication of this paper

Acknowledgments

The authors thank the support of CONACYT throughProjects 81965 and 178673 PAPIIT-UNAM through ProjectIN107513 and ICYTDF through Project PICCO12-024

References

[1] C Peirce N Houser and C Kloesel The Essential PeirceSelected Philosophical Writings vol 1 of The Essential PeirceIndiana University Press 1992

[2] C Peirce N Houser C Kloesel and P E ProjectThe EssentialPeirce Selected Philosophical Writings vol 2 of The EssentialPeirce Indiana University Press 1998

[3] L A Pineda L Salinas I V Meza C Rascon and G FuentesldquoSit log A programming language for service robot taskrdquoInternational Journal of Advanced Robotic Systems vol 10 pp1ndash12 2013

[4] L Pineda A Rodrguez G Fuentes C Rascn and I MezaldquoConcept and functional structure of a service robotrdquo Interna-tional Journal of Advanced Robotic Systems vol 6 pp 1ndash15 2015

[5] C Rascon and L Pineda ldquoMultiple direction-of-arrival estima-tion for a mobile robotic platform with small hardware setuprdquoin IAENG Transactions on Engineering Technologies H K KimS-I Ao M A Amouzegar and B B Rieger Eds vol 247 ofLecture Notes in Electrical Engineering pp 209ndash223 SpringerAmsterdam Netherlands 2014

[6] S Argentieri A Portello M Bernard P Danes and B GasldquoBinaural systems in roboticsrdquo in The Technology of BinauralListening pp 225ndash253 Springer Berlin Germany 2013

[7] H G Okuno and K Nakadai ldquoRobot audition its rise andperspectivesrdquo in Proceedings of the IEEE International Confer-ence on Acoustics Speech and Signal Processing (ICASSP rsquo15) pp5610ndash5614 South Brisbane Australia April 2015

[8] S Argentieri P Danes and P Soueres ldquoA survey on soundsource localization in robotics from binaural to array process-ing methodsrdquo Computer Speech amp Language vol 34 no 1 pp87ndash112 2015

[9] R A Brooks C Breazeal M Marjanovic B Scassellati and MM Williamson ldquoThe cog project building a humanoid robotrdquoin Computation for Metaphors Analogy and Agents vol 1562 ofLecture Notes in Computer Science pp 52ndash87 Springer BerlinGermany 1999

[10] H Kitano H G Okuno K Nakadai T Sabisch and T MatsuildquoDesign and architecture of sig the humanoid an experimentalplatform for integrated perception in robocup humanoid chal-lengerdquo in Proceedings of the IEEERSJ International Conferenceon Intelligent Robots and Systems (IROS rsquo00) vol 1 pp 181ndash190IEEE Takamatsu Japan 2000


[11] F Michaud C Cote D Letourneau et al ldquoSpartacus attendingthe 2005 AAAI conferencerdquo Autonomous Robots vol 22 no 4pp 369ndash383 2007

[12] J-M Valin S Yamamoto J Rouat F Michaud K Nakadai andH G Okuno ldquoRobust recognition of simultaneous speech by amobile robotrdquo IEEE Transactions on Robotics vol 23 no 4 pp742ndash752 2007

[13] I Hara F Asano H Asoh et al ldquoRobust speech interface basedon audio and video information fusion for humanoid HRP-2rdquo in Proceedings of the IEEERSJ International Conference onIntelligent Robots and Systems (IROS rsquo04) vol 3 pp 2404ndash2410September-October 2004

[14] J C Murray H R Erwin and S Wermter ldquoRobotic sound-source localisation architecture using cross-correlation andrecurrent neural networksrdquo Neural Networks vol 22 no 2 pp173ndash189 2009

[15] H-D Kim J-S Choi and M Kim ldquoHuman-robot interactionin real environments by audio-visual integrationrdquo InternationalJournal of Control Automation and Systems vol 5 no 1 pp 61ndash69 2007

[16] B-C Park K-D Ban K-C Kwak and H-S Yoon ldquoSoundsource localization based on audio-visual information for intel-ligent service robotsrdquo in Proceedings of the 8th InternationalSymposium on Advanced Intelligent Systems (ISIS rsquo07) pp 364ndash367 Sokcho South Korea 2007

[17] A Badali J-M Valin F Michaud and P Aarabi ldquoEvaluatingreal-time audio localization algorithms for artificial auditionin roboticsrdquo in Proceedings of the IEEERSJ International Con-ference on Intelligent Robots and Systems (IROS rsquo09) pp 2033ndash2038 St Louis Miss USA October 2009

[18] D Bohus and E Horvitz ldquoFacilitating multiparty dialog withgaze gesture and speechrdquo in Proceedings of the InternationalConference on Multimodal Interfaces and the Workshop onMachine Learning forMultimodal Interaction (ICMI-MLMI rsquo10)pp 51ndash58 Beijing China November 2010

[19] K Nakadai H G Okuno and H Kitano ldquoReal-time soundsource localization and separation for robot auditionrdquo inProceedings of the IEEE International Conference on SpokenLanguage Processing (ICSLP rsquo02) pp 193ndash196 Denver ColoUSA September 2002

[20] V M Trifa A Koene J Moren and G Cheng ldquoReal-timeacoustic source localization in noisy environments for human-robot multimodal interactionrdquo in Proceedings of the 16th IEEEInternational Conference on Robot and Human Interactive Com-munication (RO-MAN rsquo07) pp 393ndash398 August 2007

[21] J G Trafton M D Bugajska B R Fransen and R M RatwanildquoIntegrating vision and audition within a cognitive architectureto track conversationsrdquo in Proceedings of the 3rd ACMIEEEInternational Conference on Human-Robot Interaction (HRIrsquo08) pp 201ndash208 ACM March 2008

[22] K Teachasrisaksakul N Iemcha-Od SThiemjarus and C Pol-prasert ldquoSpeaker trackingmodule for indoor robot navigationrdquoin Proceedings of the 9th International Conference on Electri-cal EngineeringElectronics Computer Telecommunications andInformation Technology (ECTI-CON rsquo12) 4 1 pages May 2012

[23] X Li M Shen W Wang and H Liu ldquoReal-time Sound sourcelocalization for a mobile robot based on the guided spectral-temporal position methodrdquo International Journal of AdvancedRobotic Systems vol 9 article 78 2012

[24] B FransenVMorariu EMartinson et al ldquoUsing vision acous-tics and natural language for disambiguationrdquo in Proceedingsof the ACMIEEE International Conference on Human-Robot

Interaction ((HRI rsquo07) pp 73ndash80 Arlington Va USA March2007

[25] K Nakadai S Yamamoto H G Okuno H Nakajima YHasegawa and H Tsujino ldquoA robot referee for rock-paper-scissors sound gamesrdquo in Proceedings of the IEEE InternationalConference on Robotics and Automation (ICRA rsquo08) pp 3469ndash3474 IEEE Pasadena Calif USA May 2008

[26] I Nishimuta K Itoyama K Yoshii and H G Okuno ldquoTowarda quizmaster robot for speech-based multiparty interactionrdquoAdvanced Robotics vol 29 no 18 pp 1205ndash1219 2015

[27] H M DoW Sheng andM Liu ldquoAn open platform of auditoryperception for home service robotsrdquo in PRoceedings of theIEEERSJ International Conference on Intelligent Robots andSystems (IROS rsquo15) pp 6161ndash6166 IEEE Hamburg GermanySeptember-October 2015

[28] A G Brooks and C Breazeal ldquoWorking with robots andobjects revisiting deictic reference for achieving spatial com-mon groundrdquo in Proceedings of the 1st ACM SIGCHISIGARTConference on Human-robot Interaction (HRI rsquo06) pp 297ndash304Salt Lake City UT USA March 2006

[29] O Sugiyama T Kanda M Imai H Ishiguro and N Hagita ldquoAmodel of natural deictic interactionrdquo Human-Robot Interactionin Social Robotics vol 104 2012

[30] H G Okuno K Nakadai K-I Hidai H Mizoguchi andH Kitano ldquoHumanndashrobot non-verbal interaction empoweredby real-time auditory and visual multiple-talker trackingrdquoAdvanced Robotics vol 17 no 2 pp 115ndash130 2003

[31] A Atkin ldquoPeircersquos theory of signsrdquo inThe Stanford Encyclopediaof Philosophy E N Zalta Ed 2013

[32] D Kaplan ldquoDthatrdquo in Syntax and Semantics P Cole Ed vol 9pp 221ndash243 Academic Press New York NY USA 1978

[33] L Pineda and G Garza ldquoA model for multimodal referenceresolutionrdquo Computational Linguistics vol 26 no 2 pp 139ndash193 2000

[34] R Mitkov Anaphora Resolution vol 134 Longman LondonUK 2002

[35] J R Tetreault ldquoA corpus-based evaluation of centering andpronoun resolutionrdquo Computational Linguistics vol 27 no 4pp 507ndash520 2001

[36] V Ng and C Cardie ldquoImproving machine learning approachesto coreference resolutionrdquo in Proceedings of the 40th AnnualMeeting on Association for Computational Linguistics pp 104ndash111 Association for Computational Linguistics 2002

[37] R A Brooks ldquoHow to build complete creatures rather thanisolated cognitive simulatorsrdquo in Architectures for IntelligenceK VanLehn Ed pp 225ndash239 1991

[38] R P Bonasso R J Firby E Gat D Kortenkamp D PMiller and M G Slack ldquoExperiences with an architecturefor intelligent reactive agentsrdquo Journal of Experimental andTheoretical Artificial Intelligence vol 9 no 2-3 pp 237ndash2561997

[39] L A Pineda I Meza H H Aviles et al ldquoIOCA an interaction-oriented cognitive architecturerdquo Research in Computer Sciencevol 54 pp 273ndash284 2011

[40] I Meza C Rascon and L A Pineda ldquoPractical speechrecognition for contextualized service robotsrdquo in Advances inSoft Computing and Its Applications 12th Mexican InternationalConference on Artificial Intelligence MICAI 2013 Mexico CityMexico November 24ndash30 2013 Proceedings Part II vol 8266of Lecture Notes in Computer Science pp 423ndash434 SpringerBerlin Germany 2013


[41] L Pineda and G Golme ldquoGrupo Golem robocuphomerdquoin Proceedings of the RoboCup 8 pages RoboCup FederationEindhoven The Netherlands June 2013

[42] C Rascon I Meza G Fuentes L Salinas and L A PinedaldquoIntegration of the multi-doa estimation functionality tohuman-robot interactionrdquo International Journal of AdvancedRobotic Systems vol 12 no 8 2015

[43] C Rascon G Fuentes and I Meza ldquoLightweight multi-DOAtracking of mobile speech sourcesrdquo EURASIP Journal on AudioSpeech and Music Processing vol 2015 no 1 pp 1ndash16 2015

[44] P Davis JACK Connecting a World of Audio httpjackaudioorg

[45] T van der Zant and T Wisspeintner ldquoRoboCup X a proposalfor a new league where RoboCup goes real worldrdquo in RoboCup2005 Robot Soccer World Cup IX A Bredenfeld A Jacoff INoda and Y Takahashi Eds vol 4020 of Lecture Notes inComputer Science pp 166ndash172 2006

[46] T Wisspeintner T van der Zant L Iocchi and S SchifferldquoRoboCupHome scientific competition and benchmarkingfor domestic service robotsrdquo Interaction Studies vol 10 no 3pp 392ndash426 2009

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics


Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks



since it relies on other basic behaviours such as walk and seeobject In essence a behaviour has to be (1) portable so it canbe used in different tasks (2) compositional so by couplingit with different behaviours we can create more complexbehaviours or program a task and (3) able to handle potentialfailures (eg not arriving to its destination not listening to ananswer from the user or not finding the object)

In order to model the indexicality of the DOAs andincorporate it into the interaction capabilities of our robotwe propose a simple behaviour to handle DOAs and interpretthem as indices This supports the interaction by allowingcalls at a distance In this case the user or users canbring the attention of the robot to a specific area in whichthe sound originates This behaviour can also be used incombination with the walk behaviour to allow interruptionsduring walking such as calling a waiter over Additionallythe behaviour also supports the verification of the number ofsources during a conversation In particular this can be usedin combination with the ask behaviour to ensure that whenthe robot is listening to an answer there is only one persontalking

This paper is organized as follows Section 2 presentsprevious work in which sound information is part of a taskfor a service robot Section 3 reviews indexicality as interpre-tation procedure Section 4 presents the framework we useto interpret DOAs signs as indices Section 5 describes themanner in which the DOA behaviour resolves a DOA indexSection 6 shows the four tasks on which the indices fromsound sources are used to direct the interaction Section 7presents a discussion about our proposal and findings

2 Previous Work

The use of sound source localization (SSL) in robotics isa blooming field Okuno and Nakadai and Argentieri etal present reviews of the main methods and their usewith different types of robots [6ndash8] Since its application torobotics SSL has been promoted as a main skill for robotsBrooks et al proposed it as a basic skill for interaction inthe Cog project [9] The robot SIG was proposed as anexperimental setting for the RoboCup competition [10] andthe robot Spartacus participated in the 2005 AAAI MobileRobot Challenge implementing SSL as a part of its skill set[11]

Although the interaction goal is a driving force in thefield there is a great effort on developing a robust SSLmoduleWith this goal in mind the preferred setting is the use ofan array of microphones on board of the robot A many-microphone solution provides good performance given itsredundancy In [12] Valin et al propose an 8-microphone 3Darray to enhance speech recognition and as a preamble forDOA estimation More microphones are possible Hara et aluse two arrays of 8 microphones (16 in total) [13] Howeverminimal systems with 2 microphones are also possible suchas the one presented in [14] A common setting which wefollow is to use 3microphones [15 16]The best approach andconfiguration are still an open question Badali et al presentan evaluation of the main approaches over an 8-microphonearray for mobile robots [17]

Source localization can be reactive as a modality of theinteraction as presented in works that deal with followinga conversation [18ndash21] or it can be an essential part ofthe interaction such as using sound localization to controlthe navigation of the robot In particular Teachasrisaksakulet al propose a system that follows a person through aroom by her or his voice [22] On the other hand Li et aldemonstrated a robot which plays hide and seek using visualand audio modalities in the interaction [23] These worksfollow a similar approach onhow to incorporate sound sourcelocation as a part of the interaction as the one we proposeHowever since only one user is present in their work itstreatment as an index is trivial The DOA sign is directlyinterpreted as the user there are no conflicts regarding whoor what the sign could represent

Given the progress in the field there has been an effortdirected at capturing a more complicated interaction Forinstance Fransen et al present a multimodal robot whichfollows the instructions among two users to reach a target[24]Nakadai et al present a robotwhich is able to judge rock-paper-scissors sound games [25]TheQuizmaster robot playsa game in which it asks a question to up to four participantsand uses SSL and source separation to decide who answeredit first [26] Do et al present a robot which together witha caregiver log and detect the origin of certain sounds [27]At this level of complexity of the interaction DOA signs arebeing indirectly used as indices the DOAs become indiceswhich signify directly the users and this information is usedto disambiguate who calls or wins In [27] the DOAs anda category associated with them signify types of events Atthis level the multiple sources make a method to assign theDOA to the right entity necessary As we will see in thiswork the benefit of considering a DOA as an index in theright context allows us to use indexical resolution to performthis disambiguation in the cases that more than one DOAcompetes In this work we formalize this resolution similarlyto other reference resolution mechanisms on robots such asdeictic references [28 29]This differs from other approachesin which DOAs and multimodal information complementeach other for disambiguation [30]

3 Indexical Expressions

An interactive agent such as a robot has to understand theintention of its users or have expectations about the eventsthat can occur in the environment in order to produce anadequate response Considering the theory of signs by Piercewe can formulate the goal of ldquounderstandingrdquo as providinga sign or a set of signs to identify the signified object orobjects [1 2 31] In order to denote the object the robot buildsa representation of it For example if the user commandsrobot go to the kitchen the sequence of symbols robot goto the and kitchen get translated into the representation119892119900(119903119900119887119900119905 119896119894119905119888ℎ119890119899) with which the robot can establish thesignified objects (itself the kitchen and the action) andact accordingly A similar mechanism can be used for theinterpretation of iconic signs for instance when the robotsees a table and identifies on it a bottle of juice it will




Take it 120582119909119905119886119896119890(119909 119900119887119895119890119888119905) Object 119905119886119896119890(119895119906119894119888119890)




120582120572119876119909 (119876119909 120572) (1)











[ id ==gt s i


]

[ id ==gt s i


]

[ id ==gt s i



]


Si Sj120572 120573

(a)

Si

hf g

(b)

Si

SjRM


120572 120573

120572M

(c)



















Multi-DOAtracking

VAD

ITDcalc

Redundancycheck


Clusteringmethod

ClusterDOA

averaging


PairwiseDOA

estimation

Filter(1ndash4kHz)

IRL

ILF

IFR

DfrontRL

DbackRL

DfrontLF

DbackLF

DfrontFR

DbackFR





5 DOA Behaviour



Mic 2 (right)

Mic 1 (left)

Mic 3 (back)





120582119909119903119890119897 119901119900119904 (119909 119886 119886) (2)





119897119894119904119905119863119874119860119904 larr []


119897119894119904119905119863119874119860119904119901119906119904ℎ(119889119900119886)

end ifend for

end for







Speakerq = 1

q = 0

r = 1

r = 0p = 1

p = 0 RL

F

(a)

Speaker

q = 1

q = 0

r = 1 r = 0

p = 1p = 0 RL

F

(b)


Ang

les

150

100

50

0

minus50

minus100

minus150

Samples50 100 150 200 250 300 350 400



ok(U Z)

err

doas(C)

interpretCX

120576 120576



6 Implemented Tasks







(a) (b)


s w120576 go(D)

walkdoas(D C)

arrived 120576ok

err 120576




s l120576 said(P)

askdoas(P C)

said(X) 120576



ok(Z U) 120576ok(X)

err

checkZ

Rdoas(C)













Resolution 119903119890119897119901119900119904(119906119904119890119903 119886) 119903119890119897119901119900119904(119899119886119898119890 119886)



Resolution 119903119890119897 119901119900119904(119901119897119886119910119890119903 119886) 119903119890119897119901119900119904(119905119886119887119897119890 119886)



Following

err ε

err ε

err ε

ok([U | _] [Z | _]) turn(Z)see(U)


Checking assistance


Robotturn

ε said(marco)

Marco-Polo


Rdoas([])

Rdoas([])

Rdoas([])

(1m)






Wait ε ε

Waitererr ε

ok([T | _] [Z | _]) turn(Z)

ok ε

ok(Order) ε

err

Bring

Rdoas(Ts)


Rwalkdoas(TTs)

Raskdoas(order Ts)






7 Discussion







Competing Interests


Acknowledgments


References




















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation












Take it 120582119909119905119886119896119890(119909 119900119887119895119890119888119905) Object 119905119886119896119890(119895119906119894119888119890)




120582120572119876119909 (119876119909 120572) (1)











[ id ==gt s i


]

[ id ==gt s i


]

[ id ==gt s i



]


Si Sj120572 120573

(a)

Si

hf g

(b)

Si

SjRM


120572 120573

120572M

(c)



















Multi-DOAtracking

VAD

ITDcalc

Redundancycheck


Clusteringmethod

ClusterDOA

averaging


PairwiseDOA

estimation

Filter(1ndash4kHz)

IRL

ILF

IFR

DfrontRL

DbackRL

DfrontLF

DbackLF

DfrontFR

DbackFR





5 DOA Behaviour



Mic 2 (right)

Mic 1 (left)

Mic 3 (back)





120582119909119903119890119897 119901119900119904 (119909 119886 119886) (2)





119897119894119904119905119863119874119860119904 larr []


119897119894119904119905119863119874119860119904119901119906119904ℎ(119889119900119886)

end ifend for

end for







Speakerq = 1

q = 0

r = 1

r = 0p = 1

p = 0 RL

F

(a)

Speaker

q = 1

q = 0

r = 1 r = 0

p = 1p = 0 RL

F

(b)


Ang

les

150

100

50

0

minus50

minus100

minus150

Samples50 100 150 200 250 300 350 400



ok(U Z)

err

doas(C)

interpretCX

120576 120576



6 Implemented Tasks







(a) (b)


s w120576 go(D)

walkdoas(D C)

arrived 120576ok

err 120576




s l120576 said(P)

askdoas(P C)

said(X) 120576



ok(Z U) 120576ok(X)

err

checkZ

Rdoas(C)













Resolution 119903119890119897119901119900119904(119906119904119890119903 119886) 119903119890119897119901119900119904(119899119886119898119890 119886)



Resolution 119903119890119897 119901119900119904(119901119897119886119910119890119903 119886) 119903119890119897119901119900119904(119905119886119887119897119890 119886)



Following

err ε

err ε

err ε

ok([U | _] [Z | _]) turn(Z)see(U)


Checking assistance


Robotturn

ε said(marco)

Marco-Polo


Rdoas([])

Rdoas([])

Rdoas([])

(1m)






Wait ε ε

Waitererr ε

ok([T | _] [Z | _]) turn(Z)

ok ε

ok(Order) ε

err

Bring

Rdoas(Ts)


Rwalkdoas(TTs)

Raskdoas(order Ts)






7 Discussion







Competing Interests


Acknowledgments


References




















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










[ id ==gt s i


]

[ id ==gt s i


]

[ id ==gt s i



]


Si Sj120572 120573

(a)

Si

hf g

(b)

Si

SjRM


120572 120573

120572M

(c)



















Multi-DOAtracking

VAD

ITDcalc

Redundancycheck


Clusteringmethod

ClusterDOA

averaging


PairwiseDOA

estimation

Filter(1ndash4kHz)

IRL

ILF

IFR

DfrontRL

DbackRL

DfrontLF

DbackLF

DfrontFR

DbackFR





5 DOA Behaviour



Mic 2 (right)

Mic 1 (left)

Mic 3 (back)





120582119909119903119890119897 119901119900119904 (119909 119886 119886) (2)





119897119894119904119905119863119874119860119904 larr []


119897119894119904119905119863119874119860119904119901119906119904ℎ(119889119900119886)

end ifend for

end for







Speakerq = 1

q = 0

r = 1

r = 0p = 1

p = 0 RL

F

(a)

Speaker

q = 1

q = 0

r = 1 r = 0

p = 1p = 0 RL

F

(b)


Ang

les

150

100

50

0

minus50

minus100

minus150

Samples50 100 150 200 250 300 350 400



ok(U Z)

err

doas(C)

interpretCX

120576 120576



6 Implemented Tasks







(a) (b)


s w120576 go(D)

walkdoas(D C)

arrived 120576ok

err 120576




s l120576 said(P)

askdoas(P C)

said(X) 120576



ok(Z U) 120576ok(X)

err

checkZ

Rdoas(C)













Resolution 119903119890119897119901119900119904(119906119904119890119903 119886) 119903119890119897119901119900119904(119899119886119898119890 119886)



Resolution 119903119890119897 119901119900119904(119901119897119886119910119890119903 119886) 119903119890119897119901119900119904(119905119886119887119897119890 119886)



Following

err ε

err ε

err ε

ok([U | _] [Z | _]) turn(Z)see(U)


Checking assistance


Robotturn

ε said(marco)

Marco-Polo


Rdoas([])

Rdoas([])

Rdoas([])

(1m)






Wait ε ε

Waitererr ε

ok([T | _] [Z | _]) turn(Z)

ok ε

ok(Order) ε

err

Bring

Rdoas(Ts)


Rwalkdoas(TTs)

Raskdoas(order Ts)






7 Discussion







Competing Interests


Acknowledgments


References




















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation


















Multi-DOAtracking

VAD

ITDcalc

Redundancycheck


Clusteringmethod

ClusterDOA

averaging


PairwiseDOA

estimation

Filter(1ndash4kHz)

IRL

ILF

IFR

DfrontRL

DbackRL

DfrontLF

DbackLF

DfrontFR

DbackFR





5 DOA Behaviour



Mic 2 (right)

Mic 1 (left)

Mic 3 (back)





120582119909119903119890119897 119901119900119904 (119909 119886 119886) (2)





119897119894119904119905119863119874119860119904 larr []


119897119894119904119905119863119874119860119904119901119906119904ℎ(119889119900119886)

end ifend for

end for







Speakerq = 1

q = 0

r = 1

r = 0p = 1

p = 0 RL

F

(a)

Speaker

q = 1

q = 0

r = 1 r = 0

p = 1p = 0 RL

F

(b)


Ang

les

150

100

50

0

minus50

minus100

minus150

Samples50 100 150 200 250 300 350 400



ok(U Z)

err

doas(C)

interpretCX

120576 120576



6 Implemented Tasks







(a) (b)


s w120576 go(D)

walkdoas(D C)

arrived 120576ok

err 120576




s l120576 said(P)

askdoas(P C)

said(X) 120576



ok(Z U) 120576ok(X)

err

checkZ

Rdoas(C)













Resolution 119903119890119897119901119900119904(119906119904119890119903 119886) 119903119890119897119901119900119904(119899119886119898119890 119886)



Resolution 119903119890119897 119901119900119904(119901119897119886119910119890119903 119886) 119903119890119897119901119900119904(119905119886119887119897119890 119886)



Following

err ε

err ε

err ε

ok([U | _] [Z | _]) turn(Z)see(U)


Checking assistance


Robotturn

ε said(marco)

Marco-Polo


Rdoas([])

Rdoas([])

Rdoas([])

(1m)






Wait ε ε

Waitererr ε

ok([T | _] [Z | _]) turn(Z)

ok ε

ok(Order) ε

err

Bring

Rdoas(Ts)


Rwalkdoas(TTs)

Raskdoas(order Ts)






7 Discussion







Competing Interests


Acknowledgments


References




















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










Mic 2 (right)

Mic 1 (left)

Mic 3 (back)





120582119909119903119890119897 119901119900119904 (119909 119886 119886) (2)





119897119894119904119905119863119874119860119904 larr []


119897119894119904119905119863119874119860119904119901119906119904ℎ(119889119900119886)

end ifend for

end for







Speakerq = 1

q = 0

r = 1

r = 0p = 1

p = 0 RL

F

(a)

Speaker

q = 1

q = 0

r = 1 r = 0

p = 1p = 0 RL

F

(b)


Ang

les

150

100

50

0

minus50

minus100

minus150

Samples50 100 150 200 250 300 350 400



ok(U Z)

err

doas(C)

interpretCX

120576 120576



6 Implemented Tasks







(a) (b)


s w120576 go(D)

walkdoas(D C)

arrived 120576ok

err 120576




s l120576 said(P)

askdoas(P C)

said(X) 120576



ok(Z U) 120576ok(X)

err

checkZ

Rdoas(C)













Resolution 119903119890119897119901119900119904(119906119904119890119903 119886) 119903119890119897119901119900119904(119899119886119898119890 119886)



Resolution 119903119890119897 119901119900119904(119901119897119886119910119890119903 119886) 119903119890119897119901119900119904(119905119886119887119897119890 119886)



Following

err ε

err ε

err ε

ok([U | _] [Z | _]) turn(Z)see(U)


Checking assistance


Robotturn

ε said(marco)

Marco-Polo


Rdoas([])

Rdoas([])

Rdoas([])

(1m)






Wait ε ε

Waitererr ε

ok([T | _] [Z | _]) turn(Z)

ok ε

ok(Order) ε

err

Bring

Rdoas(Ts)


Rwalkdoas(TTs)

Raskdoas(order Ts)






7 Discussion







Competing Interests


Acknowledgments


References




















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










Speakerq = 1

q = 0

r = 1

r = 0p = 1

p = 0 RL

F

(a)

Speaker

q = 1

q = 0

r = 1 r = 0

p = 1p = 0 RL

F

(b)


Ang

les

150

100

50

0

minus50

minus100

minus150

Samples50 100 150 200 250 300 350 400



ok(U Z)

err

doas(C)

interpretCX

120576 120576



6 Implemented Tasks







(a) (b)


s w120576 go(D)

walkdoas(D C)

arrived 120576ok

err 120576




s l120576 said(P)

askdoas(P C)

said(X) 120576



ok(Z U) 120576ok(X)

err

checkZ

Rdoas(C)













Resolution 119903119890119897119901119900119904(119906119904119890119903 119886) 119903119890119897119901119900119904(119899119886119898119890 119886)



Resolution 119903119890119897 119901119900119904(119901119897119886119910119890119903 119886) 119903119890119897119901119900119904(119905119886119887119897119890 119886)



Following

err ε

err ε

err ε

ok([U | _] [Z | _]) turn(Z)see(U)


Checking assistance


Robotturn

ε said(marco)

Marco-Polo


Rdoas([])

Rdoas([])

Rdoas([])

(1m)






Wait ε ε

Waitererr ε

ok([T | _] [Z | _]) turn(Z)

ok ε

ok(Order) ε

err

Bring

Rdoas(Ts)


Rwalkdoas(TTs)

Raskdoas(order Ts)






7 Discussion







Competing Interests


Acknowledgments


References




















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










(a) (b)


s w120576 go(D)

walkdoas(D C)

arrived 120576ok

err 120576




s l120576 said(P)

askdoas(P C)

said(X) 120576



ok(Z U) 120576ok(X)

err

checkZ

Rdoas(C)













Resolution 119903119890119897119901119900119904(119906119904119890119903 119886) 119903119890119897119901119900119904(119899119886119898119890 119886)



Resolution 119903119890119897 119901119900119904(119901119897119886119910119890119903 119886) 119903119890119897119901119900119904(119905119886119887119897119890 119886)



Following

err ε

err ε

err ε

ok([U | _] [Z | _]) turn(Z)see(U)


Checking assistance


Robotturn

ε said(marco)

Marco-Polo


Rdoas([])

Rdoas([])

Rdoas([])

(1m)






Wait ε ε

Waitererr ε

ok([T | _] [Z | _]) turn(Z)

ok ε

ok(Order) ε

err

Bring

Rdoas(Ts)


Rwalkdoas(TTs)

Raskdoas(order Ts)






7 Discussion







Competing Interests


Acknowledgments


References




















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation












Resolution 119903119890119897119901119900119904(119906119904119890119903 119886) 119903119890119897119901119900119904(119899119886119898119890 119886)



Resolution 119903119890119897 119901119900119904(119901119897119886119910119890119903 119886) 119903119890119897119901119900119904(119905119886119887119897119890 119886)



Following

err ε

err ε

err ε

ok([U | _] [Z | _]) turn(Z)see(U)


Checking assistance


Robotturn

ε said(marco)

Marco-Polo


Rdoas([])

Rdoas([])

Rdoas([])

(1m)






Wait ε ε

Waitererr ε

ok([T | _] [Z | _]) turn(Z)

ok ε

ok(Order) ε

err

Bring

Rdoas(Ts)


Rwalkdoas(TTs)

Raskdoas(order Ts)






7 Discussion







Competing Interests


Acknowledgments


References




















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation










Wait ε ε

Waitererr ε

ok([T | _] [Z | _]) turn(Z)

ok ε

ok(Order) ε

err

Bring

Rdoas(Ts)


Rwalkdoas(TTs)

Raskdoas(order Ts)






7 Discussion







Competing Interests


Acknowledgments


References




















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation














Competing Interests


Acknowledgments


References




















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation


















































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation









on indexicality, direction of arrival of sound sources, and human-robot interaction

Documents