bring your body into action - diva-portal.se716122/fulltext01.pdf · tributions of this thesis...

Bring Your Body into ActionBody Gesture Detection, Tracking, and Analysis for Natural Interaction

FARID ABEDAN KONDORI

Doctoral ThesisDepartment of Applied Physics and Electronics

Umeå UniversityUmeå, Sweden 2014

Doctoral ThesisDigital Media LabDepartment of Applied Physics and ElectronicsUmeå UniversitySE-901 87 Umeå, Sweden

Copyright c© 2014 by Farid Abedan KondoriISSN: 1652-6295:19ISBN: 978-91-7601-067-9Author e-mail: [email protected] in LATEX by Farid Abedan KondoriE-version available at http://umu.diva-portal.org/Printed by Print & Media, Umeå University, Umeå, Sweden 2014

Abstract

Due to the large influx of computers in our daily lives, human-computer in-teraction has become crucially important. For a long time, focusing on whatusers need has been critical for designing interaction methods. However, newperspective tends to extend this attitude to encompass how human desires, in-terests, and ambitions can be met and supported. This implies that the way weinteract with computers should be revisited. Centralizing human values ratherthan user needs is of the utmost importance for providing new interaction tech-niques. These values drive our decisions and actions, and are essential to whatmakes us human. This motivated us to introduce new interaction methods thatwill support human values, particularly human well-being.

The aim of this thesis is to design new interaction methods that will em-power human to have a healthy, intuitive, and pleasurable interaction with to-morrow’s digital world. In order to achieve this aim, this research is concernedwith developing theories and techniques for exploring interaction methods be-yond keyboard and mouse, utilizing human body. Therefore, this thesis ad-dresses a very fundamental problem, human motion analysis. Technical con-tributions of this thesis introduce computer vision-based, marker-less systemsto estimate and analyze body motion. The main focus of this research work ison head and hand motion analysis due to the fact that they are the most fre-quently used body parts for interacting with computers. This thesis gives aninsight into the technical challenges and provides new perspectives and robusttechniques for solving the problem.

Keywords: Human Well-Being, Bodily Interaction, Natural Interaction, Human Mo-tion Analysis, Active Motion Estimation, Direct Motion Estimation, Head Pose Esti-mation, Hand Pose Estimation.

Acknowledgments

I am grateful to the following people who have directly or indirectly con-tributed to the work in this thesis and deserve acknowledgment.

First of all, I would like to thank my supervisors Assoc. Prof. Li Liuand Prof. Haibo Li for not only providing this research opportunity, but alsofor the positive environment they created in our research group. I am trulyindebted and thankful for the valuable guidance from them in the researchworld. Thanks for all the inspirations and encouragements during these years.Without your help and support I hadn’t been able to pursue my research.

I would also like to thank my best friend, and my best colleague, ShahrouzYousefi. We had many collaborations, interesting discussions and enjoyablemoments. Special gratitude goes to my friends and colleagues, Jean-PaulKouma, Shafiq Ur Rehman, Alaa Halawani, and Ulrik Söderström for theirhelpful suggestions and discussions. Thanks to all the staff of the depart-ment of Applied Physics and Electronics (TFE) for creating an enjoyable andpleasant working environment. I would like to thank the Center for MedicalTechnology and Physics (CMTF) at Umeå University for financing part of myresearch.

Most importantly, I owe sincere and earnest thankfulness to my parentsand my brothers, without whom none of this would be possible, for all the loveand support they provide. Finally, I would especially like to thank my dearfiancée, Zeynab. Thanks for bringing endless peace and happiness to my life.

Thank you all.

Farid Abedan KondoriUmeå, June 2014

iii

Contents

Abstract i

Acknowledgments iii

1 Introduction 11.1 Aim of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Research problem . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . 3

I WHY 5

2 Design for Values 72.1 From user experience to human values . . . . . . . . . . . . . 72.2 Human well-being . . . . . . . . . . . . . . . . . . . . . . . 8

3 Human Engagement with Technology 113.1 Integrating surgery with imaging technology . . . . . . . . . . 113.2 From clinical labs to individual homes . . . . . . . . . . . . . 12

4 The Growth of Technology 154.1 Mobile computers . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Wearable computers . . . . . . . . . . . . . . . . . . . . . . . 164.3 New sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Introducing New Interaction Methods 21

v

vi

II WHAT 23

6 Interaction through Human Body 256.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . 256.2 Human motion analysis . . . . . . . . . . . . . . . . . . . . . 26

6.2.1 Non-vision-based systems . . . . . . . . . . . . . . . 276.2.2 Vision-based systems . . . . . . . . . . . . . . . . . . 27

6.3 Technical challenges . . . . . . . . . . . . . . . . . . . . . . 30

III HOW 33

7 Revisiting the Problem of Human Motion Analysis 357.1 New strategy; right problem, right tool . . . . . . . . . . . . . 357.2 New setting; active vs. passive . . . . . . . . . . . . . . . . . 367.3 New tool; 3D vs. 2D . . . . . . . . . . . . . . . . . . . . . . 397.4 New perspective; search-based motion estimation . . . . . . . 40

8 Comparative Evaluation of Developed Systems 438.1 3D interaction with mobile devices . . . . . . . . . . . . . . . 438.2 Active motion estimation . . . . . . . . . . . . . . . . . . . . 458.3 Direct motion estimation . . . . . . . . . . . . . . . . . . . . 468.4 Depth-based gestural interaction . . . . . . . . . . . . . . . . 47

9 Summary of the Selected Articles 499.1 List of all publications . . . . . . . . . . . . . . . . . . . . . 52

10 Contributions, Conclusions, and Future Directions 5710.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 57

10.1.1 Body motion analysis . . . . . . . . . . . . . . . . . . 5710.1.2 Bodily interaction . . . . . . . . . . . . . . . . . . . 58

10.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5810.3 Future directions . . . . . . . . . . . . . . . . . . . . . . . . 60

Bibliography 61

Chapter 1

Introduction

The emergence of computers has reshaped our world. From late 20th century,when there was one computer for thousands of users, to 21st century wherethere are thousands of computers for one user, technology has changed theway we grow up, live together, and grow older.

Our interaction with computerized devices has also changed. The shiftfrom typing commands as the only input channel, to running applications byjust clicking on icons in graphical user interfaces was the major step in theevolution of human-computer interaction. As digital technologies advanced,so there has been a turn towards touch-based interface systems, where usersperform different hand gestures to operate their devices.

In the near future, computers will continue to proliferate, from inside ourbodies to roaming Mars. They will also look quite different from the PCs,laptops, and handheld electronic gadgets of today. Now the question is: whatshould our interaction with computers be like in the coming years? Will currentmethods enable us to have enjoyable life with computers? And if not, whatkind of interfaces would be more useful to improve quality of human life inthe future?

Obviously existing interaction methods will face a lot of challenges, duelargely to the fact that they are designed to satisfy current user needs. How-ever, in the next decade, or in the next few years, users will demand more fromcomputers, and technology will affect all aspect of human life. Therefore, inorder to explore interaction methods for the future world, we neeed to antici-pate future human needs. To do so, detailed observation of human engagementwith digital technologies will be of great importance.

1

2

Figure 1.1: Three areas that need to be taken into account to introduce new interac-tion methods.

Nevertheless, good understanding of human values, and centralizing them inthe design and development process, will be the key factor to succeed. And ofcourse, all of this is possible when we are provided with recent technologies.Consequently, to design and develop new interaction techniques, three key ar-eas must be taken into account: human values, human engagement, and newtechnologies (Fig. 1.1).

1.1 Aim of the thesis

The aim of this thesis is to understand, study, and analyze the three aforemen-tioned factors, and provide new interaction methods that will empower humanto have a healthy, intuitive, and pleasurable interaction with tomorrow’s digi-tal world. A natural and intuitive interaction enables users to operate throughintuitive actions related to natural, everyday human behavior.

1.2 Research problem

In order to achieve the aim, this research is concerned with developing theoriesand techniques for exploring interaction methods beyond keyboard and mouse,

3

utilizing human body. To accomplish this, we address a very fundamentalproblem, human motion analysis. More specifically, this thesis attempts to de-velop vision-based, marker-less systems for analyzing head and hand motion.

1.3 Thesis outline

The thesis is carefully organized in four parts. In order to help readers to followthe thesis smoothly, three principle questions will be answered throughout thethesis:

1. Why this research is done?

2. What is done to achieve the goal?

3. How the research goal is achieved?

Part I provides a concrete reason that why this thesis is done. Three mainaspects will be addressed, and the need to introduce new interaction methodswill be discussed.

What we have done to achieve our goal is explained in Part II. In this partwe will define a scientific problem on how to provide new interaction methodsand address the technical challenges.

Third part will discuss how we have accomplished our goal. New per-spectives and robust techniques for solving the problem are presented in thispart. Then, contributions, concluding remarks, and future directions will behighlighted.

Eventually, all publications that provide basis for this thesis are listed inthe last part.

Part I

WHY

Chapter 2

Design for Values

Computers has reshaped our world and changed the way we live. New gen-eration of technologies is providing us with more creative uses of computingthan ever before. Computers will continue to proliferate, and affect all aspectsof our lives in tomorrow’s digital world. They will also look quite differentfrom computerized devices of today. The world we live in is changing by newtechnologies, and the extent of these changes will increase in the near future.

However, transformations in computing technologies are not the main con-cerns about the future world. The main issue is how technologies can supportthe things that matter to people in their lives, human values. These values driveour decisions and actions, and are essential to what makes us human. Beingactive and healthy, self-expression, creativity, and taking care of loved ones aresuch values. When considering the future digital world, we should be awarethat the attributes that make us human should be manifest in our interactionwith technology.

2.1 From user experience to human values

For a long time, what has been crucial for human-computer interaction (HCI)researchers was focusing on user-centered design, centralizing what users needand want from technology. Nevertheless, new perspective in HCI tends to ex-tend this attitude to encompass how human desires, interests, and ambitionscan be met and supported through technology.

7

8

Figure 2.1: Instagram enables its users to share their life events, show their affection,and express their creativity.

Currently, technologies do not necessarily solve users’ problems as they usedto, but increasingly are enabling users to achieve other kinds of interests andaspirations. For instance, Instagram, a photo and video sharing application,is designed specifically to support certain values, such as enabling people toexpress themselves, to show their affection to others, and to demonstrate theircreativity (Fig. 2.1).

In short, recognizing human values and centralizing them in the design anddeveloping of interaction methods is critical.

2.2 Human well-being

Human well-being is one of the core values that should be recognized and sup-ported by the future technology [1]. The motivation for this work is that beingactive is central to human well-being in many situations and over time [2].Indeed, movement is part of the essence of being human. The first abilities

9

developed by humans were physical skills. Our ancestors learned how to usetheir hands to make tools for hunting animals, protecting their families, andbuilding their communities. We, as human beings, move. From early child-hood to our last step, we are designed to move.

When computer technology grew into our workplaces and other areas ofour everyday life, many of the opportunities to use our physical abilities di-minished. We have utilized computers in many situations to relieve ourselvesfrom physically demanding tasks. For many years, the main physical inter-action between humans and computer technology has been our fingertips onthe keyboard and some clicking with the mouse while the rest of our bodyhas remained seated. Passive interaction with technology changed the macro-monotony of human movements into the micro-monotony in front of comput-ers.

We need to remember that physical activity is fundamental to human well-being. It has long been known that regular exercise is good for our physicalhealth. It can reduce the risk of cancer, heart disease and strokes. In recentyears, studies have shown that regular physical activity also has benefits forour mental and emotional health [3]. Emotional well-being and mental healthconcerns can dramatically affect physical health. Stress, depression, and anxi-ety can contribute to a host of physical ailments including digestive disorders,sleep disturbances, and lack of energy. Exercise can help people with depres-sion and prevent them becoming depressed in the first place.

Physical, mental, and emotional health can improve through a balancedlevel of physical activity in everyday life. It has turned out that two-thirds ofthe adult populations (people aged 15 years or more) in the European Union donot achieve recommended levels of activity [4]. We need to move, otherwisewe risk developing physical and mental health problems.

Another aspect of human well-being encompasses our interpersonal re-lationships, social skills, and community engagement. At the soul of everyhuman is the need to feel a sense of togetherness. This is an extension ofour dependency and our care for our fellow human beings. From early child-hood we are taught caring for friends, family members, neighbors and relativeswhich leads to a need for togetherness and helps build community. This can beseen when we come together at festivals or any social ritual - there is alwaysthis sense of community. When we do not have this, we feel alone with no oneto talk to or share with. In the higher level, this can affect the unity in societies.We need to be together to build and develop our societies.

10

In our digital world, we are surrounded by computers. We are spending agreat deal of time working with laptops, smartphones, and tablets in our dailylives. With the emergence of teleworking, today we do not even need to phys-ically interact with our employers and customers. Many people stay at homeand work remotely. This increasing tendency has led to isolation of individualsin our societies. Despite using technology to communicate with others, manypeople feel alone. Although social networks have made it easier to connect tofriends, still loneliness is evident. There is a classical belief that technologycan be employed to remedy loneliness, to connect people to each other, but to-day we feel more alone. Based on the study conducted in Australia, the morewe use technology to connect to others, the more we feel alone [5].

Social networks and communication technologies are meant to connect us,but why they are making us feel alone? Answer to this question is beyond thescope of this thesis; however there is a clue that might help. Research reportsreveal that only 20 to 40 percent of interpersonal communications involvesexplicit meaning of words and verbal cues, while 60 to 80 percent is conveyedthrough non-verbal signals [6]. When we communicate, non-verbal cues canbe as important as, or in some cases even more important than, what we say.Non-verbal communications include tone of the voice, facial expressions, andbody language or body movements. Obviously using body movements have agreat impact when we communicate, the fact that is missing when we interactwith technology.

Passive interaction with social networks might be one of the reasons thatmake us feel lonelier. Imagine a bodily active social network where users in-teract with and communicate with each other through body movements. Activeinteraction through body movements can convey closeness and togetherness,the things we need as human beings in our lives. Perhaps new generation ofsocial networks will incorporate body movements to deliver more humanisticexperiences.

Chapter 3

Human Engagement withTechnology

To design our interaction with tomorrow’s digital world, a comprehensivestudy of today’s human engagement with technology is crucial. Not only willconsideration of human involvement with digital technologies expose limita-tion of existing interaction methods, but it will provide us with vital clues towhat extent future technology will empower human.

Several areas can be reviewed, however, we will consider two creative hu-man engagements with digital technologies in medical applications.

3.1 Integrating surgery with imaging technology

Reviewing medical images and records of patients by surgeons is one of thecrucial steps during surgery. Surgeons have to gather and analyze additionalinformation on the state of a patient in surgery. Incorporating computers in theoperating rooms enables a wide range of patient related data to be collected andanalyzed. However, the optimal viewing of this information becomes problem-atic due to the sterility restriction in the operating rooms.

Since maintaining a pristine sterile field is substantially important in theoperating rooms, using non-sterile objects to interact with computers cause aproblem for surgeons. For instance, a surgeon may be required to step out ofthe surgical filed to operate the computer with keyboard and mouse to viewMRI or CT images. The surgeon then has to scrub up to the elbows onceagain, and then re-enter the surgical field. Due to the time involved in re-

11

12

Figure 3.1: A modern operating room equipped with computers and imaging technol-ogy [7].

scrubbing, surgeons may not use computers as often as needed, and rely on thedata from the last observations. This problem can be addressed by develop-ing new interaction systems beyond the keyboard and mouse, utilizing naturalhand gestures. Surgeons can control computers and manipulate screen con-tents simply by moving hands in the air, without ever having to sacrifice sterilefiled. The need for new interaction systems is more evident with the emer-gence of robotic nurses in the tomorrow’s operating rooms [8]. The growth oftechno-dependency is a common feature in the future operating rooms. As aresult, designing natural and easy to learn interaction methods that meet therequired sterility condition in the operating rooms will be of great importance.

3.2 From clinical labs to individual homes

Until recently, patients had to visit their therapists to monitor and evaluate seri-ous health issues, movement disorders, for instance. Clinicians and physicianscould monitor patient’s improvement after surgical interventions by tracking

13

Figure 3.2: Using rehabilitation gaming system for home-based therapy [9].

and analyzing patient’s movements. Clinical gait analysis labs are the mostknown existing technology for this task. Several markers and identifiers aremounted on the human body, which are detected and tracked by special sen-sors to monitor and analyze patient’s motion.

But healthcare systems are changing fast. Technological progress in telecom-munications has led to the development of telerehabilitation systems for home-based therapy. Telerehabilitation offers systems to control rehabilitation pro-cess at distance. Communication via the Internet between patients and ther-apists enables telerehabilitation. Therapists can observe patients’ status andsend commands via the Internet; thus patients do not need to visit rehabilita-tion facility for every rehabilitation session.

Telerehabilitation systems provide the possibility of early diagnosis, in-crease of the intensity and the duration of rehabilitation, start of therapy underacute conditions, and shortening of in-hospital period [10].However, the change in healthcare systems is mainly due to the effectiveness ofhome setting in rehabilitation [10], i.e. taking patient’s treatment and observa-tion process from clinical labs out to daily life activities. One of the drawbacksassociated with clinical gait analysis labs is the patient’s unnatural behavior. Ithas turned out that patients are biased towards their physicians [11]. When it

14

is time for a follow-up check, patients might be affected or even intimidatedby the specific circumstances in the lab environment. Therefore, they may notpresent their natural movement pattern, and try to present the best possible per-formance, pretending that everything is under control, though this is not true.In fact, the best way to assess patient’s improvement is to observe the patientin the daily life activities. Not only physical treatment process will be altered,but the way movement disorders are diagnosed will be also changed.

There will be a shift towards ubiquitous diagnostic systems, where comput-ers will be utilized to analyze human daily life activities, literally everywhere.Humans will be monitored as they work, walk, play etc. This new perspectivecalls for intelligent diagnostic systems, where every human movement shouldbe observed and analyzed to distinguish abnormal movement patterns.

Chapter 4

The Growth of Technology

In this chapter the rapid development of digital technologies and visual sen-sors, and the provided opportunity to design, develop, and implement newinteraction techniques will be addressed.

4.1 Mobile computers

Since their inceptions, mobile phones have gone through a lot of changes.Without doubt, the first stage of mobile phone evolution was reducing the size.Users could proudly show their cellphone to other people, and then, put it backinto their pocket. Adding colorful display and digital camera made cellphoneslook much better, and by then, users started to realize that cellphones can en-able them to experience new functionalities, such as capturing and recordinglife events, and sending/receiving multimedia messages.

Nevertheless, Apple’s revolutionary product iPhone completely changedthe concept of cellphones, and introduced new terminology, smartphone.iPhone’s touch display let users to have large visualization space as they usetheir fingers to open and run different applications. Equipped with high qualitycamera, and strong processor, it was able to run various applications, frommap/navigation to different game applications. Apple’s success in the marketmade other competitors to manufacture similar products. And now, thanks tothe high competition in the market, smartphones are of acceptable price, whileequipped with high resolution cameras, and strong CPUs.

Emergence of smartphones was largely because of invention of touch screens.Replacing physical buttons with multi-touch displays provided new input chan-

15

16

Figure 4.1: Moving from 2D touch screen interaction space towards 3D space be-hind smartphones provides a great opportunity to design and develop natural userinterfaces.

nel for users, and it was surprisingly successful. Although designers and man-ufacturers have explored various nice features of smartphones, they are stillunable to utilize smartphone capacities to introduce new means of interaction.Since smartphones have high resolution camera and powerful CPUs, they pro-vide us opportunities to employ computer vision algorithms for developingnew interaction methods, to deliver new experiences (Fig. 4.1).

4.2 Wearable computers

Wearable computers are electronic devices that are worn by the bearer under,with or on top of clothing [12]. This class of wearable technology has beendeveloped for general or special purpose information technologies and mediadevelopment. One of the most known applications of wearable computers isaugmented reality [12]. Augmented reality means to superimpose an extra

17

Figure 4.2: 3D gestural interaction for wearable computers. User is looking at a 3Dmodel augmented on the display, as he manipulates the model in 3D space.

layer on a real world environment. The extra layer’s elements are computergenerated inputs such as sound, graphics, GPS data etc. that let users receivemore information on top of the real world view. Head-mounted displays aresuch wearable computers.

Recently Google launched their optical head-mounted display, Google Glass[13] for certain group of developers. Google Glass is operated through atouchpad and voice commands, and displays information in a smartphone-likehands-free format. Google Glass perfectly shows us the potential of futurewearable computers, whose hardware include powerful processing unit, Inter-net connectivity, high resolution camera, display, and motion capture sensors.And what else do we need to create a totally new interaction method for thiskind of future devices (Fig. 4.2).

18

Figure 4.3: 3D sensors; top: Microsoft Kinect [15], left: Creative Senz3D [14], right:Leap motion [17].

4.3 New sensors

New visual sensors have opened a new angle for researchers to make use ofimage analysis algorithms in order to design new interaction techniques. Com-mercial high resolution cameras are becoming smaller, and therefore, can beconveniently mounted on the human body to analyze human activities. Theyare also capable of recording human daily life activities for further analysis.

More recently, sensing manufacturers have attempted to add one more di-mension to conventional 2D RGB cameras. Creative Senz3D [14], MicrosoftKinect [15], Structure [16], and Leap Motion [17], are such sensors (Fig. 4.3).Kinect, for example, provides a robust solution to infer 3D scene informationfrom a continuously-projected infrared structured light.

Growing availability of 3D sensors reflects the trend towards usage of 3Dinformation to create innovative applications never experienced before. 3Dsensors are integrated into smartphones and tablets quite recently, openingcompletely new world for creation of novel experiences (Fig. 4.4 and 4.5).

19

Figure 4.4: Structure sensor simply mounts around the side of the iPad, turning thetablet into a portable 3D scanner [18].

Figure 4.5: Google’s Project Tango, an Android prototype phone with 3D sensor [19].

Chapter 5

Introducing New InteractionMethods

This chapter will answer the first question; Why this research is done? For thisreason, motivation, need, and technical feasibility will be addressed (Fig. 5.1).

• Motivation

In the second chapter the importance of human values in the tomorrow’s digitalworld was emphasized. Centralizing human values rather than user needs inthe future interaction systems is essential. Human well-being is one of the corevalues that should be supported through the future technology. The motivationfor this work is to design and develop bodily interaction methods to achievehuman well-being.

• Need

Human involvement with new technologies is growing rapidly. The new gener-ation of technologies is providing us with more creative uses of computing thanever before. The growth of human engagement illustrates to what extent fu-ture technology will empower human. Computers will definitely be employedfor all kinds of novel engagements. A variety of people will use computingtechnologies, from children to the elderly, and from non-computer literate toprofessionals. Although rapid advancements of computing science will makeall of these engagements possible, there is a need to introduce natural interac-tion methods to facilitate future human engagements.

21

22

Figure 5.1: Why this research is done.

• Feasibility

Digital technology is rapidly becoming inexpensive, everyday commodities.Mobile phones have evolved from handsets to the world in our hands. Wear-able computers are no longer cumbersome, but are vastly considered as fash-ionable accessories. The world we live in is changing by new technologies, andthe extent of these changes will increase in the near future. All of this meansthat the way we interact with the future technologies can be changed. Recentadvancement in sensing technologies is strong evidence that we are providedwith great opportunities.

Now it should be easy to answer why this work is done:

"To introduce new interaction methods that will empower human to have ahealthy, intuitive, and pleasurable interaction with tomorrow’s digital world."

Part II

WHAT

Chapter 6

Interaction through HumanBody

This chapter will discuss what needs to be done to accomplish the aim of thethesis.

6.1 Problem definition

In order to provide a healthy, intuitive, and pleasurable interaction, this re-search is concerned with developing theories and techniques for exploring in-teraction methods beyond keyboard and mouse, utilizing human body. In thisthesis, human body is regarded as functional surface to interact with comput-ers. It will be shown that human body position, orientation, and 3D motioncan be utilized to facilitate natural and immersive interaction with computers.Therefore, to enable new interaction methods, we address a very fundamentalproblem, human motion analysis.

Human body motion can be described in terms of global rotation and trans-lationG(R, T ), and local motion L(β1, β2, · · · , βn). Thus, we formulate bodymotion as a function of two variables G and L:

Body motion = M(G(R, T ), L(β1, β2, · · · , βn))

Considering the head, global motion G(R, T ) describes the six degree-of-freedom (DOF) head motion with respect to the real world coordinate system,while L(β1, β2, · · · , βn) can represent lip motion, eye motion etc. The fo-

25

26

Figure 6.1: Taxonomy of human motion analysis systems.

cus of the thesis is to estimate global body motion to provide new interactionmethods.

6.2 Human motion analysis

The study of human motion dates back to 1870s when Muybridge [20] startedhis work. Since then, the field of human motion analysis has grown in manydirections. However, research and results that involve the recovery of humanmotion is still far from being satisfactory. The science of human motion analy-sis is fascinating because of its highly interdisciplinary nature and wide rangeof applications. The modeling, tracking, and understanding of human mo-tion has gained more and more attention particularly in the last decade withthe emergence of applications in sports sciences, human-machine interaction,medicine, biomechanics, entertainment, surveillance etc.

Fig. 6.1 illustrates different approaches to human motion analysis. Exist-ing human motion tracking and analysis systems can be divided into two maingroups: non-vision-based and vision-based systems.

27

6.2.1 Non-vision-based systems

In these systems, non-vision-based sensors are mounted on the human bodyin order to collect motion information and detect changes in body position.Several different types of sensors have been considered. Inertial and magneticsensors are examples of widely used sensor types. Well-known types of iner-tial sensors are accelerometers and gyroscopes. An accelerometer is a deviceused to measure physical acceleration experienced by the user [21] [22]. Oneof the drawbacks of accelerometers is the lack of information about the rota-tion around the global Z-axis [23]. Hence, Gyroscopes, which are capable ofmeasuring angular velocity, can be used in combination with accelerometersin order to give a complete description of orientation [24]. However, the majordisadvantage of inertial sensors is the drift problem. New positions are calcu-lated based on previous positions, meaning that any error in the measurementswill be accumulated over time.

In [25], authors show that adaptive filtering techniques can be used to han-dle the drift problem. They also suggest that utilizing magnetometers can min-imize the drift in some situations. The use of magnetometers, or magneticsensors, is reported in several works [26] [27]. A magnetometer is a devicethat is used to measure the strength and direction of a magnetic field. Theperformance of magnetometers is affected by the availability of ferromagneticmaterials in the surrounding environment [28].

6.2.2 Vision-based systems

Vision-based motion capture systems rely on cameras as optical sensors. Twodifferent types can be identified: Marker-based and marker-less systems.

The idea behind marker-based systems is to place some type of visual iden-tifiers on the joints to be tracked. Stereo cameras are then used to detectthese markers and estimate the motion between consecutive frames. Thesesystems are accurate and have been used successfully in biomedical applica-tions [29] [30] [31]. Though, many difficulties are associated with such a con-figuration. For instance, scale changes (user distance from camera) and lightconditions will seriously affect the performance. Additionally, marker-basedsystems suffer from occlusion (line of sight) problems whenever a requiredlight path is blocked. Interference from other light sources or reflections mayalso be a problem, which can result in so-called ghost markers. The most im-portant limitation for such systems is a need to use special markers to attach

28

to the human body; furthermore, human motion can only be analyzed in apredefined area covered by fixed, expensive cameras.

Marker-less systems try to employ computer vision algorithms to estimatehuman motion without using markers. The use of cheap cameras is possible insuch systems [32] [33] [34]. Nevertheless, the markers removal comes with theprice of complicating the estimation process of 3D non-rigid human motion.These systems can be divided into two groups: model-based and appearance-based systems.

Model-based methods

Model-based approaches use human body models and evaluate them on theavailable visual cues [35] [36] [37] [38] [39] [40] [41] [42] [43]. A matchingfunction between visual input and the generated appearance of the human bodymodel is needed to evaluate how well the model instantiation explains the vi-sual input. This is performed by formulating an optimization problem whoseobjective function measures the discrepancy between the visual observationsand the ones that are expected due to the generated model. The goal is to finda set of body pose parameters that minimizes the error. These pose (positionand orientation) parameters define the configuration of body parts.

One drawback of these methods is the fact that initialization in the firstframe of a sequence is needed since the initial estimate is based on the pre-vious one. Another problem is the high computational cost that is associatedwith model-based approaches. The employed optimization method must beable to evaluate the objective function at arbitrary points in the multidimen-sional model parameters space, and most of the computations need to be per-formed online. Therefore, the resulting computational complexity is the maindrawback of these methods. On the positive side, such methods are more easilyextendable.

Appearance-based methods

Appearance-based approaches aim to establish a direct relation between im-age observations and pose parameters. Three main classes can be identifiedin this group: learning-based, example-based, and feature-based approaches.Learning-based and example-based methods do not explicitly use human bodymodels, but implicitly model variations in pose configuration, body shape andappearance. Feature-based methods estimate pose parameters by extracting

29

low-level and(or) high-level features. Appearance-based methods do not needinitialization step and can be used for initialization of model-based algorithms.

• Learning-based

These methods typically establish a mapping from image feature space to posespace using training data [44] [45] [46] [47] [48] [49] [50] [51]. The perfor-mance of these methods depends on the invariance properties of the extractedfeatures, the number and the diversity of the body postures to be analyzed andthe method utilized to derive the mapping. Due to their nature, these meth-ods are well suited for problems such as body posture recognition where asmall set of known configurations needs to be recognized. Conversely, suchmethods are less suited for problems that require an accurate estimation of thehuman body pose. Additionally, generalization for such methods is achievedonly through adequate training. The main advantage of these methods is thattraining is performed offline and online execution is typically computationallyefficient.

• Example-based

Instead of learning the mapping from image space to pose space, example-based methods use a database of exemplars annotated with their correspondingposition and orientation data [52] [53] [54] [55] [56] [57] [58]. For a given in-put image, a similarity search is performed and the best match is obtained fromthe database images. Then the corresponding pose information is retrieved asthe best estimation for the actual ones. The discriminative power of these al-gorithms depends on the database entities. This implies in practice that bodypose configurations must be densely sampled in the database entities. One ofthe drawbacks of these methods is the large amount of space needed to storethe database. Another issue is employing efficient search algorithm to estimatethe similarity between input images and database entities.

• Feature-based

Some methods estimate motion parameters by only extracting image features.These methods are usually developed for specific applications and cannot begeneralized. Low-level features such as contours, edges, centroid of body re-gion, and optical flow are utilized in different reported works [59] [60]. Low-level features are fairly robust to noise and can be extracted quickly. Another

30

set of methods use the location of high-level features such as fingers, finger-tips, nose tip, eyes, and mouth to infer the head and hands motion parame-ters [61] [62] [63]. The major difficulty associated with feature-based methodsis that body parts have to be localized prior to feature extraction.

6.3 Technical challenges

This thesis attempts to develop vision-based, marker-less methods to estimateand analyze human motion. Human motion analysis is a broad concept. In-deed, as many details as the human body can exhibit could be estimated, suchas facial movement and changes in skin surface as a result of muscle tighten-ing. In this research work, human motion analysis is limited to estimation ofthe global motion of head and hand, due to the fact that they are the most fre-quently used body parts for interacting with computers. Note that we are onlyinterested in the estimation of the motion parameters and not the interpretationof the movements. To design and develop a robust motion estimation system,several difficulties need to be addressed:

Motion complexity: human motion analysis is a challenging problem due tothe large variations in body motion and appearance. Although the head can beconsidered as a rigid object with six DOF motion, the hand is an articulatedobject with 27 DOF, i.e. 6 DOF for wrist (global motion) and 21 DOF forfingers (local motion). However, natural hand motion does not have 27 DOFdue to the interdependences between fingers [64]. Together with the locationand orientation of the hand itself, there still exist a large number of parametersto be estimated.

Motion resolution: since the visual resolution of the observations is limited,small changes in the body pose can go unnoticed. Body motion results inchanges in a small region of the scene, and as a consequence, small move-ments cannot be detected.

Rapid motion: we humans have very fast motion capabilities. The humanhand motion speed, for instance, can reach up to 5 m/s for translation and300◦/s for wrist rotation [64]. Currently, commercial cameras can support 30-60 Hz frame rates. Moreover, it is a non-trivial task for many algorithms to

31

achieve even a 30 Hz estimation speed. In fact, the combination of high speedbody motion and low sampling rates introduces extra difficulties for motionanalysis algorithms since images at consecutive frames become more uncorre-lated with increasing speed of motion.

Self-occlusions: this problem dramatically affects the performance for handpose estimation. Since the hand is an articulated object, its projection resultsin a large variety of shapes with many self-occlusions. This makes it difficultto segment different parts of the hand and extract high level features.

Uncontrolled environments: for widespread use, many interactive systemswould be expected to operate under non-uniform backgrounds and a widerange of lighting conditions. This poses many difficulties for vision-basedmotion analysis algorithms.

Processing speed: due to the fact that the goal is to develop motion estimationsystems to facilitate natural and enjoyable interaction with computers, thesesystems should be able to operate in real-time.

Part III

HOW

Chapter 7

Revisiting the Problem ofHuman Motion Analysis

In this chapter we will explain how our goal is achieved. Technical contribu-tions of the thesis are highlighted with respect to four important aspects of theresearch work. We review the contributions and refer to the papers for moredetailed descriptions.

7.1 New strategy; right problem, right tool

There are already many powerful tools and advanced algorithms for resolv-ing the problem in the field. Therefore, it is not really necessary to reinventcomplex techniques. What we need is innovation! That is to define the rightproblem (or redefine the problem), and then select right tools or integrate al-gorithms to form a solution.

In Paper I and Paper II we show that in a particular case the position ofhand and fingertips are needed to enable natural interaction with mobile de-vices. In this case, we do not go for solving complicated hand gesture recog-nition problem. Simply, we just need to "see" the gesture as a large rotationalsymmetry feature. This is what we call recognition without intelligence.

This method is based on low level operators, detecting natural featureswithout any effort to have an intelligent system. For this reason, this approachis computationally efficient and can be used in real-time applications. Wepropose to take advantage of rotational symmetry patterns, i.e. lines, curva-tures, and circular patterns associated with the hand and fingertips that leads

35

36

Figure 7.1: Rotational symmetry patterns. From left, linear, curvature, and circularpatterns with four different phases.

to the detection of the gesture. Rotational symmetries are specific curvaturepatterns detected from local orientation [65]. This theory was developed in80’s by Granlund and Knutsson [66]. The main idea behind that is to use localorientation to detect complex curvatures in double-angle representation. Fig.7.1 illustrates three orders of rotational symmetry patterns with four differentphases.

7.2 New setting; active vs. passive

Considering the traditional system settings, all elements involved in the systemhas been set in a passive way. Input devices and sensors have been placed at afixed position, and in the same way, users used to sit still and act in a very lim-ited physical space passively. But we want to remove all physical constraint,allow users to move freely. This is what we aimed for; letting people to move,to be active and healthy. Therefore, we introduce a new setting, active setup,where sensors are mounted on human body instead. This concept is presentedand discussed in Paper III, and also used in Paper VII. Technical question

37

Figure 7.2: Top view of a head and a fixed camera. The head turns with angle θcausing a change in the captured image. The amount of change depends on the cameralocation (A or B).

is whether or not the problem becomes easier with this kind of active thinkingand acting. The answer is Yes.

There are several issues that degrade the system performance in the passiveconfiguration. The most essential drawback is the resolution problem. Humanmotion results in changes in a small region of the scene, the fact that increasesthe burden of detecting small movements accurately. But, we believe this prob-lem can be easily resolved by employing the active setup. Active configurationcan dramatically enhance the resolution problem. Based on the experiments inour lab, mounting the camera on the human body can enhance the resolutionin the order of 10 times compared to the passive setup [67].

In order to simplify the idea, consider a simple rotation around Y-axis asit is illustrated in Fig. 7.2. This figure shows a top view of an abstract hu-man head and a camera. Two possible configurations are presented, placingthe camera at point A, in front of user (the passive setup), and mounting thecamera on the head at point B (the active setup). Assume that the head rotates

38

with angle θ. This causes a horizontal displacement of ∆x pixels of the pro-jection of a world point, P = (X,Y, Z)T ,in the captured image. Assumingthe perspective camera model, ∆x is given by:

∆x =f

k.∆X

Z(7.1)

where f represents the focal length and k the pixel size. If the camera is locatedat point A, then we have:

∆x1 =f

k.

r1.sinθ

r2 − r1.cosθ. (7.2)

Since r.cosθ is very small, ∆x1 can be written as:

∆x1 ≈f

k.r1.sinθ

r2. (7.3)

Multiplying and dividing by cosθ yields:

∆x1 ≈f

k.r1.sinθ.cosθ

r2.cosθ=f

k.tanθ.

r1.cosθ

r2. (7.4)

For the active setup, where the camera is mounted on the head at point B wehave:

∆x2 =f

k.r2.sinθ

r2.cosθ=f

k.tanθ. (7.5)

r1 � r2, thus:

r1.cosθ

r2� 1⇒ ∆x1 � ∆x2. (7.6)

For example, if f = 3 mm, k = 10 µm, r1 = 10 cm, r2 = 100 cm, andθ = 45◦, then the displacement for both cases will be:

∆x1 =3× 10−3

10× 10−6× 1× 10

100× 1√

2≈ 21 pixels, (7.7)

∆x2 =3× 10−3

10× 10−6× 1 = 300 pixels. (7.8)

This indicates that motion detection is much easier when mounting the cameraon the head, since the active camera configuration causes changes in the entireimage while the passive setup often affects a small region of the image.

39

7.3 New tool; 3D vs. 2D

New sensors provide a new dimension and totally change the nature of theproblem. This section investigates the problem of non-linearity inherent inclassical motion estimation techniques using only 2D color images, and demon-strates how new 3D sensors empower us to tackle this issue.

Here the same notations used by Horn [68] are employed to explore thenonlinearity associated with 2D RGB image-based motion estimation tech-niques. First, we review the equations describing the relation between themotion of a camera and the optical flow generated by the motion. We can as-sume either a fixed camera in a changing environment or a moving camera ina static environment. Let us assume a moving camera in a static environment.A coordinate system can be fixed with respect to the camera, with the Z-axispointing along the optical axis. The camera motion could be separated into twocomponents, a translation and a rotation about an axis through the origin. Thetranslational component is denoted by t and angular velocity of the camera byω. Let the instantaneous coordinates of a point P in the 3D environment be(X,Y, Z)T . (Here Z > 0 for points in front of the imaging system.)

Let r be the column vector (X,Y, Z)T , where T denotes the transpose.Then the velocity of P with respect to the XY Z coordinate system is

V = −t− ω × r. (7.9)

If we define the components of V, t and ω as

V = (X, Y , Z)T , t = (U, V,W )T , and ω = (A,B,C)T

we can rewrite the equation in component form asX = −U −BZ + CY

Y = −V − CX +AZ

Z = −W −AY +BX

(7.10)

where the dot denotes differentiation with respect to time.The optical flow at each point in the image plane is the instantaneous veloc-

ity of the brightness pattern at that point [68]. Let (x, y) denote the coordinateof a point in the image plane. We assume perspective projection between anobject point P and the corresponding image point p. Thus, the coordinates ofp are

x =X

Zand y =

Y

Z.

40

The optical flow at a point (x, y), denoted by (u, v) is

u = x and v = y.

Differentiating the equations for x and y with respect to time and using thederivatives of X , Y , and Z, we obtain the following equations for the opticalflow [68]:

u =X

Z− XZ

Z2= (−U

Z−B + Cy)− x(−W

Z−Ay +Bx), (7.11)

v =Y

Z− Y Z

Z2= (−V

Z− Cx+A)− y(−W

Z−Ay +Bx). (7.12)

The resultant equations for the optical flow are inversely proportional to thedistance ofP to the camera (Z). Unlike the motion parameters (A,B,C,U, V,W )which are global and point independent, Z is pointwise and varies at eachpoint. Therefore, Z should be eliminated from the optical flow equations.

After removing Z from the equations, we eventually obtain the followingequation at each point:

x(UC−Wv+AW )+y(V C+Wu+BW )+xy(BU+V A)−y2(CW+AU)

−x2(V B + CW )− V (B + u) + U(v −A) = 0. (7.13)

Here the problem arises, since the final equation is nonlinear and pointwise interms of u and v. However, this issue can be simply resolved by utilizing depthinformation acquired by new 3D sensors. In Paper IV, Paper V, and PaperVI we will show how new 3D sensors enable us to develop 3D linear motionestimation method.

7.4 New perspective; search-based motion estimation

Due to the increasing capability of computers in storing and processing ex-tremely large databases, we introduce a novel approach, and propose to shiftthe complexity from computer vision and pattern recognition algorithms tolarge-scale image retrieval methods. The idea is to redefine the motion esti-mation problem as the image search problem. The new search-based method

41

Figure 7.3: Search-based motion estimation block diagram [69].

is based on storing an extremely large database of body gesture images andretrieving the best match from the database. The most important point is thatthe database should include body gesture images annotated with their corre-sponding position and orientation information.

The block diagram of the search-based method is illustrated in Fig. 7.3.It includes four main components: annotated and indexed gesture database,image query processing unit, search engine, and interface level. The queryimage will be processed and its visual features will be extracted for similarityanalysis. The search engine retrieves the best image based on the similarity ofthe query image with the database entries. Eventually, the best match and itsposition and orientation information will be used to facilitate different applica-tions. This approach might seem similar to example-based methods; however,there is a systematic difference between them. The main idea is to form a largedatabase, hundred thousands of images, and store them on rather too smallspace. This is one of the challenging issues in example-based methods. Inaddition, in the proposed method, searching the large-scale database can beperformed more efficiently in real-time.

This method is filed as a U.S. patent application. It has successfully passedthe novelty analysis step, and is currently undergoing necessary processing.Due to the patent application restrictions, the detailed description of this methodcannot be included in this thesis.

Chapter 8

Comparative Evaluation ofDeveloped Systems

Technical contributions of this research work have resulted in developing thefollowing systems:

• 3D interaction with mobile devices (Paper I and Paper II)

• Active motion estimation system (Paper III)

• Direct motion estimation system (Paper IV)

• Depth-based gestural interaction system (Paper VI)

In this section, the developed systems will be quantitatively and qualitativelycompared to state-of-the-art and the results will be presented.

8.1 3D interaction with mobile devices

This system provides natural interaction with mobile devices. The system isdeveloped for finger tracking in Paper I, and is improved to extract 3D handmotion in Paper II. Extensive experiments have been carried out to evaluatethe performance of the system, both on desktop and mobile platforms. Theexperimental results are discussed in Paper I and Paper II. Here we try toprovide comparative evaluation.

Due to the high computational cost, most available marker-less systemscannot perform real-time gesture analysis on mobile devices [70] [71] [72]

43

44

Table 8.1: Comparison between the proposed mobile interaction system and othervision-based works (NA: not available).

Ref. Approach Real-time Detection Stationary Tracking

rate/error /mobile /3D mo-tion

[70] Model-based

No NA / NA Yes / No Yes / Yes

[71] Model-based

No NA / NA Yes / No Yes / Yes

[72] Learning-based

Yes NA / NA Yes / No Yes / No

[74] Feature-based

Yes ≈ 98% /NA

Yes / No Yes / No

[73] Feature-based

Yes ≈ 84% /NA

Yes / No Yes / No

[75] Feature-based

Yes NA / NA Yes / Yes Yes / No

Our sys-tem

Feature-based

Yes ≈ 100% /6.6 pixels

Yes / Yes Yes / Yes

[73]. However, we can evaluate our technical contribution from different as-pects of detection, tracking, and 3D motion analysis. Our algorithm alwaysdetects the gesture with the flexibility of the ±45 degrees rotation around x, y,and z axes within the interaction space. This detection rate is quite promisingin comparison with the 84.4% and 98.42%, proposed by [73] and [74] (bothalgorithms only work on stationary systems). Moreover, the mean error ofthe localized gesture in our algorithm is 6.59 pixels which is substantially im-proved (reduced to half), from the proposed system by [72]. In addition, oursystem provides 3D motion analysis by retrieving the rotation parameters be-tween the consecutive image frames. Since the relative distance between theuser’s gesture and the device’s camera is rather small, this configuration is al-most the same as active configuration, proposed in Paper III. This accuratesystem can guarantee the effective 3D manipulation in virtual and augmentedreality environments. In Table 8.1, characteristics of the proposed system andother vision-based related works are compared from different aspects.

45

Table 8.2: Comparative evaluation of the proposed active motion estimation approach(NA: not available, MAE: mean absolute error).

Ref. Approach Features Camera MAE Real-time

[76] Activeconfigura-tion

SIFT Multi-camera

3.0◦ No

Our sys-tem

Activeconfigura-tion

SIFT Single 0.8◦ Yes

8.2 Active motion estimation

Active configuration for human motion analysis is introduced in Paper III. Byintroducing a new setting, we propose to mount optical sensors onto the humanbody to measure and analyze body movements. Our system is camera-basedand relies on the rich data in a detailed view of the environment. This systemrecovers motion parameters by extracting stable features in the scene and find-ing corresponding points in the consecutive frames.

While this approach has previously been attempted in visual odometry forlocalization in GPS-blind environments [77] [78] [79] and Simultaneous Lo-calization and Mapping (SLAM) for estimating the motion of moving plat-forms [80] [81] [82], we propose to take the advantage of active configurationfor human motion estimation.

A closely related work is that of Shiratori et al. [76]. In both works activeset-up is used to estimate human motion, and color camera is regarded as theonly employed sensor. However, Shiratori utilizes a set of cameras attachedonto the body to estimate the pose parameters. Thus, instead of recovering theindependent ego-motion of individual cameras, they try to reconstruct the 3Dmotion of a set of cameras related by an underlying articulated structure.

To quantify the angular accuracy of the system, we have provided an elec-tronic measuring device which is presented in Paper III. Outputs of this mea-suring device are considered as the Ground Truth to evaluate the system perfor-mance. The device is consists of a protractor, a servo motor with an indicator,and a control board connected to a power supply. A normal webcam is alsofixed on the servo motor, so its rotation is synchronized with the servo. Theservo can move in two different directions with specified speed, and its actual

46

rotation value (the Ground Truth) is indicated on the protractor. Whenever theservo motor rotates, the camera will be rotated and the video frames will befed to the system to estimate the camera rotation parameters. These values areused to measure the accuracy of the system. Three different setups are usedto analyze the system performance in rotations around X, Y, and Z axes. Theservo motor can be operated through Lynx SSC-32 Terminal [83]. This servocontroller has a range of 180◦ with a high resolution of 0.09◦, and can be usedfor accurate positioning. The performance of the system is compared to thatof [76] and is presented in Table 8.2.

8.3 Direct motion estimation

One of the objectives of this thesis is to explore the opportunity provided bynew 3D sensors. Range sensors provide depth information and introduce newpossibilities for developing motion estimation techniques. By employing depthimages, we introduce a direct method for 3D rigid motion estimation in PaperIV.

Due to the availability of Ground-Truth data, the performance of the directmotion estimation system can be evaluated in comparison with other methods.Two different approaches, model-based [84] and learning-based [85] are usedfor performing comparative evaluation. In [84], authors have used a 3D headmodel to estimate the pose parameters. The head pose estimation problemhas been formulated as an optimization problem that has been solved basedon Particle Swarm Optimization. Instead of using a head model, [85] utilizesrandom regression forests to establish a mapping between depth features andhead pose parameters.

Biwi database [86] has been used to evaluate the methods. The databasecontains over 15K depth images (640 x 480 pixels) of 20 people. For eachdepth image, the Ground-Truth head motion parameters are provided. Table8.3 illustrates mean and standard deviation of the errors for the 3D head local-ization task and the individual rotation angles with required processing time.Although the model-based approach provides higher accuracy, but it is com-putationally costly and is not favorable for real-time applications. On the otherhand, the learning-based method performs beyond real time, though offeringlow accuracy. In contrast to these systems, our approach provides acceptableaccuracy with low processing time that makes this method suitable for real-

47

Table 8.3: Properties of the model-based and learning-based methods for the headpose estimation problem are compared with the proposed approach of this thesis.

Ref. Approach Localizationerror (mm)

Pitch error Yaw error Roll error Time

[84] Model-based

2.8± 1.8 1.3± 1.1◦ 1.1± 1.0◦ 1.7± 1.7◦ 100ms

[85] Learning-based

14.6 ±22.3

8.5± 9.9◦ 8.9 ±13.0◦

7.9± 8.3◦ 25ms

Ourmethod

Feature-based

10.2 ±17.6

6.4± 7.9◦ 6.3± 8.1◦ 6.5± 8.8◦ 40ms

time applications.

8.4 Depth-based gestural interaction

In Paper VI, a novel approach for performing gestural interaction is presented.By using the direct motion estimation technique, we developed a system thatis able to extract the global hand motion parameters. The hand pose param-eters can be employed to provide natural interaction with computers. Severalexperiments are conducted to demonstrate the system accuracy and capacityto accommodate different interaction procedures, which are discussed in moredetail in Paper VI.

In this part, the proposed depth-based gestural interaction system is com-pared to other depth-based approaches with respect to different aspects. Threemodel-based [41] [42] [87] and two example-based [88] [89] systems are se-lected for the evaluation. The results of the comparative analysis are presentedin Table 8.4. Model-based approaches estimate global hand motion and fulljoint angles. Note that in [42] the pose parameters are estimated for two hands.

Model-based methods are computationally costly, and cannot be used inreal-time applications. However they provide continuous solutions and can begeneralized. Despite the fact that example-based methods are well suited forreal-time applications, they only estimate hand position or orientation, and arenot extendible. In comparison with these systems, our system is able to esti-mate global hand motion parameters continuously and fast, which is desirablefor real-time applications. Like model-based systems, the proposed system canbe generalized to other manipulative gestures.

48

Table 8.4: Comparison between the proposed depth-based gestural interaction systemand other approaches.

Ref. Approach Estimatedparameters

Continuousestimation

Complexity Extendible Time

[41] Model-based

Globalhand mo-tion + jointangles

Yes High Yes 40 s

[42] Model-based


Yes High Yes 250 ms

[87] Model-based


Yes High Yes 66 ms

[88] Example-based

Handorientation

No Low No 34 ms

[89] Example-based

Hand posi-tion

No Low No 34 ms

Our system Feature-based

Globalhandmotion

Yes Low Yes 36 ms

Chapter 9

Summary of the SelectedArticles

This thesis is based on the contributions of seven papers which are referred toin the text and presented in the last part of the thesis. Paper I results from anequal collaboration between the first author and Farid Abedan Kondori. In theother papers Farid Abedan Kondori is the main author and has major contribu-tions in developing theories, implementations, experiments, and writing. As-sociate Professor Li Liu and Professor Haibo Li have inspired the discussionsand have supervised Farid Abedan Kondori during the PhD study. ShahrouzYousefi, PhD candidate, and Jean-Paul Kouma, licentiate of engineering incomputer vision, have participated in the discussions and have assisted FaridAbedan Kondori in some experiments.

Paper IShahrouz Yousefi, Farid Abedan Kondori, Haibo LiCamera-based gesture tracking for 3D interaction behind mobile devices.International Journal of Pattern Recognition and Artificial Intelligence, Vol.26, No. 8, 2012.In this paper we demonstrate how position of fingertips can be utilized to pro-vide natural interaction with mobile devices. This paper suggests the use offirst order rotational symmetry patterns, curvatures, to detect fingertips. Thismethod is based on low level operators, detecting natural features associatedwith fingertips without any effort to have an intelligent system, i.e. recognitionwithout intelligence. The proposed approach is computationally efficient and

49

50

can be used in real-time applications.

Paper IIFarid Abedan Kondori, Shahrouz Yousefi, Haibo LiReal 3D interaction behind mobile phones for augmented environments.In Proceeding of the IEEE International Conference on Multimedia and Expo(ICME2011), Barcelona, Spain, July 2011.This paper suggests employing 3D hand gesture motion to enable a new wayto interact with mobile phones in 3D space behind the device. Hand gesturerecognition is achieved by detecting second order rotational symmetry pat-terns, i.e. circular patterns. Relative 3D motion of hand gesture between con-secutive frames is estimated by means of extracting stable features in the scene.We demonstrate how 3D hand motion recovery provides a novel way to ma-nipulate objects in augmented reality applications.

Paper IIIFarid Abedan Kondori, Li Liu3D active human motion estimation for biomedical applications.In Proceeding of the World Congress on Medical Physics and Biomedical En-gineering (WC2012), Beijing, China, May 2012.In this paper, a vision-based active human motion estimation system is devel-oped to diagnose and treat movement disorders. One of the important factors inhealthcare is to observe and monitor patients in their daily life activities. How-ever current motion analysis systems are confined to lab environments and arespatially limited. This hinders the possibility of observing patients while mov-ing naturally and freely, the matter that can affect the quality of the diagnosis.By introducing a new setting, we propose to mount cameras on human bodyto measure and analyze body movements. The system extracts stable featuresin the scene and finds corresponding points in the consecutive frames. After-wards it recovers 3D motion parameters.

Paper IVFarid Abedan Kondori, Shahrouz Yousefi, Haibo LiDirect 3D head pose estimation from Kinect-type sensors.Electronics Letters, Vol. 50, No. 4, 2014.In this paper, it is shown how to take the advantage of new 3D sensors to in-troduce a direct method for estimating 3D motion parameters. Based on the

51

range images, we derive a new version of optical flow constraint equation todirectly estimate 3D motion parameters without any need of imposing otherconstraints. In this work, the range images are treated as classical intensityimages to obtain the new constraint equation. The new optical flow constraintequation is employed to recover 3D motion of a moving head from the se-quences of range images. Furthermore, a closed-loop feed-back system is pro-posed to handle the case when optical flow is large.

Paper VFarid Abedan Kondori, Shahrouz Yousefi, Li Liu, Haibo LiHead operated electric wheelchair.In Proceeding of the IEEE Southwest Symposium on Image Analysis and In-terpretation (SSIAI2014), San Diego, California, USA, April 2014.In order to steer and control an electric wheelchair, this paper proposes to uti-lize user head orientation. Most of the electric wheelchairs available in themarket are joystick-driven, and therefore assume that users are able to usetheir fine motor skill to drive the wheelchair. Unfortunately, this is not validfor many patients like those suffering from quadriplegia. These patients areusually unable to control an electric wheelchair using conventional methods,such as joystick or chin stick. This paper introduces a novel assistive technol-ogy to help these patients. By acquiring depth information from Kinect, userhead orientation is estimated and mapped onto control commands to steer thewheelchair.

Paper VIFarid Abedan Kondori, Shahrouz Yousefi, Jean-Paul Kouma, Li Liu, HaiboLiDirect hand pose estimation for immersive gestural interaction.Submitted to Pattern Recognition Letters, February 2014.This paper presents a novel approach for performing intuitive gestural inter-action. The main issue, i.e. dynamic hand gesture recognition, is addressedas a combination of two tasks: gesture recognition and gesture pose estima-tion. The former initiates the gestural interaction by recognizing a gesture asbelonging to a predefined set of gestures, while the latter provides required in-formation for recognition of the gesture over the course of interaction. To easedynamic gesture recognition, this paper proposes a direct method for hand poserecovery. We demonstrate the system performance in 3D object manipulation

52

on two different setups; desktop computing and mobile platform. This revealsthe system capability to accommodate different interaction procedures. Addi-tionally, usability test is conducted to evaluate learnability, user experience andinteraction quality in 3D gestural interaction in comparison to 2D touch-screeninteraction.

Paper VIIFarid Abedan Kondori, Li Liu, Haibo LiTelelife; a new concept for future rehabilitation systems?Submitted to the IEEE Communications Magazine, April 2014.In this paper we introduce telelife, a new concept for future rehabilitation sys-tems. Recently, telerehabilitation systems for home-based therapy have alteredhealthcare systems. The main focus of these systems is on helping patients torecover physical health. Although telerehabilitation provides great opportuni-ties, there are two major issues that affect effectiveness of telerehabilitation:relegation of patient at home, and loss of direct supervision of therapist. Tocomplement telerehabilitation, this paper proposes telelife. Telelife can en-hance patients’ physical, mental, and emotional health. Moreover, it improvespatients’ social skills and also enables direct supervision of therapists. In thiswork we introduce telelife to enhance telerehabilitation, and investigate tech-nical challenges and possible solutions to achieve it.

9.1 List of all publications

All the contributions that have been published during the PhD study are listedhere.

Journal articles

• Farid Abedan Kondori, Li Liu, Haibo Li. Telelife; a new concept forfuture rehabilitation systems? Submitted to the IEEE CommunicationsMagazine, 2014.

• Farid Abedan Kondori, Shahrouz Yousefi, Jean-Paul Kouma, Li Liu,Haibo Li. Direct hand pose estimation for immersive gestural interac-tion. Submitted to Pattern Recognition Letters, 2014.

53

• Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li. Direct 3D headpose estimation from Kinect-type sensors. Electronics Letters, Vol. 50,No. 4, 2014.

• Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li. Experiencingreal 3D gestural interaction with mobile devices. Pattern RecognitionLetters, Vol. 34, No. 8, 2013.

• Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li. Camera-basedgesture tracking for 3D interaction behind mobile devices. InternationalJournal of Pattern Recognition and Artificial Intelligence, Vol. 26, No.8, 2012.

Conference papers

• Farid Abedan Kondori, Shahrouz Yousefi, Ahmad Ostovar, Li Liu,Haibo Li. A direct method for 3D hand pose recovery. Accepted inthe 22nd International Conference on Pattern Recognition (ICPR2014),August 24-28, 2014, Stockholm, Sweden.

• Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li. Interactive 3Dvisualization on a 4K wall-sized display. Submitted to the IEEE In-ternational Conference on Image Processing (ICIP2014), Paris, France,2014.

• Farid Abedan Kondori, Shahrouz Yousefi, Li Liu, Haibo Li. Head op-erated electric wheelchair. In Proceeding of the IEEE Southwest Sym-posium on Image Analysis and Interpretation (SSIAI2014), April 6-8,2014, San Diego, California, USA.

• Farid Abedan Kondor, Shahrouz Yousefi, Li Liu. Active human ges-ture capture for diagnosing and treating movement disorders. In Pro-ceeding of The Swedish Symposium on Image Analysis (SSBA2013),Gothenburg, Sweden, March 2013.

• Farid Abedan Kondori, Li Liu. 3D active human motion estimation forbiomedical applications. In Proceeding of The World Congress on Med-ical Physics and Biomedical Engineering (WC2012), Beijing, China,May 26-31, 2012.

54

• Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Samuel Sonning,Sabina Sonning. 3D head pose estimation using the kinect. In Proceed-ing of the IEEE International Conference on Wireless Communicationsand Signal Processing (WCSP2011), Nanjing, China, November 9-11,2011, pp.1-4.

• Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li. Real 3D in-teraction behind mobile phones for augmented environments. In Pro-ceeding of the IEEE International Conference on Multimedia and Expo(ICME2011), Barcelona, Spain, July 11-15, 2011, pp.1-6.

• Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li. Gesture trackingfor 3D interaction in augmented environments. In Proceeding of TheSwedish Symposium on Image Analysis (SSBA2011), Linköping, Swe-den, March 17-18, 2011.

• Farid Abedan Kondori, Shahrouz Yousefi. Smart baggage in avia-tion. In Proceeding of the IEEE International Conference on Internetof Things and International Conference on Cyber, Physical and SocialComputing, Dalian, China, October 19-22, 2011, pp.620-623.

• Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li. Robust correctionof 3D geo-metadata in photo collections by forming a photo grid. In Pro-ceeding of the IEEE International Conference on Wireless Communica-tions and Signal Processing (WCSP2011), Nanjing, China, November9-11, 2011, pp.1-5.

• Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li. 3D gestural inter-action for stereoscopic visualization on mobile devices. In Proceeding ofthe 14th International Conference on Computer Analysis of Images andPatterns (CAIP’11), Seville, Spain, August 29-31, 2011, pp.555-562.

• Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li. 3D visualizationof single images using patch level depth. In Proceeding of the Inter-national Conference on Signal Processing and Multimedia Applications(SIGMAP2011), Seville, Spain, July 18-21, 2011, pp.61-66.

• Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li. Stereoscopic vi-sualization of monocular images in photo collections. In Proceedingof the IEEE International Conference on Wireless Communications and

55

Signal Processing (WCSP2011), Nanjing, China, November 9-11, 2011,pp.1-5.

• Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li. 3D visualiza-tion of monocular images in photo collections. In Proceeding of TheSwedish Symposium on Image Analysis (SSBA2011), Linköping, Swe-den, March 17-18, 2011.

• Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li. Tracking fingersin 3D space for mobile interaction. In Proceeding of The 20th Interna-tional Conference on Pattern Recognition (ICPR), The Second Interna-tional Workshop on Mobile Multimedia Processing (WMMP), Istanbul,Turkey, August 2010, pp.72-79.

Licentiate thesis

• Farid Abedan Kondori. Human motion analysis for creating immer-sive experiences. Umeå University, Umeå, Sweden, April 2012.

Patents

• Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li. Real-time 3Dgesture recognition and tracking system for mobile devices. U.S. PatentApplication, filed January 2014, Patent Pending.

Chapter 10

Contributions, Conclusions,and Future Directions

10.1 Contributions

Contributions of this thesis can be divided into two parts: technical contri-butions towards body motion analysis, and implementation of the developedmethods for providing bodily interaction in various application scenarios.

10.1.1 Body motion analysis

Technical contributions of this thesis are focused on body gesture detection,tracking, and analysis. These contributions can be listed as:

• Hand gesture recognition and tracking based on low-level operations.

• Relative motion estimation between consecutive frames by extractingand tracking stable landmarks in the environment.

• Adopting active configuration for human motion analysis.

• Taking advantage of 3D sensors to develop a direct and linear motionestimation method.

• Introducing a novel approach for body gesture analysis based on large-scale search framework.

57

58

10.1.2 Bodily interaction

The proposed motion analysis systems can be utilized to facilitate a healthy,intuitive, and pleasurable interaction in different applications. In these appli-cations, gesture motion is captured, and detailed analysis of the motion is usedto provide different functionalities. The application scenarios are summarizedin the following:

• 3D gestural interaction with mobile devices.

• Bare-hand interaction with computers for 3D object manipulation.

• Gestural interaction for browsing medical images in the operating rooms.

• Assistive technology for controlling electric wheelchairs.

• Analysis of human daily life activities for diagnosing and treating move-ment disorders.

10.2 Conclusions

This thesis aimed to propose bodily interaction methods that empower humanto have a healthy, natural, and enjoyable interaction with computers. In orderto achieve our aim, we have developed computer vision-based, marker-lesssystems that use low-level features to detect, track, and analyze body gestures.

• We presented a novel approach for recovering six degrees of freedom ofhand motion. This approach is computationally efficient and relies onlow level operations, detecting natural features with the minimum levelof intelligence. Due to the low complexity, this method can be used inreal-time applications to estimate the hand motion. However, rapid handmotions affect the system performance.

• This thesis proposed an active configuration to capture and analyze bodymotion. Mounting the camera on the body, we have built a convenient,low cost motion capture system that can be used even in daily-life ac-tivities. This system provides high resolution and can be used with highaccuracy in real-time applications. Detecting sufficient number of inter-est points in the scene is the only requirement for the proposed method.

59

• Moreover, we derived a new version of optical flow constraint equationfor range images. This new equation not only enables a direct recoveryof 3D motion, but also offers an opportunity of re-using the techniques,tools, and experiences developed and accumulated in the topic of opti-cal flow computation in the computer vision fields. Simply by treatingrange images as conventional intensity images, we can apply the provedtechniques and tools used in the optical flow computation for recovering3D motion parameters. Obviously, more experiments need to be doneto verify if the new equation could act as a fundamental equation in thecoming era of range data just as the role of the normal optical flow equa-tion in motion estimation from 2D intensity images.

• In addition, the main challenge for providing bare-hand gestural interac-tion, i.e. dynamic gesture recognition, has been discussed in this thesis.This issue is regarded as a combination of two tasks: gesture recogni-tion and gesture motion estimation. We proposed to incorporate a fastand robust hand pose estimation to ease the problem.

• We also demonstrated that advanced algorithms from other research ar-eas could be employed to tackle the problem of human motion analysis.In order to handle the diversity and flexibility of the body movements,we decided to shift the complexity from classical motion estimationproblem to large-scale image retrieval system. Due to the availabilityof the large-scale image databases, the new problem is to find the bestmatch for the query among the whole database. In fact, for a query im-age or video, representing a unique body gesture with specific positionand orientation, the challenging task is to retrieve the most similar im-age from the database that represent the same gesture with maximumsimilarity in position and orientation. In this approach, annotating thedatabase images with their actual pose parameters is crucially impor-tant.

60

10.3 Future directions

Our vision is to improve human well-being by means of a natural and enjoy-able interaction. Vision-driven design1 can result in quantum leaps in the waywe interact with computers. It complements needs-driven and technology-driven design by looking beyond today limits. Focusing on user needs andcentralizing what user wants from technology has been essential for designinginteraction methods. However, new vision extents this view and concerns howhuman values can be supported through new technology.

The motivation for this work was introducing bodily interaction methodsthat will support human well-being. The main reason why we focus on thisvision-driven approach is its life span. Technologies can become outdated ina few years, and user needs can change quickly and applications become ob-solete within a decade. However, vision-driven approaches can last for a longtime. Improving human physical, mental, emotional, and social well-beingwill be central for future technologies. Bodily interaction with computers canenhance human well-being. As a result, utilizing human body in differentsituations will be a common approach in design and development of futureinteraction methods.

Although we showed the high potential of computer vision-based systems,using other sensing modalities can facilitate bodily interaction systems. In-deed, bodily interaction methods can benefit from the fusion of vision- andnon-vision-based systems.

1Note the difference between vision-driven design and vision-based motion estimation.Vision-driven design refers to our vision, i.e. improving human well-being.

BIBLIOGRAPHY

Bibliography

[1] R. Harper, T. Rodden, Y. Rogers, and A. Sellen, eds., Being Human: Human-Computer Interaction in the year 2020. Cambridge, England: Microsoft Re-search Ltd, 2008.

[2] H. Tobiasson, There’s More to Movement than Meets the Eye : perspectives onPhysical Interaction. No. 2010:17 in Trita-CSC-A, KTH, 2010.

[3] M. C. Miller, Understanding Depression. Harvard Health Publications, 2013.

[4] N. Cavill, S. Kahlmeier, and F. Racioppi, Physical Activity and Health in Eu-rope: Evidence for Action. World Health Organization, 2006.

[5] Issues and Concerns for Australian Relationships Today. Relationships AustraliaInc, 2011.

[6] A. Pease and B. Pease, The Definitive Book of Body Language. Bantam Dell,2006.

[7] https://www.djc.com/news/ae/12019115.html/.

[8] J. Wachs, “Robot, pass me the scissors! how robots can assist us in the operatingroom,” in Progress in Pattern Recognition, Image Analysis, Computer Vision,and Applications, vol. 7441 of Lecture Notes in Computer Science, pp. 46–57,Springer Berlin Heidelberg, 2012.

[9] http://www.youtube.com/watch?v=eVngyL_6TUY.

[10] M. Zampolini, E. Todeschini, M. B. Guitart, H. Hermens, S. Ilsbroukx, V. Ma-cellari, R. Magni, M. Rogante, S. S. Marchese, M. Vollenbroek, and C. Giaco-mozzi, “Tele-rehabilitation: present and future.,” Ann Ist Super Sanita, vol. 44,no. 2, pp. 125–134, 2008.

[11] F. Kondori and L. Liu, “3d active human motion estimation for biomedical ap-plications,” in World Congress on Medical Physics and Biomedical EngineeringMay 26-31, 2012, Beijing, China (M. Long, ed.), vol. 39 of IFMBE Proceedings,pp. 1014–1017, Springer Berlin Heidelberg, 2013.

63

64

[12] S. Mann, Wearable Computing. Aarhus, Denmark: The Interaction DesignFoundation, 2013.

[13] http://www.google.com/glass/start/.

[14] http://se.creative.com/p/web-cameras/creative senz3d/.

[15] http://research.microsoft.com/en us/um/redmond/projects/kinectsdk/default.aspx/.

[16] http://structure.io/#/.

[17] https://www.leapmotion.com/.

[18] http://www.digitaltrends.com/computing/structure-sensor-shrinks-3d-scanning-down-to-an-ipad-attachment/#ixzz2uR5FiXd1 /.

[19] https://www.google.com/atap/projecttango/.

[20] E. Muybridge, Muybridge’s Complete human and animal locomotion: all 781plates from the 1887 Animal locomotion. Muybridge’s Complete Human andAnimal Locomotion: All 781 Plates from the 1887 Animal Locomotion, DoverPublications, 1979.

[21] F. Foerster, “Detection of posture and motion by accelerometry: a validationstudy in ambulatory monitoring,” Computers in Human Behavior, vol. 15, no. 5,pp. 571–583, 1999.

[22] G. Kamen, C. Patten, C. D. Du, and S. Sison, “An accelerometry-based systemfor the assessment of balance and postural sway.,” Gerontologia (Basel), vol. 44,no. 1, pp. 40–45, 1998.

[23] H. J. Luinge and P. H. Veltink, “Measuring orientation of human body segmentsusing miniature gyroscopes and accelerometers.,” Medical & biological engi-neering & computing, vol. 43, pp. 273–282, Mar. 2005.

[24] B. Kemp, A. J. Janssen, and B. van der Kamp, “Body position can be moni-tored in 3d using miniature accelerometers and earth-magnetic field sensors,”Electroencephalography and Clinical Neurophysiology/Electromyography andMotor Control, vol. 109, no. 6, pp. 484 – 488, 1998.

[25] F. Öhberg, R. Lundström, and H. Grip, “Comparative analysis of different adap-tive filters for tracking lower segments of a human body using inertial motionsensors,” Measurement science and technology, vol. 24, no. 8, pp. 085703–,2013.

[26] O. Suess, S. Suess, S. Mularski, B. Kuhn, T. Picht, S. Hammersen, R. Sten-del, M. Brock, and T. Kombos, “Study on the clinical application of pulsed dcmagnetic technology for tracking of intraoperative head motion during framelessstereotaxy,” Head & Face Medicine, vol. 2, no. 1, p. 10, 2006.

65

[27] E. R. Bachmann, R. B. McGhee, X. Yun, and M. J. Zyda, “Inertial and magneticposture tracking for inserting humans into networked virtual environments,” inIn ACM Symposium on Virtual Reality Software and Technology (VRST, pp. 9–16, ACM, 2001.

[28] H. Zheng, N. Black, and N. Harris, “Position-sensing technologies for move-ment analysis in stroke rehabilitation,” Medical and Biological Engineering andComputing, vol. 43, pp. 413–420, 2005. 10.1007/BF02344720.

[29] I. Davis, S. Ounpuu, D. Tyburski, and J. R. Gage, “A gait analysis data collectionand reduction technique,” Human Movement Science, vol. 10, pp. 575–587, Oct.1991.

[30] I. Charlton, P. Tate, P. Smyth, and L. Roren, “Repeatability of an optimised lowerbody model,” Gait & Posture, vol. 20, no. 2, pp. 213 – 221, 2004.

[31] E. Delahunt, K. Monaghan, and B. Caulfield, “Ankle function during hopping insubjects with functional instability of the ankle joint.,” Scandinavian Journal ofMedicine & Science in Sports, vol. 17, no. 6, pp. 641–648, 2007.

[32] R. Poppe, “Vision-based human motion analysis: An overview,” Comput. Vis.Image Underst., vol. 108, pp. 4–18, Oct. 2007.

[33] T. B. Moeslund, A. Hilton, and V. Kruger, “A survey of advances in vision-basedhuman motion capture and analysis,” Computer Vision and Image Understand-ing, vol. 104, no. 2-3, pp. 90–126, 2006. Special Issue on Modeling People:Vision-based understanding of a person’s shape, appearance, movement and be-haviour.

[34] J. J. Wang and S. Singh, “Video analysis of human dynamics: a survey,” Real-Time Imaging, vol. 9, no. 5, pp. 321 – 346, 2003.

[35] A. Agarwal and B. Triggs, “Recovering 3d human pose from monocular im-ages,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28,no. 1, pp. 44–58, 2006.

[36] C. BenAbdelkader and L. S. Davis, “Estimation of anthropomeasures from asingle calibrated camera,” in FG, pp. 499–504, 2006.

[37] J. Deutscher and I. Reid, “Articulated body motion capture by stochastic search,”International Journal of Computer Vision, vol. 61, no. 2, pp. 185–205, 2005.

[38] R. Kehl and L. Van Gool, “Markerless tracking of complex human motions frommultiple views,” Comput. Vis. Image Underst., vol. 104, pp. 190–209, Nov. 2006.

[39] R. Navaratnam, A. Thayananthan, P. H. S. Torr, and R. Cipolla, “Hierarchicalpart-based human body pose estimation.,” in BMVC, 2005.

66

[40] H. Hamer, K. Schindler, E. Koller-Meier, and L. V. Gool, “Tracking a handmanipulating an object,” in IEEE International Conference on Computer Vision(ICCV), pp. 1475–1482, October 2009.

[41] M. de La Gorce, D. Fleet, and N. Paragios, “Model-based 3d hand pose estima-tion from monocular video,” Pattern Analysis and Machine Intelligence, IEEETransactions on, vol. 33, no. 9, pp. 1793–1805, 2011.

[42] I. Oikonomidis, N. Kyriazis, and A. Argyros, “Tracking the articulated motionof two strongly interacting hands,” in Computer Vision and Pattern Recognition(CVPR), 2012 IEEE Conference on, pp. 1862–1869, June 2012.

[43] C. Keskin, F. Kirac, Y. Kara, and L. Akarun, “Real time hand pose estimationusing depth sensors,” in Computer Vision Workshops (ICCV Workshops), 2011IEEE International Conference on, pp. 1228–1234, 2011.

[44] A. Agarwal and B. Triggs, “A local basis representation for estimating humanpose from cluttered images,” in Computer Vision - ACCV 2006, vol. 3851 of Lec-ture Notes in Computer Science, pp. 50–59, Springer Berlin Heidelberg, 2006.

[45] K. Grauman, G. Shakhnarovich, and T. Darrell, “Inferring 3d structure with astatistical image-based shape model,” in In ICCV, pp. 641–648, 2003.

[46] R. Navaratnam, A. W. Fitzgibbon, and R. Cipolla, “Semi-supervised learning ofjoint density models for human pose estimation,” in BMVC, pp. 679–688, 2006.

[47] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas, “Discriminative densitypropagation for 3d human motion estimation,” in Computer Vision and PatternRecognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1,pp. 390–397 vol. 1, June 2005.

[48] L. Taycher, G. Shakhnarovich, D. Demirdjian, and T. Darrell, “Conditional ran-dom people: Tracking humans with crfs and grid filters,” in In IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), pp. 222–229, CVPR,2006.

[49] J. Romero, H. Kjellstrom, and D. Kragic, “Monocular real-time 3d articulatedhand pose estimation,” in Humanoid Robots, 2009. Humanoids 2009. 9th IEEE-RAS International Conference on, pp. 87–92, 2009.

[50] J. Shotton, A. W. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,A. Kipman, and A. Blake, “Real-time human pose recognition in parts fromsingle depth images,” in CVPR, pp. 1297–1304, 2011.

[51] Z. Ren, J. Yuan, J. Meng, and Z. Zhang, “Robust part-based hand gesturerecognition using kinect sensor,” Multimedia, IEEE Transactions on, vol. 15,pp. 1110–1120, Aug 2013.

67

[52] G. Mori and J. Malik, “Recovering 3d human body configurations using shapecontexts,” Pattern Analysis and Machine Intelligence, IEEE Transactions on,vol. 28, pp. 1052–1062, July 2006.

[53] E.-J. Ong, A. S. Micilotta, R. Bowden, and A. Hilton, “Viewpoint invariantexemplar-based 3d human tracking,” Computer Vision and Image Understand-ing, vol. 104, no. 2-3, pp. 178 – 189, 2006. Special Issue on Modeling People:Vision-based understanding of a person’s shape, appearance, movement and be-haviour.

[54] C. Orrite-Urunuela, J. del Rincon, J. Herrero-Jaraba, and G. Rogez, “2d silhou-ette and 3d skeletal models for human detection and tracking,” in Pattern Recog-nition, 2004. ICPR 2004. Proceedings of the 17th International Conference on,vol. 4, pp. 244–247 Vol.4, Aug 2004.

[55] L. Ren, G. Shakhnarovich, J. K. Hodgins, H. Pfister, and P. Viola, “Learningsilhouette features for control of human motion,” ACM Trans. Graph., vol. 24,pp. 1303–1331, Oct. 2005.

[56] P. Suryanarayan, A. Subramanian, and D. Mandalapu, “Dynamic hand poserecognition using depth data,” in Pattern Recognition (ICPR), 2010 20th Inter-national Conference on, pp. 3105–3108, 2010.

[57] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble for actionrecognition with depth cameras,” in Computer Vision and Pattern Recognition(CVPR), 2012 IEEE Conference on, pp. 1290–1297, June 2012.

[58] X. Chen and M. Koskela, “Online rgb-d gesture recognition with extreme learn-ing machines,” in Proceedings of the 15th ACM on International Conference onMultimodal Interaction, ICMI ’13, (New York, NY, USA), pp. 467–474, ACM,2013.

[59] M.-H. Yang, N. Ahuja, and M. Tabb, “Extraction of 2d motion trajectories andits application to hand gesture recognition,” Pattern Analysis and Machine Intel-ligence, IEEE Transactions on, vol. 24, pp. 1061–1074, Aug 2002.

[60] T. Starner, J. Weaver, and A. Pentland, “Real-time american sign language recog-nition using desk and wearable computer based video,” Pattern Analysis andMachine Intelligence, IEEE Transactions on, vol. 20, pp. 1371–1375, Dec 1998.

[61] J.-G. Wang and E. Sung, “Em enhancement of 3d head pose estimated by pointat infinity,” Image Vision Comput., vol. 25, pp. 1864–1874, Dec. 2007.

[62] S. Ohayon and E. Rivlin, “Robust 3d head tracking using camera pose estima-tion,” in Pattern Recognition, 2006. ICPR 2006. 18th International Conferenceon, vol. 1, pp. 1063–1066, 2006.

68

[63] G. Zhao, L. Chen, J. Song, and G. Chen, “Large head movement tracking usingsift-based registration,” in Proceedings of the 15th International Conference onMultimedia, MULTIMEDIA ’07, (New York, NY, USA), pp. 807–810, ACM,2007.

[64] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly, “Vision-basedhand pose estimation: A review,” Computer Vision and Image Understanding,vol. 108, no. 1-2, pp. 52 – 73, 2007. Special Issue on Vision for Human-Computer Interaction.

[65] B. Johansson, Low Level Operations and Learning in Computer Vision. PhDthesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden, Decem-ber 2004. Dissertation No. 912, ISBN 91-85295-93-0.

[66] H. Knutsson and G. H. Granlund, “Apparatus for determining the degree of vari-ation of a feature in a region of an image that is divided into discrete pictureelements.” US-Patent 4.747.151, 1988, 1988. (Swedish patent 1986).

[67] Z. Yao, Model Based Coding : Initialization, Parameter Extraction and Evalua-tion. PhD thesis, Umeå University, Applied Physics and Electronics, 2005.

[68] B. Horn, Robot vision:. MIT electrical engineering and computer science series,MIT Press, 1986.

[69] S. Yousefi, F. A. Kondori, and H. Li, “Real-time 3d gesture recognition andtracking system for mobile devices.” U.S. Patent Application, filed January 2014,Patent Pending.

[70] B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla, “Model-based handtracking using a hierarchical bayesian filter,” Pattern Analysis and Machine In-telligence, IEEE Transactions on, vol. 28, pp. 1372–1384, Sept 2006.

[71] M. de La Gorce, N. Paragios, and D. Fleet, “Model-based hand tracking withtexture, shading and self-occlusions,” in Computer Vision and Pattern Recogni-tion, 2008. CVPR 2008. IEEE Conference on, pp. 1–8, June 2008.

[72] C. Kerdvibulvechi and H. Saito, “Markerless guitarist fingertip detection usinga bayesian classifier and a template matching for supporting guitarists,” in 10thInternational Conference on Virtual Reality, 2008.

[73] D. Lee and S. Lee, “Vision-based finger action recognition by angle detectionand contour analysis,” ETRI Journal, vol. 33, pp. 415–422, 2011.

[74] D. Lee and Y. Park, “Vision-based remote control system by motion detec-tion and open finger counting,” Consumer Electronics, IEEE Transactions on,vol. 55, pp. 2308–2313, November 2009.

69

[75] M. Baldauf, S. Zambanini, P. Fröhlich, and P. Reichl, “Markerless visual finger-tip detection for natural mobile device interaction,” in Proceedings of the 13thInternational Conference on Human Computer Interaction with Mobile Devicesand Services, MobileHCI ’11, (New York, NY, USA), pp. 539–544, ACM, 2011.

[76] T. Shiratori, H. S. Park, L. Sigal, Y. Sheikh, and J. K. Hodgins, “Motion capturefrom body-mounted cameras,” ACM Trans. Graph., vol. 30, pp. 31:1–31:10, July2011.

[77] T. Oskiper, Z. Zhu, S. Samarasekera, and R. Kumar, “Visual odometry systemusing multiple stereo cameras and inertial measurement unit,” in Computer Vi-sion and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pp. 1–8,June 2007.

[78] Z. Zhu, T. Oskiper, S. Samarasekera, R. Kumar, and H. Sawhney, “Ten-foldimprovement in visual odometry using landmark matching,” in Computer Vision,2007. ICCV 2007. IEEE 11th International Conference on, pp. 1–8, Oct 2007.

[79] Z. Zhu, T. Oskiper, S. Samarasekera, R. Kumar, and H. Sawhney, “Real-timeglobal localization with a pre-built visual landmark database,” in Computer Vi-sion and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8,June 2008.

[80] L. Ballan, G. J. Brostow, J. Puwein, and M. Pollefeys, “Unstructured video-based rendering: Interactive exploration of casually captured videos,” pp. 1–11,July 2010.

[81] A. Davison, I. Reid, N. Molton, and O. Stasse, “Monoslam: Real-time singlecamera slam,” Pattern Analysis and Machine Intelligence, IEEE Transactionson, vol. 29, pp. 1052–1067, June 2007.

[82] D. Nister, O. Naroditsky, and J. Bergen, “Visual odometry for ground vehicleapplications,” Journal of Field Robotics, vol. 23, no. 1, pp. 3–20, 2006.

[83] http://www.lynxmotion.com/p-395-ssc-32-servo controller.aspx/.

[84] P. Padeleris, X. Zabulis, and A. Argyros, “Head pose estimation on depth databased on particle swarm optimization,” in Computer Vision and Pattern Recogni-tion Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pp. 42–49, June 2012.

[85] G. Fanelli, T. Weise, J. Gall, and L. Van Gool, “Real time head pose estima-tion from consumer depth cameras,” in 33rd Annual Symposium of the GermanAssociation for Pattern Recognition (DAGM’11), September 2011.

[86] G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. Van Gool, “Random forestsfor real time 3d face analysis,” Int. J. Comput. Vision, vol. 101, pp. 437–458,February 2013.

70

[87] N. K. Iason Oikonomidis and A. Argyros, “Efficient model-based 3dtracking of hand articulations using kinect,” in Proceedings of theBritish Machine Vision Conference, pp. 101.1–101.11, BMVA Press, 2011.http://dx.doi.org/10.5244/C.25.101.

[88] M. Van den Bergh, D. Carton, R. de Nijs, N. Mitsou, C. Landsiedel, K. Kuehn-lenz, D. Wollherr, L. Van Gool, and M. Buss, “Real-time 3d hand gesture in-teraction with a robot for understanding directions from humans,” in RO-MAN,2011 IEEE, pp. 357–362, July 2011.

[89] D. Ferstl, M. Ruther, and H. Bischof, “Real-time hand gesture recognition in avirtual 3d environment,” in DAGM-OAGM,2012 Conference on, 2012.

bring your body into action - diva-portal.se716122/fulltext01.pdf · tributions of this thesis...

Documents