a vision-enhanced multisensor lbs suitable for augmented reality applications

21
This article was downloaded by: [North Dakota State University] On: 28 October 2014, At: 04:21 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of Location Based Services Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tlbs20 A vision-enhanced multisensor LBS suitable for augmented reality applications Antonio J. Ruiz-Ruiz a , Oscar Canovas a & Pedro E. Lopez-de- Teruel a a Department of Computer Engineering , University of Murcia , Murica , Spain Published online: 15 Jul 2013. To cite this article: Antonio J. Ruiz-Ruiz , Oscar Canovas & Pedro E. Lopez-de-Teruel (2013) A vision-enhanced multisensor LBS suitable for augmented reality applications, Journal of Location Based Services, 7:3, 145-164, DOI: 10.1080/17489725.2013.807026 To link to this article: http://dx.doi.org/10.1080/17489725.2013.807026 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms- and-conditions

Upload: pedro-e

Post on 01-Mar-2017

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A vision-enhanced multisensor LBS suitable for augmented reality applications

This article was downloaded by: [North Dakota State University]On: 28 October 2014, At: 04:21Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Location Based ServicesPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/tlbs20

A vision-enhanced multisensorLBS suitable for augmented realityapplicationsAntonio J. Ruiz-Ruiz a , Oscar Canovas a & Pedro E. Lopez-de-Teruel aa Department of Computer Engineering , University of Murcia ,Murica , SpainPublished online: 15 Jul 2013.

To cite this article: Antonio J. Ruiz-Ruiz , Oscar Canovas & Pedro E. Lopez-de-Teruel (2013) Avision-enhanced multisensor LBS suitable for augmented reality applications, Journal of LocationBased Services, 7:3, 145-164, DOI: 10.1080/17489725.2013.807026

To link to this article: http://dx.doi.org/10.1080/17489725.2013.807026

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: A vision-enhanced multisensor LBS suitable for augmented reality applications

A vision-enhanced multisensor LBS suitable for augmented realityapplications

Antonio J. Ruiz-Ruiz*, Oscar Canovas and Pedro E. Lopez-de-Teruel

Department of Computer Engineering, University of Murcia, Murica, Spain

(Received 9 March 2013; final version received 18 April 2013; accepted 7 May 2013)

This paper presents an augmented reality system for indoor environments, whichdoes not require special tagging or intrusive landmarks. We have designed alocalisation architecture that acquires data from different sensors available incommodity smartphones to provide location estimations. The accuracy of thoseestimations is very high, which enables the information overlay on a camera-phone display image. During a training phase, we used visual structure frommotion techniques to run offline 3D reconstructions of the environment from thecorrespondences among the scale invariant feature transform descriptors of thetraining images. To determine the position of the smartphones, we first obtained acoarse-grained estimation based on WiFi signals, digital compasses and built-inaccelerometers, making use of fingerprinting methods, probabilistic techniquesand motion estimators. Then, using images captured by the camera, weperformed a matching process to determine correspondences between 2D pixelsand model 3D points, but only analysing a subset of the 3D model delimited bythe coarse-grained estimation. This multisensor approach achieves a goodbalance between accuracy and performance. Finally, a resection process wasimplemented providing high localisation accuracy when the camera has beenpreviously calibrated, that is we know intrinsic parameters such as focal length,but it is also accurate if an auto-calibration process is required. Our experimentaltests showed that this proposal is suitable for applications combining location-based services and augmented reality, since we are able to provide high accuracywith an average error down to 15 cm in ,0.5 s of response time.

Keywords: augmented reality; image processing; multisensor; SIFT; smartphones

1. Introduction

Several augmented reality applications do not require a continuous real-time response,

such as those designed to display location-aware reminders or notes, to obtain information

about who is behind the door of a particular laboratory or to visualise additional objects of

interest virtually added to the scene. However, those applications require a good trade-off

between fine-grained accuracy and an acceptable response time.

The wide adoption of smartphones enables new scenarios to offer overlay information

and augmented reality to an increasing number of users. Smartphones are equipped with

several sensors, which have simplified the process of obtaining the required context

information. For a diverse set of areas including tracking, geographic routing or

q 2013 Taylor & Francis

*Corresponding author. Email: [email protected]

Journal of Location Based Services, 2013

Vol. 7, No. 3, 145–164, http://dx.doi.org/10.1080/17489725.2013.807026

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 3: A vision-enhanced multisensor LBS suitable for augmented reality applications

entertainment, location-sensing systems are currently a very active research field. There

are several widespread applications that make use of the images that users obtain from the

camera to display augmented reality. Indeed, those images can be used to provide a better

estimation of the users’ position, offering a seamless integration of the sensor and the

display. We presented a previous version of this work in teruizruiz:ipin, and in this article,

we provide additional details about the suitability of this proposal to develop augmented

reality applications. We have designed an location-based services (LBS) based on images

which provides not only very fine-grained position estimations (up to a few cm of

maximum error) but also an equally precise estimation of the absolute orientation of the

imaging device. Therefore, hereinafter, we assume that the users are capturing images,

typically with a tablet or a smartphone, for which they want to obtain the precise

estimation of the full 6 degrees of freedom (dof) – three for ðX; Y ; ZÞ position and three

additional for the ða; b; gÞ Euler angles for the rotation matrix – of the device with

respect to some conveniently chosen world reference coordinate system. In this article, we

focus on how to develop augmented reality applications for indoor environments using

data from multiple sensors.

Image processing is a very time-consuming task, so we decided to follow a multisensor

design to limit the amount of analysed images. To determine the position of the

smartphones, we first obtained a coarse-grained estimation based on WiFi signals, digital

compasses and built-in accelerometers, making use of fingerprinting methods, probabilistic

techniques and motion estimators. Then, using images captured by the camera, we

performed a matching process to determine correspondences between 2D pixels and model

3D points, but only considering a subset of the 3D model delimited by the coarse-grained

estimation. The keystone in our approach is the use of the scale invariant feature transform

(SIFT) to run offline 3Dmap reconstructions of the environment from the features extracted

using training images. Several resection techniques were also used to estimate the full 6 dof,

depending onwhether we know some intrinsic camera parameters, such as focal length. The

system can currently complete an entire cycle, from obtaining data from sensors to

displaying augmented reality based on accurate estimations, in ,400ms.

2. Related work

Indoor positioning is an active research field. Many types of signals (radio, images and

sound) and methods have been used to infer location.

Paying attention to radio signals, pattern recognition methods, among many others,

estimate locations by recognising position-related patterns. Fingerprinting is based on

radio maps containing patterns of receive signal strength indicator (RSSIs), which are

obtained using 802.11, Zigbee, Bluetooth or any other widespread wireless technology.

The analysis of RSSI patterns is a technique that has been examined by several authors

(Chandrasekaran et al. 2009), obtaining an accuracy ranging from 0.5 to 3m.

However, better results can be obtained by integrating the information captured by

multiple sensors. For example, Hattori et al. (2009) and Mulloni et al. (2009) used WiFi

signals and image recognition to estimate positions. However, they are based on 2D

landmarks that must be placed in the scenario of interest, which involves the inclusion of

obtrusive elements.

Arai, Horimi, and Nishio (2010) performed a preliminary analysis of how image

processing techniques such as SIFT (Lowe 2004) can be used to improve the accuracy in

landmark-free location-based systems. SIFT features provide a powerful mechanism to

A.J. Ruiz-Ruiz et al.146

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 4: A vision-enhanced multisensor LBS suitable for augmented reality applications

perform robust matching between different images of a given scene, which is fundamental

not only for visual recognition of objects or places (Frauendorfer et al. 2008), but also –

and what is more relevant for this work – in the context of simultaneous localisation and

3D scene reconstruction (Se, Lowe, and Little 2005).

Alternatively, some other works on localisation have used other kind of features and

associated local image descriptors instead of SIFT, such as speeded up robust features

(Murillo, Guerrero, and Sagues 2007) or affine-SIFT (Li and Wang 2012). Nevertheless,

the goal is always essentially the same: to extract a collection of visual features from input

images, which are reasonably invariant to changes such as translation, scaling, rotation and

illumination, and which can be used to perform matching against a database of features

extracted from a large collection of images previously obtained from the application

environment.

As we show, our system goes beyond, and we build a full 3D model containing

hundreds of thousands of 3D points obtained using these features and state-of-the-art

structure from motion (SfM) computer vision techniques (Hartley and Zisserman 2011).

The computation of this 3D model not only allows us to provide a much higher accuracy in

the localisation, but also reduces the required time in the offline training process.

3. A multisensor design

In recent years, the wide adoption of smartphones equipped with several sensors has

simplified the process of obtaining context information. Although our solution uses the

images from the camera as the main source of information to infer accurate positions, we

can take advantage from other sensors to limit the amount of information to be examined,

to simplify the training process or to reduce the computational load of the smartphones.

We believe that this multisensor approach is crucial to accomplish a feasible image-based

location-based service.

The main drawback of most of the indoor positioning systems is the scalability issues

that arise when modelling large scenarios. However, several solutions can be envisioned

following a multisensor approach to reduce the necessary time effort and the required

processing during the online phase.

Our solution follows a methodology based on a continuous video recording and data

acquisition process. Then, using some fingerprinting techniques during the training phase,

divide the scenario into different zones according to the characterisation of the 802.11

signals. As a consequence, we can obtain clusters of physical points, where signals show a

similar behaviour, and they will be used to partition the 3D model into different submodels

(Ruiz-Ruiz et al. 2011). In addition, although we are recording the training video, we also

save information related to the current orientation of the training device using the data

provided by the built-in digital compass, because the different video frames are tagged

with their corresponding orientation.

During the online phase, we determine a coarse-grained geographical zone, in which

the device is located using the obtained RSSI. That zone is related to different 3D

submodels, and one of them is selected according to the orientation of the device to find

correspondences between the features of the current image and those contained in the

submodel. Consequently, we reduce the amount of information (image features) to check

against during the online phase, improving performance.

However, the information from the different sensors is also used to reduce the

computational cost on the smartphone, to minimise the amount of information transmitted

Journal of Location Based Services 147

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 5: A vision-enhanced multisensor LBS suitable for augmented reality applications

and therefore, to preserve the smartphone battery life. For example, we have developed a

motion estimator based on accelerometer readings to determine whether the user location

has to be recalculated because the user is moving fast. Our augmented reality applications,

apart from making use of our localisation service, rely on the information provided by the

gyroscope to adapt the overlaid information to the device’s rotations.

4. Generation of 3D models based on SIFT

Each SIFT feature is characterised by its position in the image and its natural size and

dominant orientation. It is also attached to a highly distinctive 128D description vector,

which is computed from an image patch in the corresponding position, scale and

orientation (Lowe 2004). By matching these descriptors against a geo-tagged database of

images previously extracted from the application environment, we were already able to

compute a good approximation of the current location in a previous work (Ruiz-Ruiz et al.

2011). Here, the images in the database were captured in a discretised grid of the

environment in all plausible directions of movement, so the precision was clearly limited

by both the resolution of the grid and the set of orientations used during training.

Furthermore, the training phase was a cumbersome manual process, which could become

clearly a problem when modelling large scenarios such as a university campus or an

airport.

To overcome these issues, instead of just storing features in the database along with the

location from which they were taken, in this work, we obtain an accurate 3D model of the

environment, which is built by getting the 3D points in the scene that generated the

corresponding image features. In order to do that, we combine the use of SIFT features

with state-of-the-art techniques in 3D computer vision, which, using the now well

established field of applied projective geometry (Hartley and Zisserman 2011), are able to

fully estimate a 3D reconstruction of the scene using only a sufficiently large number of

images taken from different viewpoints. A brief overview of this reconstruction process,

known generically as SfM, is given in Figure 1, using our own test environment as the

illustrating example.1 The input is a set of images extracted from a video recorded during

the training phase, trying to capture all the details of the scene. The output is an accurate

3D reconstruction, which includes both the 3D points and the full 6 dof position of all the

cameras.

5. Matching

Once we have obtained the 3D representation of our scenario, we need a mechanism to

look for coincidences between the features extracted from the 2D images captured during

the online phase and those features that made up the 3D model. Several techniques based

on nearest neighbour algorithm to perform the matching process efficiently were analysed

in previous works (Ruiz-Ruiz, Lopez-de-Teruel, and Canovas 2012).

As we can take advantage of our multisensor design, we divide our space search

according to the different zones determined after analysing additional sensors (RSSIs

signals and orientations). Therefore, the amount of features that we have to compare with

is considerably reduced. The moderately small size of an individual zone in our case

(,75K descriptors) led us to test a brute force algorithm. Keeping in mind that the

matching process will be usually performed by the cloud, we made use of an exhaustive

exact nearest neighbour (ENN) implementation of the brute force search algorithm

A.J. Ruiz-Ruiz et al.148

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 6: A vision-enhanced multisensor LBS suitable for augmented reality applications

running on a graphics processing unit (GPU). To take advantage of the enormous power of

these relatively inexpensive computing platforms, we slightly modified the algorithm

proposed in Garcia, Debreuve, and Barlaud. (2008) to adapt its functionality to our zone-

based space search. In addition to a better accuracy, the use of this algorithm resulted in an

increased matching performance, as compared with those techniques based on nearest

neighbour.

6. Camera resection

Camera resection (sometimes generically referred to as camera calibration) is the process

by which we can determine the P3£4 projection matrix from a given set of matching 3D

scene points and the corresponding 2D pixels in an image. This matrix defines the

algebraic relationship PXi ¼ xi for every Xi $ xi correspondence when both the pixels xiand the 3D points Xi are given in homogeneous coordinates (Hartley and Zisserman

2011). P can be factorised as P ¼ K3£3R3£3½Ij2 C�3£4, in a way that the internally

encoded relationship with the rotation R and 3D position C ¼ ðcx; cy; czÞ` of the camera in

the reference 3D coordinate system is made explicit. This factorisation depends on the

camera intrinsic calibration matrix K, which contains internal camera parameters such as

the focal length and the aspect ratio (e.g. a very simple calibration matrix which is

commonly used is K ¼ diagðf ; f ; 1Þ, with f the focal length in pixel units). In this section,

1.SIFT featuresextraction and description

4. Matching matrix forevery pair of views

5. Growing 3D structureresectioning additional viewsand triangulating new pionts

6.Global (refined) 3Dstructure and motion

reconstruction

7. Final texturedreconstruction

(not used)

2.Pairwise 2D matching

3. Stereo 3Dreconstuctions0. Input images

Structure frommotion process

Figure 1. Illustration of the SfM technique. Input images are preprocessed by extracting andmatching SIFT features among them to get relevant stereo pairs from which the 3D reconstructioncan be bootstrapped. These partial reconstructions are then augmented with new cameras by anincremental resectioning/triangulation procedure. The final step, known as bundle adjustment(Triggs et al. 2000), is a global nonlinear optimisation, which minimises the reprojection error of allthe 3D points to their corresponding 2D image projections. All figures obtained using ChangchangWu’s VisualSfM software (Wu 2001) on our test environment.

Journal of Location Based Services 149

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 7: A vision-enhanced multisensor LBS suitable for augmented reality applications

we detail the entire resectioning process that we follow in this work, which is summarised

in Algorithm 1.

The overall process starts by running a robust implementation of the so-called direct

linear transformation (DLT) algorithm (Hartley and Zisserman 2011), which allows us to

obtain the camera matrix P that minimises the reprojection error for the maximum number

of inlier correspondences. In this context, inliers are matching pairs of 3D$2D points,

which can be fitted to the model up to a given error tolerance. Then, if the camera had been

precalibrated (i.e., we knew in advance the calibration matrix K), we would discard P

using only the obtained inlier set to get the desired camera pose (R, C) using an exterior

orientation algorithm. Otherwise, we still can perform a camera autocalibration using the

previously estimated P to solve the factorisation problem by carrying out the simple and

well-known matrix algebra operation of QR decomposition (Golub and Van Loan 1996).

These two alternatives give us the possibility to treat both the cases of knowing the camera

internal parameters in advance (feasible for most common devices), as well as the

completely uncalibrated case, which is solved at the price of a slightly lesser precision.

The final pose is obtained by refining the former solution using some standard nonlinear

optimisation procedure (Nocedal and Wright 1999).

6.1. Direct linear transformation

The P matrix that we must first estimate is an homogeneous entity with 12 elements, so

that it has, in general, 11 dof. Each 3D$2D correspondence ðx; y; zÞ $ ðq; rÞ sets up the

homogeneous restriction Pðx; y; z; 1Þ` ¼ lðq; r; 1Þ`, which is equivalent to

ðq; r; 1Þ £ Pðx; y; z; 1Þ ¼ ð0; 0; 0Þ. Namely,

0` 2X`i riX

`i

X`i 0` 2qiX

`i

24

35

P1

P2

P3

2664

3775 ¼ 0; ð1Þ

where Xi ¼ ðxi; yi; zi; 1Þ` and xi ¼ ðqi; ri; 1Þ` ;i are the homogeneous coordinates of the

corresponding pair of points, and Pk ¼ ðp4k23; p4k22; p4k21; p4kÞ` for k ¼ 1; 2; 3 are the

three rows of P. Taking into account that we must solve for 11 unknowns (dof of P), and

since each 2D$ 3D correspondence gives rise to two independent equations over the

elements of P, it is necessary to make use of a minimum of 6 of these correspondences to

solve for the complete system. From the solution for the 12 unknowns p1· · ·p12, we can

Algorithm 1. Resection algorithm

Require: 2D$ 3D correspondencesP, inliers ˆ DLT (correspondences) MRANSACif camera intrinsic known then M Precalibrated caseK, R, C ˆ Exterior Orientation (Kfixed, inliers)else MAutocalibrationK, R, C ˆQRDecompositionðPÞend ifRopt, Copt ˆ NonLinear Optimization (K, R, C)Ensure: Ropt and Copt (camera pose in world coordinates)

A.J. Ruiz-Ruiz et al.150

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 8: A vision-enhanced multisensor LBS suitable for augmented reality applications

easily reconstruct a canonical version of the homogeneous matrix P:

P ¼

p1p12

p2p12

p3p12

p4p12

p5p12

p6p12

p7p12

p8p12

p9p12

p10p12

p11p12

1

26664

37775: ð2Þ

The former procedure can be easily adapted to the overdetermined case (more than six

correspondences). As points are measured inexactly due to noise, some of these

correspondences may not be fully compatible with any projective transformation, and

therefore, the best possible P has to be determined, which was found by minimising the

reprojection error in the minimum square sense.

The SIFT matching process frequently ‘contaminates’ the set of real correspondences

with some subset of outliers (wrong correspondences), which would seriously distort the

obtained solution if not adequately filtered. To solve this problem, a robust estimation

procedure is needed, and in this work, we use the well-known RANSAC algorithm

(Fischler and Bolles 1981). This algorithm works by finding the best transformation matrix

that minimises the reprojection error taking into account only inlier (i.e. correct)

correspondences. To filter the outlier (incorrect) correspondences, RANSAC iteratively

selects a random subset of six elements from the original set of correspondences, from

which it computes a tentative P. Because the number of available correspondences is

notably higher than this required minimum of six, at each RANSAC iteration, we perform

an incremental process that iteratively adds the correspondences identified as inliers and

recalculates P until no more inliers were detected. One correspondence is identified as an

inlier if its reprojection error does not exceed a specified threshold. This iterative

procedure, which will try several tentative P matrices before success, is completes when

either obtains a sufficient percentage of inliers over the total number of input

correspondences or the number of iterations exceeds a given threshold. In Section 7.3, we

show the influence of all these parameters in both the accuracy and performance of the

results.

The K, R and C values can be easily obtained from the QR decomposition of the first

three columns of P. Therefore, at least theoretically, this generic procedure could be

applied without a previous knowledge of the calibration matrix of the camera. However,

the main problem with this approach is that the solution values that we obtain (mainly K

and C) can be severely coupled, that is they are not mutually independent from each other.

Particularly, in some ill-conditioned cases such as scenes with a dominant plane (a not so

uncommon case in typical indoor scenarios, where large walls can cover most of the

image), the autocalibration process can fail to give an accurate solution. As Section 3

shows, the QR decomposition can still be a good alternative when the camera intrinsic

parameters are completely unknown, except under the aforementioned ill-posed

conditions. In the following section, we show how this a priori knowledge of calibration

largely improves the quality of the results.

6.2. Exterior orientation

In many cases, the on-board camera specifications (mainly focal length and aspect ratio) of

typical smartphones and tablets are known, so the calibration matrix Kfixed can be

determined. Fiore (2001) developed an alternative linear method to obtain only the

Journal of Location Based Services 151

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 9: A vision-enhanced multisensor LBS suitable for augmented reality applications

exterior orientation of the camera, which, using this knowledge of the intrinsic parameters,

results in a much more precise estimation of R and C. Moreover, it performs remarkably

well even in the planar case which poorly conditioned the problem for the DLT algorithm,

as commented in the previous section. Fiore’s algorithm works in two stages, first,

estimating the unknown depths zi for each homogeneous 2D point xi, that is first we get an

intermediate set of (inhomogeneous) 3D points zi K21fixed xi, and then what is left is an

absolute 3D orientation problem, whose solution yields the desired values of R and C.

More specifically, given a number of input 2D$3D point correspondences, xi $ Xi,

and the intrinsic camera matrix K ¼ Kfixed, we are required to find a rotation R, a vector C

(attitude and position of the camera) and a subsidiary depth vector z ¼ ðz1;A; zi;A; znÞ`(where n is the number of correspondences) such that:

zixi ¼ KR½Ij2 C�Xi , K21zixi ¼ R½Ij2 C�Xi ;i: ð3Þ

In order to solve for the unknown values of zi, we first compose the matrix M, for

which columns are formed with the homogeneous coordinates of the 3D points,

M4£n ¼ ½X1· · ·Xn� ð4Þ

and another matrix N, arranging in this case, the homogeneous coordinates xi ¼ðpi; qi; 1Þ` of the 2D points in the following way:

N3n£n ¼

x1 0 . . . 0

0 x2 . . . ...

..

. . ..

0

0 0 . . . xn

266666664

377777775: ð5Þ

Then, taking the singular value decomposition (SVD) (Golub and Van Loan 1996) of

M, we obtain the factorisationM ¼ U4£4D4£nV`n£n, where U and V are orthogonal andD is

diagonal. Being k the rank ofM, we denote by V2 the submatrix of size n £ ðn2 kÞ, whichresults from taking only the last n2 k columns of V . This submatrix spans the null space

of M, so that MV2 ¼ 0.

Given the former definitions, it can be shown (Fiore 2001) that the depth vector z canbe recovered linearly (up to a scale factor), just by solving the following null-space

problem:

ððV`2 ^K21ÞNÞz ¼ 0; ð6Þ

where the ^ symbol stands for the Kronecker product (Horn and Johnson 1991) of

matrices. Once the depths have been obtained, our initial Equation (3) is now reduced to

find the correct alignment of two sets of 3D points, an instance of the well-known absolute

3D orientation problem.

This last stage of the exterior orientation algorithm proceeds as follows. LetWi be the

3D depth recovered points, Wi ¼ ðzi·ðqi=f Þ; zi·ðri=f Þ; ziÞ` (where we have used the

Kfixed ¼ diagðf ; f ; 1Þ assumption), and Yi the likewise inhomogeneous original 3D points,

Yi ¼ ðxi; yi; ziÞ`. Then, given these two sets of 3D pointsWi and Yi, we are required to find

A.J. Ruiz-Ruiz et al.152

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 10: A vision-enhanced multisensor LBS suitable for augmented reality applications

the rotation matrix R, the vector C and the scalar s, such that:

Wi ¼ sðRYi þ CÞ ;i ¼ 1· · ·n: ð7ÞNow, centring the point clouds on their respective means, that is �Wi ¼

Wi 2 ð1=nÞPnj¼1Wj and �Yi ¼ Yi 2 ð1=nÞPn

j¼1Yj, the scale s is very easy to obtain as

s ¼ ðk �WikÞ=ðk �YikÞ, for every i. In practice, s can be averaged for the set of points

;i ¼ 1· · ·n, and then used to scale accordingly the values of �Yi.

With the sets of points adequately scaled and centred, we are left with the problem of

estimating the unknown rotation between �Wi and s �Yi. This is known as the orthogonal

Procrustes problem (Gower and Dijksterhuis 2004). Being W3£n and Y3£n the matrices

formed by stacking the points �Wi and s �Yi, respectively, the solution is found using again

the SVD decomposition. The sought rotation is given by:

R ¼ V01 0 0

0 1 0

0 0 DetðU 0V 0`Þ

2664

3775U 0`; ð8Þ

whereU03£3D

03£3V

03£3 stands for the SVD of the matrix YW `, and the determinant ofU 0V 0T

is always 1 or 21.

Finally, once we have calculated the rotation matrix R, we can obtain the translation

vector C from (7) simply this way:

C ¼ 1

s

1

n

Xni¼1

Wi

!2 R

1

n

Xni¼1

Yi

!: ð9Þ

As we demonstrate in Section 7.3, this precalibrated exterior orientation procedure

ensures both robustness and accuracy in the estimated camera pose, even in environments

where we have to deal with essentially planar scenes.

6.3. Nonlinear optimisation

The methods described in the previous subsections are linear, so in fact they minimise an

algebraic instead of a geometric error (Hartley and Zisserman 2011). Thus, although

essentially correct, the obtained solutions for the ðK;R;CÞ decomposition of the camera

matrix are slightly suboptimal. Especially when we know the camera intrinsic parameters

(Kfixed) with a high degree of accuracy, a final polishing of the solution by any standard

nonlinear procedure, which aims to minimise the geometric reprojection error is

recommended (line 8 of Algorithm 1). Several alternatives are also available for this task

(Nocedal and Wright 1999), in this work, we simply use a straightforward iterative

procedure implemented in the GNU scientific library. Our experiments show that this final

optimisation, to a greater or lesser extent, always improves the linear estimation.

6.4. From 3D model to real-world coordinates

Obviously, in the absence of any real-world reference, the SfM techniques can determine

the 3D reconstruction only up to an arbitrary global euclidean transform. Nevertheless, our

LBS must of course attach this 3D model to some convenient coordinate system defined

Journal of Location Based Services 153

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 11: A vision-enhanced multisensor LBS suitable for augmented reality applications

for our physical environment, so that we obtain the localisation results in real-world

coordinates. We can also define the desired augmented reality structures on this world

coordinate system. Given a minimum of three ground control points (although, for

numerical stability, it is convenient to use a few more), it is possible to estimate the

corresponding parameters s0, R0 and C0 of the needed 3D reconstruction! world euclidean

transformation solving again an absolute 3D orientation problem analogous to the problem

mentioned in Section 6.2. Those parameters will be used to globally transform the whole

3D model to the desired final real-world coordinates, where all the posterior resectioning

process will take place.

7. Experimental analysis

This section describes the experimental environment where we have carried out a set of

tests with the aim of validating the proposal described in this paper. We also outline the

software and hardware elements as well as the methodology to perform the training

process. Finally, we discuss the details of one specific validation (among the several tests

we performed) and their related results.

7.1. Experimental environment

The testbed, where our experiments were conducted is located on the ground floor of our

Faculty. As shown in Figure 2, it is a 220m2 open space area. To build the fingerprinting

map based on an RSSI, we took advantage of the access points deployed throughout the

dependencies.

The training video and RSSI observations were captured using a Samsung Galaxy Tab

Plus 7.0 with Android OS 3.2, and experiments were carried out making use of an ASUS

Transformer TF-101 with Android OS 4.0.3. We made use of our own application,

LOCUM-Trainer, to perform a fast training process. As a result, we obtained a lightweight

fingerprinting map of RSSIs and a set of images (frames extracted from the recording) geo-

tagged with the correspondent zone and orientation. Then, using a clustering technique

described in Lemelson et al. (2009), we determined the geographical areas dividing the

space search for image analysis, as it was mentioned in Section 5.

During the online phase, we extracted SIFT features to look for correspondences with

the points of the 3D model. For image processing purposes, we used the SIFTGPU library,

a GPU implementation of SIFT developed by Wu (2007). SIFTGPU requires a high-end

GPU which is not available in smartphones, so we use a server with an nVidia GeForce

Figure 2. Experimental environment. Different shape dots indicate the RSSI clusters. Rectanglesenclosing dots define the 3D submodels defined including orientations.

A.J. Ruiz-Ruiz et al.154

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 12: A vision-enhanced multisensor LBS suitable for augmented reality applications

GTX580 with 512 cores, supporting both GLSL (OpenGL Shading Language) and CUDA

(Compute Unified Device Architecture). This GPU was installed in an Intel(R) Core(TM)

i7-2600K CPU 3.40GHz. This server will act as the cloud, which is responsible for

estimating the location of the different devices. In addition, we have also used a modified

version of the CUDA ENN software developed by Garcia, Debreuve, and Barlaud (2008)

to perform the matching process, and our own implementation of the resection algorithm

described in Section 6, making use of the QVision library,2 available in SourceForge.

7.2. Building 3D maps

As we mentioned earlier, our proposal is mainly focused on the image analysis. Thus, we

needed to obtain a set of images to build our 3D model. To minimise the training time, we

recorded a 702400 long video, capturing every detail of our environment. As there are not

many objects in the centre of the hall, the video was recorded paying attention to the four

main walls with the camera situated at shoulder level. During our tests, we learnt that a

better reconstruction is obtained when the scene is recorded with the camera oriented to

the walls and also including a second round using a 458 angle of view. Once the video wasrecorded, and after several tests varying the frame extraction rate, we extracted one image

per 24 frames, thus obtaining 524 different images. Those images were good enough to

build the 3D model, which was used for the following tests.

Our 3D model was built using the VisualSFM software (Wu 2001), a visual structure

from motion application that makes use of a set of input images to run 3D reconstructions.

VisualSFM relies on the SIFTGPU library to extract the SIFT features and to perform the

pairwise image matching process. The bundle adjustment is implemented using a GPU

version that exploits the hardware parallelism.

Finally, using the information extracted from the 3D model and according to the zones

dividing our map, we built the search structures that contain the descriptors of the 3D

points belonging to each submodel.

7.3. Validation results

In this section, we present the experimental results of our proposal. A set of 20 images

(640£480 pixels) is our validation data, and they were obtained while walking along our

scenario using the rear camera of an ASUS EEEPad Transformer TF-101 tablet. As the

camera was previously calibrated, we knew its intrinsic parameters (focal length and

aspect ratio). Results shown in Table 1 were obtained by performing the resection process

making use of the Exterior Orientation technique.

We have evaluated different camera resection parameters varying their values in order

to select the best configuration according to the characteristics of our environment. The

evaluated parameters included: (1) # Correspondences refers to the number of 2D$3D

pairs used as input for the resection process; (2)Minimum Inliers is the minimum required

number of correspondences that must be considered as inlier to stop the RANSAC process

(Early-Detention); (3) Iterations is the maximum number of repetitions that the RANSAC

process is executed to obtain the transformation matrix that minimises the reprojection

error and (4) Reprojection Error represents the upper threshold to consider a 2D$3D

correspondence as inlier. The values selected for testing each parameter were selected

after carrying out additional experiments guided by the information obtained from vision-

based literature and the study of the goodness of the correspondences we are working with.

Journal of Location Based Services 155

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 13: A vision-enhanced multisensor LBS suitable for augmented reality applications

Table

1.Resultsoftheresectionprocess

usingExteriorOrientationanddifferentconfigurationparam

eters.

Number

ofcorrespondences

40

50

60

Minim

um

inliers

10

20

30

10

20

30

10

20

30

Iterations

Reprojectionerror(px)

%%

%%

%%

%%

%

419.5–100

10.11–96

9.0–90

25.2–100

11.60–97

9.6–90

39.9–99

15.6–96

9.4–90

100

514.4–100( *)

15.1–98

10.8–90

25.8–100

19.0–98

9.8–90

32.5–99

17.0–95

10.6–91

616.3–100

13.9–99

10.5–90

27.2–100

19.3–98

10.1–90

37.0–100

15.0–95

15.0–91

410.8–100

9.5–97

8.6–90( **)

20.0–100

17.9–99

10.1–90

29.1–99

13.6–98

9.2–90

200

514.6–100

12.3–99

10.7–90

15.9–100

17.3–99

9.7–90

18.1–99

22.0–97

10.9–92

614.0–100

12.9–100

10.0–90

15.9–100

12.0–100

10.2–90

32.5–100

17.6–98

12.4–92

412.4–100

10.4–97

9.1–90

15.3–100

11.5–99

10.2–90

19.3–100

14.4–98

10.0–90

300

514.1–100

11.9–99

10.6–90

11.6–100

15.3–100

9.5–90

23.4–100

13.2–98

11.0–92

612.3–100

11.6–100

9.9–90

14.7–100

16.1–100

10.0–90

19.8–100

13.0–99

12.5–93

Estim

ationError(cm)–resectioncamerasuccessfulcases(%

).

A.J. Ruiz-Ruiz et al.156

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 14: A vision-enhanced multisensor LBS suitable for augmented reality applications

As we can see, there are several sets of parameters providing, on average, an

estimation error under 15 cm. The choice of the most appropriate configuration depends on

the associated processing time, which depends on the number of iterations executed. For

example, we consider that the set of parameters that offers the best trade-off between

accuracy and performance is the one marked as (*) in Table 1. Although this option

supposes a slight loss of accuracy in relation to other combinations, the involved time for

the resection process is 127ms on average, what supposes a reduction down to 37% in

relation to more accurate options. Increasing the number of iterations to 200 or 300, the

execution time rises to 201 and 267ms, respectively. Moreover, this choice is less

restrictive than that providing the best accuracy results, since requiring a fewer number of

inliers we were able to perform a valid camera resection in all the tests instead of the 90%

of success obtained when using the most accurate option (**). It is worth noting that

although this selected set of parameters works properly for our environment, it may differ

from one scenario to another.

Once we have selected the best configuration of parameters, we have also evaluated

how the number of descriptors in the search space influences on the accuracy and

performance of the resection process. We have reduced the number of descriptors by

filtering those defining the same 3D point and whose camera poses are closer than a

specified threshold (100 cm). This reduction is based on the idea that each 3D point is

formed by a set of SIFT descriptors, but some of them were obtained from nearby cameras,

thus introducing some redundancy to the matching process because of their visual

similarity. In this way, we have reduced the number of descriptors from 254,631 to 86,088.

Figure 3 shows a comparison of the accuracy obtained when we use all the descriptor or

only the filtered decriptions. As a conclusion, we can notice that accuracy is barely

decreased when a filtered set of descriptors is used. In those cases, the mean estimation

error is 16.36 cm; however, the required matching time (Figure 4) is reduced down to 67%

in relation with the whole set of descriptors.

A zone-based search space (the multisensor approach) reduces (62%) the time required

to carry out the matching process, as Figure 4 shows. In this case, the time needed to

complete the whole cycle from sensor data acquisition for 3D estimation is around 345ms.

We have detailed the transmission time of sensor data from the smartphone, the required

time at the cloud to perform the SIFT extraction, to run the matching process and to apply

the resection technique. The time required to perform the RSSI analysis is 1.5ms, so it has

been omitted because of its unnoticeable influence.

Est

imat

ion

erro

r (c

m)

Image

100.00

10.00

1.00

0.101 2 3 4 5 6 7 8 9 10 11 12 13 14 15

All descriptrosFiltered descriptors

16 17 18 19 20

Figure 3. Accuracy results for each validation image using different sets of descriptors (using all ofthem or a filtered subset).

Journal of Location Based Services 157

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 15: A vision-enhanced multisensor LBS suitable for augmented reality applications

However, we have also evaluated the use of an Early-Detection (ED) version of the

RANSAC process (Table 2). It is based on the idea that once the threshold of inliers has

been exceeded, it might not be necessary to execute the rest of RANSAC iterations

because the obtained camera matrix P might be good enough to estimate the camera pose

accurately. Despite its better performance, which may save up to 63% of processing time

as shown in Table 3, ED shows less accuracy than the complete version of RANSAC.

Table 2 shows the resection results obtained using Exterior Orientation and an ED version

of RANSAC. After analysing these results, we can conclude that despite the high accuracy

obtained using some configurations, we can notice an important loss of reliability, because

none of the configuration parameters guarantees 100% of successful cases.

It is worth mentioning that in those cases where the intrinsic camera parameters are

unknown, we can make use of the DLT algorithm described in Section 6 to carry out the

resection process. A mean estimation error around 78 cm is obtained and the elapsed time

is similar to that obtained with Exterior Orientation. Therefore, our proposal is able to

address reasonably well even those situations where there is no previous knowledge of the

camera configuration. Figure 5 shows a comparison between DLT and Exterior

Orientation, in terms of accuracy.

8. Augmented reality prototype

Throughout the preceding sections, we have laid the foundations to design an efficient

system, which is able to provide accurate location estimations, using data from several

sensors and without requiring intrusive landmarks. We envision that these types of

systems are really useful to build augmented reality services requiring very precise

overlaid information. Although we can think of many mobile apps that do not need freshly

updated 3D position and orientation of the device at video input frame rate, many other

applications would certainly benefit of a location service providing continuous, always on,

real-time localisation. For example, an application providing information about where are

your friends inside a particular building could be just based on a 2D map showing mobile

icons referring to your friends, but it would be very useful if it were also able to display

Tim

e (m

s)

1000.00

800.00

600.00

400.00

200.00

0.00All Filtered

Resection

Matching

SIFT

Sensor Tx

Zone-based

Figure 4. Performance using different alternatives to build the search space.

A.J. Ruiz-Ruiz et al.158

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 16: A vision-enhanced multisensor LBS suitable for augmented reality applications

Table

2.Resultsoftheresectionprocess

usingExteriorOrientation(w

iththeEarly-D

etentionversionofRANSAC)anddifferentconfigurationparam

eters.

Number

ofcorrespondences

40

50

60

Minim

um

inliers

10

20

30

10

20

30

10

20

30

Iterations

Reprojectionerror(px)

%%

%%

%%

%%

%

100

439.2–68

22.7–79

10.8–80

48.4–66

31.1–75

13.4–75

45.5–60

28.2–72

15.4–72

535.4–67

20.8–81

11.1–78

38.8–63

17.2–76

16.9–75

29.4–62

24.2–70

17.8–71

633.5–66

18.6–76

11.3–76

36.2–62

21.2–73

16.1–72

46.4–57

27.2–69

12.7–70

200

443.6–67

25.8–80

10.9–78

41.6–65

22.7–78

13.1–76

47.8–61

24.2–73

15.3–71

532.1–67

22.0–79

10.9–78

29.3–64

18.5–75

14.2–75

59.0–58

22.5–69

17.4–71

648.8–67

20.8–80

12.4–77

33.7–64

32.7–74

13.1–72

32.6–57

22.7–70

17.1–72

300

436.6–68

26.4–82

11.3–78

46.3–65

26.8–77

12.1–76

46.4–60

25.1–72

14.6–72

531.7–68

24.6–81

11.0–78

34.8–61

22.4–75

21.3–74

33.4–60

30.0–71

20.2–73

632.3–66

26.2–79

11.2–76

35.7–62

24.5–73

20.1–72

43.8–59

21.3–70

14.3–71

Estim

ationError(cm)–resectioncamerasuccessfulcases(%

).

Journal of Location Based Services 159

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 17: A vision-enhanced multisensor LBS suitable for augmented reality applications

continuously updated augmented reality information on top of the video input captured by

the camera device as the user keeps on moving. Users could even switch from one

interface to another, depending on the kind of information they are interested in that

moment.

Although, as stated in the previous experimental section, the response time of our

resection service by the cloud is typically limited to,200ms, this refreshment rate is still

insufficient to fully cope with video input rates of mobile devices (25 frames per second in

typical cameras). Moreover, the robustness and overall behaviour of the application could

certainly suffer if the location system tried to update the location information by making

continuous queries to the cloud, due to both the communication overhead and the

eventually unavoidable resectioning failures. To avoid such degradation of the user

experience, while also allowing a smooth and real-time image refreshment rate, we make

use of some additional sensors provided by the device, such as the accelerometer and the

gyroscope, which allow us to detect different types of motion and reduce the needed

communication with the cloud.

In order to test this hybrid approach, we developed a prototype app for Android

devices based on OpenGL ES 2.03, which is able to display augmented reality based on

precise estimations of the full 6 dof: ðX; Y ; ZÞ position and ða;b; gÞ Euler angles for therotation matrix. We display a wire-frame model on top of the real world to visually check

the accuracy of our estimations. In addition, we add virtual banners to the doors in order to

display information of interest. Figure 6 shows a snapshot of the prototype.

Table 3. Time required (ms) at each step of the resection process.

Non-ED ED

RANSAC 75.1 2.1Exterior Orientation 2.6 2.8Optimisation 41.7 39.6Total 119.4 44.5

400.00

300.00

200.00

100.00

0.001 2 3 4 5 6 7 8 9 10

Image

Est

imat

ion

err

or

(cm

)

11 12 13 14 15

External OrientationDLT

16 17 18 19 20

Figure 5. Comparison of DLT and Exterior Orientation.

A.J. Ruiz-Ruiz et al.160

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 18: A vision-enhanced multisensor LBS suitable for augmented reality applications

Every time a new rotation of the device is read from the sensors (typically at 50Hz rate

in most Android devices), the original rotation matrix estimated by the last successfully

resectioned image is modified accordingly, and the augmented reality is updated. The

camera centre C is not modified at all. Only when a translation movement is detected by

the accelerometer the device is forced to request a fully new estimation of position and

rotation from the cloud. The continuous update of the current device rotation is performed

using the gyroscope sensor according to the following formula:

RnewðtcurÞ ¼ Rcomp·RgyrðtcurÞ`·Rgyrðt0Þ·Rcomp·Rcamðt0Þ: ð10Þ

Here, Rcamðt0Þ stands for the last rotation estimation provided by the cloud using

strictly visual information, and is therefore only refreshed when the user clearly moves to a

new location, as stated above. RnewðtÞ, on the contrary, is updated at the systems sensor

reading rate using the most current gyroscope information. This provides continuous,

smooth visual feedback to the user. The update is based on the fact that the matrix product

RgyrðtcurÞ`·Rgyrðt0Þ with RgyrðtÞ, the absolute rotation read by the gyroscope of the device attime t, gives in every moment the relative motion between the moment the picture

corresponding to the last camera resection was taken (t ¼ t0) and the current instant of

time (t ¼ tcur), assuming that the movement performed by the user is a pure rotation (i.e.

there is no physical translation of the camera centre). Finally, the matrix Rcomp ¼diagð1;21;21Þ is a simple diagonal matrix that acts just to compensate the different

handedness of the camera and world systems.

In fact, gyroscope readings measure angular velocities, which should be integrated in

time to obtain the corresponding real-rotation matrices. Nevertheless, most android

systems do in fact integrate compass, gravity and gyroscope readings to get a virtual sensor

that provides a filtered output rotation which, although essentially based on the gyroscope

estimation, is certainly much more stable and prevents typical drifting of the gyroscope

estimation alone in the long term. In our system we use this second type of virtual sensor,

Figure 6. Snapshot of the Android AR app.

Journal of Location Based Services 161

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 19: A vision-enhanced multisensor LBS suitable for augmented reality applications

which is more robust in practice, although, given the typically short elapsed time between

successive visual estimations (a few seconds in real-application cases), the raw integrated

gyro sensor readings could also be used without the risk of drifting too much in the

estimation.

Given this freshly updated rotation, the final projection matrix, which is used at every

moment to perform the projection of virtual structures for augmented reality is

P ¼ KRnewðtcurÞ½Ij2 Cðt0Þ�, with the value of the camera position Cðt0Þ as estimated in the

last query to the resection service in the cloud, and the rotation matrix R substituted by

RnewðtcurÞ, computed as stated in Equation (10). Observe that this is a very fast operation

that only changes in the most current RgyrðtcurÞ reading, and can thus be performed at a very

high refreshment rate even in low-end mobile devices. At the end of the video mentioned

in Section 4, one can also observe the visually appealing effect accomplished by this

hybrid visual/inertial estimation in action.

9. Conclusions

As emphasised during the paper, the multisensor design is crucial to accomplish a robust

and scalable image-based location-based service. We have introduced an approach that not

only eliminates the need to discretise the environment at a given granularity, but also

improves the quality of the location estimations. We have presented how to build an

accurate 3D model of the environment ready to perform the desired full 6 dof accurate

localisation process for any new input image. In addition, the multisensor design provides

mechanisms to work in difficult scenarios such as those presenting similar areas from a

visual perspective (floors and rooms). In those cases, SIFT features can fail to

disambiguate, but other signals such as RSSIs can be crucial to perform an accurate

resection process.

We envision that our proposal constitutes a good starting point to develop augmented

reality applications, since locations are obtained with a mean error of 15 cm, and with the

additional bonus of estimating the rotation with a similar precision. Moreover, the mean

time required to complete an entire cycle is 345ms, making this proposal able to support

applications requiring high accuracy and rapid response. An augmented reality prototype

was developed to visually check that our estimations are correct and that the integration of

additional sensors (like gyroscope) is especially useful to integrate user’s movements.

As a statement of direction, we are designing a new training process based on inertial

sensors in order to build 3Dmodels in a more robust and not supervised way. Themain goal

of this training is to facilitate the fully automatic and more reliable reconstruction of large

indoor environments, since it has a direct influence over the system accuracy. Finally,we are

also working on the use of amore powerful camera resection technique that takes advantage

of the information provided by inertial sensors available in smartphones. Hopefully, this

technique will allow us to use less 2D$3D correspondences in the RANSAC procedure to

obtain an equally valid solution for the camera position, a fact that we think that will

improve both the robustness and the speed of the resection process. We expect these

improvements to be especially useful in less textured scenes also typically found in indoor

environments, such as aisles or rooms with bare walls and mostly without decorations.

Acknowledgements

This work was supported by the Seneca Foundation, a Science and Technology Agency of Region ofMurcia, under the Seneca Program 2009.

A.J. Ruiz-Ruiz et al.162

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 20: A vision-enhanced multisensor LBS suitable for augmented reality applications

Notes

1. A short illustration video of the overall system, which also explains in more detail the 3Dreconstruction process, is available in the URL http://www.youtube.com/watch?v¼qycdlROrtXE.

2. http://qvision.sourceforge.net.3. http://www.khronos.org/opengles/.

References

Arai, I., S. Horimi, and N. Nishio. 2010. “Wi-Foto 2: Heterogeneous Device Controller Using Wi-FiPositioning and Template Matching.” Paper presented at Pervasive, Helsinki, Finland, May 17–20.

Chandrasekaran, G., M. A. Ergin, J. Yang, S. Liu, Y. Chen, M. Gruteser, and R. P. Martin. 2009.“Empirical Evaluation of the Limits on Localization Using Signal Strength.” Paper presented atSECON, Rome, Italy, June 22–26.

Fiore, P. 2001. “Efficient Linear Solution of Exterior Orientation.” IEEE Transactions on PatternAnalysis and Machine Intelligence 23: 140–148.

Fischler, M. A., and R. C. Bolles. 1981. “Random Sample Consensus: A Paradigm for Model Fittingwith Applications to Image Analysis and Automated Cartography.” Communications of theACM 24: 381–395.

Frauendorfer, F., C. Wu, J. M. Frahm, and M. Pollefeys. 2008. “Visual Word Based LocationRecognition in 3D Models Using Distance Augmented Weighting.” Paper presented at the 4th3DPVT, Atlanta, GA, USA, June 18–20.

Garcia, V., E. Debreuve, and M. Barlaud. 2008. “Fast k Nearest Neighbor Search Using GPU.”Paper presented at the CVPRWorkshop on Computer Vision on GPU, Anchorage, Alaska, USA,June 24–26.

Golub, Gene H., and Charles F. Van Loan. 1996. Matrix Computations. Baltimore, MA: JohnsHopkins Studies in Mathematical Sciences, The Johns Hopkins University Press.

Gower, J. C., and G. B. Dijksterhuis. 2004. Procrustes Problems. Oxford: Oxford University Press.Hartley, R., and A. Zisserman. 2011. Multiple View Geometry in Computer Vision. 2nd ed.

Cambridge: Cambridge University Press.Hattori, K., R. Kimura, N. Nakajima, T. Fujii, Y. Kado, B. Zhang, T. Hazugawa, and K. Takadama.

2009. “Hybrid Indoor Location Estimation System Using Image Processing and WiFi Strength.”Paper presented at WNIS, Washington, DC, USA, December 28–29.

Horn, R. A., and C. R. Johnson. 1991. Topics in Matrix Analysis. Cambridge: Cambridge UniversityPress.

Lemelson, H., M. B. Kjaergaard, R. Hansen, and T. King. 2009. “Error Estimation for Indoor 802.11Location Fingerprinting.” Paper presented at LoCA, Tokyo, Japan, May 7–8.

Li, X., and J. Wang. 2012. “Image Matching Techniques for Vision-Based Indoor NavigationSystems: Performance Analysis for 3D Map Based Approach.” Paper presented at IPIN,Sydney, Australia, November 13–15.

Lowe, D. G. 2004. “Distinctive Image Features from Scale-Invariant Keypoints.” InternationalJournal of Computer Vision 60: 91–110.

Mulloni, A., D. Wagner, I. Barakonyi, and D. Schmalstieg. 2009. “Indoor Positioning andNavigation with Camera Phones.” IEEE Pervasive Computing 8: 22–31.

Murillo, A., J. Guerrero, and C. Sagues. 2007. “SURF Features for Efficient Robot Localization withOmnidirectional Images.” Paper presented at IEEE International Conference on Robotics andAutomation, Rome, Italy, April 10–14.

Nocedal, J., and S. Wright. 1999. Numerical Optimization. New York: Springer.Ruiz-Ruiz, A. J., O. Canovas, R. A. Rubio, and P. E. Lopez-de-Teruel. 2011. “Using SIFT and WiFi

Signals to Provide Location-Based Services for Smartphones.” Paper presented at MobiQuitous,Copenhagen, Denmark, December 6–9.

Ruiz-Ruiz, A. J., P. E. Lopez-de-Teruel, and O. Canovas. 2012. “A Multisensor LBS Using SIFT-Based 3D Models.” Paper presented at IPIN, Sydney, Australia, November 13–15.

Se, S., D. G. Lowe, and J. J. Little. 2005. “Vision Based Global Localization and Mapping forMobile Robots.” IEEE Transactions on Robotics 21 (3): 364–375.

Journal of Location Based Services 163

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4

Page 21: A vision-enhanced multisensor LBS suitable for augmented reality applications

Triggs, B., P. F. Mclauchlan, R. I. Hartley, and A. W. Fitzgibbon. 2000. “Bundle Adjustment – AModern Synthesis.” Lecture Notes in Computer Science 298–372.

Wu, C. 2001. “VisualSFM: A Visual Structure from Motion System.” Accessed June 5, 2013. http://www.cs.washington.edu/homes/ccwu/vsfm/

Wu, C. 2007. “SiftGPU: A GPU implementation of Scale Invaraint Feature Transform (SIFT).”Accessed June 5, 2013. http://cs.unc.edu/ ccwu/siftgpu

A.J. Ruiz-Ruiz et al.164

Dow

nloa

ded

by [

Nor

th D

akot

a St

ate

Uni

vers

ity]

at 0

4:21

28

Oct

ober

201

4