a vision-enhanced multisensor lbs suitable for augmented reality applications
TRANSCRIPT
This article was downloaded by: [North Dakota State University]On: 28 October 2014, At: 04:21Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Journal of Location Based ServicesPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/tlbs20
A vision-enhanced multisensorLBS suitable for augmented realityapplicationsAntonio J. Ruiz-Ruiz a , Oscar Canovas a & Pedro E. Lopez-de-Teruel aa Department of Computer Engineering , University of Murcia ,Murica , SpainPublished online: 15 Jul 2013.
To cite this article: Antonio J. Ruiz-Ruiz , Oscar Canovas & Pedro E. Lopez-de-Teruel (2013) Avision-enhanced multisensor LBS suitable for augmented reality applications, Journal of LocationBased Services, 7:3, 145-164, DOI: 10.1080/17489725.2013.807026
To link to this article: http://dx.doi.org/10.1080/17489725.2013.807026
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
A vision-enhanced multisensor LBS suitable for augmented realityapplications
Antonio J. Ruiz-Ruiz*, Oscar Canovas and Pedro E. Lopez-de-Teruel
Department of Computer Engineering, University of Murcia, Murica, Spain
(Received 9 March 2013; final version received 18 April 2013; accepted 7 May 2013)
This paper presents an augmented reality system for indoor environments, whichdoes not require special tagging or intrusive landmarks. We have designed alocalisation architecture that acquires data from different sensors available incommodity smartphones to provide location estimations. The accuracy of thoseestimations is very high, which enables the information overlay on a camera-phone display image. During a training phase, we used visual structure frommotion techniques to run offline 3D reconstructions of the environment from thecorrespondences among the scale invariant feature transform descriptors of thetraining images. To determine the position of the smartphones, we first obtained acoarse-grained estimation based on WiFi signals, digital compasses and built-inaccelerometers, making use of fingerprinting methods, probabilistic techniquesand motion estimators. Then, using images captured by the camera, weperformed a matching process to determine correspondences between 2D pixelsand model 3D points, but only analysing a subset of the 3D model delimited bythe coarse-grained estimation. This multisensor approach achieves a goodbalance between accuracy and performance. Finally, a resection process wasimplemented providing high localisation accuracy when the camera has beenpreviously calibrated, that is we know intrinsic parameters such as focal length,but it is also accurate if an auto-calibration process is required. Our experimentaltests showed that this proposal is suitable for applications combining location-based services and augmented reality, since we are able to provide high accuracywith an average error down to 15 cm in ,0.5 s of response time.
Keywords: augmented reality; image processing; multisensor; SIFT; smartphones
1. Introduction
Several augmented reality applications do not require a continuous real-time response,
such as those designed to display location-aware reminders or notes, to obtain information
about who is behind the door of a particular laboratory or to visualise additional objects of
interest virtually added to the scene. However, those applications require a good trade-off
between fine-grained accuracy and an acceptable response time.
The wide adoption of smartphones enables new scenarios to offer overlay information
and augmented reality to an increasing number of users. Smartphones are equipped with
several sensors, which have simplified the process of obtaining the required context
information. For a diverse set of areas including tracking, geographic routing or
q 2013 Taylor & Francis
*Corresponding author. Email: [email protected]
Journal of Location Based Services, 2013
Vol. 7, No. 3, 145–164, http://dx.doi.org/10.1080/17489725.2013.807026
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
entertainment, location-sensing systems are currently a very active research field. There
are several widespread applications that make use of the images that users obtain from the
camera to display augmented reality. Indeed, those images can be used to provide a better
estimation of the users’ position, offering a seamless integration of the sensor and the
display. We presented a previous version of this work in teruizruiz:ipin, and in this article,
we provide additional details about the suitability of this proposal to develop augmented
reality applications. We have designed an location-based services (LBS) based on images
which provides not only very fine-grained position estimations (up to a few cm of
maximum error) but also an equally precise estimation of the absolute orientation of the
imaging device. Therefore, hereinafter, we assume that the users are capturing images,
typically with a tablet or a smartphone, for which they want to obtain the precise
estimation of the full 6 degrees of freedom (dof) – three for ðX; Y ; ZÞ position and three
additional for the ða; b; gÞ Euler angles for the rotation matrix – of the device with
respect to some conveniently chosen world reference coordinate system. In this article, we
focus on how to develop augmented reality applications for indoor environments using
data from multiple sensors.
Image processing is a very time-consuming task, so we decided to follow a multisensor
design to limit the amount of analysed images. To determine the position of the
smartphones, we first obtained a coarse-grained estimation based on WiFi signals, digital
compasses and built-in accelerometers, making use of fingerprinting methods, probabilistic
techniques and motion estimators. Then, using images captured by the camera, we
performed a matching process to determine correspondences between 2D pixels and model
3D points, but only considering a subset of the 3D model delimited by the coarse-grained
estimation. The keystone in our approach is the use of the scale invariant feature transform
(SIFT) to run offline 3Dmap reconstructions of the environment from the features extracted
using training images. Several resection techniques were also used to estimate the full 6 dof,
depending onwhether we know some intrinsic camera parameters, such as focal length. The
system can currently complete an entire cycle, from obtaining data from sensors to
displaying augmented reality based on accurate estimations, in ,400ms.
2. Related work
Indoor positioning is an active research field. Many types of signals (radio, images and
sound) and methods have been used to infer location.
Paying attention to radio signals, pattern recognition methods, among many others,
estimate locations by recognising position-related patterns. Fingerprinting is based on
radio maps containing patterns of receive signal strength indicator (RSSIs), which are
obtained using 802.11, Zigbee, Bluetooth or any other widespread wireless technology.
The analysis of RSSI patterns is a technique that has been examined by several authors
(Chandrasekaran et al. 2009), obtaining an accuracy ranging from 0.5 to 3m.
However, better results can be obtained by integrating the information captured by
multiple sensors. For example, Hattori et al. (2009) and Mulloni et al. (2009) used WiFi
signals and image recognition to estimate positions. However, they are based on 2D
landmarks that must be placed in the scenario of interest, which involves the inclusion of
obtrusive elements.
Arai, Horimi, and Nishio (2010) performed a preliminary analysis of how image
processing techniques such as SIFT (Lowe 2004) can be used to improve the accuracy in
landmark-free location-based systems. SIFT features provide a powerful mechanism to
A.J. Ruiz-Ruiz et al.146
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
perform robust matching between different images of a given scene, which is fundamental
not only for visual recognition of objects or places (Frauendorfer et al. 2008), but also –
and what is more relevant for this work – in the context of simultaneous localisation and
3D scene reconstruction (Se, Lowe, and Little 2005).
Alternatively, some other works on localisation have used other kind of features and
associated local image descriptors instead of SIFT, such as speeded up robust features
(Murillo, Guerrero, and Sagues 2007) or affine-SIFT (Li and Wang 2012). Nevertheless,
the goal is always essentially the same: to extract a collection of visual features from input
images, which are reasonably invariant to changes such as translation, scaling, rotation and
illumination, and which can be used to perform matching against a database of features
extracted from a large collection of images previously obtained from the application
environment.
As we show, our system goes beyond, and we build a full 3D model containing
hundreds of thousands of 3D points obtained using these features and state-of-the-art
structure from motion (SfM) computer vision techniques (Hartley and Zisserman 2011).
The computation of this 3D model not only allows us to provide a much higher accuracy in
the localisation, but also reduces the required time in the offline training process.
3. A multisensor design
In recent years, the wide adoption of smartphones equipped with several sensors has
simplified the process of obtaining context information. Although our solution uses the
images from the camera as the main source of information to infer accurate positions, we
can take advantage from other sensors to limit the amount of information to be examined,
to simplify the training process or to reduce the computational load of the smartphones.
We believe that this multisensor approach is crucial to accomplish a feasible image-based
location-based service.
The main drawback of most of the indoor positioning systems is the scalability issues
that arise when modelling large scenarios. However, several solutions can be envisioned
following a multisensor approach to reduce the necessary time effort and the required
processing during the online phase.
Our solution follows a methodology based on a continuous video recording and data
acquisition process. Then, using some fingerprinting techniques during the training phase,
divide the scenario into different zones according to the characterisation of the 802.11
signals. As a consequence, we can obtain clusters of physical points, where signals show a
similar behaviour, and they will be used to partition the 3D model into different submodels
(Ruiz-Ruiz et al. 2011). In addition, although we are recording the training video, we also
save information related to the current orientation of the training device using the data
provided by the built-in digital compass, because the different video frames are tagged
with their corresponding orientation.
During the online phase, we determine a coarse-grained geographical zone, in which
the device is located using the obtained RSSI. That zone is related to different 3D
submodels, and one of them is selected according to the orientation of the device to find
correspondences between the features of the current image and those contained in the
submodel. Consequently, we reduce the amount of information (image features) to check
against during the online phase, improving performance.
However, the information from the different sensors is also used to reduce the
computational cost on the smartphone, to minimise the amount of information transmitted
Journal of Location Based Services 147
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
and therefore, to preserve the smartphone battery life. For example, we have developed a
motion estimator based on accelerometer readings to determine whether the user location
has to be recalculated because the user is moving fast. Our augmented reality applications,
apart from making use of our localisation service, rely on the information provided by the
gyroscope to adapt the overlaid information to the device’s rotations.
4. Generation of 3D models based on SIFT
Each SIFT feature is characterised by its position in the image and its natural size and
dominant orientation. It is also attached to a highly distinctive 128D description vector,
which is computed from an image patch in the corresponding position, scale and
orientation (Lowe 2004). By matching these descriptors against a geo-tagged database of
images previously extracted from the application environment, we were already able to
compute a good approximation of the current location in a previous work (Ruiz-Ruiz et al.
2011). Here, the images in the database were captured in a discretised grid of the
environment in all plausible directions of movement, so the precision was clearly limited
by both the resolution of the grid and the set of orientations used during training.
Furthermore, the training phase was a cumbersome manual process, which could become
clearly a problem when modelling large scenarios such as a university campus or an
airport.
To overcome these issues, instead of just storing features in the database along with the
location from which they were taken, in this work, we obtain an accurate 3D model of the
environment, which is built by getting the 3D points in the scene that generated the
corresponding image features. In order to do that, we combine the use of SIFT features
with state-of-the-art techniques in 3D computer vision, which, using the now well
established field of applied projective geometry (Hartley and Zisserman 2011), are able to
fully estimate a 3D reconstruction of the scene using only a sufficiently large number of
images taken from different viewpoints. A brief overview of this reconstruction process,
known generically as SfM, is given in Figure 1, using our own test environment as the
illustrating example.1 The input is a set of images extracted from a video recorded during
the training phase, trying to capture all the details of the scene. The output is an accurate
3D reconstruction, which includes both the 3D points and the full 6 dof position of all the
cameras.
5. Matching
Once we have obtained the 3D representation of our scenario, we need a mechanism to
look for coincidences between the features extracted from the 2D images captured during
the online phase and those features that made up the 3D model. Several techniques based
on nearest neighbour algorithm to perform the matching process efficiently were analysed
in previous works (Ruiz-Ruiz, Lopez-de-Teruel, and Canovas 2012).
As we can take advantage of our multisensor design, we divide our space search
according to the different zones determined after analysing additional sensors (RSSIs
signals and orientations). Therefore, the amount of features that we have to compare with
is considerably reduced. The moderately small size of an individual zone in our case
(,75K descriptors) led us to test a brute force algorithm. Keeping in mind that the
matching process will be usually performed by the cloud, we made use of an exhaustive
exact nearest neighbour (ENN) implementation of the brute force search algorithm
A.J. Ruiz-Ruiz et al.148
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
running on a graphics processing unit (GPU). To take advantage of the enormous power of
these relatively inexpensive computing platforms, we slightly modified the algorithm
proposed in Garcia, Debreuve, and Barlaud. (2008) to adapt its functionality to our zone-
based space search. In addition to a better accuracy, the use of this algorithm resulted in an
increased matching performance, as compared with those techniques based on nearest
neighbour.
6. Camera resection
Camera resection (sometimes generically referred to as camera calibration) is the process
by which we can determine the P3£4 projection matrix from a given set of matching 3D
scene points and the corresponding 2D pixels in an image. This matrix defines the
algebraic relationship PXi ¼ xi for every Xi $ xi correspondence when both the pixels xiand the 3D points Xi are given in homogeneous coordinates (Hartley and Zisserman
2011). P can be factorised as P ¼ K3£3R3£3½Ij2 C�3£4, in a way that the internally
encoded relationship with the rotation R and 3D position C ¼ ðcx; cy; czÞ` of the camera in
the reference 3D coordinate system is made explicit. This factorisation depends on the
camera intrinsic calibration matrix K, which contains internal camera parameters such as
the focal length and the aspect ratio (e.g. a very simple calibration matrix which is
commonly used is K ¼ diagðf ; f ; 1Þ, with f the focal length in pixel units). In this section,
1.SIFT featuresextraction and description
4. Matching matrix forevery pair of views
5. Growing 3D structureresectioning additional viewsand triangulating new pionts
6.Global (refined) 3Dstructure and motion
reconstruction
7. Final texturedreconstruction
(not used)
2.Pairwise 2D matching
3. Stereo 3Dreconstuctions0. Input images
Structure frommotion process
Figure 1. Illustration of the SfM technique. Input images are preprocessed by extracting andmatching SIFT features among them to get relevant stereo pairs from which the 3D reconstructioncan be bootstrapped. These partial reconstructions are then augmented with new cameras by anincremental resectioning/triangulation procedure. The final step, known as bundle adjustment(Triggs et al. 2000), is a global nonlinear optimisation, which minimises the reprojection error of allthe 3D points to their corresponding 2D image projections. All figures obtained using ChangchangWu’s VisualSfM software (Wu 2001) on our test environment.
Journal of Location Based Services 149
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
we detail the entire resectioning process that we follow in this work, which is summarised
in Algorithm 1.
The overall process starts by running a robust implementation of the so-called direct
linear transformation (DLT) algorithm (Hartley and Zisserman 2011), which allows us to
obtain the camera matrix P that minimises the reprojection error for the maximum number
of inlier correspondences. In this context, inliers are matching pairs of 3D$2D points,
which can be fitted to the model up to a given error tolerance. Then, if the camera had been
precalibrated (i.e., we knew in advance the calibration matrix K), we would discard P
using only the obtained inlier set to get the desired camera pose (R, C) using an exterior
orientation algorithm. Otherwise, we still can perform a camera autocalibration using the
previously estimated P to solve the factorisation problem by carrying out the simple and
well-known matrix algebra operation of QR decomposition (Golub and Van Loan 1996).
These two alternatives give us the possibility to treat both the cases of knowing the camera
internal parameters in advance (feasible for most common devices), as well as the
completely uncalibrated case, which is solved at the price of a slightly lesser precision.
The final pose is obtained by refining the former solution using some standard nonlinear
optimisation procedure (Nocedal and Wright 1999).
6.1. Direct linear transformation
The P matrix that we must first estimate is an homogeneous entity with 12 elements, so
that it has, in general, 11 dof. Each 3D$2D correspondence ðx; y; zÞ $ ðq; rÞ sets up the
homogeneous restriction Pðx; y; z; 1Þ` ¼ lðq; r; 1Þ`, which is equivalent to
ðq; r; 1Þ £ Pðx; y; z; 1Þ ¼ ð0; 0; 0Þ. Namely,
0` 2X`i riX
`i
X`i 0` 2qiX
`i
24
35
P1
P2
P3
2664
3775 ¼ 0; ð1Þ
where Xi ¼ ðxi; yi; zi; 1Þ` and xi ¼ ðqi; ri; 1Þ` ;i are the homogeneous coordinates of the
corresponding pair of points, and Pk ¼ ðp4k23; p4k22; p4k21; p4kÞ` for k ¼ 1; 2; 3 are the
three rows of P. Taking into account that we must solve for 11 unknowns (dof of P), and
since each 2D$ 3D correspondence gives rise to two independent equations over the
elements of P, it is necessary to make use of a minimum of 6 of these correspondences to
solve for the complete system. From the solution for the 12 unknowns p1· · ·p12, we can
Algorithm 1. Resection algorithm
Require: 2D$ 3D correspondencesP, inliers ˆ DLT (correspondences) MRANSACif camera intrinsic known then M Precalibrated caseK, R, C ˆ Exterior Orientation (Kfixed, inliers)else MAutocalibrationK, R, C ˆQRDecompositionðPÞend ifRopt, Copt ˆ NonLinear Optimization (K, R, C)Ensure: Ropt and Copt (camera pose in world coordinates)
A.J. Ruiz-Ruiz et al.150
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
easily reconstruct a canonical version of the homogeneous matrix P:
P ¼
p1p12
p2p12
p3p12
p4p12
p5p12
p6p12
p7p12
p8p12
p9p12
p10p12
p11p12
1
26664
37775: ð2Þ
The former procedure can be easily adapted to the overdetermined case (more than six
correspondences). As points are measured inexactly due to noise, some of these
correspondences may not be fully compatible with any projective transformation, and
therefore, the best possible P has to be determined, which was found by minimising the
reprojection error in the minimum square sense.
The SIFT matching process frequently ‘contaminates’ the set of real correspondences
with some subset of outliers (wrong correspondences), which would seriously distort the
obtained solution if not adequately filtered. To solve this problem, a robust estimation
procedure is needed, and in this work, we use the well-known RANSAC algorithm
(Fischler and Bolles 1981). This algorithm works by finding the best transformation matrix
that minimises the reprojection error taking into account only inlier (i.e. correct)
correspondences. To filter the outlier (incorrect) correspondences, RANSAC iteratively
selects a random subset of six elements from the original set of correspondences, from
which it computes a tentative P. Because the number of available correspondences is
notably higher than this required minimum of six, at each RANSAC iteration, we perform
an incremental process that iteratively adds the correspondences identified as inliers and
recalculates P until no more inliers were detected. One correspondence is identified as an
inlier if its reprojection error does not exceed a specified threshold. This iterative
procedure, which will try several tentative P matrices before success, is completes when
either obtains a sufficient percentage of inliers over the total number of input
correspondences or the number of iterations exceeds a given threshold. In Section 7.3, we
show the influence of all these parameters in both the accuracy and performance of the
results.
The K, R and C values can be easily obtained from the QR decomposition of the first
three columns of P. Therefore, at least theoretically, this generic procedure could be
applied without a previous knowledge of the calibration matrix of the camera. However,
the main problem with this approach is that the solution values that we obtain (mainly K
and C) can be severely coupled, that is they are not mutually independent from each other.
Particularly, in some ill-conditioned cases such as scenes with a dominant plane (a not so
uncommon case in typical indoor scenarios, where large walls can cover most of the
image), the autocalibration process can fail to give an accurate solution. As Section 3
shows, the QR decomposition can still be a good alternative when the camera intrinsic
parameters are completely unknown, except under the aforementioned ill-posed
conditions. In the following section, we show how this a priori knowledge of calibration
largely improves the quality of the results.
6.2. Exterior orientation
In many cases, the on-board camera specifications (mainly focal length and aspect ratio) of
typical smartphones and tablets are known, so the calibration matrix Kfixed can be
determined. Fiore (2001) developed an alternative linear method to obtain only the
Journal of Location Based Services 151
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
exterior orientation of the camera, which, using this knowledge of the intrinsic parameters,
results in a much more precise estimation of R and C. Moreover, it performs remarkably
well even in the planar case which poorly conditioned the problem for the DLT algorithm,
as commented in the previous section. Fiore’s algorithm works in two stages, first,
estimating the unknown depths zi for each homogeneous 2D point xi, that is first we get an
intermediate set of (inhomogeneous) 3D points zi K21fixed xi, and then what is left is an
absolute 3D orientation problem, whose solution yields the desired values of R and C.
More specifically, given a number of input 2D$3D point correspondences, xi $ Xi,
and the intrinsic camera matrix K ¼ Kfixed, we are required to find a rotation R, a vector C
(attitude and position of the camera) and a subsidiary depth vector z ¼ ðz1;A; zi;A; znÞ`(where n is the number of correspondences) such that:
zixi ¼ KR½Ij2 C�Xi , K21zixi ¼ R½Ij2 C�Xi ;i: ð3Þ
In order to solve for the unknown values of zi, we first compose the matrix M, for
which columns are formed with the homogeneous coordinates of the 3D points,
M4£n ¼ ½X1· · ·Xn� ð4Þ
and another matrix N, arranging in this case, the homogeneous coordinates xi ¼ðpi; qi; 1Þ` of the 2D points in the following way:
N3n£n ¼
x1 0 . . . 0
0 x2 . . . ...
..
. . ..
0
0 0 . . . xn
266666664
377777775: ð5Þ
Then, taking the singular value decomposition (SVD) (Golub and Van Loan 1996) of
M, we obtain the factorisationM ¼ U4£4D4£nV`n£n, where U and V are orthogonal andD is
diagonal. Being k the rank ofM, we denote by V2 the submatrix of size n £ ðn2 kÞ, whichresults from taking only the last n2 k columns of V . This submatrix spans the null space
of M, so that MV2 ¼ 0.
Given the former definitions, it can be shown (Fiore 2001) that the depth vector z canbe recovered linearly (up to a scale factor), just by solving the following null-space
problem:
ððV`2 ^K21ÞNÞz ¼ 0; ð6Þ
where the ^ symbol stands for the Kronecker product (Horn and Johnson 1991) of
matrices. Once the depths have been obtained, our initial Equation (3) is now reduced to
find the correct alignment of two sets of 3D points, an instance of the well-known absolute
3D orientation problem.
This last stage of the exterior orientation algorithm proceeds as follows. LetWi be the
3D depth recovered points, Wi ¼ ðzi·ðqi=f Þ; zi·ðri=f Þ; ziÞ` (where we have used the
Kfixed ¼ diagðf ; f ; 1Þ assumption), and Yi the likewise inhomogeneous original 3D points,
Yi ¼ ðxi; yi; ziÞ`. Then, given these two sets of 3D pointsWi and Yi, we are required to find
A.J. Ruiz-Ruiz et al.152
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
the rotation matrix R, the vector C and the scalar s, such that:
Wi ¼ sðRYi þ CÞ ;i ¼ 1· · ·n: ð7ÞNow, centring the point clouds on their respective means, that is �Wi ¼
Wi 2 ð1=nÞPnj¼1Wj and �Yi ¼ Yi 2 ð1=nÞPn
j¼1Yj, the scale s is very easy to obtain as
s ¼ ðk �WikÞ=ðk �YikÞ, for every i. In practice, s can be averaged for the set of points
;i ¼ 1· · ·n, and then used to scale accordingly the values of �Yi.
With the sets of points adequately scaled and centred, we are left with the problem of
estimating the unknown rotation between �Wi and s �Yi. This is known as the orthogonal
Procrustes problem (Gower and Dijksterhuis 2004). Being W3£n and Y3£n the matrices
formed by stacking the points �Wi and s �Yi, respectively, the solution is found using again
the SVD decomposition. The sought rotation is given by:
R ¼ V01 0 0
0 1 0
0 0 DetðU 0V 0`Þ
2664
3775U 0`; ð8Þ
whereU03£3D
03£3V
03£3 stands for the SVD of the matrix YW `, and the determinant ofU 0V 0T
is always 1 or 21.
Finally, once we have calculated the rotation matrix R, we can obtain the translation
vector C from (7) simply this way:
C ¼ 1
s
1
n
Xni¼1
Wi
!2 R
1
n
Xni¼1
Yi
!: ð9Þ
As we demonstrate in Section 7.3, this precalibrated exterior orientation procedure
ensures both robustness and accuracy in the estimated camera pose, even in environments
where we have to deal with essentially planar scenes.
6.3. Nonlinear optimisation
The methods described in the previous subsections are linear, so in fact they minimise an
algebraic instead of a geometric error (Hartley and Zisserman 2011). Thus, although
essentially correct, the obtained solutions for the ðK;R;CÞ decomposition of the camera
matrix are slightly suboptimal. Especially when we know the camera intrinsic parameters
(Kfixed) with a high degree of accuracy, a final polishing of the solution by any standard
nonlinear procedure, which aims to minimise the geometric reprojection error is
recommended (line 8 of Algorithm 1). Several alternatives are also available for this task
(Nocedal and Wright 1999), in this work, we simply use a straightforward iterative
procedure implemented in the GNU scientific library. Our experiments show that this final
optimisation, to a greater or lesser extent, always improves the linear estimation.
6.4. From 3D model to real-world coordinates
Obviously, in the absence of any real-world reference, the SfM techniques can determine
the 3D reconstruction only up to an arbitrary global euclidean transform. Nevertheless, our
LBS must of course attach this 3D model to some convenient coordinate system defined
Journal of Location Based Services 153
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
for our physical environment, so that we obtain the localisation results in real-world
coordinates. We can also define the desired augmented reality structures on this world
coordinate system. Given a minimum of three ground control points (although, for
numerical stability, it is convenient to use a few more), it is possible to estimate the
corresponding parameters s0, R0 and C0 of the needed 3D reconstruction! world euclidean
transformation solving again an absolute 3D orientation problem analogous to the problem
mentioned in Section 6.2. Those parameters will be used to globally transform the whole
3D model to the desired final real-world coordinates, where all the posterior resectioning
process will take place.
7. Experimental analysis
This section describes the experimental environment where we have carried out a set of
tests with the aim of validating the proposal described in this paper. We also outline the
software and hardware elements as well as the methodology to perform the training
process. Finally, we discuss the details of one specific validation (among the several tests
we performed) and their related results.
7.1. Experimental environment
The testbed, where our experiments were conducted is located on the ground floor of our
Faculty. As shown in Figure 2, it is a 220m2 open space area. To build the fingerprinting
map based on an RSSI, we took advantage of the access points deployed throughout the
dependencies.
The training video and RSSI observations were captured using a Samsung Galaxy Tab
Plus 7.0 with Android OS 3.2, and experiments were carried out making use of an ASUS
Transformer TF-101 with Android OS 4.0.3. We made use of our own application,
LOCUM-Trainer, to perform a fast training process. As a result, we obtained a lightweight
fingerprinting map of RSSIs and a set of images (frames extracted from the recording) geo-
tagged with the correspondent zone and orientation. Then, using a clustering technique
described in Lemelson et al. (2009), we determined the geographical areas dividing the
space search for image analysis, as it was mentioned in Section 5.
During the online phase, we extracted SIFT features to look for correspondences with
the points of the 3D model. For image processing purposes, we used the SIFTGPU library,
a GPU implementation of SIFT developed by Wu (2007). SIFTGPU requires a high-end
GPU which is not available in smartphones, so we use a server with an nVidia GeForce
Figure 2. Experimental environment. Different shape dots indicate the RSSI clusters. Rectanglesenclosing dots define the 3D submodels defined including orientations.
A.J. Ruiz-Ruiz et al.154
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
GTX580 with 512 cores, supporting both GLSL (OpenGL Shading Language) and CUDA
(Compute Unified Device Architecture). This GPU was installed in an Intel(R) Core(TM)
i7-2600K CPU 3.40GHz. This server will act as the cloud, which is responsible for
estimating the location of the different devices. In addition, we have also used a modified
version of the CUDA ENN software developed by Garcia, Debreuve, and Barlaud (2008)
to perform the matching process, and our own implementation of the resection algorithm
described in Section 6, making use of the QVision library,2 available in SourceForge.
7.2. Building 3D maps
As we mentioned earlier, our proposal is mainly focused on the image analysis. Thus, we
needed to obtain a set of images to build our 3D model. To minimise the training time, we
recorded a 702400 long video, capturing every detail of our environment. As there are not
many objects in the centre of the hall, the video was recorded paying attention to the four
main walls with the camera situated at shoulder level. During our tests, we learnt that a
better reconstruction is obtained when the scene is recorded with the camera oriented to
the walls and also including a second round using a 458 angle of view. Once the video wasrecorded, and after several tests varying the frame extraction rate, we extracted one image
per 24 frames, thus obtaining 524 different images. Those images were good enough to
build the 3D model, which was used for the following tests.
Our 3D model was built using the VisualSFM software (Wu 2001), a visual structure
from motion application that makes use of a set of input images to run 3D reconstructions.
VisualSFM relies on the SIFTGPU library to extract the SIFT features and to perform the
pairwise image matching process. The bundle adjustment is implemented using a GPU
version that exploits the hardware parallelism.
Finally, using the information extracted from the 3D model and according to the zones
dividing our map, we built the search structures that contain the descriptors of the 3D
points belonging to each submodel.
7.3. Validation results
In this section, we present the experimental results of our proposal. A set of 20 images
(640£480 pixels) is our validation data, and they were obtained while walking along our
scenario using the rear camera of an ASUS EEEPad Transformer TF-101 tablet. As the
camera was previously calibrated, we knew its intrinsic parameters (focal length and
aspect ratio). Results shown in Table 1 were obtained by performing the resection process
making use of the Exterior Orientation technique.
We have evaluated different camera resection parameters varying their values in order
to select the best configuration according to the characteristics of our environment. The
evaluated parameters included: (1) # Correspondences refers to the number of 2D$3D
pairs used as input for the resection process; (2)Minimum Inliers is the minimum required
number of correspondences that must be considered as inlier to stop the RANSAC process
(Early-Detention); (3) Iterations is the maximum number of repetitions that the RANSAC
process is executed to obtain the transformation matrix that minimises the reprojection
error and (4) Reprojection Error represents the upper threshold to consider a 2D$3D
correspondence as inlier. The values selected for testing each parameter were selected
after carrying out additional experiments guided by the information obtained from vision-
based literature and the study of the goodness of the correspondences we are working with.
Journal of Location Based Services 155
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
Table
1.Resultsoftheresectionprocess
usingExteriorOrientationanddifferentconfigurationparam
eters.
Number
ofcorrespondences
40
50
60
Minim
um
inliers
10
20
30
10
20
30
10
20
30
Iterations
Reprojectionerror(px)
%%
%%
%%
%%
%
419.5–100
10.11–96
9.0–90
25.2–100
11.60–97
9.6–90
39.9–99
15.6–96
9.4–90
100
514.4–100( *)
15.1–98
10.8–90
25.8–100
19.0–98
9.8–90
32.5–99
17.0–95
10.6–91
616.3–100
13.9–99
10.5–90
27.2–100
19.3–98
10.1–90
37.0–100
15.0–95
15.0–91
410.8–100
9.5–97
8.6–90( **)
20.0–100
17.9–99
10.1–90
29.1–99
13.6–98
9.2–90
200
514.6–100
12.3–99
10.7–90
15.9–100
17.3–99
9.7–90
18.1–99
22.0–97
10.9–92
614.0–100
12.9–100
10.0–90
15.9–100
12.0–100
10.2–90
32.5–100
17.6–98
12.4–92
412.4–100
10.4–97
9.1–90
15.3–100
11.5–99
10.2–90
19.3–100
14.4–98
10.0–90
300
514.1–100
11.9–99
10.6–90
11.6–100
15.3–100
9.5–90
23.4–100
13.2–98
11.0–92
612.3–100
11.6–100
9.9–90
14.7–100
16.1–100
10.0–90
19.8–100
13.0–99
12.5–93
Estim
ationError(cm)–resectioncamerasuccessfulcases(%
).
A.J. Ruiz-Ruiz et al.156
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
As we can see, there are several sets of parameters providing, on average, an
estimation error under 15 cm. The choice of the most appropriate configuration depends on
the associated processing time, which depends on the number of iterations executed. For
example, we consider that the set of parameters that offers the best trade-off between
accuracy and performance is the one marked as (*) in Table 1. Although this option
supposes a slight loss of accuracy in relation to other combinations, the involved time for
the resection process is 127ms on average, what supposes a reduction down to 37% in
relation to more accurate options. Increasing the number of iterations to 200 or 300, the
execution time rises to 201 and 267ms, respectively. Moreover, this choice is less
restrictive than that providing the best accuracy results, since requiring a fewer number of
inliers we were able to perform a valid camera resection in all the tests instead of the 90%
of success obtained when using the most accurate option (**). It is worth noting that
although this selected set of parameters works properly for our environment, it may differ
from one scenario to another.
Once we have selected the best configuration of parameters, we have also evaluated
how the number of descriptors in the search space influences on the accuracy and
performance of the resection process. We have reduced the number of descriptors by
filtering those defining the same 3D point and whose camera poses are closer than a
specified threshold (100 cm). This reduction is based on the idea that each 3D point is
formed by a set of SIFT descriptors, but some of them were obtained from nearby cameras,
thus introducing some redundancy to the matching process because of their visual
similarity. In this way, we have reduced the number of descriptors from 254,631 to 86,088.
Figure 3 shows a comparison of the accuracy obtained when we use all the descriptor or
only the filtered decriptions. As a conclusion, we can notice that accuracy is barely
decreased when a filtered set of descriptors is used. In those cases, the mean estimation
error is 16.36 cm; however, the required matching time (Figure 4) is reduced down to 67%
in relation with the whole set of descriptors.
A zone-based search space (the multisensor approach) reduces (62%) the time required
to carry out the matching process, as Figure 4 shows. In this case, the time needed to
complete the whole cycle from sensor data acquisition for 3D estimation is around 345ms.
We have detailed the transmission time of sensor data from the smartphone, the required
time at the cloud to perform the SIFT extraction, to run the matching process and to apply
the resection technique. The time required to perform the RSSI analysis is 1.5ms, so it has
been omitted because of its unnoticeable influence.
Est
imat
ion
erro
r (c
m)
Image
100.00
10.00
1.00
0.101 2 3 4 5 6 7 8 9 10 11 12 13 14 15
All descriptrosFiltered descriptors
16 17 18 19 20
Figure 3. Accuracy results for each validation image using different sets of descriptors (using all ofthem or a filtered subset).
Journal of Location Based Services 157
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
However, we have also evaluated the use of an Early-Detection (ED) version of the
RANSAC process (Table 2). It is based on the idea that once the threshold of inliers has
been exceeded, it might not be necessary to execute the rest of RANSAC iterations
because the obtained camera matrix P might be good enough to estimate the camera pose
accurately. Despite its better performance, which may save up to 63% of processing time
as shown in Table 3, ED shows less accuracy than the complete version of RANSAC.
Table 2 shows the resection results obtained using Exterior Orientation and an ED version
of RANSAC. After analysing these results, we can conclude that despite the high accuracy
obtained using some configurations, we can notice an important loss of reliability, because
none of the configuration parameters guarantees 100% of successful cases.
It is worth mentioning that in those cases where the intrinsic camera parameters are
unknown, we can make use of the DLT algorithm described in Section 6 to carry out the
resection process. A mean estimation error around 78 cm is obtained and the elapsed time
is similar to that obtained with Exterior Orientation. Therefore, our proposal is able to
address reasonably well even those situations where there is no previous knowledge of the
camera configuration. Figure 5 shows a comparison between DLT and Exterior
Orientation, in terms of accuracy.
8. Augmented reality prototype
Throughout the preceding sections, we have laid the foundations to design an efficient
system, which is able to provide accurate location estimations, using data from several
sensors and without requiring intrusive landmarks. We envision that these types of
systems are really useful to build augmented reality services requiring very precise
overlaid information. Although we can think of many mobile apps that do not need freshly
updated 3D position and orientation of the device at video input frame rate, many other
applications would certainly benefit of a location service providing continuous, always on,
real-time localisation. For example, an application providing information about where are
your friends inside a particular building could be just based on a 2D map showing mobile
icons referring to your friends, but it would be very useful if it were also able to display
Tim
e (m
s)
1000.00
800.00
600.00
400.00
200.00
0.00All Filtered
Resection
Matching
SIFT
Sensor Tx
Zone-based
Figure 4. Performance using different alternatives to build the search space.
A.J. Ruiz-Ruiz et al.158
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
Table
2.Resultsoftheresectionprocess
usingExteriorOrientation(w
iththeEarly-D
etentionversionofRANSAC)anddifferentconfigurationparam
eters.
Number
ofcorrespondences
40
50
60
Minim
um
inliers
10
20
30
10
20
30
10
20
30
Iterations
Reprojectionerror(px)
%%
%%
%%
%%
%
100
439.2–68
22.7–79
10.8–80
48.4–66
31.1–75
13.4–75
45.5–60
28.2–72
15.4–72
535.4–67
20.8–81
11.1–78
38.8–63
17.2–76
16.9–75
29.4–62
24.2–70
17.8–71
633.5–66
18.6–76
11.3–76
36.2–62
21.2–73
16.1–72
46.4–57
27.2–69
12.7–70
200
443.6–67
25.8–80
10.9–78
41.6–65
22.7–78
13.1–76
47.8–61
24.2–73
15.3–71
532.1–67
22.0–79
10.9–78
29.3–64
18.5–75
14.2–75
59.0–58
22.5–69
17.4–71
648.8–67
20.8–80
12.4–77
33.7–64
32.7–74
13.1–72
32.6–57
22.7–70
17.1–72
300
436.6–68
26.4–82
11.3–78
46.3–65
26.8–77
12.1–76
46.4–60
25.1–72
14.6–72
531.7–68
24.6–81
11.0–78
34.8–61
22.4–75
21.3–74
33.4–60
30.0–71
20.2–73
632.3–66
26.2–79
11.2–76
35.7–62
24.5–73
20.1–72
43.8–59
21.3–70
14.3–71
Estim
ationError(cm)–resectioncamerasuccessfulcases(%
).
Journal of Location Based Services 159
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
continuously updated augmented reality information on top of the video input captured by
the camera device as the user keeps on moving. Users could even switch from one
interface to another, depending on the kind of information they are interested in that
moment.
Although, as stated in the previous experimental section, the response time of our
resection service by the cloud is typically limited to,200ms, this refreshment rate is still
insufficient to fully cope with video input rates of mobile devices (25 frames per second in
typical cameras). Moreover, the robustness and overall behaviour of the application could
certainly suffer if the location system tried to update the location information by making
continuous queries to the cloud, due to both the communication overhead and the
eventually unavoidable resectioning failures. To avoid such degradation of the user
experience, while also allowing a smooth and real-time image refreshment rate, we make
use of some additional sensors provided by the device, such as the accelerometer and the
gyroscope, which allow us to detect different types of motion and reduce the needed
communication with the cloud.
In order to test this hybrid approach, we developed a prototype app for Android
devices based on OpenGL ES 2.03, which is able to display augmented reality based on
precise estimations of the full 6 dof: ðX; Y ; ZÞ position and ða;b; gÞ Euler angles for therotation matrix. We display a wire-frame model on top of the real world to visually check
the accuracy of our estimations. In addition, we add virtual banners to the doors in order to
display information of interest. Figure 6 shows a snapshot of the prototype.
Table 3. Time required (ms) at each step of the resection process.
Non-ED ED
RANSAC 75.1 2.1Exterior Orientation 2.6 2.8Optimisation 41.7 39.6Total 119.4 44.5
400.00
300.00
200.00
100.00
0.001 2 3 4 5 6 7 8 9 10
Image
Est
imat
ion
err
or
(cm
)
11 12 13 14 15
External OrientationDLT
16 17 18 19 20
Figure 5. Comparison of DLT and Exterior Orientation.
A.J. Ruiz-Ruiz et al.160
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
Every time a new rotation of the device is read from the sensors (typically at 50Hz rate
in most Android devices), the original rotation matrix estimated by the last successfully
resectioned image is modified accordingly, and the augmented reality is updated. The
camera centre C is not modified at all. Only when a translation movement is detected by
the accelerometer the device is forced to request a fully new estimation of position and
rotation from the cloud. The continuous update of the current device rotation is performed
using the gyroscope sensor according to the following formula:
RnewðtcurÞ ¼ Rcomp·RgyrðtcurÞ`·Rgyrðt0Þ·Rcomp·Rcamðt0Þ: ð10Þ
Here, Rcamðt0Þ stands for the last rotation estimation provided by the cloud using
strictly visual information, and is therefore only refreshed when the user clearly moves to a
new location, as stated above. RnewðtÞ, on the contrary, is updated at the systems sensor
reading rate using the most current gyroscope information. This provides continuous,
smooth visual feedback to the user. The update is based on the fact that the matrix product
RgyrðtcurÞ`·Rgyrðt0Þ with RgyrðtÞ, the absolute rotation read by the gyroscope of the device attime t, gives in every moment the relative motion between the moment the picture
corresponding to the last camera resection was taken (t ¼ t0) and the current instant of
time (t ¼ tcur), assuming that the movement performed by the user is a pure rotation (i.e.
there is no physical translation of the camera centre). Finally, the matrix Rcomp ¼diagð1;21;21Þ is a simple diagonal matrix that acts just to compensate the different
handedness of the camera and world systems.
In fact, gyroscope readings measure angular velocities, which should be integrated in
time to obtain the corresponding real-rotation matrices. Nevertheless, most android
systems do in fact integrate compass, gravity and gyroscope readings to get a virtual sensor
that provides a filtered output rotation which, although essentially based on the gyroscope
estimation, is certainly much more stable and prevents typical drifting of the gyroscope
estimation alone in the long term. In our system we use this second type of virtual sensor,
Figure 6. Snapshot of the Android AR app.
Journal of Location Based Services 161
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
which is more robust in practice, although, given the typically short elapsed time between
successive visual estimations (a few seconds in real-application cases), the raw integrated
gyro sensor readings could also be used without the risk of drifting too much in the
estimation.
Given this freshly updated rotation, the final projection matrix, which is used at every
moment to perform the projection of virtual structures for augmented reality is
P ¼ KRnewðtcurÞ½Ij2 Cðt0Þ�, with the value of the camera position Cðt0Þ as estimated in the
last query to the resection service in the cloud, and the rotation matrix R substituted by
RnewðtcurÞ, computed as stated in Equation (10). Observe that this is a very fast operation
that only changes in the most current RgyrðtcurÞ reading, and can thus be performed at a very
high refreshment rate even in low-end mobile devices. At the end of the video mentioned
in Section 4, one can also observe the visually appealing effect accomplished by this
hybrid visual/inertial estimation in action.
9. Conclusions
As emphasised during the paper, the multisensor design is crucial to accomplish a robust
and scalable image-based location-based service. We have introduced an approach that not
only eliminates the need to discretise the environment at a given granularity, but also
improves the quality of the location estimations. We have presented how to build an
accurate 3D model of the environment ready to perform the desired full 6 dof accurate
localisation process for any new input image. In addition, the multisensor design provides
mechanisms to work in difficult scenarios such as those presenting similar areas from a
visual perspective (floors and rooms). In those cases, SIFT features can fail to
disambiguate, but other signals such as RSSIs can be crucial to perform an accurate
resection process.
We envision that our proposal constitutes a good starting point to develop augmented
reality applications, since locations are obtained with a mean error of 15 cm, and with the
additional bonus of estimating the rotation with a similar precision. Moreover, the mean
time required to complete an entire cycle is 345ms, making this proposal able to support
applications requiring high accuracy and rapid response. An augmented reality prototype
was developed to visually check that our estimations are correct and that the integration of
additional sensors (like gyroscope) is especially useful to integrate user’s movements.
As a statement of direction, we are designing a new training process based on inertial
sensors in order to build 3Dmodels in a more robust and not supervised way. Themain goal
of this training is to facilitate the fully automatic and more reliable reconstruction of large
indoor environments, since it has a direct influence over the system accuracy. Finally,we are
also working on the use of amore powerful camera resection technique that takes advantage
of the information provided by inertial sensors available in smartphones. Hopefully, this
technique will allow us to use less 2D$3D correspondences in the RANSAC procedure to
obtain an equally valid solution for the camera position, a fact that we think that will
improve both the robustness and the speed of the resection process. We expect these
improvements to be especially useful in less textured scenes also typically found in indoor
environments, such as aisles or rooms with bare walls and mostly without decorations.
Acknowledgements
This work was supported by the Seneca Foundation, a Science and Technology Agency of Region ofMurcia, under the Seneca Program 2009.
A.J. Ruiz-Ruiz et al.162
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
Notes
1. A short illustration video of the overall system, which also explains in more detail the 3Dreconstruction process, is available in the URL http://www.youtube.com/watch?v¼qycdlROrtXE.
2. http://qvision.sourceforge.net.3. http://www.khronos.org/opengles/.
References
Arai, I., S. Horimi, and N. Nishio. 2010. “Wi-Foto 2: Heterogeneous Device Controller Using Wi-FiPositioning and Template Matching.” Paper presented at Pervasive, Helsinki, Finland, May 17–20.
Chandrasekaran, G., M. A. Ergin, J. Yang, S. Liu, Y. Chen, M. Gruteser, and R. P. Martin. 2009.“Empirical Evaluation of the Limits on Localization Using Signal Strength.” Paper presented atSECON, Rome, Italy, June 22–26.
Fiore, P. 2001. “Efficient Linear Solution of Exterior Orientation.” IEEE Transactions on PatternAnalysis and Machine Intelligence 23: 140–148.
Fischler, M. A., and R. C. Bolles. 1981. “Random Sample Consensus: A Paradigm for Model Fittingwith Applications to Image Analysis and Automated Cartography.” Communications of theACM 24: 381–395.
Frauendorfer, F., C. Wu, J. M. Frahm, and M. Pollefeys. 2008. “Visual Word Based LocationRecognition in 3D Models Using Distance Augmented Weighting.” Paper presented at the 4th3DPVT, Atlanta, GA, USA, June 18–20.
Garcia, V., E. Debreuve, and M. Barlaud. 2008. “Fast k Nearest Neighbor Search Using GPU.”Paper presented at the CVPRWorkshop on Computer Vision on GPU, Anchorage, Alaska, USA,June 24–26.
Golub, Gene H., and Charles F. Van Loan. 1996. Matrix Computations. Baltimore, MA: JohnsHopkins Studies in Mathematical Sciences, The Johns Hopkins University Press.
Gower, J. C., and G. B. Dijksterhuis. 2004. Procrustes Problems. Oxford: Oxford University Press.Hartley, R., and A. Zisserman. 2011. Multiple View Geometry in Computer Vision. 2nd ed.
Cambridge: Cambridge University Press.Hattori, K., R. Kimura, N. Nakajima, T. Fujii, Y. Kado, B. Zhang, T. Hazugawa, and K. Takadama.
2009. “Hybrid Indoor Location Estimation System Using Image Processing and WiFi Strength.”Paper presented at WNIS, Washington, DC, USA, December 28–29.
Horn, R. A., and C. R. Johnson. 1991. Topics in Matrix Analysis. Cambridge: Cambridge UniversityPress.
Lemelson, H., M. B. Kjaergaard, R. Hansen, and T. King. 2009. “Error Estimation for Indoor 802.11Location Fingerprinting.” Paper presented at LoCA, Tokyo, Japan, May 7–8.
Li, X., and J. Wang. 2012. “Image Matching Techniques for Vision-Based Indoor NavigationSystems: Performance Analysis for 3D Map Based Approach.” Paper presented at IPIN,Sydney, Australia, November 13–15.
Lowe, D. G. 2004. “Distinctive Image Features from Scale-Invariant Keypoints.” InternationalJournal of Computer Vision 60: 91–110.
Mulloni, A., D. Wagner, I. Barakonyi, and D. Schmalstieg. 2009. “Indoor Positioning andNavigation with Camera Phones.” IEEE Pervasive Computing 8: 22–31.
Murillo, A., J. Guerrero, and C. Sagues. 2007. “SURF Features for Efficient Robot Localization withOmnidirectional Images.” Paper presented at IEEE International Conference on Robotics andAutomation, Rome, Italy, April 10–14.
Nocedal, J., and S. Wright. 1999. Numerical Optimization. New York: Springer.Ruiz-Ruiz, A. J., O. Canovas, R. A. Rubio, and P. E. Lopez-de-Teruel. 2011. “Using SIFT and WiFi
Signals to Provide Location-Based Services for Smartphones.” Paper presented at MobiQuitous,Copenhagen, Denmark, December 6–9.
Ruiz-Ruiz, A. J., P. E. Lopez-de-Teruel, and O. Canovas. 2012. “A Multisensor LBS Using SIFT-Based 3D Models.” Paper presented at IPIN, Sydney, Australia, November 13–15.
Se, S., D. G. Lowe, and J. J. Little. 2005. “Vision Based Global Localization and Mapping forMobile Robots.” IEEE Transactions on Robotics 21 (3): 364–375.
Journal of Location Based Services 163
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4
Triggs, B., P. F. Mclauchlan, R. I. Hartley, and A. W. Fitzgibbon. 2000. “Bundle Adjustment – AModern Synthesis.” Lecture Notes in Computer Science 298–372.
Wu, C. 2001. “VisualSFM: A Visual Structure from Motion System.” Accessed June 5, 2013. http://www.cs.washington.edu/homes/ccwu/vsfm/
Wu, C. 2007. “SiftGPU: A GPU implementation of Scale Invaraint Feature Transform (SIFT).”Accessed June 5, 2013. http://cs.unc.edu/ ccwu/siftgpu
A.J. Ruiz-Ruiz et al.164
Dow
nloa
ded
by [
Nor
th D
akot
a St
ate
Uni
vers
ity]
at 0
4:21
28
Oct
ober
201
4