machine learning for geological mapping: algorithms … · machine learning algorithms, ... chapter...
TRANSCRIPT
MACHINE LEARNING FOR GEOLOGICALMAPPING: ALGORITHMS AND APPLICATIONS
MATTHEW J. CRACKNELL
BSc (Hons)
ARC Centre of Excellence in Ore Deposits (CODES)
School of Physical Sciences (Earth Sciences)
Submitted in fulfilment of the requirements for the degree of
Doctor of Philosophy
University of Tasmania
May, 2014
i
Did you ever fly a kite in bed?
Did you ever walk with ten cats on your head?
Did you ever milk this kind of cow?
Well, we can do it.
We know how.
If you never did you should.
These things are fun and fun is good.
Dr. Seuss
iii
DECLARATION OF ORIGINALITY
This thesis contains no material which has been accepted for a degree or diploma by the
University or any other institution, except by way of background information and duly
acknowledged in the thesis, and to the best of my knowledge and belief no material
previously published or written by another person except where due acknowledgement is
made in the text of the thesis, nor does the thesis contain any material that infringes
copyright.
AUTHORITY OF ACCESS
This non-published content of the thesis (see below) may be made available for loan and
limited copying and communication in accordance with the Copyright Act 1968.
STATEMENT REGARDING PUBLISHED WORKCONTAINED IN THESIS
Chapter 4 of this thesis is published under a Creative Commons Attribution (CC BY)
licence. You are free to copy, communicate and adapt the work, so long as you attribute
the authors. To view a copy of this licence, visit http://creativecommons.org/licenses/. The
publishers of the papers comprising Chapters 5 to 6 hold the copyright for that content, and
access to the material should be sought from the respective journals.
Matthew J. Cracknell
May 2014
v
STATEMENT OF CO-AUTHORSHIP
The following people and institutions contributed to the publication of work undertaken as
part of this thesis:
Matthew James Cracknell, ARC Centre of Excellence in Ore Deposits (CODES), School of
Earth Sciences, University of Tasmania = Candidate
Anya Marie Reading, ARC Centre of Excellence in Ore Deposits (CODES), School of
Earth Sciences, University of Tasmania = Author 1
Andrew William McNeill, Mineral Resources Tasmania, Department of Infrastructure
Energy & Resources (DIER) = Author 2
Author details and their roles:
Paper 1, ‘Geological mapping using remote sensing data: A comparison of five
machine learning algorithms, their response to variations in the spatial distribution of
training data and the use of explicit spatial information’:
Located in Chapter 4
Candidate was the primary author and with Author 1 contributing to its development,
refinement and presentation.
vi Machine learning for geological mapping
Paper 2, ‘The upside of uncertainty: Identification of lithology contact zones from
airborne geophysics and satellite data using Random Forests and Support Vector
Machines’:
Located in Chapter 5
Candidate was the primary author and with Author 1 contributing to development,
refinement and presentation.
Paper 3, ‘Mapping geology and volcanic-hosted massive sulfide alteration in the
Hellyer–Mt Charter region, Tasmania, using Random Forests™ and Self-Organising
Maps’:
Located in Chapter 6
Candidate was the primary author and with Author 1 contributing to its refinement and
presentation and Author 2 contributing to its formalisation and development.
We the undersigned agree with the above stated “proportion of work undertaken” for each
of the above published (or submitted) peer-reviewed manuscripts contributing to this
thesis:
Signed:
Anya M. Reading Jocelyn McPhie
Supervisor Head of School
School Of Earth Sciences School Of Earth Sciences
University of Tasmania University of Tasmania
Date:
vii
ABSTRACT
Machine learning algorithms are designed to identify efficiently and to predict accurately
patterns within multivariate data. They provide analysts computational tools to aid
predictive modelling and the interpretation of interactions between data and the
phenomena under investigation. The analysis of large volumes of disparate multivariate
geospatial data using machine learning algorithms therefore offers great promise to
industry and research in the geosciences. Geoscience data are frequently characterised by a
restriction in the number and distribution of direct observations, irreducible noise in these
data and a high degree of intraclass variability and interclass similarity. The choice of
machine learning algorithm, or algorithms and the details of how algorithms are applied
must therefore be appropriate to the context of geoscience data. With this knowledge, I aim
to employ machine learning as a means of understanding the spatial distribution of
complex geological phenomena.
I conduct a rigorous and comprehensive comparison of machine learning algorithms,
representing the five general machine learning strategies, for supervised lithology
classification applications. I also develop and test a novel method for obtaining robust
estimates of the uncertainty associated with machine learning algorithm categorical
predictions. The insights gained from these experiments leads to the further development
and comparison of new methods for the incorporation of spatial-contextual information
into machine learning supervised classifiers.
In using machine learning algorithms for geoscience applications, I have developed best-
practice methodologies that address the challenges facing geoscientists for geospatial
supervised classification. Guidelines are established that detail the preparation and
integration of disparate spatial data, the optimisation of trained classifiers for a given
application and the robust statistical and spatial evaluation of outputs. I demonstrate,
through a case study in a region that is prospective for economic mineralisation, the
combination of supervised and unsupervised machine learning algorithms for the critical
appraisal of pre-existing geological maps and formulation of meaningful interpretations of
geological phenomena.
viii Machine learning for geological mapping
The experiments conducted as part of my research confirm the efficacy of machine
learning algorithms to generate accurate geological maps representing a variety of terranes.
I identify and explore key aspects of the spatial and statistical distributions of geoscience
data that affect machine learning algorithm performance. My research clearly identifies
Random Forests™ as a good first-choice algorithm for the prediction of classes
representing lithologies using commonly available multivariate geological and geophysical
data. Furthermore, Random Forests prediction uncertainty is shown to be closely related to
ambiguous and/or erroneous classifications and, thus provides a practical means of
indicating variable levels of confidence. Spatial-contextual information is best incorporated
into machine learning supervised classifiers via the pre-processing of input variables
and/or the post-regularisation of classifications. My findings indicate that a trade-off
between optimal predictive models and interpretable explanatory models exists, whereby,
intuitively interpretable models are not necessarily the most accurate.
The practical application of machine learning algorithms requires the implementation of
three key stages: (1) data pre-processing; (2) algorithm training; and (3) prediction
evaluation. This methodology provides the foundation for generating accurate and
geologically meaningful predictions with minimal user intervention and assists in the
formulation of robust interpretations of complex geological phenomena. For example,
classifications obtained by Random Forests are useful for critically appraising interpreted
geological maps. Clusters produced by Self-Organising Maps indicate the presence of
discrete, spatially contiguous and geologically significant sub-classes within individual
lithological units, which represent regions of contrasting primary composition and
alteration styles. My results may be widely applied to a broad range of practical geoscience
challenges such as ore deposit targeting, geo-hazard risk assessment, engineering and
construction projects, hydrological and environmental modelling and ecological studies.
The applications of machine learning algorithms detailed in this thesis align well with
state-of-the-art Big Data online infrastructure and virtual laboratories currently emerging in
Australia.
ix
CONTENTS
DECLARATION OF ORIGINALITY ............................................................................... III
AUTHORITY OF ACCESS ................................................................................................. III
STATEMENT REGARDING PUBLISHED WORK CONTAINED IN THESIS ...... III
STATEMENT OF CO-AUTHORSHIP ...............................................................................V
ABSTRACT ...........................................................................................................................VII
CONTENTS .............................................................................................................................IX
LIST OF TABLES ................................................................................................................ XV
LIST OF FIGURES ........................................................................................................... XVII
LIST OF ABBREVIATIONS.............................................................................................XXI
ACKNOWLEDGEMENTS ............................................................................................. XXIII
CHAPTER 1 – INTRODUCTION ........................................................................................ 1
1.1. Machine learning .......................................................................................................................2
1.2. Geological maps .........................................................................................................................4
1.3. Research scope and hypothesis ..................................................................................................5
1.3.1. Major research questions to be addressed ..........................................................................6
1.4. Thesis structure..........................................................................................................................7
CHAPTER 2 – MACHINE LEARNING THEORY AND IMPLEMENTATION ....... 9
2.1. Machine learning .......................................................................................................................9
2.1.1. Supervised versus unsupervised learning.........................................................................10
2.2. Supervised classification ..........................................................................................................10
2.2.1. Classification strategies...................................................................................................11
2.2.1.1. Statistical learning algorithms.....................................................................................11
2.2.1.2. Instance-based learners...............................................................................................14
2.2.1.3. Logic-based learners ..................................................................................................17
2.2.1.4. Support Vector Machines ...........................................................................................20
2.2.1.5. Perceptrons ................................................................................................................23
2.2.2. Supervised classifier implementation ..............................................................................25
2.2.2.1. Data pre-processing....................................................................................................26
2.2.2.2. Classifier training.......................................................................................................27
x Machine learning for geological mapping
2.2.2.3. Prediction evaluation ................................................................................................. 29
2.3. Unsupervised clustering.......................................................................................................... 33
2.3.1. Clustering strategies....................................................................................................... 33
2.3.1.1. Partitioning algorithms .............................................................................................. 33
2.3.1.2. Hierarchical algorithms ............................................................................................. 35
2.3.1.3. Self-Organising Maps................................................................................................ 36
2.3.2. Unsupervised clustering implementation ........................................................................ 38
2.4. Conclusions ............................................................................................................................. 38
CHAPTER 3 – A REVIEW OF MACHINE LEARNING FOR GEOSCIENCE
CLASSIFICATION APPLICATIONS ..............................................................................41
3.1. Machine learning non-geoscience applications....................................................................... 41
3.2. Machine learning geoscience applications .............................................................................. 44
3.2.1. Classification of 0D data ................................................................................................ 45
3.2.1. Classification of 1D data ................................................................................................ 46
3.2.1.1. One temporal dimension............................................................................................ 46
3.2.1.2. One spatial dimension ............................................................................................... 47
3.2.1. Classification of 2D data ................................................................................................ 51
3.2.1.3. Land cover/vegetation mapping ................................................................................. 52
3.2.1.4. Geological mapping .................................................................................................. 55
Supervised classification...................................................................................................... 55
Unsupervised clustering....................................................................................................... 58
Combined supervised and unsupervised methods.................................................................. 60
3.3. Practical machine learning implementation ........................................................................... 61
3.3.1. Data............................................................................................................................... 63
3.3.2. Data pre-processing ....................................................................................................... 64
3.3.3. Prediction evaluation...................................................................................................... 64
3.3.4. Integrated workflow....................................................................................................... 65
3.4. Conclusions ............................................................................................................................. 66
CHAPTER 4 – GEOLOGICAL MAPPING USING REMOTE SENSING DATA: A
COMPARISON OF FIVE MACHINE LEARNING ALGORITHMS, THEIR
RESPONSE TO VARIATIONS IN THE SPATIAL DISTRIBUTION OF TRAINING
DATA AND THE USE OF EXPLICIT SPATIAL INFORMATION ...........................69
4.0. Abstract................................................................................................................................... 69
4.1. Introduction ................................................................................................................................ 70
4.1.1. Machine learning for supervised classification................................................................ 72
4.1.2. Machine learning algorithm theory................................................................................. 73
4.1.2.1. Naïve Bayes .............................................................................................................. 73
4.1.2.2. k-Nearest Neighbours ................................................................................................ 73
Contents xi
4.1.2.3. Random Forests .........................................................................................................73
4.1.2.4. Support Vector Machines ...........................................................................................74
4.1.2.5. Artificial Neural Networks .........................................................................................74
4.1.3. Geology and tectonic setting ...........................................................................................75
4.2. Data ..........................................................................................................................................77
4.3. Methods....................................................................................................................................78
4.3.1. Pre-processing ................................................................................................................78
4.3.2. Classification model training...........................................................................................79
4.3.3. Prediction evaluation ......................................................................................................79
4.4. Results ......................................................................................................................................79
4.5. Discussion.................................................................................................................................84
4.5.1. Machine learning algorithms compared...........................................................................84
4.5.2. Influence of training data spatial distribution ...................................................................87
4.5.3. Using spatially constrained data ......................................................................................88
4.6. Conclusions ..............................................................................................................................89
4.7. Acknowledgements ..................................................................................................................90
4.8. Description of supplementary information..............................................................................91
CHAPTER 5 – THE UPSIDE OF UNCERTAINTY: IDENTIFICATION OF
LITHOLOGY CONTACT ZONES FROM AIRBORNE GEOPHYSICS AND
SATELLITE DATA USING RANDOM FORESTS AND SUPPORT VECTOR
MACHINES ............................................................................................................................93
5.0. Abstract....................................................................................................................................93
5.1. Introduction .............................................................................................................................94
5.1.1. The lithology prediction problem ....................................................................................97
5.1.2. Random Forests..............................................................................................................98
5.1.3. Support Vector Machines................................................................................................99
5.2. Data ........................................................................................................................................101
5.2.1. Tectonic setting and history ..........................................................................................101
5.2.2. Data sources .................................................................................................................103
5.2.3. Data pre-processing ......................................................................................................103
5.3. Methods..................................................................................................................................103
5.3.1. Training and evaluating algorithms ...............................................................................105
5.3.2. Variance.......................................................................................................................106
5.4. Results ....................................................................................................................................106
5.5. Discussion...............................................................................................................................114
5.6. Conclusions ............................................................................................................................118
5.7. Acknowledgements ................................................................................................................119
xii Machine learning for geological mapping
CHAPTER 6 – MAPPING GEOLOGY AND VOLCANIC-HOSTED MASSIVE
SULFIDE ALTERATION IN THE HELLYER–MT CHARTER REGION,
TASMANIA, USING RANDOM FORESTS™ AND SELF-ORGANISING MAPS
................................................................................................................................................ 121
6.0. Abstract..................................................................................................................................121
6.1. Introduction ...........................................................................................................................122
6.1.1. Geological setting .........................................................................................................123
6.1.2. Random Forests ............................................................................................................128
6.1.3. Self-Organising Maps ...................................................................................................130
6.2. Data and Methods ..................................................................................................................130
6.2.1. Source data ...................................................................................................................130
6.2.2. Data sampling...............................................................................................................131
6.2.3. Training Random Forests and variable selection ............................................................133
6.2.4. Implementing Self-Organising Maps .............................................................................136
6.3. Results ....................................................................................................................................137
6.3.1. Geological classification using Random Forests ............................................................137
6.3.2. Discrimination of geological sub-classes using Self-Organising Maps............................141
6.4. Discussion...............................................................................................................................144
6.5. Conclusions ............................................................................................................................146
6.6. Acknowledgements.................................................................................................................147
CHAPTER 7 – SPATIAL-CONTEXTUAL MACHINE LEARNING SUPERVISED
CLASSIFIERS: LITHOSTRATIGRAPHY CLASSIFICATION EXAMPLE ........ 149
7.0. Abstract..................................................................................................................................149
7.1. Introduction ...........................................................................................................................150
7.1.1. Pre-processing methods.................................................................................................152
7.1.1.1. Focal operators.........................................................................................................152
7.1.1.2. Image segmentation..................................................................................................153
7.1.2. Training data selection ..................................................................................................154
7.1.3. Post-processing methods ...............................................................................................155
7.1.4. Combination methods ...................................................................................................155
7.1.5. Study aims....................................................................................................................155
7.2. Data ........................................................................................................................................156
7.2.1. Lithostratigraphy – classification target .........................................................................156
7.2.2. Geophysical data – input variables ................................................................................159
7.2.2.1. Pre-processing..........................................................................................................160
7.3. Methods..................................................................................................................................160
7.3.1. Data sampling...............................................................................................................160
7.3.2. Global pixel-based classifiers........................................................................................162
Contents xiii
7.3.3. Spatial-contextual classifiers.........................................................................................162
7.3.3.1. Pre-processing..........................................................................................................162
7.3.3.2. Algorithm training....................................................................................................164
7.3.3.3. Post-processing ........................................................................................................165
7.3.4. Prediction evaluation ....................................................................................................165
7.4. Results ....................................................................................................................................165
7.5. Discussion...............................................................................................................................173
7.5.1. Spatial-contextual classifiers compared .........................................................................173
7.5.2. Issues of spatial scale....................................................................................................175
7.5.3. Geological interpretations .............................................................................................176
7.6. Conclusions ............................................................................................................................177
CHAPTER 8 – SYNTHESIS AND DISCUSSION ........................................................ 179
8.1. Algorithms..............................................................................................................................179
8.1.1. Supervised classification...............................................................................................179
8.1.1.1. Implementation ........................................................................................................180
8.1.1.2. Decision structures...................................................................................................181
8.1.1.3. Accuracy comparison ...............................................................................................181
8.1.1.4. Spatial-contextual classifiers ....................................................................................183
8.1.1.5. Prediction uncertainty...............................................................................................184
8.1.2. Unsupervised clustering................................................................................................185
8.2. Applications ...........................................................................................................................186
8.2.1. Data pre-processing ......................................................................................................186
8.2.1.1. Data preparation.......................................................................................................187
8.2.1.2. Variable extraction ...................................................................................................188
8.2.1.3. Variable selection.....................................................................................................189
8.2.2. Classifier training .........................................................................................................189
8.2.2.1. Training and test data ...............................................................................................190
8.2.2.2. Classifier induction ..................................................................................................190
8.2.2.3. Classification post-processing...................................................................................191
8.2.3. Evaluation and interpretation ........................................................................................192
8.2.3.1. Statistical evaluation ................................................................................................193
8.2.3.2. Interrogating decision structures ...............................................................................194
8.2.3.3. Complementary interpretation ..................................................................................197
8.3. Extended research implications.............................................................................................199
8.3.1. Integrated workflow using R.........................................................................................199
8.3.2. Wider geoscience applications ......................................................................................200
8.3.3. Big Data .......................................................................................................................202
CHAPTER 9 – CONCLUSIONS...................................................................................... 205
xiv Machine learning for geological mapping
REFERENCES .................................................................................................................... 209
APPENDIX A – MACHINE LEARNING ALGORITHM SENSITIVITY TO
IMBALANCED CLASS DISTRIBUTIONS .................................................................. 253
A.1. Introduction ..........................................................................................................................253
A.2. Methods .................................................................................................................................254
A.3. Results ....................................................................................................................................256
A.4. Discussion and Conclusions ...................................................................................................259
APPENDIX B – VARIANCE AND ENTROPY FOR MULTICLASS
CLASSIFICATION UNCERTAINTY ............................................................................ 261
APPENDIX C – SUPPLEMENTARY INFORMATION............................................. 263
C.1. Data ........................................................................................................................................263
C.2. MLA software and parameters..............................................................................................266
APPENDIX D – R PACKAGES....................................................................................... 269
APPENDIX E – DATA SOURCES AND PRE-PROCESSING .................................. 271
APPENDIX F – R CODE AND SCRIPTS...................................................................... 275
README.txt.....................................................................................................................................275