how to handle and analyse large datasets benvgee7 'methods of environmental analysis' ed...
TRANSCRIPT
How to Handle and Analyse Large DatasetsBENVGEE7 'Methods of Environmental Analysis'Ed Sharp21st February 2012
Introduction
• Me……. • BSc Geography, • Worked as SABSCO ltd, niche power station construction contractor• MSc GIS, • MRes Energy Demand Studies• PhD: The Spatiotemporal patterns of energy demand and supply in the UK• Recent interest and research into large datasets including a major piece of
research into the effects of disparate inaccurate datasets on energy demand forecast models
• Email: [email protected]• Web
• Linkedin: http://www.linkedin.com/pub/ed-sharp/43/2b4/b1b• UCL: http://www.bartlett.ucl.ac.uk/energy/people/students/ed-sharp• LoLo: http://www.lolo.ac.uk/profilepreview/view/id/102
Todays Lecture
• Three distinct sections
1. Theory: Describe how to handle and analyse large datasets
2. Practice: Run an exercise outlining some pervasive issues
3. Case Study: Demonstrate these within the context of some existing research
• Slides available on Moodle with web and literature references in full, colour denotes section.
Part 1: What is a large dataset?
• Large volumes of data– Millions of entries– Many Terabytes– Computationally intensive– Past 10 years x 1m
• Varied sources of data– Same variables– Different sources– Separate set of issues
causing problems with handling and analysis
Two types
There are issues that are common between the two as well as individual
Examples….
• Volumes– Census (http://census.ac.uk/)– Home Energy Efficiency
Database (HEED http://www.energysavingtrust.org.uk/Professional-resources/Existing-Housing/Homes-Energy-Efficiency-Database
)– Time series datasets e.g. energy
production/consumption– Remotely sensed data– Geographic datasets– Climate reanalyses
• Sources– Population– Economic variables (GDP,
GVA etc.)– Socio-demographic
variables (Population, Employment etc.)
Sources including repositories and search engines:• Data.gov: www.data.gov.uk• GoGeo: www.gogeo.ac.uk • ShareGeo: www.sharegeo.ac.uk • Eurostat: http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat/home/• IEA: www.iea.org • National Statistics: www.statistics.gov.uk • Odyssee: http://www.odyssee-indicators.org/• OECD: www.oecd.org • UNECE: www.unece.org • World Bank: www.worldbank.org • ADS, Archaeology Data Service; archaeologydataservice.ac.uk • BADC, British Atmospheric Data Centre; badc.nerc.ac.uk • BODC: (Oceanographic): www.bodc.ac.uk • CDS, Chemical Database Service; cds.dl.ac.uk • EBI, European Bioinformatics Institute; www.ebi.ac.uk • ESDS, Economic and Social Data Service; www.esds.ac.uk • NCDR, National Cancer Data Repository; www.ncin.org • NGDC, National Geo-science Data Centre; www.ngdc.noaa.gov • UKSSDC, UK Solar System Data Centre. www.ukssdc.ac.uk • Office for national statistics: www.ons.gov.uk • UK data archive (UKDA): www.data-archive.ac.uk • Casweb (census): casweb.mimas.ac.uk • DFT: www.dft.gov.uk • EEA: www.eea.europe.eu • World Energy Council: www.worldenergy.org • Florida solar energy centre: www.fsec.ucf.edu/ • EDINA: edina.ac.uk • Mapcruzin: www.mapcruzin.com • Guardian datastore: www.guardian.co.uk/data
• London air quality network: www.londonair.org.uk • OpenStreetMap: www.openstreetmap.org • UK Borders: edina.ac.uk/ukborders • Met Office: www.metoffice.gov.uk • DECC: www.decc.gov.uk • Etc……………………………• Highlighted examples should be the most relevant to EDE
Has anyone used “large datasets” before?
1 2
88%
12%
1. Yes
2. No
Does anyone think they will use it in the future?
1 2 3
44%
19%
38%
1. Yes
2. No
3. Don’t know
Likely encounters• Access is predominantly through the web• Some may require sign in through university• Fees sometimes waived for academic use (always worth asking)• Verify Copyright and Licensing• Used in
– Research– Modelling– Pervasive in the environmental domain– Property– Finance
• Volume and complexity are increasing (e.g. Facebook, Flickr)• Mckinsey: concluded that the analysis of this kind of dataset will become
increasingly important in influencing business decisions therefore skills in this area will be valuable
Mckinsey: “Big data: The next frontier for innovation, competition, and productivity” Available from: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation
Storage:• Very large datasets require their own servers, especially those which require security e.g.
HEED and OpenStreetMap• Parallel storage allows download simultaneously with simulation, visualisation and
analysis• Hardware development means all but the very biggest can be stored and transported on
portable hard drives• Most can be downloaded via the internet or in special cases requested on a CD (e.g.
Ordnance Survey Mastermap)• Effective backup is necessary especially once analysis begins• Bespoke data architecture exists (e.g. financial databases)• This requires knowledge of primarily SQL• Most data that you encounter will be accessible through some sort of graphical interface
– Example on next slide
Graphical interface SQL script
Software and data format• Use whatever you are comfortable with• Excel OK for majority of operations, good graphically
– Limited to 1 million rows and 16384 columns (beware when importing data)
• For larger datasets or more sophisticated operations consider a statistical packge– SAS very good for large datasets but requires programming skill– SPSS almost as powerful with a better interface
• Works well in conjunction with Field (2009)
• Microsoft Access allows handling of large complicated databases• All of these available through cluster machines or for home use from http://
www.ucl.ac.uk/isd/common/software• Alternatives include: R, Mathematica, Statistica and Rapidminer
Formats• Excel (.xls, .xlsx)• Access (.mdb, .dbf)• SAS and SPSS have proprietary formats but can be exported to excel• A common format used for exchange is comma separated (.CSV, .txt)• Others include: xml (machine readable), CDF (NASA), NeXus, OpenMath, PDS, SAIF,
SDTS, VICAR etc…… (these require some kind of specialist knowledge)
Field, A. P. 2009. Discovering statistics using SPSS, SAGE publications Ltd.
Data Handling: First steps
1. Metadata– Data about data– Attached in different ways– Varies in forms and content– Should follow standards e.g. INSPIRE http://inspire.jrc.ec.europa.eu/
2. Identify methods of collection– Are these uniform across data sources?– May require reading supporting documentation
3. Identify contributors– Are they reliable
4. Identify alternative sources– Case study will show that divergence is possible
5. Identify data gaps– First do this visually– Genuine gaps should not skew subsequent analysis– If this has been replaced by for example NULL or 0.0 it may cause
problems and should be investigated– If several datasets are used this should be harmonised– Follow a convention that is obvious to you and acceptable to the
software
6. Identify Duplicates– More than one value for a data point– Possibly valid– E.g. shortened labels falsely groups values
Data Handling: Second steps
Data Handling: Second steps continued…
7. Note precision– Data should be stored at a reasonable precision– For example: Beware of the dataset that tries to depict population
to the nearest person– Harmonise between datasets– Can affect comparability to other data
8. Identify spurious data– Many rows and columns may not be needed– Discard to make analysis simple– Note changes – Keep copies of original
9. Harmonise heading– Ensure that they make sense to you and the software
Graphical representation and statistical analysis• The above steps can be carried out by looking through a
data• However techniques exist to automate them and therefore
reduce time• The first step in any analysis should be to create graphs• These can reveal patterns alongside highlighting duplicates,
gaps and errors• After this is done it may be useful to clean your data again• Excel is fine but more complex and repeatable operations
are available with other software and some programming
Some examples…..• A simple graph
Tufte (1983) and McCandless (2009)
• Something more complex
• Some better looking examples
Statistical tests• Another automated analysis
technique is statistical• These can be combined in
a box plot conveying statistics graphically
• Simple metrics such as mean, median, mode and standard deviation are useful as well as looking at distribution
• As well as the t test• More sophisticated analysis
through e.g. SPSS, GIS…..
Advanced analysis, simulation and visualisation• These methods vary based on purpose and available data• If you have purely statistical intentions then something like
SPSS or SAS is ideal, especially in conjunction with Field (2009)
• A multitude of tests exist which will suit your needs, beware that these depend on data type, collection etc.
• The internet along with books and lecturers are a good source for deciding which to choose
• A good program for visualisation, provided that you have spatially related data
• Some examples of output that I have produced are on the next slide, again there is an abundance of web and literature resources
GIS
Part 2: Exercise
• Attempt to calculate the floor area of central house (this building) in pairs
• Stay in the room but use whatever techniques you have at your disposal
• No use of the internet (it will be obvious)• Write your answer down on a piece of paper• 10 minutes• Be prepared to answer some questions using the poll
system• We will declare a floor area champion at the end
What units did you use?
1 2 3 4 5 6
0% 0% 0%
100%
0%0%
1. Acres
2. Hectares
3. Square Mile
4. Square Kilometre
5. Square Metre
6. Square foot
Why?
• Although the standard is m2 you should not assume that data you are given uses this standard
• Always check the metadata to ensure that it has been done correctly
• Remember that Americans will not use the metric system and a large volume of data will originate from here
• Other units could well be correct but ensure that you use the data properly
Did you include the basement in your calculations?
1 2
0%
100%1. Yes
2. No
Why
• Floor area calculations can be defined as usable, in this case the basement is used but someone creating a larger database would not have this information
• This can cause divergence between real data and that which you are provided with
• Check the metadata• And if necessary at source
Did you attempt to subtract the floor area of interior walls?
1 2
100%
0%
1. Yes
2. No
Why
• Alongside different ways of defining floor area (semantics)• There are different ways of calculating it• It is possible a dataset may have been formed from an
Ordnance survey outline which would include them• Or a building survey would not• Neither is wrong but transparency is essential
How many floors did you allow for?
1 2 3 4 5 6 7 8
0% 0% 0%
16%
0%
42%
16%
26%
1. 3
2. 4
3. 5
4. 6
5. 7
6. 8
7. 9
8. More
Why?
• The correct number is eight but this may not be clear from plans
• Is the basement included in this?
Did you allow for the light well in the centre of the building?
1 2
71%
29%
1. Yes
2. No
Why?
• One method of calculating this would be to figure out the bottom floor and multiply it by the number of floors
• If you were unaware of the gap this may skew the result• This type of error is common not only in floor area
calculation but others that you may come across• It is important to investigate and understand these sources
of error
What was your final answer in metres squared?
1 2 3 4 5 6 7 8 9
11%
0% 0% 0%
21%
26%
11%
5%
26%
1. 0 – 750
2. 750 – 1500
3. 1500 – 2250
4. 2250 – 3000
5. 3000 – 3500
6. 3500 – 4000
7. 4000 – 4500
8. 4500 – 5000
9. More
Conclusion:
• The “Real” answer was 3,658m2– 39,376 sqft, – 0.003658km2, – 0.903949 Acre, – 0.365815 hectare, – 0.001412 mile2
• Interestingly there is no DEC here so the figure is off the internet• Different ways of defining the floor area have been used here as is the
case for real datasets• The reality is that the data you have created is probably as good an
estimation of the floor area as is available publicly• Errors would be multiplied if applied to for example the whole country
which is “a large dataset”
Data Sources (UK only)
Part 3: Research Case study: Assessing the availability and quality of data for tertiary sector energy demand forecast models
• Large number of separate datasets• Divergence responsible for error of up to 100%
Results – Classification schemes
NACE (Tertiary) ISIC (Commercial)Wholesale & Retail Trade; repair of motor vehicles and motorcycles
Wholesale and Retail Trade; Repair of Motor Vehicles, Motorcycles and Personal and Household Goods
Accommodation and food service activities
Hotels and Restaurants
Financial, insurance and real estate activities
Real Estate, Renting and Business Activities
Administrative and support service activities
Post and telecommunication, Financial Intermediation
Education EducationHuman health and social work activities
Health
Other NACE activities Miscellaneous Public administration and defence Agriculture, Forestry and Fishery (as
separate sub sectors
NACE: Nomenclature statistique des Activités économiques dans la Communité Européenne (Eurostat, 2008)ISIC: United Nations International Standard Industrial Classifications (UNIDO, 2010)
Results - Floor space in the sector
Entire Non-domestic stock
“Tertiary sector”
Questionable Difference
Questionable Difference
“Tertiary sector”
All Commercial and Public buildings
Results - Energy consumption in the sector
Values from the ISIC scheme
Values from the NACE scheme
Declining Range
Results - Population
Results - Employee numbers in the sector
Values from the ISIC scheme
Values from the NACE scheme
Declining Range
Same patterns as seen with the energy consumption data
Results - Gross Domestic Product
Clearly wrong (would this be obvious in isolation)
Results - Gross value added
Values from the ISIC scheme
Values from the NACE scheme
Conclusions………..
Research Case Study Conclusions
• Majority of error caused by lack of standard classification methodology• Semantic differences exist but can be resolved• Artefacts of harmonisation require care to eradicate• Lack of transparency is pervasive• Precision inextricably varies• Variables with associated established methodology can be relied upon• Many issues could be resolved through the setting up of a centralised
repository• Data is dangerous
Theory conclusions:
• Data exists in many and varied forms• Handling and analysis skills will become
increasingly important• There are a set of standard steps which should be
followed in an initial exploration of any dataset• Foremost in your mind should be viewing a
dataset critically• Visualisation is key to understanding• Graphs etc. are generally the best way of
communicating information
References:
– Field, A. P. 2009. Discovering statistics using SPSS, SAGE publications Ltd.
– Witten, I. H. & Frank, E. 2005. Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann.
– Mccandless, D. 2009. Information is beautiful, Collins.– Tufte, E. R. & Howard, G. 1983. The visual display of quantitative
information, Graphics press Cheshire, CT.– Mckinsey. 2011. Big data: The next frontier for innovation,
competition, and productivity Available from: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.
– Infrastructures, D. S. D. 2000. The SDI Cookbook. GSDI/Nebert. (for those interested in data infrastructure)
– See also slide detailing data sources