how to handle and analyse large datasets benvgee7 'methods of environmental analysis' ed...

How to Handle and Analyse Large DatasetsBENVGEE7 'Methods of Environmental Analysis'Ed Sharp21st February 2012

Introduction

• Me……. • BSc Geography, • Worked as SABSCO ltd, niche power station construction contractor• MSc GIS, • MRes Energy Demand Studies• PhD: The Spatiotemporal patterns of energy demand and supply in the UK• Recent interest and research into large datasets including a major piece of

research into the effects of disparate inaccurate datasets on energy demand forecast models

• Email: [email protected]• Web

• Linkedin: http://www.linkedin.com/pub/ed-sharp/43/2b4/b1b• UCL: http://www.bartlett.ucl.ac.uk/energy/people/students/ed-sharp• LoLo: http://www.lolo.ac.uk/profilepreview/view/id/102

mailto:[email protected]

http://www.linkedin.com/pub/ed-sharp/43/2b4/b1b

http://www.linkedin.com/pub/ed-sharp/43/2b4/b1b

http://www.bartlett.ucl.ac.uk/energy/people/students/ed-sharp

http://www.lolo.ac.uk/profilepreview/view/id/102

Todays Lecture

• Three distinct sections

1. Theory: Describe how to handle and analyse large datasets

2. Practice: Run an exercise outlining some pervasive issues

3. Case Study: Demonstrate these within the context of some existing research

• Slides available on Moodle with web and literature references in full, colour denotes section.

Part 1: What is a large dataset?

• Large volumes of data– Millions of entries– Many Terabytes– Computationally intensive– Past 10 years x 1m

• Varied sources of data– Same variables– Different sources– Separate set of issues

causing problems with handling and analysis

Two types

There are issues that are common between the two as well as individual

Examples….

• Volumes– Census (http://census.ac.uk/)– Home Energy Efficiency

Database (HEED http://www.energysavingtrust.org.uk/Professional-resources/Existing-Housing/Homes-Energy-Efficiency-Database

)– Time series datasets e.g. energy

production/consumption– Remotely sensed data– Geographic datasets– Climate reanalyses

• Sources– Population– Economic variables (GDP,

GVA etc.)– Socio-demographic

variables (Population, Employment etc.)

http://census.ac.uk/

http://census.ac.uk/

http://www.energysavingtrust.org.uk/Professional-resources/Existing-Housing/Homes-Energy-Efficiency-Database




Sources including repositories and search engines:• Data.gov: www.data.gov.uk• GoGeo: www.gogeo.ac.uk • ShareGeo: www.sharegeo.ac.uk • Eurostat: http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat/home/• IEA: www.iea.org • National Statistics: www.statistics.gov.uk • Odyssee: http://www.odyssee-indicators.org/• OECD: www.oecd.org • UNECE: www.unece.org • World Bank: www.worldbank.org • ADS, Archaeology Data Service; archaeologydataservice.ac.uk • BADC, British Atmospheric Data Centre; badc.nerc.ac.uk • BODC: (Oceanographic): www.bodc.ac.uk • CDS, Chemical Database Service; cds.dl.ac.uk • EBI, European Bioinformatics Institute; www.ebi.ac.uk • ESDS, Economic and Social Data Service; www.esds.ac.uk • NCDR, National Cancer Data Repository; www.ncin.org • NGDC, National Geo-science Data Centre; www.ngdc.noaa.gov • UKSSDC, UK Solar System Data Centre. www.ukssdc.ac.uk • Office for national statistics: www.ons.gov.uk • UK data archive (UKDA): www.data-archive.ac.uk • Casweb (census): casweb.mimas.ac.uk • DFT: www.dft.gov.uk • EEA: www.eea.europe.eu • World Energy Council: www.worldenergy.org • Florida solar energy centre: www.fsec.ucf.edu/ • EDINA: edina.ac.uk • Mapcruzin: www.mapcruzin.com • Guardian datastore: www.guardian.co.uk/data

• London air quality network: www.londonair.org.uk • OpenStreetMap: www.openstreetmap.org • UK Borders: edina.ac.uk/ukborders • Met Office: www.metoffice.gov.uk • DECC: www.decc.gov.uk • Etc……………………………• Highlighted examples should be the most relevant to EDE

http://www.data.gov.uk/

http://www.gogeo.ac.uk/

http://www.sharegeo.ac.uk/

http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat/home/

http://www.iea.org/

http://www.statistics.gov.uk/

http://www.odyssee-indicators.org/

http://www.oecd.org/

http://www.unece.org/

http://www.worldbank.org/

http://www.archaeologydataservice.ac.uk/

http://www.archaeologydataservice.ac.uk/

http://www.badc.nerc.ac.uk/

http://www.bodc.ac.uk/

http://www.cds.dl.ac.uk/

http://www.ebi.ac.uk/

http://www.esds.ac.uk/

http://www.ncin.org/

http://www.ngdc.noaa.gov/

http://www.ukssdc.ac.uk/

http://www.ons.gov.uk/

http://www.data-archive.ac.uk/

http://www.casweb.mimas.ac.uk/

http://www.dft.gov.uk/

http://www.eea.europe.eu/

http://www.worldenergy.org/

http://www.fsec.ucf.edu/

http://www.fsec.ucf.edu/

http://www.edina.ac.uk/

http://www.mapcruzin.com/

http://www.guardian.co.uk/data

http://www.londonair.org.uk/

http://www.openstreetmap.org/

http://www.edina.ac.uk/ukborders

http://www.edina.ac.uk/ukborders

http://www.metoffice.gov.uk/

http://www.decc.gov.uk/

Has anyone used “large datasets” before?

1 2

88%

12%

1. Yes

2. No

Does anyone think they will use it in the future?

1 2 3

44%

19%

38%

1. Yes

2. No

3. Don’t know

Likely encounters• Access is predominantly through the web• Some may require sign in through university• Fees sometimes waived for academic use (always worth asking)• Verify Copyright and Licensing• Used in

– Research– Modelling– Pervasive in the environmental domain– Property– Finance

• Volume and complexity are increasing (e.g. Facebook, Flickr)• Mckinsey: concluded that the analysis of this kind of dataset will become

increasingly important in influencing business decisions therefore skills in this area will be valuable

Mckinsey: “Big data: The next frontier for innovation, competition, and productivity” Available from: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation

http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation


Storage:• Very large datasets require their own servers, especially those which require security e.g.

HEED and OpenStreetMap• Parallel storage allows download simultaneously with simulation, visualisation and

analysis• Hardware development means all but the very biggest can be stored and transported on

portable hard drives• Most can be downloaded via the internet or in special cases requested on a CD (e.g.

Ordnance Survey Mastermap)• Effective backup is necessary especially once analysis begins• Bespoke data architecture exists (e.g. financial databases)• This requires knowledge of primarily SQL• Most data that you encounter will be accessible through some sort of graphical interface

– Example on next slide

Graphical interface SQL script

Software and data format• Use whatever you are comfortable with• Excel OK for majority of operations, good graphically

– Limited to 1 million rows and 16384 columns (beware when importing data)

• For larger datasets or more sophisticated operations consider a statistical packge– SAS very good for large datasets but requires programming skill– SPSS almost as powerful with a better interface

• Works well in conjunction with Field (2009)

• Microsoft Access allows handling of large complicated databases• All of these available through cluster machines or for home use from http://

www.ucl.ac.uk/isd/common/software• Alternatives include: R, Mathematica, Statistica and Rapidminer

Formats• Excel (.xls, .xlsx)• Access (.mdb, .dbf)• SAS and SPSS have proprietary formats but can be exported to excel• A common format used for exchange is comma separated (.CSV, .txt)• Others include: xml (machine readable), CDF (NASA), NeXus, OpenMath, PDS, SAIF,

SDTS, VICAR etc…… (these require some kind of specialist knowledge)

Field, A. P. 2009. Discovering statistics using SPSS, SAGE publications Ltd.

http://www.ucl.ac.uk/isd/common/software

http://www.ucl.ac.uk/isd/common/software

Data Handling: First steps

1. Metadata– Data about data– Attached in different ways– Varies in forms and content– Should follow standards e.g. INSPIRE http://inspire.jrc.ec.europa.eu/

2. Identify methods of collection– Are these uniform across data sources?– May require reading supporting documentation

3. Identify contributors– Are they reliable

4. Identify alternative sources– Case study will show that divergence is possible

http://inspire.jrc.ec.europa.eu/

http://inspire.jrc.ec.europa.eu/

5. Identify data gaps– First do this visually– Genuine gaps should not skew subsequent analysis– If this has been replaced by for example NULL or 0.0 it may cause

problems and should be investigated– If several datasets are used this should be harmonised– Follow a convention that is obvious to you and acceptable to the

software

6. Identify Duplicates– More than one value for a data point– Possibly valid– E.g. shortened labels falsely groups values

Data Handling: Second steps

Data Handling: Second steps continued…

7. Note precision– Data should be stored at a reasonable precision– For example: Beware of the dataset that tries to depict population

to the nearest person– Harmonise between datasets– Can affect comparability to other data

8. Identify spurious data– Many rows and columns may not be needed– Discard to make analysis simple– Note changes – Keep copies of original

9. Harmonise heading– Ensure that they make sense to you and the software

Graphical representation and statistical analysis• The above steps can be carried out by looking through a

data• However techniques exist to automate them and therefore

reduce time• The first step in any analysis should be to create graphs• These can reveal patterns alongside highlighting duplicates,

gaps and errors• After this is done it may be useful to clean your data again• Excel is fine but more complex and repeatable operations

are available with other software and some programming

Some examples…..• A simple graph

Tufte (1983) and McCandless (2009)

• Something more complex

• Some better looking examples

Statistical tests• Another automated analysis

technique is statistical• These can be combined in

a box plot conveying statistics graphically

• Simple metrics such as mean, median, mode and standard deviation are useful as well as looking at distribution

• As well as the t test• More sophisticated analysis

through e.g. SPSS, GIS…..

Advanced analysis, simulation and visualisation• These methods vary based on purpose and available data• If you have purely statistical intentions then something like

SPSS or SAS is ideal, especially in conjunction with Field (2009)

• A multitude of tests exist which will suit your needs, beware that these depend on data type, collection etc.

• The internet along with books and lecturers are a good source for deciding which to choose

• A good program for visualisation, provided that you have spatially related data

• Some examples of output that I have produced are on the next slide, again there is an abundance of web and literature resources

Part 2: Exercise

• Attempt to calculate the floor area of central house (this building) in pairs

• Stay in the room but use whatever techniques you have at your disposal

• No use of the internet (it will be obvious)• Write your answer down on a piece of paper• 10 minutes• Be prepared to answer some questions using the poll

system• We will declare a floor area champion at the end

What units did you use?

1 2 3 4 5 6

0% 0% 0%

100%

0%0%

1. Acres

2. Hectares

3. Square Mile

4. Square Kilometre

5. Square Metre

6. Square foot

Why?

• Although the standard is m2 you should not assume that data you are given uses this standard

• Always check the metadata to ensure that it has been done correctly

• Remember that Americans will not use the metric system and a large volume of data will originate from here

• Other units could well be correct but ensure that you use the data properly

Did you include the basement in your calculations?

1 2

0%

100%1. Yes

2. No

Why

• Floor area calculations can be defined as usable, in this case the basement is used but someone creating a larger database would not have this information

• This can cause divergence between real data and that which you are provided with

• Check the metadata• And if necessary at source

Did you attempt to subtract the floor area of interior walls?

1 2

100%

0%

1. Yes

2. No

Why

• Alongside different ways of defining floor area (semantics)• There are different ways of calculating it• It is possible a dataset may have been formed from an

Ordnance survey outline which would include them• Or a building survey would not• Neither is wrong but transparency is essential

How many floors did you allow for?

1 2 3 4 5 6 7 8

0% 0% 0%

16%

0%

42%

16%

26%

1. 3

2. 4

3. 5

4. 6

5. 7

6. 8

7. 9

8. More

Why?

• The correct number is eight but this may not be clear from plans

• Is the basement included in this?

Did you allow for the light well in the centre of the building?

1 2

71%

29%

1. Yes

2. No

Why?

• One method of calculating this would be to figure out the bottom floor and multiply it by the number of floors

• If you were unaware of the gap this may skew the result• This type of error is common not only in floor area

calculation but others that you may come across• It is important to investigate and understand these sources

of error

What was your final answer in metres squared?

1 2 3 4 5 6 7 8 9

11%

0% 0% 0%

21%

26%

11%

5%

26%

1. 0 – 750

2. 750 – 1500

3. 1500 – 2250

4. 2250 – 3000

5. 3000 – 3500

6. 3500 – 4000

7. 4000 – 4500

8. 4500 – 5000

9. More

Conclusion:

• The “Real” answer was 3,658m2– 39,376 sqft, – 0.003658km2, – 0.903949 Acre, – 0.365815 hectare, – 0.001412 mile2

• Interestingly there is no DEC here so the figure is off the internet• Different ways of defining the floor area have been used here as is the

case for real datasets• The reality is that the data you have created is probably as good an

estimation of the floor area as is available publicly• Errors would be multiplied if applied to for example the whole country

which is “a large dataset”

Data Sources (UK only)

Part 3: Research Case study: Assessing the availability and quality of data for tertiary sector energy demand forecast models

• Large number of separate datasets• Divergence responsible for error of up to 100%

Results – Classification schemes

NACE (Tertiary) ISIC (Commercial)Wholesale & Retail Trade; repair of motor vehicles and motorcycles

Wholesale and Retail Trade; Repair of Motor Vehicles, Motorcycles and Personal and Household Goods

Accommodation and food service activities

Hotels and Restaurants

Financial, insurance and real estate activities

Real Estate, Renting and Business Activities

Administrative and support service activities

Post and telecommunication, Financial Intermediation

Education EducationHuman health and social work activities

Health

Other NACE activities Miscellaneous Public administration and defence Agriculture, Forestry and Fishery (as

separate sub sectors

NACE: Nomenclature statistique des Activités économiques dans la Communité Européenne (Eurostat, 2008)ISIC: United Nations International Standard Industrial Classifications (UNIDO, 2010)

Results - Floor space in the sector

Entire Non-domestic stock

“Tertiary sector”

Questionable Difference

Questionable Difference

“Tertiary sector”

All Commercial and Public buildings

Results - Energy consumption in the sector

Values from the ISIC scheme

Values from the NACE scheme

Declining Range

Results - Population

Results - Employee numbers in the sector



Declining Range

Same patterns as seen with the energy consumption data

Results - Gross Domestic Product

Clearly wrong (would this be obvious in isolation)

Results - Gross value added



Conclusions………..

Research Case Study Conclusions

• Majority of error caused by lack of standard classification methodology• Semantic differences exist but can be resolved• Artefacts of harmonisation require care to eradicate• Lack of transparency is pervasive• Precision inextricably varies• Variables with associated established methodology can be relied upon• Many issues could be resolved through the setting up of a centralised

repository• Data is dangerous

Theory conclusions:

• Data exists in many and varied forms• Handling and analysis skills will become

increasingly important• There are a set of standard steps which should be

followed in an initial exploration of any dataset• Foremost in your mind should be viewing a

dataset critically• Visualisation is key to understanding• Graphs etc. are generally the best way of

communicating information

References:

– Field, A. P. 2009. Discovering statistics using SPSS, SAGE publications Ltd.

– Witten, I. H. & Frank, E. 2005. Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann.

– Mccandless, D. 2009. Information is beautiful, Collins.– Tufte, E. R. & Howard, G. 1983. The visual display of quantitative

information, Graphics press Cheshire, CT.– Mckinsey. 2011. Big data: The next frontier for innovation,

competition, and productivity Available from: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.

– Infrastructures, D. S. D. 2000. The SDI Cookbook. GSDI/Nebert. (for those interested in data infrastructure)

– See also slide detailing data sources




how to handle and analyse large datasets benvgee7 'methods of environmental analysis' ed...

Documents

volumescensus http

archaeology data service

social data service

florida solar energy

large datasets2

uk solar system data

world energy council

mres energy demand studiesphd