statistics and computer science for a data-rich world

14
Data mining and statistical l earning: lecture 1a Statistics and computer Statistics and computer science for a data-rich world science for a data-rich world

Upload: maude

Post on 22-Feb-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Statistics and computer science for a data-rich world. 2020 Computing: Everything everywhere Declan Butler, nature , Vol 440, Issue no. 7083, 23 March 2006. Computing is getting exponentially cheaper - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

Statistics and computer science for a Statistics and computer science for a data-rich worlddata-rich world

Page 2: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

2020 Computing: Everything everywhere2020 Computing: Everything everywhereDeclan Butler, Declan Butler, naturenature, Vol 440, Issue no. 7083, 23 March 2006, Vol 440, Issue no. 7083, 23 March 2006

Computing is getting exponentially cheaper

Tiny computers that constantly monitor ecosystems, buildings and even human bodies could turn science on its head

Page 3: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

2020 Computing: Everything everywhere2020 Computing: Everything everywhereDeclan Butler, Declan Butler, naturenature, Vol 440, Issue no. 7083, 23 March 2006, Vol 440, Issue no. 7083, 23 March 2006

Science of the future: researchers can keep a constant eye on the flow of a Norwegian glacier by tracking miniature sensors buried beneath the ice.

Page 4: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

Examples of huge databasesExamples of huge databases

Transaction databases

Customer relations databases

Electronic health records (patient information)

Records of phone calls and website visits

Security information

Weather and climate data

Astrophysics data

Particle accelerator data

Page 5: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

Emerging Database InfrastructureEmerging Database Infrastructure

2001: The National Virtual Observatory project gets under way in the United States, developing methods for mining huge astronomical data sets.

2001: The US National Institutes of Health launches the Biomedical Informatics Research Network (BIRN), a grid of supercomputers designed to let multiple institutions share data.

2007: INSPIRE (The INfrastructure for SPatial InfoRmation in Europe). The INSPIRE initiative intends to trigger the creation of a European spatial information infrastructure that delivers to the users integrated spatial information services.

2007: CERN's Large Hadron Collider in Switzerland, the world's largest particle accelerator, is slated to come online. The flood of data it delivers will demand more processing power than ever before.

Page 6: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

The future of scientific computingThe future of scientific computingnaturenature, Vol 440, Issue no. 7083, 23 March 2006, Vol 440, Issue no. 7083, 23 March 2006

Science will increasingly be done directly in the database, finding relationships among existing data, while someone else performs the data collecting role

This means that scientists will have to understand computer science much the same way as they previously had to understand mathematics, as a basic tool with which to do their jobs

Page 7: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

2020 Computing: Everything everywhere2020 Computing: Everything everywhereDeclan Butler, Declan Butler, naturenature, Vol 440, Issue no. 7083, 23 March 2006, Vol 440, Issue no. 7083, 23 March 2006

In the medical sciences, researchers will be able to mine up-to-the-minute databases instead of painstakingly collecting their own data

The understanding of diseases, and the efficacy of treatments will be dissected by ceaselessly monitoring huge clinical populations

It will be a very different way of thinking, sifting through the data to find patterns.

Page 8: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

A two-way street to science’s futureA two-way street to science’s futureIan Foster, Ian Foster, naturenature, Vol 440, Issue no. 7083, 23 March 2006, Vol 440, Issue no. 7083, 23 March 2006

Science is increasingly about information: its collection, organization and transformation

George Djorgovski: “Applied computer science is now playing the role which mathematics did from the seventeenth through the twentieth centuries: providing an orderly, formal framework and exploratory apparatus for other sciences”

Science is becoming less reductionist and more integrative

Page 9: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

Science in an exponential worldScience in an exponential worldAlexander Szalay and Jim Gray, Alexander Szalay and Jim Gray, naturenature, Vol 440, Issue no. 7083, 23 March 2006, Vol 440, Issue no. 7083, 23 March 2006

Increasingly, scientists are analysing complex systems that require data to be combined from several groups and even several disciplines.

Important discoveries are made by scientists and teams who combine different skill sets – not just biologists, physicists and chemists, but also computer scientists, statisticians and data-visualization experts.

Page 10: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

Exceeding human limitsExceeding human limitsStephen H. Muggleton, Stephen H. Muggleton, naturenature, Vol 440, Issue no. 7083, 23 March 2006, Vol 440, Issue no. 7083, 23 March 2006

A single high-throughput experiment in biology can easily generate more than a gigabyte of data per day.

It is clear that the future of science involves the expansion of automation in all its aspects: data collection, storage of information, hypothesis formation and experimentation.

We are seeing a range of techniques from mathematics, statistics and computer science being used to create scientific models from empirical data in an increasingly automated way.

But, there is a severe danger that increases in speed and volume of data generation could lead to decreases in comprehensibility!

Page 11: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

Visual AnalyticsVisual Analytics

Visual analytics integrates new computational and theory-based tools with innovative interactive techniques and visual representations to enable human-information discourse.

The design of the tools and the techniques is based on cognitive, design, and perceptual principles.

Illuminating the Path: The Research and Development Agenda for Visual Analytics

Page 12: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

Organizing Undergraduate and Graduate TrainingOrganizing Undergraduate and Graduate Training

It is important to realize that today’s graduate students need formal training in areas beyond their central discipline:

they need to know some data management, computational concepts and statistical techniques.

Page 13: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

Key competencesKey competences

Artificial intelligence and machine learning

Databases and data warehousing

Statistics for prediction, classification, and assessment of data quality

Visual analytics

Scientific computing

Page 14: Statistics and computer science for a data-rich world

Data mining and statistical learning: lecture 1a

The science of statistics in a data-rich worldThe science of statistics in a data-rich world

Decreasing interest Increasing interest

Hypothesis testing Description and visualization

Prediction and classification

Theoretically derived Resampling techniquesestimators Simulation (MC, MCMC)

Classical linear models Generalized linear models Generalized additive models

Neural networks