the structure of (computer) scientific revolutions

30
The Structure of (Computer) Scientific Revolutions Dow Jones Enterprise Ventures May 2006 Michael Franklin UC Berkeley & Amalgamated Insight

Upload: efrem

Post on 11-Jan-2016

22 views

Category:

Documents


2 download

DESCRIPTION

The Structure of (Computer) Scientific Revolutions. Michael Franklin UC Berkeley & Amalgamated Insight. Dow Jones Enterprise Ventures May 2006. Data Management: Then. Structured Data Processing. Data Management: Now. The Structure Spectrum. Structured data (schema-first) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The  Structure  of (Computer) Scientific Revolutions

The Structure of (Computer) Scientific Revolutions

Dow Jones Enterprise VenturesMay 2006

Michael Franklin

UC Berkeley&

Amalgamated Insight

Page 2: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Data Management: Then

Structured DataProcessing

Page 3: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Data Management: Now

Page 4: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

The Structure Spectrum

• Structured data (schema-first)• regular, known, conforming, …• e.g., Relational database

• Unstructured data (schema-never) freeform, irregular, • e.g., plain text, images, audio, …

• Semi-structured data (schema-later)• Provides structural information, but

less constrained. e.g., XML, tagged text/media

Page 5: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Whither Structured Data?

• Conventional Wisdom: ~20% of data is structured currently.

• Consumer apps, enterprise search, media apps are placing downward pressure on this.

Page 6: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

A Contrarian View? Two reasons why structured data is where

the action will be:

• The “Data Industrial Revolution”: Data used to be “hand-crafted”, now it’s generated by computers!!!

• The Data Integration quagmire: structure provides crucial cues for making data usable.

Page 7: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

The New LandscapeBell’s Law: Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect

• Mainframes 1960s• Minicomputers 1970s• Microcomputers/PCs 1980s• Web-based computing 1990s• Devices (Cell phones, PDAs, wireless sensors,

RFID) 2000’s

Enabling a new generation of applications forOperational Visibility, monitoring, and alerting.

Page 8: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Data Streams Data Flood

Clickstream

BarcodesPoS System

SensorsRFID

Telematics

Inventory

• Exponential data growth

• New challenges: continuous, inter-connected, distributed, physical

• Shrinking business cycles

• More complex decisions

Phones

TransactionalSystems

Page 9: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

State of the Art

• Custom-coded implementations that are expensive and often unsuccessful.

• Can we develop the right infrastructure to support large-scale data streaming apps?

Page 10: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

High Fan In Systems• A data management infrastructure for

large-scale data streaming environments.

• Uniform Declarative Framework • Every node is a data stream processor that

speaks SQL-ese stream-oriented queries at all levels• Hierarchical, stream-based views as an

organizing principle.• Can impose a “view” over messy devices.

Page 11: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

HiFi - Taming the Data Flood

Receptors

Warehouses, Stores

Dock doors, Shelves

Regional Centers

Headquarters

Hierarchical Aggregation

• Spatial• TemporalIn-network StreamQuery Processing and Storage

Fast DataPath vs.Slow DataPath

Page 12: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Device Issues: example

Shelf RIFD Test - Ground Truth

Page 13: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Actual RFID Readings

“Restock every time inventory goes below 5”

Page 14: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Query-based Data Cleaning

Point

Smooth

CREATE VIEW smoothed_rfid_stream AS(SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T)

Page 15: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Query-based Data Cleaning

Point

Smooth

ArbitrateCREATE VIEW arbitrated_rfid_stream AS(SELECT receptor_id, tag_idFROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’]GROUP BY receptor_id, tag_idHAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id))

Page 16: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

After Query-based Cleaning

“Restock every time inventory goes below 5”

Page 17: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Once you have the right abstractions…

• “Soft Sensors”• Quality and lineage• Optimization (power, etc.)• Pushdown of external validation

information• Data archiving• Model-based sensing• Imperative processing• …

Page 18: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Data Integration

• Integration is the ultimate schema-first problem.

• Structure is both a key enabler and a key impediment here.

Page 19: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Search vs. Query

What if you wanted to find out which actors donated to John Kerry’s presidential campaign?

Page 20: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Search vs. Query

Page 21: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Search vs. Query

What if you wanted to find out which actors donated to John Kerry’s presidential campaign?

Page 22: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Search vs. Query

• “Search” can return only what’s been previously “stored”.

Page 23: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Also…

• What if you wanted to find out the average donation of actors to each candidate?

• What if you wanted to compare actor donations this campaign to the last one?

• What if you wanted to find out who gave the most to each candidate?

• What if you wanted to know where the information came from, and how old it was?

Page 24: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

A “Deep-Web” Query Approach

SELECT y.name,f.occupation,…FROM Yahoo_Actors y, FECInfo fWHERE y.name = f.name

Page 25: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

“Yahoo Actors” JOIN “FECInfo”

Q: Did it Work?

Page 26: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

The Fundamental Tradeoff

Level ofFunctionality

Time (and cost)

Structured(schema-first)

Unstructured (schema-less)

Semi-Structured(schema-later)

Structure enables computers to help users manipulate and maintain the data.

Page 27: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Dataspaces*

• Deal with all the data from an enterprise – in whatever form

• Data co-existenceno integrated schema, no single warehouse

• Pay-as-you-go services• Keyword search is bare minimum.• Data manipulation and increased consistency as you add work.

* “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005.

Page 28: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Dataspaces vs. Databases

• Data Coexistence• Autonomous

Sources

• Search, Browse, Approximate Answer

• Best Effort Guarantees

• Single Schema• Centralized

Administration

• Structured Query

• Strict Integrity Constraints

Page 29: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

The World of Dataspaces

High Low

Near

Far

Desktop Search

Web SearchVirtual

Organization

Federated DBMS

DBMS

Semantic Integration

AdministrativeProximity

Page 30: The  Structure  of (Computer) Scientific Revolutions

Michael FranklinDow Jones EV Summit May 2006

Conclusions• Structured data not going away.

• In fact, there will be lots more of it.• and it must be processed as fast as it is created.

• Structure is crucial for successful data integration and manipulation.• Much effort will be expended to add structural information to text and media.

• Traditional (structured) database technology is not up to the task.

• Great opportunities for innovation.• HiFi and Dataspaces are examples.