life and work of jim gray | turing100@persistent
DESCRIPTION
Dr. Anand Deshpande, Chairman, Managing Director & CEO, Persistent systems Ltd talks about Life and Work of Jim Gray ( 1998 Turing Award Recipient) during 6th Turing SessionTRANSCRIPT
1
Life and Work of Jim GrayJanuary 5, 2013
Turing 100 @Persistent
2
3
JAMES ("JIM") NICHOLAS GRAY
United States – 1998CITATIONFor fundamental contributions to database and transaction processing research and technical leadership in system implementation from research prototypes to commercial products. The transaction is the fundamental abstraction underlying database system concurrency and failure recovery. Gray’s work [defined] the key transaction properties: atomicity, consistency, isolation and durability, and his locking and recovery work demonstrated how to build … systems that exhibit these properties.
E. F. Codd invented the Relational Databases in 1970 and created what is a 100+ Billion Dollar/year Industry today.
5
● Simple model● Data stored in relational tables● Data Independence – separation of
data storage and data access ● Declarative Queries● Algebra to mathematically reason
about data objects – made query optimization possible
● Ad-hoc queries through SQL.● Embedded in operational systems.
Codd’s Relational Model
6
●Jim Gray defined ACID properties to guarantee database transactions are processed reliably.
ACID properties are fundamental to Relational Systems and necessary for on-line transaction processing (OLTP)systems
Atomicity Consistency Isolation Durability
7
From Transactions to Transaction Processing Systems - II
Change
Reality Abstraction
Transaction
Qu
ery
AnswerDB'
DB
The real state is represented by an abstraction, called the database, and the transformation of the real state is mirrored by the execution of a program, called a transaction, that transforms the database.
8
Gray defined Data Manipulation Actions as
• transient and internal state
Unprotected
• grouped into transactions and reflected in the state of transaction outcome
Protected
• involve sensors, actuators etc. They cannot be undone they can be compensated.
Real
9
Definitions
● A transaction is a sequence of operations that form a single unit of work
● A transaction is often initiated by an application program – begin a transaction
START TRANSACTION – end a transaction
COMMIT (if successful) or ROLLBACK (if errors)
● Either the whole transaction must succeed or the effect of all operations has to be undone (rollback)
● To achieve durable transaction atomicity, the transition to the “committed” state must be accomplished by the single write to non-volatile storage.
10
Structure of a Transaction Program
BEGIN WORK ()
COMMIT WORK ()
ROLL BACK WORK ()WORK
ROLL BACK WORK ()
11
While at IBM San Jose Research LaboratoryOctober 1972 to December 1980
● Jim Gray developed three key ideas related to transaction concurrency control: – The notion of transaction– Serializability; degrees of consistency; – Multi-granularity locking.
● There are two main transaction issues – concurrent execution of multiple transactions– recovery after hardware failures and system crashes
12
Write Ahead Log (WAL) protocol● The WAL protocol records the old and new states induced by
protected actions separately from the actual state changes. ● The logged changes are written to stable storage before the
actual changes are written back to stable storage (that’s the “Write Ahead” part).
● Transactions are committed by simply appending and writing a ‘commit’ record to the recovery log. Logged changes are used to undo protected actions of aborted transactions and of transactions in progress at the time of a system failure.
13
Write Ahead Log (WAL) protocol● Log records are also used to redo committed
actions whose actual changes have not been written back to stable storage at the time of a system failure.
● The WAL protocol allows changed data to be written to their stable storage home at any time after the log records describing the changes have been written into the stable log.
● This gives the Database Manager great flexibility in managing the contents of its volatile data buffer pools.
14
ACID Properties: First Definition● Atomicity: A transaction’s changes to the state are atomic:
either all happen or none happen. These changes include database changes, messages, and actions on transducers.
● Consistency: A transaction is a correct transformation of the state. The actions taken as a group do not violate any of the integrity constraints associated with the state. This requires that the transaction be a correct program.
● Isolation: Even though transactions execute concurrently, it appears to each transaction T, that others executed either before T or after T, but not both.
● Durability: Once a transaction completes successfully (commits), its changes to the state survive failures.
15
[Gray 1993] Jim Gray and Andreas Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann, San Mateo, CA (1993).
16
In 1985, Jim and a number of other senior leaders in the field of transaction processing started the HPTS (High Performance Transaction Systems) Workshop [HPTS]. This is a biennial gathering of folks interested in transaction systems (and things related to scalable systems). It includes people from competing companies in industry and also from academia. Over the last 22 years, it has evolved to include many different topics as high-end computing morphed from the mainframe to the Internet.
17
The early years …
● Born January 12, 1944
● 1961 graduated from Westmoor High School in San Francisco.
● 1966 graduated from the University of California at Berkeley with bachelor’s degree in mathematics and engineering.
18
James Nicholas Gray was born in San Francisco, California on 12 January 1944.
● In 1961 Gray graduated from Westmoor High School in San Francisco.
● He graduated from the University of California at Berkeley bachelor’s degree in mathematics and engineering in 1966.
● After spending a year in New Jersey working at Bell Laboratories in Murray Hill and attending classes at the Courant Institute in New York City, he returned to Berkeley and enrolled in the newly-formed computer science department, earning a Ph.D. in 1969 for work on context-free grammars and formal language theory.
19
5-minute rule for Memory vs. Disk Access (1987)
When does it make economic sense to hold pages in memory versus doing IO every time data from the page is accessed?
THE FIVE MINUTE RULEPages referenced every five minutes should be
memory resident.
20
From Tandem Report 1987:Jim Gray and Gianfranco Putzolu
● The argument goes as follows: A Tandem disc, and half a controller comfortably deliver 15 accesses per second and are priced at 15K$ for a small disc and 20K$ for a large disc (180Mb and 540Mb respectively).
● So the price per access per second is about 1K$. The extra CPU and channel cost for supporting a disc are lK$/a/s. So one disc access per second costs about 2K$ on a Tandem system.
● A megabyte of Tandem main memory costs 5K$, so a kilobyte costs 5$.
21
● If making a 1Kb record resident saves 1a/s, then it saves about 2K$ worth of disc accesses at a cost of 5$, a good deal. If it saves 0.1 a/s then it saves about 200$, still a good deal. Continuing this, the break even point is an access every 2000/5 - 400 seconds.
● So, any 1KB record accessed more frequently than every 400 seconds should live in main memory. 400 seconds is "about" 5 minutes, hence the name: the Five Minute Rule.
22
5-minute rule
● The five-minute rule is based on the tradeoff between the cost of RAM and the cost of disk accesses.
=
23
5-minute rule
● The five-minute rule is based on the tradeoff between the cost of RAM and the cost of disk accesses.
=
Technology Ratio E conomic Ratio
24
1997 – Ten years later
25
New Storage Metrics: Kaps, Maps, SCAN
● Kaps: How many kilobyte objects served per second– The file server, transaction processing metric– This is the OLD metric.
● Maps: How many megabyte objects served per sec – The Multi-Media metric
● SCAN: How long to scan all the data– the data mining and utility metric
● And– Kaps/$, Maps/$, TBscan/$
26
Disk Changes
● Disks got cheaper: 20k$ -> 1K$ (or even 200$) – $/Kaps etc improved 100x (Moore’s law!) (or even 500x)– One-time event (went from mainframe prices to PC prices)
● Disk data got cooler (10x per decade):– 1990 disk ~ 1GB and 50Kaps and 5 minute scan– 2000 disk ~70GB and 120Kaps and 45 minute scan
● So– 1990: 1 Kaps per 20 MB– 2000: 1 Kaps per 500 MB– disk scans take longer (10x per decade)
● Backup/restore takes a long time (too long)
27
Storage Ratios Changed
● 10x better access time● 10x more bandwidth● 100x more capacity● Data 25x cooler
(1Kaps/20MB vs 1Kaps/500MB)
● 4,000x lower media price● 20x to 100x lower disk
price● Scan takes 10x longer (3
min vs 45 min)
● DRAM/disk media price ratio changed– 1970-1990 100:1 – 1990-1995 10:1
– 1995-1997 50:1– today
~ 0.03$/MB disk 100:1 3$/MB dram
28
The Five Minute Rule
● Trade DRAM for Disk Accesses● Cost of an access (DriveCost / Access_per_second)● Cost of a DRAM page ( $/MB / pages_per_MB)● Break even has two terms:● Technology term and an Economic term
● Grew page size to compensate for changing ratios.● Still at 5 minute for random, 1 minute sequential
From his presentations in 2000
29
Data on Disk Can Move to RAM in 10 years
Storage Price vs TimeMegabytes per kilo-dollar
0.1
1.
10.
100.
1,000.
10,000.
1980 1990 2000
Year
MB
/k$
100:1
10 years
30
Storage Hierarchy : Speed & Capacity vs Cost TradeoffsStorage Hierarchy : Speed & Capacity vs Cost Tradeoffs
1015
1012
109
106
103
Typi
cal S
yste
m (
byte
s)
Size vs Speed
Access Time (seconds)10-9 10-6 10-3 10 0 10 3
Cache
Main
Secondary
Disc
Nearline Tape
Offline Tape
Online Tape
102
100
10-2
10-4
10-6
$/M
B
Price vs Speed
Access Time (seconds)10-9 10-6 10-3 10 0 10 3
Cache
MainSecondary
Disc
Nearline Tape
Offline Tape
Online Tape
31
5-minute rule holds in 1997
● In summary, the five-minute rule still seems to apply to randomly accessed pages, primarily because page sizes have grown from 1KB to 8KB to compensate for changing technology ratios.
32
Storage Latency: How Far Away is the Data?
RegistersOn Chip CacheOn Board Cache
Memory
Disk
12
10
100
Tape /Optical Robot
109
106
Olympia
This HotelThis Room
My Head
10 min
1.5 hr
2 Years
1 min
Pluto
2,000 Years
Andromeda
From Jim Gray’s Rules of Thumb in Data Engineering Presentation
33
What’s TeraByte?
● 1 Terabyte:– 1,000,000,000 business letters 150 miles of book shelf– 100,000,000 book pages 15 miles of book shelf– 50,000,000 FAX images 7 miles of book shelf– 10,000,000 TV pictures (mpeg) 10 days of video– 4,000 LandSat images 16 earth images (100m)– 100,000,000 web page 10 copies of the web HTML
● Library of Congress (in ASCII) is 25 TB – 1980: $200 million of disc 10,000 discs– $5 million of tape silo 10,000 tapes– 1997: 200 k$ of magnetic disc 48 discs– 30 k$ nearline tape 20 tapes
Terror Byte !
Jim Gray’s presentations 1995
34
How Much Information Is there?
● Soon everything can be recorded and
indexed● Most data never be seen
by humans
● Precious Resource: Human attention
– Auto-Summarization– Auto-Search
is key technology.http://www.lesk.com/mlesk/ksg97/ksg.html
Yotta
Zetta
Exa
Peta
Tera
Giga
Mega
KiloA Book
.Movie
All LoC books(words)
All Books MultiMedia
Everything!Recorded
A Photo
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
35
2007: Twenty Years Later
36
The 5-minute rule holds in 2007● The old five-minute rule for RAM and disk now applies
to 64KB page sizes (334 seconds). – Five minutes had been the approximate break-even interval
for 1KB in 198715and for 8KB in 1997.14
● The five-minute break-even interval also applies to RAM and the expensive flash memory of 2007 for page sizes of 64KB and above (365 seconds and 339 seconds). – As the price premium for flash memory decreases, so does
the break-even interval (146 seconds and 136 seconds).
37
Flash memory falls between traditional RAM and persistent mass storage based on rotating disks in terms of acquisition cost, access latency, transfer bandwidth, spatial density, power consumption, and cooling costs.
38
20 years out:Summary and Conclusion
● The 20-year-old five-minute rule for RAM and disks still holds, but for ever-larger disk pages.
● It should be augmented by two new five-minute rules: – for small pages moving between RAM and flash memory and – for large pages moving between flash memory and traditional
disks.
● For small pages moving between RAM and disk, Gray and Putzolu were amazingly accurate in predicting a five-hour break-even point 20 years into the future.
39
40
41
Data Cube
42
Aggregates in SQL
● The SQL standard [Melton, Simon] provides five aggregate functions: COUNT, SUM, MIN, MAX, AVG
SELECT [DISTINCT] AVG(Temp)FROM Weather;
● Aggregate functions return a single value. In addition, SQL allows aggregation over distinct values.
● Using GROUP BY , SQL can create a table of aggregate values indexed by a set of attributes.
SELECT Time, Altitude, AVG(Temp)FROM WeatherGROUP BY Time, Altitude;
SUM()
TableSUM()
A
B
C
D
attributeA A A B B B B B C C C C C D D
43
Problems With This Design
● Users Want Histograms● Users want sub-totals and totals
– drill-down & roll-up reports
● Users want CrossTabs● Conventional wisdom
– These are not relational operators – They are in many report writers and
query engines
sum
M T W T F S S � AIR
HOTEL
FOOD
MISC
�
F() G() H()
44
Other Variants – Illustra
● init(&handle): – Allocates the handle and initializes the aggregate
computation.
● iter(&handle, value): – Aggregates the next value into the current aggregate.
● value = final(&handle): – Computes and returns the resulting aggregate by using data
saved in the handle. This invocation deallocates the handle.
45
DATA CUBE and ROLLUP
SELECT Model, Year, Color SUM(Sales) AS total, SUM(Sales) / total(ALL,ALL,ALL)FROM SalesWHERE Model IN {‘Ford’, ‘Chevy’} AND Year Between 1990 AND 1992GROUP BY CUBE(Model, Year, Color);
CHEVY
FORD 19901991
19921993
REDWHITEBLUE
By Color
By Make & Color
By Make & Year
By Color & Year
By MakeBy Year
Sum
The Data Cube and The Sub-Space AggregatesSum
REDWHITE
BLUE
Chevy Ford
By Make
By ColorCross Tab
REDWHITE
BLUE
By Color
Sum
Group By (with total)
Sum
Aggregate
46
47
48
A Dozen Information Technology Research Goals
1. Scalability: Devise a software and hardware architecture that scales up by a factor of 106. That is, an application’s storage and processing capacity can automatically grow by a factor of million, doing jobs faster (106 x speedup) or doing larger jobs in the same time (106 x scale-up), just by adding more resources.
2. The Turing Test: Build a computer system that wins the imitation game at least 30% of the time.
3. Speech to text: Hear as well as a native speaker.4. Text to speech: Speak as well as a native speaker.5. See as well as person: Recognize objects and motion.
49
A Dozen Information Technology Research Goals
6. Personal Memex: Record every thing a person sees and hears and quickly re retrieve any iteration on request.
7. World Memex: Build a system that given a text corpus, can answer questions about and summarize the text as precisely and quickly as a human expert in that field. Do the same for music, images, art and cinema.
8. Telepresence: Simulate being some other place retrospectively as an observer.(Teleobserver): hear and see as well as actually being there and as well as participant. Simulate being some other place as a participant (Telepresent): interacting with others and with the environment as though you are actually there.
50
A Dozen Information Technology Research Goals
9. Trouble-Free Systems: Built a system used by millions of people each day and yet administered and managed by a single part-time person.
10.Secure System: Assure that the system of problem 9 services only authorized users, service cannot be denied by unauthorized users and information cannot be stolen (and prove it).
11.Always Up: Assure that the system is unavailable for less than one second per hundred years – eight s of availability (and prove it).
51
A Dozen Information Technology Research Goals
12.Automatic Programmer: Devise a specification language or user interface that – Makes it easy for people to express designs (1,000x easier),– Computer can compile, and– Can describe all applications (is complete).The system should reason about application, asking questions about exception cases and incomplete specification. But is should not be onerous to use.
52
Computer Industry Laws (Rules of thumb)
● Metcalf’s law● Moore’s first law● Bell’s computer classes (7 price tiers)● Bell’s platform evolution● Bell’s platform economics● Bill’s law● Software economics● Grove’s law● Moore’s second law● Is info-demand infinite?● The death of Grosch’s law
53
•
Gordon Bell’s Seven Price Tiers
10$: wrist watch computers 100$: pocket/ palm computers 1,000$: portable computers 10,000$: personal computers (desktop) 100,000$: departmental computers (closet) 1,000,000$:site computers (glass house) 10,000,000$: regional computers (glass castle)
Super server: costs more than $100,000“Mainframe”: costs more than $1 million
Must be an array of processors, disks, tapes, comm ports
54
Information at your fingertips.
Bill Gates is known for his long-standing belief that, as he once put it, ”any piece of information you want should be available to you. -- Putting Information at Your Fingertips.”
Gates championed it as early as 1989, and he was in a position to do something about it. It remained his overriding goal for the next two decades.
55Federation
The Vision: Global Data Federation ● Massive datasets live near their owners:
– Near the instrument’s software pipeline– Near the applications– Near data knowledge and curation
● Each Archive publishes a (web) service– Schema: documents the data– Methods on objects (queries)
● Scientists get “personalized” extracts● Uniform access to multiple Archives
– A common global schema
56
Gray and Bell worked closely at Digital and at Microsoft’s Bay Area Research Center since 1994● MyLifeBits
● Terra Server
57
Gordon Bell’s: MyLifeBits
● MylifeBits is a lifetime store of everything. It is the fulfillment of Vannevar Bush’s 1945 Memex vision including full-text search, text and audio annotations, and hyperlinks.
● The experiment: Gordon Bell has captured a lifetime's worth of articles, books, cards, CDs, letters, memos, papers, photos, pictures, presentations, home movies, videotaped lectures, and voice recordings and stored them digitally. He is now paperless, and is beginning to capture phone calls, IM transcripts, television, and radio.
58
59
TerraServer
In late spring of 1996, Paul Flessner, the General Manager of the SQL Server team asked our lab to build a database application that would test and demonstrate the scalability of the next release of SQL Server code named “Sphinx”.
One of Jim’s greatest abilities was to clearly define and articulate the problem. The SQL team gave us two goals:1. Test SQL’s ability to scale up to support a database of
one terabyte or larger.2. An internet application where SQL marketing could
demonstrate Windows and SQL Server’s scalability.
60
About moving research to production
“ideas don’t transfer, people transfer…”
61
TerraServer Requirements
● BIG —1 TB of data including catalog, temporary space, etc.● PUBLIC — available on the world wide web● INTERESTING — to a wide audience● ACCESSIBLE — using standard browsers (IE, Netscape)● REAL — a LOB application (users can buy imagery)● FREE —cannot require NDA or money to a user to access● FAST — usable on low-speed (56kbps) and high speeds(T-1+)● EASY — we do not want a large group to develop, deploy, or maintain
the application
● CHEAP – An unwritten requirement (1) because TerraServer was only a prototype, test, and free demonstration; and (2) Jim Gray was a very frugal person!
62
United States Geological Survey (USGS)
An Interesting Internet Server
SOVINFORMSPUTNIK (the Russian Space Agency) and Aerial Images
http://msdn.microsoft.com/en-us/library/aa226316(v=sql.70).aspx
63
Thesis: Scaleable Servers● Scaleable Servers
– Commodity hardware allows new applications– New applications need huge servers– Clients and servers are built of the same “stuff”• Commodity software and • Commodity hardware
● Servers should be able to – Scale up (grow node by adding CPUs, disks, networks)
– Scale out (grow by adding nodes)
– Scale down (can start small)
● Key software technologies– Objects, Transactions, Clusters, Parallelism
64
Thesis: Scaleable Servers● Scaleable Servers
– Commodity hardware allows new applications– New applications need huge servers– Clients and servers are built of the same “stuff”• Commodity software and • Commodity hardware
● Servers should be able to – Scale up (grow node by adding CPUs, disks, networks)
– Scale out (grow by adding nodes)
– Scale down (can start small)
● Key software technologies– Objects, Transactions, Clusters, Parallelism
65
Scaleable ServersBOTH SMP And Cluster
Grow up with SMP; 4xP6is now standardGrow out with clusterCluster has inexpensive parts
Clusterof PCs
SMP superserver
Departmentalserver
Personalsystem
66
SMPs Have Advantages● Single system image
easier to manage, easier to program threads in shared memory, disk, Net
● 4x SMP is commodity● Software capable of 16x● Problems:
– >4 not commodity– Scale-down problem
(starter systems expensive)● There is a BIGGEST one
SMP superserver
Departmentalserver
Personalsystem
67
Grow UP and OUT
1 billion transactions per day
SMP superserver
Departmentalserver
Personalsystem
1 Terabyte DB Cluster: • a collection of nodes • as easy to program and
manage as a single node
68
Clusters Have Advantages
● Clients and servers made from the same stuff● Inexpensive:
– Built with commodity components
● Fault tolerance: – Spare modules mask failures
● Modular growth– Grow by adding small modules
● Unlimited growth: no biggest one
69
Windows NT Clusters● Microsoft & 60 vendors defining NT clusters
– Almost all big hardware and software vendors involved
● No special hardware needed - but it may help● Fault-tolerant first, scaleable second
– Microsoft, Oracle, SAP giving demos today
● Enables – Commodity fault-tolerance– Commodity parallelism (data mining, virtual reality…)– Also great for workgroups!
70
ParallelismThe OTHER aspect of clusters
● Clusters of machines allow two kinds of parallelism– Many little jobs: online
transaction processing• TPC-A, B, C…
– A few big jobs: data search and analysis
• TPC-D, DSS, OLAP
● Both give automatic parallelism
71Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Kinds of Parallel Execution
Pipeline
Partition outputs split N ways inputs merge M ways
Any Sequential Program
Any Sequential Program
Any Sequential
Any Sequential Program Program
72Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Data RiversSplit + Merge Streams
River
M ConsumersN producers
Producers add records to the river, Consumers consume records from the riverPurely sequential programming.River does flow control and buffering
does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano.
N X M Data Streams
73Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Partitioned Execution
A...E F...J K...N O...S T...Z
A Table
Count Count Count Count Count
Count
Spreads computation and IO among processors
Partitioned data gives NATURAL parallelism
74Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
N x M way Parallelism
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Merge Merge
N inputs, M outputs, no bottlenecks.Partitioned DataPartitioned and Pipelined Data Flows
75
Year 2000 4B Machine
The Year 2000 commodity PC
●Billion Instructions/Sec
● .1 Billion Bytes RAM
●Billion Bits/s Net
●10 B Bytes Disk
●Billion Pixel display– 3000 x 3000 x 24
●1,000 $
10 GB byte Disk
.1 B byte RAM
1 Bips Processor
1 B
bits
/sec
LA
N/W
AN
Jim Gray & Gordon Bell: 1997 presentations
76
Super Server: 4T Machine● Array of 1,000 4B machines
– 1 b ips processors– 1 B B DRAM – 10 B B disks – 1 Bbps comm lines– 1 TB tape robot
● A few megabucks● Challenge:
– Manageability– Programmability– Security– Availability– Scaleability– Affordability
● As easy as a single system
Future servers are CLUSTERSof processors, discs
Distributed database techniquesmake clusters work
CPU
50 GB Disc
5 GB RAM
Cyber Bricka 4B machine
Jim Gray & Gordon Bell: 1997 presentations
77
Jim Gray’s quest for real problems and real data … led to a collaboration with Astronomers.
Alex Szalay
Why Astronomy Data?● It has no commercial value
– No privacy concerns– Can freely share results with others– Great for experimenting with algorithms
● It is real and well documented– High-dimensional data (with confidence
intervals)– Spatial data– Temporal data
● Many different instruments from many different places and many different times
● Federation is a goal● There is a lot of it (petabytes)
78
Availability and ability
to handlevery large volumes
of storage and complex computing
is redefining how we do Science
79
First Paradigm:For thousands of years, Science was about empirically describing natural phenomenon
Galileo and his telescope
80
Second Paradigm:Theoretical Science using models and generalization
Newton
Kepler
Maxwell
81
Third Paradigm:Computational Science: Simulating Complex Phenomenon
Over the last 25 years
Scientists have used computer
simulation to validate
theories.A hurricane computer simulation.
82
Fourth Paradigm:Data Intensive Science
The scientific method was traditionally driven by hypothesis.
First scientists predict a good response, then collect experimental data to validate the data against its predictions.
However, in the new data-driven approach researchers start with collecting data and analyze data later.
83
Scientists are collecting data How to codify data and extract insights and knowledge?
Experiments and Instruments
Simulations
Literature
Other Archives
Question
Answer
Astronomy
● Help build world-wide telescope– All astronomy data and literature online
and cross indexed– Tools to analyze the data
● Built SkyServer.SDSS.org● Built Analysis system
– MyDB– CasJobs (batch job)
● Results:– It works and is used every day– Spatial extensions in SQL 2005– A good example of Data Grid– Good examples of Web Services.
World Wide TelescopeVirtual Observatoryhttp://www.us-vo.org/ http://www.ivoa.net/
● Premise: Most data is (or could be online)● So, the Internet is the world’s best telescope:
– It has data on every part of the sky– In every measured spectral band: optical, x-ray, radio..
– As deep as the best instruments (2 years ago).– It is up when you are up.
The “seeing” is always great (no working at night, no clouds no moons no..).
– It’s a smart telescope: links objects and data to literature on them.
SkyServer.SDSS.org● A modern archive
– Access to Sloan Digital Sky SurveySpectroscopic and Optical surveys
– Raw Pixel data lives in file servers– Catalog data (derived objects) lives in Database– Online query to any and all
● Also used for education– 150 hours of online Astronomy– Implicitly teaches data analysis
● Interesting things– Spatial data search– Client query interface via Java Applet– Query from Emacs, Python, …. – Cloned by other surveys (a template design) – Web services are core of it.
SkyServerSkyServer.SDSS.org
● Like the TerraServer, but looking the other way: a picture of ¼ of the universe
● Sloan Digital Sky Survey Data: Pixels + Data Mining
● About 400 attributes per “object”
● Spectrograms for 1% of objects
88
SkyQuery
SkyQuery (http://skyquery.net/)
● Distributed Query tool using a set of web services● Many astronomy archives from
Pasadena, Chicago, Baltimore, Cambridge (England)● Has grown from 4 to 15 archives,
now becoming international standard
● WebService Poster Child● Allows queries like:SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2
SkyServer/SkyQuery Evolution MyDB and Batch Jobs
Problem: need multi-step data analysis (not just single query).
Solution: Allow personal databases on portal
Problem: some queries are monsters
Solution: “Batch schedule” on portal. Deposits answer in personal database.
Ecosystem Sensor NetLifeUnderYourFeet.Org
● Small sensor net monitoring soil● Sensors feed to a database● Helping build system to
collect & organize data.● Working on data analysis tools● Prototype for other LIMS
Laboratory Information Management Systems
RNA Structural Genomics● Goal: Predict secondary and
tertiary structure from sequence.Deduce tree of life.
● Technique: Analyze sequence variations sharing a common structure across tree of life
● Representing structurally aligned sequences is a key challenge
● Creating a database-driven alignment workbench accessing public and private sequence data
VHA Health Informatics● VHA: largest standardized electronic medical records system
in US.● Design, populate and tune a ~20 TB Data Warehouse and
Analytics environment● Evaluate population health and treatment outcomes, ● Support epidemiological studies
– 7 million enrollees– 5 million patients– Example Milestones:
• 1 Billionth Vital Sign loaded in April ‘06
• 30-minutes to population-wide obesity analysis (next slide)
• Discovered seasonality in blood pressure -- NEJM fall ‘06
Wt/Ht 5ft 0in 5ft 1in 5ft 2in 5ft 3in 5ft 4in 5ft 5in 5ft 6in 5ft 7in 5ft 8in 5ft 9in 5ft 10in 5ft 11in 6ft 0in 6ft 1in 6ft 2in 6ft 3in 6ft 4in 6ft 5in Legend100 230 211 334 276 316 364 346 300 244 172 114 73 58 16 11 3 1 1 BMI < 18 Underweight105 339 364 518 532 558 561 584 515 436 284 226 144 102 25 13 4 4 1 BMI 18-24.9 Healthy Weight110 488 489 836 815 955 972 1,031 899 680 521 395 256 161 70 23 10 6 4 BMI 25-29.9 Overweight115 526 614 1,018 1,098 1,326 1,325 1,607 1,426 1,175 903 598 451 264 84 59 17 6 4 BMI 30+ Obese120 644 714 1,419 1,583 1,964 2,153 2,612 2,374 1,933 1,450 1,085 690 501 153 95 38 13 9 125 672 855 1,682 1,933 2,628 3,005 3,521 3,405 2,929 2,197 1,538 1,144 756 253 114 46 32 8 130 753 944 1,984 2,392 3,462 3,968 5,039 4,827 4,285 3,223 2,378 1,765 1,182 429 214 81 41 12 135 753 1,062 2,173 2,852 4,105 4,912 6,535 6,535 5,797 4,500 3,393 2,467 1,668 596 309 108 70 15 140 754 1,073 2,300 3,177 4,937 6,286 8,769 8,750 7,939 6,303 4,837 3,493 2,534 977 513 144 106 22 145 748 1,053 2,254 3,389 5,412 7,334 10,485 11,004 10,576 8,084 6,511 4,686 3,344 1,207 680 221 140 41 150 730 1,077 2,361 3,596 6,152 8,665 12,772 14,335 13,866 11,255 9,250 6,545 4,796 1,792 979 350 162 48 155 683 923 2,178 3,391 6,031 8,891 14,181 15,899 16,594 13,517 11,489 8,056 5,741 2,155 1,203 472 249 70 160 671 872 2,106 3,532 6,184 9,580 15,493 18,869 19,939 17,046 14,650 10,366 7,708 2,831 1,618 615 341 100 165 627 772 1,894 3,074 5,773 9,549 16,332 20,080 22,507 19,692 17,729 12,588 9,558 3,548 2,032 716 399 117 170 596 750 1,716 2,900 5,428 9,080 16,633 21,550 25,051 22,568 21,198 15,552 12,093 4,548 2,626 944 489 124 175 493 674 1,521 2,551 4,816 8,417 15,900 21,420 26,262 24,277 23,756 18,194 13,817 5,361 3,178 1,152 586 144 180 486 599 1,411 2,323 4,584 7,855 15,482 20,873 26,922 26,067 26,313 20,358 16,459 6,451 3,848 1,441 737 207 185 420 546 1,195 1,985 3,905 6,918 13,406 19,362 25,818 25,620 27,037 21,799 18,172 7,206 4,458 1,548 867 247 190 424 495 1,073 1,729 3,383 5,909 11,918 17,640 24,277 25,263 27,398 22,697 19,977 8,344 4,937 1,858 963 287 195 341 463 913 1,474 2,803 5,207 10,584 15,727 22,137 23,860 26,373 22,513 20,163 8,754 5,683 2,178 1,120 309 200 315 384 763 1,338 2,602 4,551 9,413 14,149 20,608 22,541 25,452 23,358 21,548 9,284 6,221 2,294 1,295 372 205 265 338 633 1,026 1,993 3,736 7,765 11,940 17,501 19,944 23,065 21,094 20,354 9,270 6,350 2,597 1,322 376 210 275 284 543 853 1,794 3,148 6,804 10,540 15,647 18,129 21,862 20,540 20,271 9,566 6,816 2,786 1,509 418 215 205 244 501 746 1,389 2,645 5,747 8,712 13,064 15,560 19,089 18,191 19,063 9,019 6,675 2,798 1,509 454 220 168 208 415 652 1,231 2,326 4,950 7,751 11,645 13,900 17,577 17,239 17,583 8,896 6,818 2,948 1,635 484 225 156 160 325 522 968 1,873 4,015 6,340 9,794 11,890 14,898 15,097 15,741 8,332 6,441 2,915 1,647 452 230 141 160 259 486 880 1,653 3,334 5,410 8,657 10,500 13,532 13,488 14,815 7,901 6,258 2,859 1,701 496 235 115 119 244 373 738 1,251 2,795 4,570 7,192 8,784 11,489 11,857 12,796 7,113 5,544 2,744 1,617 465 240 72 116 214 313 562 1,099 2,422 3,861 6,044 7,652 9,982 10,692 11,825 6,496 5,392 2,606 1,581 449 245 71 76 169 253 509 888 1,858 3,167 5,076 6,446 8,312 8,647 9,910 5,638 4,742 2,263 1,479 469 250 70 55 152 226 452 753 1,647 2,826 4,505 5,509 7,569 8,064 8,900 5,183 4,319 2,177 1,451 469 255 59 61 128 174 316 599 1,289 2,130 3,468 4,540 5,957 6,451 7,438 4,320 3,741 1,903 1,271 443 260 50 64 117 167 281 493 1,107 1,929 2,963 3,947 5,190 5,797 6,725 3,900 3,429 1,828 1,218 481 265 37 34 88 122 234 454 894 1,449 2,457 3,152 4,374 4,818 5,729 3,350 2,984 1,539 1,028 406 270 47 42 67 119 203 367 800 1,291 2,110 2,740 3,878 4,133 5,075 2,934 2,685 1,468 918 403 275 22 34 44 85 184 291 662 1,064 1,767 2,235 3,113 3,412 4,267 2,598 2,362 1,247 837 334 280 21 20 51 69 139 286 548 903 1,513 1,955 2,770 3,126 3,604 2,273 2,020 1,152 763 300 285 12 12 36 68 118 201 451 720 1,318 1,613 2,208 2,394 3,132 1,924 1,780 994 677 241 290 16 14 47 38 92 182 387 667 1,050 1,301 1,904 2,150 2,655 1,749 1,529 881 688 252 295 9 12 22 53 92 127 341 493 838 1,162 1,577 1,823 2,338 1,445 1,333 813 533 202 300 12 10 30 43 59 117 309 434 764 988 1,428 1,588 1,989 1,255 1,212 709 479 205
VHA Patients in BMI Categories (Based upon vitals from FY04)
DRAFT
HDR Vitals Based Body Mass Index Calculation on VHA FY04 PopulationSource: VHA Corporate Data Warehouse
Total Patients23,876 (0.7%)
701,089 (21.6%)
1,177,093 (36.2%)
1,347,098 (41.5%)3,249,156 (100%)
95
Jim Gray’s work on Fourth Paradigm and eScience has had a profound impact on the scientific community.
This work continues …
96
Jim Gray eScience Award
Each year, Microsoft Research presents the Jim Gray eScience Award to a researcher who has made an outstanding contribution to the field of data-intensive computing. The award recognizes innovators whose work truly makes science easier for scientists.
97
98
Jim Gray’s Legacy
● The Prolific Writer– Jim Gray’s two rules for authorship:
• The person who types puts their name first, and• It’s easier to add a name to the list of authors
than deal with someone’s hurt feelings.
● The Masterful Presenter● The Sense of Community● The Patient Listener
Ideas
PeopleCommunity
99
Jim’s Life was aText Book on Mentoring
● Making time● Simply Listening● Inspiring Self-Confidence● Lighting the Way● Nurturing and Pushing● Following the Muse● Connecting Good People
and Good Ideas Without Boundaries
● Promoting the Young● Sharing Knowledge Selflessly● Displaying Professional
Integrity● Advocating for the Field● Keeping things in
Perspective● Being a friend
100
101
Lost at Sea …. January 28, 2007
102
The Search for Jim Gray
103
The University of California, Berkeley and Gray's family hosted a tribute to him on May 31, 2008.http://www.youtube.com/user/UCBerk
eleyEvents/videos?query=jim+gray
104
105
Good references
● Microsoft Faculty Summit 2011– http://research.microsoft.com/en-us/events/fs2011/– Tony Hey’s presentations at the event– http://
research.microsoft.com/en-us/events/fs2011/welcome_introduction_hey_faculitysummit_071811.pdf
● The Fourth Paradigm book– http://
research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_complete_lr.pdf
● Jim Gray’s work– http://research.microsoft.com/en-us/um/people/gray/
● Alex Szalay’s work on Large Databases and Science– http://www.sdss.jhu.edu/~szalay/servers.html