schema-on-read vs schema-on-write

4
Schema-on-Read vs Schema- on-Write Amr Awadallah CTO, Cloudera, Inc. [email protected]

Upload: amr-awadallah

Post on 03-Dec-2014

4.480 views

Category:

Technology


3 download

DESCRIPTION

This is the first time I introduced the concept of Schema-on-Read vs Schema-on-Write to the public. It was at Berkeley EECS RAD Lab retreat Open Mic Session on May 28th, 2009 at Santa Cruz, California.

TRANSCRIPT

Page 1: Schema-on-Read vs Schema-on-Write

Schema-on-Read vs Schema-on-Write

Amr AwadallahCTO, Cloudera, [email protected]

Page 2: Schema-on-Read vs Schema-on-Write

Schema-on-Read

Traditional data systems require users to create a schema before loading any data into the system. This allows such systems to tightly control the placement of the data during load time hence enabling them to answer interactive queries very fast. However, this leads to loss of agility. In this talk I will demonstrate Hadoop's schema-on-read capability. Using this approach data can start flowing into the system in its original form, then the schema is parsed at read time (each user can apply their own "data-lens“ to interpret the data). This allows for extreme agility while dealing with complex evolving data structures.

Page 3: Schema-on-Read vs Schema-on-Write

3

Agility/FlexibilitySchema-on-Read (Hadoop):Schema-on-Write (RDBMS):

• Prescriptive Data Modeling:

• Create static DB schema

• Transform data into RDBMS

• Query data in RDBMS format

• New columns must be added explicitly before new data can propagate into the system.

• Good for Known Unknowns(Repetition)

• Descriptive Data Modeling:

• Copy data in its native format

• Create schema + parser

• Query Data in its native format(does ETL on the fly)

• New data can start flowing any time and will appear retroactively once the schema/parser properly describes it.

• Good for Unknown Unknowns(Exploration)

Page 4: Schema-on-Read vs Schema-on-Write

Traditional Data Stack

Foundational Warehouse

Grid Processing System (1st stage ETL)

Instrumentation

Log Collection

Extract-Transform-Load

Datamart Database

Business Intelligent Software (OLAP, etc)

20TB/day

200GB/day

File Server Farm