schema-on-read vs schema-on-write
DESCRIPTION
This is the first time I introduced the concept of Schema-on-Read vs Schema-on-Write to the public. It was at Berkeley EECS RAD Lab retreat Open Mic Session on May 28th, 2009 at Santa Cruz, California.TRANSCRIPT
![Page 2: Schema-on-Read vs Schema-on-Write](https://reader035.vdocuments.mx/reader035/viewer/2022081123/547ebacdb4af9fb9478b456a/html5/thumbnails/2.jpg)
Schema-on-Read
Traditional data systems require users to create a schema before loading any data into the system. This allows such systems to tightly control the placement of the data during load time hence enabling them to answer interactive queries very fast. However, this leads to loss of agility. In this talk I will demonstrate Hadoop's schema-on-read capability. Using this approach data can start flowing into the system in its original form, then the schema is parsed at read time (each user can apply their own "data-lens“ to interpret the data). This allows for extreme agility while dealing with complex evolving data structures.
![Page 3: Schema-on-Read vs Schema-on-Write](https://reader035.vdocuments.mx/reader035/viewer/2022081123/547ebacdb4af9fb9478b456a/html5/thumbnails/3.jpg)
3
Agility/FlexibilitySchema-on-Read (Hadoop):Schema-on-Write (RDBMS):
• Prescriptive Data Modeling:
• Create static DB schema
• Transform data into RDBMS
• Query data in RDBMS format
• New columns must be added explicitly before new data can propagate into the system.
• Good for Known Unknowns(Repetition)
• Descriptive Data Modeling:
• Copy data in its native format
• Create schema + parser
• Query Data in its native format(does ETL on the fly)
• New data can start flowing any time and will appear retroactively once the schema/parser properly describes it.
• Good for Unknown Unknowns(Exploration)
![Page 4: Schema-on-Read vs Schema-on-Write](https://reader035.vdocuments.mx/reader035/viewer/2022081123/547ebacdb4af9fb9478b456a/html5/thumbnails/4.jpg)
Traditional Data Stack
Foundational Warehouse
Grid Processing System (1st stage ETL)
Instrumentation
Log Collection
Extract-Transform-Load
Datamart Database
Business Intelligent Software (OLAP, etc)
20TB/day
200GB/day
File Server Farm