big data and the database community

4
Daniel Abadi Yale University * Big Data and the Database Community

Upload: serena

Post on 22-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Big Data and the Database Community. Daniel Abadi Yale University. The Big Data phenomenon is the best thing that could have happened to the database community Despite other definitions related to ‘3 Vs’ --- Big Data means BIG Data Which means we need scalable database systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Big Data and the Database Community

Daniel AbadiYale University

*Big Data and the Database Community

Page 2: Big Data and the Database Community

*Big Data

*The Big Data phenomenon is the best thing that could have happened to the database community*Despite other definitions related to ‘3 Vs’ ---

Big Data means BIG Data*Which means we need scalable database

systems*Still two main components of Big Data*Performing data analysis at scale*Performing requests on data at scale

Page 3: Big Data and the Database Community

*Performing Data Analysis at Scale

*Database community has won the battle *Some thought that MapReduce might replace

traditional database technology as the primary means to perform analysis at scale

* Just about every MapReduce vendor has abandoned this goal

*Hadapt, Impala, Tez, and several others are in a race to see who can add the most traditional database execution technology to Hadoop fastest

*Everyone is going in the direction of cost-based optimizers, traditional database operators, and push-based query execution

Page 4: Big Data and the Database Community

*Performing Requests on Data

at Scale

*The database community is losing the battle*NoSQL systems still have very little traditional database

technology inside (despite adding SQL interfaces)*No race to add DB technology --- why?* Don’t blame CAP --- CAP is only relevant when there’s a

network partition* We never figured out how to do ACID and active replication

at scale* Many new proposals make simplifying assumptions in order to

handle scale* It’s been 30 years ---- why can’t we build a distributed

database that can handle distributed transactions over actively replicated data at scale?