introduction to big data - amir h. payberahbig data, the noam chomsky way big data is a step...
TRANSCRIPT
![Page 1: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/1.jpg)
Introduction to Big Data
Amir H. PayberahSwedish Institute of Computer Science
[email protected] 8, 2014
Amir H. Payberah (SICS) Introduction April 8, 2014 1 / 36
![Page 2: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/2.jpg)
Data are not much use without human intuition ...
Data is not information, information is not knowledge, knowledge isnot understanding, understanding is not wisdom.
- Clifford Stoll
Amir H. Payberah (SICS) Introduction April 8, 2014 2 / 36
![Page 3: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/3.jpg)
... analyzing data gives power.
Without big data analytics, companies are blind and deaf, wanderingout onto the web like deer on a freeway.
- Geoffrey Moore
Amir H. Payberah (SICS) Introduction April 8, 2014 3 / 36
![Page 4: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/4.jpg)
Analyzing data is worth the cost ...
The price of light is less than the cost of darkness.
- Arthur C. Nielsen
Amir H. Payberah (SICS) Introduction April 8, 2014 4 / 36
![Page 5: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/5.jpg)
..., but there are problems with relying on data too much.
Not everything that can be counted counts, and not everything thatcounts can be counted.
- Albert Einstein
Amir H. Payberah (SICS) Introduction April 8, 2014 5 / 36
![Page 6: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/6.jpg)
Data is a treasure ..., except when it is not.
Getting information off the Internet is like taking a drinkfrom a fire hose.
- Mitchell Kapor
Amir H. Payberah (SICS) Introduction April 8, 2014 6 / 36
![Page 7: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/7.jpg)
However, any data is better than none.
An approximate answer to the right problem is worth a good dealmore than an exact answer to an approximate problem.
- John Tukey
Amir H. Payberah (SICS) Introduction April 8, 2014 7 / 36
![Page 8: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/8.jpg)
Big Data, the Noam Chomsky Way
Big data is a step forward. But, our problems are not lack of access to data,
but understanding them. [Big data] is very useful if I want to find out something
without going to the library, but I have to understand it, and that’s the problem.
Hmmm, not very much Chomsky-ish ..., but wait!
We can be confident that any system of power - whether it’s the state, Google, or
whatever - is going to use the best available technology to control, to dominate,
and to maximize their power. And they’ll want to do it in secret.
Now that’s sounding more like Chomsky.
Amir H. Payberah (SICS) Introduction April 8, 2014 8 / 36
![Page 9: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/9.jpg)
Big Data, the Noam Chomsky Way
Big data is a step forward. But, our problems are not lack of access to data,
but understanding them. [Big data] is very useful if I want to find out something
without going to the library, but I have to understand it, and that’s the problem.
Hmmm, not very much Chomsky-ish ..., but wait!
We can be confident that any system of power - whether it’s the state, Google, or
whatever - is going to use the best available technology to control, to dominate,
and to maximize their power. And they’ll want to do it in secret.
Now that’s sounding more like Chomsky.
Amir H. Payberah (SICS) Introduction April 8, 2014 8 / 36
![Page 10: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/10.jpg)
Big Data, the Noam Chomsky Way
Big data is a step forward. But, our problems are not lack of access to data,
but understanding them. [Big data] is very useful if I want to find out something
without going to the library, but I have to understand it, and that’s the problem.
Hmmm, not very much Chomsky-ish ..., but wait!
We can be confident that any system of power - whether it’s the state, Google, or
whatever - is going to use the best available technology to control, to dominate,
and to maximize their power. And they’ll want to do it in secret.
Now that’s sounding more like Chomsky.
Amir H. Payberah (SICS) Introduction April 8, 2014 8 / 36
![Page 11: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/11.jpg)
They Want to Do It In Secret ...
The truth cannot stay hidden forever!
Amir H. Payberah (SICS) Introduction April 8, 2014 9 / 36
![Page 12: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/12.jpg)
A Brief History ofData Management!
Amir H. Payberah (SICS) Introduction April 8, 2014 10 / 36
![Page 13: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/13.jpg)
4000 B.C
I Manual recording
I From tablets to papyrus, to parchment, and then to paper
Amir H. Payberah (SICS) Introduction April 8, 2014 11 / 36
![Page 14: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/14.jpg)
1450
I Gutenberg’s printing press
Amir H. Payberah (SICS) Introduction April 8, 2014 12 / 36
![Page 15: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/15.jpg)
1800’s - 1940’s
I Punched cards (no fault-tolerance)
I Binary data
I 1890: US census
I 1911: IBM appeared
Amir H. Payberah (SICS) Introduction April 8, 2014 13 / 36
![Page 16: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/16.jpg)
1940’s - 1970’s
I Magnetic tapes
I Batch transaction processing
I File-oriented record processing model (e.g., COBOL)
I Hierarchical DBMS (one-to-many)
I Network DBMS (many-to-many)
Amir H. Payberah (SICS) Introduction April 8, 2014 14 / 36
![Page 17: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/17.jpg)
1980’s
I Relational DBMS (tables) and SQL
I ACID
I Client-server computing
I Parallel processing
Amir H. Payberah (SICS) Introduction April 8, 2014 15 / 36
![Page 18: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/18.jpg)
1990’s - 2000’s
I The Internet...
Amir H. Payberah (SICS) Introduction April 8, 2014 16 / 36
![Page 19: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/19.jpg)
2010’s
I NoSQL: BASE instead of ACID
I Big Data
Amir H. Payberah (SICS) Introduction April 8, 2014 17 / 36
![Page 20: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/20.jpg)
Big Data
I In recent years we have witnessed a dramatic increase in availabledata.
I For example, the number of web pages indexed by Google, whichwere around one million in 1998, have exceeded one trillion in 2008,and its expansion is accelerated by appearance of the social net-works.
Amir H. Payberah (SICS) Introduction April 8, 2014 18 / 36
![Page 21: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/21.jpg)
Big Data Definition
I Big Data refers to datasets and flows largeenough that has outpaced our capability tostore, process, analyze, and understand.
Amir H. Payberah (SICS) Introduction April 8, 2014 19 / 36
![Page 22: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/22.jpg)
The Four Dimensions of Big Data
I Volume: data size
I Velocity: data generation rate
I Variety: data heterogeneity
I Veracity: uncertainty of accuracy andauthenticity of data
Amir H. Payberah (SICS) Introduction April 8, 2014 20 / 36
![Page 23: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/23.jpg)
Big Data Market Driving Factors
I Mobile devices
I Internet of Things (IoT)
I Cloud computing
I Open source initiatives
Amir H. Payberah (SICS) Introduction April 8, 2014 21 / 36
![Page 24: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/24.jpg)
The Big Data Stack!
Amir H. Payberah (SICS) Introduction April 8, 2014 22 / 36
![Page 25: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/25.jpg)
Big Data Analytics Stack
Amir H. Payberah (SICS) Introduction April 8, 2014 23 / 36
![Page 26: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/26.jpg)
Big Data - Storage (Filesystem)
I Traditional filesystems are not well-designed for large-scale dataprocessing systems.
I Efficiency has a higher priority than other features, e.g., directoryservice.
I Massive size of data tends to store it across multiple machines in adistributed way.
I HDFS, Amazon S3, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 24 / 36
![Page 27: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/27.jpg)
Big Data - Database
I Relational Databases Management Systems (RDMS) were not de-signed to be distributed.
I NoSQL databases relax one or more of the ACID properties: BASE
I Different data models: key/value, column-family, graph, document.
I Dynamo, Scalaris, BigTable, Hbase, Cassandra, MongoDB, Volde-mort, Riak, Neo4J, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 25 / 36
![Page 28: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/28.jpg)
Big Data - Resource Management
I Different frameworks require different computing resources.
I Large organizations need the ability to share data and resourcesbetween multiple frameworks.
I Resource management share resources in a cluster between multipleframeworks while providing resource isolation.
I Mesos, YARN, Quincy, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 26 / 36
![Page 29: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/29.jpg)
Big Data - Execution Engine
I Scalable and fault tolerance parallel data processing on clusters ofunreliable machines.
I Data-parallel programming model for clusters of commodity ma-chines.
I MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 27 / 36
![Page 30: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/30.jpg)
Big Data - Query/Scripting Language
I Low-level programming of execution engines, e.g., MapReduce, isnot easy for end users.
I Need high-level language to improve the query capabilities of exe-cution engines.
I It translates user-defined functions to low-level API of the executionengines.
I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 28 / 36
![Page 31: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/31.jpg)
Big Data - Stream Processing
I Providing users with fresh and low latency results.
I Database Management Systems (DBMS) vs. Stream ProcessingSystems (SPS)
I Storm, S4, SEEP, D-Stream, Naiad, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 29 / 36
![Page 32: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/32.jpg)
Big Data - Graph Processing
I Many problems are expressed using graphs: sparse computationaldependencies, and multiple iterations to converge.
I Data-parallel frameworks, such as MapReduce, are not ideal forthese problems: slow
I Graph processing frameworks are optimized for graph-based prob-lems.
I Pregel, Giraph, GraphX, GraphLab, PowerGraph, GraphChi, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 30 / 36
![Page 33: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/33.jpg)
Big Data - Machine Learning
I Implementing and consuming machine learning techniques at scaleare difficult tasks for developers and end users.
I There exist platforms that address it by providing scalable machine-learning and data mining libraries.
I Mahout, MLBase, SystemML, Ricardo, Presto, ...
Amir H. Payberah (SICS) Introduction April 8, 2014 31 / 36
![Page 34: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/34.jpg)
Hadoop Big Data Analytics Stack
Amir H. Payberah (SICS) Introduction April 8, 2014 32 / 36
![Page 35: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/35.jpg)
Stratosphere Big Data Analytics Stack
Amir H. Payberah (SICS) Introduction April 8, 2014 33 / 36
![Page 36: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/36.jpg)
Spark Big Data Analytics Stack
Amir H. Payberah (SICS) Introduction April 8, 2014 34 / 36
![Page 37: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/37.jpg)
Summary
Amir H. Payberah (SICS) Introduction April 8, 2014 35 / 36
![Page 38: Introduction to Big Data - Amir H. PayberahBig Data, the Noam Chomsky Way Big data is a step forward. But, our problems are not lack of access to data, but understanding them. [Big](https://reader035.vdocuments.mx/reader035/viewer/2022062302/5ee115bdad6a402d666c1803/html5/thumbnails/38.jpg)
Questions?
Amir H. Payberah (SICS) Introduction April 8, 2014 36 / 36