data for action talk - 2016-02-22
Post on 12-Feb-2017
100 Views
Preview:
TRANSCRIPT
What is Big Data in a Nutshell?: An Introduction to Problems and
Bottlenecks in Data SystemsZach Gazak
David E Drummond
Insight Data Science & Engineering
Goals• Understand what can be done with “Big Data” and
the scale of the data.
• Understand the hardware bottlenecks that dictate the technology “stack”.
• Understand different stacks that are used for different types of companies, and why.
Types of Data• Audio / Visual:
Images and Videos
• Text: Comments, Notes, Profile Content
• Interactions: Likes, Friendships, Groups
• Site usage: Log in, Scroll, Click, Post, etc.
Types of Data• Audio / Visual:
Images and Videos
• Text: Comments, Notes, Profile Content
• Interactions: Likes, Friendships, Groups
• Site usage: Log in, Scroll, Click, Post, etc.
Unstructured
Structured
Various ports (I/O)
up to ~ 10GB/s
CPU (processor)
~ 1GHz
Hard Drive (storage) ~ 250GB
RAM (memory)
~ 8GB
Various ports (I/O)
up to ~ 10GB/s
RAM (memory)
~ 8GBCPU (processor)
~ 1GHz
Hard Drive (storage) ~ 250GB
Network Processing Storage
Bottlenecks in Data SystemsProper data system design should consider these limiting bottlenecks:
• Processing time by the CPU
• Loading data into the CPU and memory
• Finding data on the disk
• Reading data from the disk
• Moving data across the network
Bottlenecks: Processing Data• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
Bottlenecks: Processing Data• All data that is processed must be loaded into the CPU
Disk Storage
Memory
CPU
Price
Speed
• Solution: Storage Hierachy, Supercomputers, Distributed Systems
Bottlenecks: Finding Data• Finding a new file on disk (known as random seeks)
Actuator arm with head that reads from disk
End of Desired File
Beginning of Desired File
Bottlenecks: Finding Data• Finding a new file on disk (known as random seeks)
• Solution: SSD and structured databases for specific use cases
Actuator arm with head that reads from disk
End of Desired File
Beginning of Desired File
Bottlenecks: Moving Data
• Solution: Keeping data close to the processors (MapReduce)
• Moving data from machine to machine over a network
Bottlenecks: Example• Processing a 2 kB transaction in memory, sequentially and
randomly on disk, or across the network 100 :1 200 :1 50 :1
Open Questions
• Will processors continue to improve?
• Are there new types of processing?
• What if memory replaced hard disks?
Tech Stacks for CompaniesDepending on your growth plans:
• Single system with small data
• Distributed data center with large data
• Renting computers for flexibility (cloud)
Small Firms with Small Data• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
Small Firms with Small Data• Example: Small medical firm with slow growth
• Pros: Easy to maintain, data locality, inexpensive
• Cons: Difficult to grow quickly, risky, not ideal for analysis
Large Firms with Stable Growth• Example: Facebook with steadily growing data centers
• Pros: Economies of scale, redundancy, innovative design
• Cons: Upfront capital, dedicated maintenance
• >100 PB of Data • 7 PB / Day • 1 kW / TB • ~$20 / TB / Month
Start-Ups with Exponential Growth• Example: AirBnB - rent processing and storage from AWS
• Pros: Scales easily, no maintenance, no upfront capital
• Cons: Expensive in the long run, depend on data provider
• 50 GB / Day • $20-50 / TB / Mo
Start-Ups with Exponential Growth• Example: Netflix - AWS fails on Christmas Eve • Con: You can rent the computers, but you own the failure
top related