balogh gyorgy big_data

Download Balogh gyorgy big_data

Post on 20-Jun-2015

250 views

Category:

Technology

2 download

Embed Size (px)

TRANSCRIPT

  • 1. BigData with brawn or brain?

2. What is BigData? Data volume cannot be handled traditional solutions (eg.: relational database) More than 100 million data rows, typically multi billion 3. Global rate of data production (per second) 30 TB/sec (22000 films) Digital media 2 hours of YouTube video Communication 3000 business emails 300000 SMS Web Half million page views Logs Billions 4. BigData Market 5. Why now? Long term trends Size of stored data doubles every 40 months since 1980s Moores law: number of transistors on integrated circuits doubles every 18 months 6. Different exponential trends 7. Hard drives in 1991 and in 2012 1991 40 MB 3500 RPM 0.7 MB/sec full scan: 1 minutes 2012 4 TB ( x 100000) 7200 RPM 120 MB/sec ( x 170) full scan: 8 hours (x 480) 8. Data access getting a scarce resource! -> Paradigm shift 9. Googles hardware 1998 10. Googles hardware 2013 12 data centers worldwide More than a million nodes A data center costs $600 million to build Oregon data center 15000 m2 power of 30 000 homes 11. Googles hardware 2013 Cheap commodity hardware each has its own battery! Modular data centers Standard container 1160 servers per container Efficiency: 11% overhead (power transformation, cooling) 12. Google cannot afford inefficiency Thought experiment: 3% data compression and data processing speed improvement for Google would save a whole data center! (magnitude of a billion dollar capital cost + operation costs) Optimal code is essential since we multiply everything with a million! 13. Distributed storage and processing Data is distributed and replicated Process data where it is (moving data is costly) Increase data access speed by increasing the number of nodes 14. Example: BigQuery SQL queries on terabytes of data in seconds Data is distributed over thousands of nodes Each node processes one part of the dataset Thousands of nodes work for us for a few milliseconds select year, SUM(mother_age * record_weight) / SUM(record_weight) as age from publicdata:samples.natality where ever_born = 1 group by year order by year; 15. Hadoop 16. Inefficiency can waste huge resources 300 node cluster Hadoop Hive One node Vectorwise Vectorwise holds world speed record in analytical database queries on a single node = 17. Clever ways to improve efficiency Lossless data compression (even 50x!) Clever lossy compression of data (e.g.: olap cubes) Cache aware implementations (asymmetric trends, memory access bottleneck) 18. Lossless data compression compression can boost sequential data access even 50 times! (100 MB/sec -> 5 GB/sec) Less data -> less I/O operation One CPU can decompress data even at 5 GB/sec gzip decompression is very slow snappy, lzo, lz4 can reach 1 GB/sec decompression speed decompression used by column oriented databases can reach 5 GB/sec (PFOR) two billion integers per second! (almost one integer per clock cycle!!!) 19. Example: clever lossy compression (LogDrill) 2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562 2011-01-08 00:00:09 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 2957 321 2011-01-08 00:01:04 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 43422 522 2011-01-08 00:01:08 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 234 425 2011-01-08 00:02:23 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 404 0 0 234 432 2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134 2011-01-08 00:00 GET 200 2 2011-01-08 00:01 GET 200 2 2011-01-08 00:02 GET 404 1 2011-01-08 00:02 POST 200 1 20. Cache aware programming CPU speed increasing about 60% a year Memory speed increasing only 10% a year The increasing gap is covered with multi level cache memories Cache is under-exploited 100x speed up!!! 21. Lesson learned Put effort into deep understanding your problem before Hadoop-ing it! Modern analytical databases with multi node scaling gives magnitudes better performance Clever aggregation can get rid of the big data problem If Hadoop be cost effective! (cheap hardware not expensive servers!)