hybrid my sql_hadoop_datawarehouse

1. Percona Live NYC 2012 1 MySQL/Hadoop Hybrid DatawarehouseWho are Palomino? Bespoke Services: we work with and like you. Production Experienced: senior DBAs, admins, engineers. 24x7: globally-distributed on-call staff. One-Month Contracts: not more.

2. Percona Live NYC 2012 2 MySQL/Hadoop Hybrid DatawarehouseWho are Palomino? Bespoke Services: we work with and like you. Production Experienced: senior DBAs, admins, engineers. 24x7: globally-distributed on-call staff. One-Month Contracts: not more. Professional Services: ETLs, Cluster tooling.

3. Percona Live NYC 2012 3 MySQL/Hadoop Hybrid DatawarehouseWho are Palomino? Bespoke Services: we work with and like you. Production Experienced: senior DBAs, admins, engineers. 24x7: globally-distributed on-call staff. One-Month Contracts: not more. Professional Services: ETLs, Cluster tooling. Configuration management (DevOps) Chef, Puppet, Ansible.

4. Percona Live NYC 2012 4 MySQL/Hadoop Hybrid DatawarehouseWho are Palomino? Bespoke Services: we work with and like you. Production Experienced: senior DBAs, admins, engineers. 24x7: globally-distributed on-call staff. One-Month Contracts: not more. Professional Services: ETLs, Cluster tooling. Configuration management (DevOps) Chef, Puppet, Ansible. Big Data Cluster Administration (OpsDev) MySQL, PostgreSQL, Cassandra, HBase, MongoDB, Couchbase.

5. Percona Live NYC 2012 5 MySQL/Hadoop Hybrid DatawarehouseWho am I?Tim EllisCTO/Principal Architect, PalominoAchievements: Palomino Big Data Strategy. Datawarehouse Cluster at Riot Games. Designed/built back-end for Firefox Sync.

6. Percona Live NYC 2012 6 MySQL/Hadoop Hybrid DatawarehouseWho am I?Tim EllisCTO/Principal Architect, PalominoAchievements: Palomino Big Data Strategy. Datawarehouse Cluster at Riot Games. Designed/built back-end for Firefox Sync. Led DB team at Digg.com. Harassed the Reddit team at a party.

7. Percona Live NYC 2012 7 MySQL/Hadoop Hybrid DatawarehouseWho am I?Tim EllisCTO/Principal Architect, PalominoAchievements: Palomino Big Data Strategy. Datawarehouse Cluster at Riot Games. Designed/built back-end for Firefox Sync. Led DB team at Digg.com. Harassed the Reddit team at a party.Ensured successful business for: Digg, Friendster,

8. Percona Live NYC 2012 8 MySQL/Hadoop Hybrid DatawarehouseWho am I?Tim EllisCTO/Principal Architect, PalominoAchievements: Palomino Big Data Strategy. Datawarehouse Cluster at Riot Games. Designed/built back-end for Firefox Sync. Led DB team at Digg.com. Harassed the Reddit team at a party.Ensured successful business for: Digg, Friendster, Mozilla, StumbleUpon, Riot Games (League of Legends).

9. What Is This Talk? 9 Experiences of a High-Volume DBAIve built high-volume Datawarehouses, but am notwell-versed in traditional Datawarehouse theory. Cube?Snowflake? Star?

10. What Is This Talk? 10 Experiences of a High-Volume DBAIve built high-volume Datawarehouses, but am notwell-versed in traditional Datawarehouse theory. Cube?Snowflake? Star?Ill win a bar bet, but would be fired from Oracle.

11. What Is This Talk? 11 Experiences of a High-Volume DBAIve built high-volume Datawarehouses, but am notwell-versed in traditional Datawarehouse theory. Cube?Snowflake? Star?Ill win a bar bet, but would be fired from Oracle.Ive administered high-volume Datawarehouses andmanaged a large ETL rollout, but havent writtenextensive ETLs or reports.

12. What Is This Talk? 12 Experiences of a High-Volume DBAIve built high-volume Datawarehouses, but am notwell-versed in traditional Datawarehouse theory. Cube?Snowflake? Star?Ill win a bar bet, but would be fired from Oracle.Ive administered high-volume Datawarehouses andmanaged a large ETL rollout, but havent writtenextensive ETLs or reports.A high-volume Datawarehouse is of a different designthan a low-volume Datawarehouse by necessity.Typically simpler schemas, more complex queries.

13. Why OSS? 13 Freedom at Scale == Economical SenseSelling OSS to Management used to be hard... My query tools are limited. The business users know DMBSx. The documentation is lacking.

14. Why OSS? 14 Freedom at Scale == Economical SenseSelling OSS to Management used to be hard... My query tools are limited. The business users know DMBSx. The documentation is lacking....but then terascale happened one day.

15. Why OSS? 15 Freedom at Scale == Economical SenseSelling OSS to Management used to be hard... My query tools are limited. The business users know DMBSx. The documentation is lacking....but then terascale happened one day. Adding 20TB costs HOW MUCH?! Adding 30 machines costs HOW MUCH?!

16. Why OSS? 16 Freedom at Scale == Economical SenseSelling OSS to Management used to be hard... My query tools are limited. The business users know DMBSx. The documentation is lacking....but then terascale happened one day. Adding 20TB costs HOW MUCH?! Adding 30 machines costs HOW MUCH?! How many sales calls before I push the release?

17. Why OSS? 17 Freedom at Scale == Economical SenseSelling OSS to Management used to be hard... My query tools are limited. The business users know DMBSx. The documentation is lacking....but then terascale happened one day. Adding 20TB costs HOW MUCH?! Adding 30 machines costs HOW MUCH?! How many sales calls before I push the release? Ill hire an entire team and still be more efficient.

18. How to begin? 18 Take stock of the current systemEstablish a data flow: Whos sending me data? How much?

19. How to begin? 19 Take stock of the current systemEstablish a data flow: Whos sending me data? How much? What are the bottlenecks? Whats the current ETL process?

20. How to begin? 20 Take stock of the current systemEstablish a data flow: Whos sending me data? How much? What are the bottlenecks? Whats the current ETL process?Were looking for typical data flow characteristics: Log data, write-mostly, free-form. Looks tabular, select * from table. Size: MB, GB or TB per hour?

21. How to begin? 21 Take stock of the current systemEstablish a data flow: Whos sending me data? How much? What are the bottlenecks? Whats the current ETL process?Were looking for typical data flow characteristics: Log data, write-mostly, free-form. Looks tabular, select * from table. Size: MB, GB or TB per hour? Who queries this data? How often?

22. What is Hadoop? 22 The Hadoop EcosystemHadoop Components: HDFS: A filesystem across the whole cluster.

23. What is Hadoop? 23 The Hadoop EcosystemHadoop Components: HDFS: A filesystem across the whole cluster. Hadoop: A map/reduce implementation.

24. What is Hadoop? 24 The Hadoop EcosystemHadoop Components: HDFS: A filesystem across the whole cluster. Hadoop: A map/reduce implementation. Hive: SQLMap/Reduce converter.

25. What is Hadoop? 25 The Hadoop EcosystemHadoop Components: HDFS: A filesystem across the whole cluster. Hadoop: A map/reduce implementation. Hive: SQLMap/Reduce converter. HBase: A column store (and more).

26. What is Hadoop? 26 The Hadoop EcosystemHadoop Components: HDFS: A filesystem across the whole cluster. Hadoop: A map/reduce implementation. Hive: SQLMap/Reduce converter. HBase: A column store (and more).Most-interesting bits: Hive lets business users formulate SQL!

27. What is Hadoop? 27 The Hadoop EcosystemHadoop Components: HDFS: A filesystem across the whole cluster. Hadoop: A map/reduce implementation. Hive: SQLMap/Reduce converter. HBase: A column store (and more).Most-interesting bits: Hive lets business users formulate SQL! HBase provides a distributed column store!

28. What is Hadoop? 28 The Hadoop EcosystemHadoop Components: HDFS: A filesystem across the whole cluster. Hadoop: A map/reduce implementation. Hive: SQLMap/Reduce converter. HBase: A column store (and more).Most-interesting bits: Hive lets business users formulate SQL! HBase provides a distributed column store! HDFS provides massive I/O and redundancy.

29. Should You Use Hadoop? 29 Hadoop Strengths and WeaknessesHadoop/HBase is good for: Scan large chunks of your data every time.

30. Should You Use Hadoop? 30 Hadoop Strengths and WeaknessesHadoop/HBase is good for: Scan large chunks of your data every time. Apply a lot of cluster resource to a query.

31. Should You Use Hadoop? 31 Hadoop Strengths and WeaknessesHadoop/HBase is good for: Scan large chunks of your data every time. Apply a lot of cluster resource to a query. Very large datasets, multiple tera/petabytes.

32. Should You Use Hadoop? 32 Hadoop Strengths and WeaknessesHadoop/HBase is good for: Scan large chunks of your data every time. Apply a lot of cluster resource to a query. Very large datasets, multiple tera/petabytes. With HBase, column store engine.

33. Should You Use Hadoop? 33 Hadoop Strengths and WeaknessesHadoop/HBase is good for: Scan large chunks of your data every time. Apply a lot of cluster resource to a query. Very large datasets, multiple tera/petabytes. With HBase, column store engine.Where Hadoop/HBase falls short: Query iteration is typically minutes.

34. Should You Use Hadoop? 34 Hadoop Strengths and WeaknessesHadoop/HBase is good for: Scan large chunks of your data every time. Apply a lot of cluster resource to a query. Very large datasets, multiple tera/petabytes. With HBase, column store engine.Where Hadoop/HBase falls short: Query iteration is typically minutes. Administration is new and unusual.

35. Should You Use Hadoop? 35 Hadoop Strengths and WeaknessesHadoop/HBase is good for: Scan large chunks of your data every time. Apply a lot of cluster resource to a query. Very large datasets, multiple tera/petabytes. With HBase, column store engine.Where Hadoop/HBase falls short: Query iteration is typically minutes. Administration is new and unusual. Hadoop still immature (some say beta).

36. Should You Use Hadoop? 36 Hadoop Strengths and WeaknessesHadoop/HBase is good for: Scan large chunks of your data every time. Apply a lot of cluster resource to a query. Very large datasets, multiple tera/petabytes. With HBase, column store engine.Where Hadoop/HBase falls short: Query iteration is typically minutes. Administration is new and unusual. Hadoop still immature (some say beta). Documentation is bad or non-existent.

37. Should You Use MySQL? 37 MySQL Strengths and WeaknessesMySQL is good for: Smaller datasets, typically gigabytes. Indexing data automatically and quickly. Short query iteration, even milliseconds. Quick dataloads and processing with MyISAM.

38. Should You Use MySQL? 38 MySQL Strengths and WeaknessesMySQL is good for: Smaller datasets, typically gigabytes. Indexing data automatically and quickly. Short query iteration, even milliseconds. Quick dataloads and processing with MyISAM.Where MySQL falls short: Has no column store engine. Documentation for datawarehousing minimal.

39. Should You Use MySQL? 39 MySQL Strengths and WeaknessesMySQL is good for: Smaller datasets, typically gigabytes. Indexing data automatically and quickly. Short query iteration, even milliseconds. Quick dataloads and processing with MyISAM.Where MySQL falls short: Has no column store engine. Documentation for datawarehousing minimal. You probably know better than I. Trust the DBA.

40. Should You Use MySQL? 40 MySQL Strengths and WeaknessesMySQL is good for: Smaller datasets, typically gigabytes. Indexing data automatically and quickly. Short query iteration, even milliseconds. Quick dataloads and processing with MyISAM.Where MySQL falls short: Has no column store engine. Documentation for datawarehousing minimal. You probably know better than I. Trust the DBA. Be honest with management. If Vertica is better...

41. MySQL/Hadoop Hybrid 41 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? No built-in end-user-friendly query tools.

42. MySQL/Hadoop Hybrid 42 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? No built-in end-user-friendly query tools. Immature technology can crash sometimes.

43. MySQL/Hadoop Hybrid 43 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? No built-in end-user-friendly query tools. Immature technology can crash sometimes. Not too much documentation.

44. MySQL/Hadoop Hybrid 44 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? No built-in end-user-friendly query tools. Immature technology can crash sometimes. Not too much documentation.Youll need buy-in, savvy, and resilience from: ETL/Datawarehouse developers,

45. MySQL/Hadoop Hybrid 45 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? No built-in end-user-friendly query tools. Immature technology can crash sometimes. Not too much documentation.Youll need buy-in, savvy, and resilience from: ETL/Datawarehouse developers, Business Users,

46. MySQL/Hadoop Hybrid 46 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? No built-in end-user-friendly query tools. Immature technology can crash sometimes. Not too much documentation.Youll need buy-in, savvy, and resilience from: ETL/Datawarehouse developers, Business Users, Systems Administrators,

47. MySQL/Hadoop Hybrid 47 Common WeaknessesSo if you combine the weaknesses of these twotechnologies... what have you got? No built-in end-user-friendly query tools. Immature technology can crash sometimes. Not too much documentation.Youll need buy-in, savvy, and resilience from: ETL/Datawarehouse developers, Business Users, Systems Administrators, Management.

48. Building a Hadoop Cluster 48 The NameNodeTypical Reasons Clusters Fail: Cascading failure (distributed fail) Network outage (distributed fail) Bad query executed (distributed fail)

49. Building a Hadoop Cluster 49 The NameNodeTypical Reasons Clusters Fail: Cascading failure (distributed fail) Network outage (distributed fail) Bad query executed (distributed fail) NameNode dies? (single point of failure)

50. Building a Hadoop Cluster 50 The NameNodeTypical Reasons Clusters Fail: Cascading failure (distributed fail) Network outage (distributed fail) Bad query executed (distributed fail)NameNode failing is not a common failure case. Still,its good to plan for it: All critical filesystems on RAID 1+0

51. Building a Hadoop Cluster 51 The NameNodeTypical Reasons Clusters Fail: Cascading failure (distributed fail) Network outage (distributed fail) Bad query executed (distributed fail)NameNode failing is not a common failure case. Still,its good to plan for it: All critical filesystems on RAID 1+0 Redundant PSU

52. Building a Hadoop Cluster 52 The NameNodeTypical Reasons Clusters Fail: Cascading failure (distributed fail) Network outage (distributed fail) Bad query executed (distributed fail)NameNode failing is not a common failure case. Still,its good to plan for it: All critical filesystems on RAID 1+0 Redundant PSU Redundant NICs to independent routers

53. Building a Hadoop Cluster 53 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: RAID-0 or even JBOD.

54. Building a Hadoop Cluster 54 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: RAID-0 or even JBOD. More spindles: linux-1u.net has 8HDD in 1U.

55. Building a Hadoop Cluster 55 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: RAID-0 or even JBOD. More spindles: linux-1u.net has 8HDD in 1U. 7200rpm SATA nice, 15Krpm overkill.

56. Building a Hadoop Cluster 56 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: RAID-0 or even JBOD. More spindles: linux-1u.net has 8HDD in 1U. 7200rpm SATA nice, 15Krpm overkill. Multiple TB of storage.

57. Building a Hadoop Cluster 57 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: RAID-0 or even JBOD. More spindles: linux-1u.net has 8HDD in 1U. 7200rpm SATA nice, 15Krpm overkill. Multiple TB of storage. 8-24GB RAM.

58. Building a Hadoop Cluster 58 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: RAID-0 or even JBOD. More spindles: linux-1u.net has 8HDD in 1U. 7200rpm SATA nice, 15Krpm overkill. Multiple TB of storage. 8-24GB RAM. Good/fast network cards!

59. Building a Hadoop Cluster 59 Basic Cluster Node ConfigurationSo much for the specialised hardware. All non-NameNode nodes in your cluster: RAID-0 or even JBOD. More spindles: linux-1u.net has 8HDD in 1U. 7200rpm SATA nice, 15Krpm overkill. Multiple TB of storage. lots of this!!! 8-24GB RAM. Good/fast network cards!A DBA thinks Database == RAM. Likewise,Hadoop Node == disk spindles, disk storage, andnetwork. You lose 2-3x storage to data replication.

60. Building a Hadoop Cluster 60 Network and Rack LayoutNetwork within a rack (top-of-rack switching): Bandwidth for 30 machines going full-tilt.

61. Building a Hadoop Cluster 61 Network and Rack LayoutNetwork within a rack (top-of-rack switching): Bandwidth for 30 machines going full-tilt. Multiple TOR switches for redundancy. Consider bridging.

62. Building a Hadoop Cluster 62 Network and Rack LayoutNetwork within a rack (top-of-rack switching): Bandwidth for 30 machines going full-tilt. Multiple TOR switches for redundancy. Consider bridging.Network between racks (datacentre switching): Inter-rack switches: better than 2Gbit desireable. Hadoop rack awareness reduces inter-rack traffic.

63. Building a Hadoop Cluster 63 Network and Rack LayoutNetwork within a rack (top-of-rack switching): Bandwidth for 30 machines going full-tilt. Multiple TOR switches for redundancy. Consider bridging.Network between racks (datacentre switching): Inter-rack switches: better than 2Gbit desireable. Hadoop rack awareness reduces inter-rack traffic.Need sharp Networking employees on-board to helpbuild cluster. Network instability can cause crashes.

64. Building a Hadoop Cluster 64 Monitoring: Trending and AlertingPick your graphing solution, and put stats into it. Indoubt about which stats to graph?

65. Building a Hadoop Cluster 65 Monitoring: Trending and AlertingPick your graphing solution, and put stats into it. Indoubt about which stats to graph? Try all of them. Every Hadoop stat exposed via JMX. Every HBase stat exposed via JMX. All disk, CPU, RAM, network stats.

66. Building a Hadoop Cluster 66 Monitoring: Trending and AlertingPick your graphing solution, and put stats into it. Indoubt about which stats to graph? Try all of them. Every Hadoop stat exposed via JMX. Every HBase stat exposed via JMX. All disk, CPU, RAM, network stats.A possible solution: Use collectds JMX plugin to collect stats.

67. Building a Hadoop Cluster 67 Monitoring: Trending and AlertingPick your graphing solution, and put stats into it. Indoubt about which stats to graph? Try all of them. Every Hadoop stat exposed via JMX. Every HBase stat exposed via JMX. All disk, CPU, RAM, network stats.A possible solution: Use collectds JMX plugin to collect stats. Put stats into Graphite. Or Ganglia if you know how.

68. Building a Hadoop Cluster 68 Palomino Cluster ToolUse Configuration Management to build your cluster: Ansible easiest and quickest. Opscode Chef most popular, must love Ruby. Puppet most mature.

69. Building a Hadoop Cluster 69 Palomino Cluster ToolUse Configuration Management to build your cluster: Ansible easiest and quickest. Opscode Chef most popular, must love Ruby. Puppet most mature.The Palomino Cluster Tool (open source on Github)uses the above tools to build a cluster for you: Pre-written Configuration Management scripts.

70. Building a Hadoop Cluster 70 Palomino Cluster ToolUse Configuration Management to build your cluster: Ansible easiest and quickest. Opscode Chef most popular, must love Ruby. Puppet most mature.The Palomino Cluster Tool (open source on Github)uses the above tools to build a cluster for you: Pre-written Configuration Management scripts. Sets up HDFS, Hadoop, HBase, Monitoring.

71. Building a Hadoop Cluster 71 Palomino Cluster ToolUse Configuration Management to build your cluster: Ansible easiest and quickest. Opscode Chef most popular, must love Ruby. Puppet most mature.The Palomino Cluster Tool (open source on Github)uses the above tools to build a cluster for you: Pre-written Configuration Management scripts. Sets up HDFS, Hadoop, HBase, Monitoring. In the future, will also set up alerting and backups.

72. Building a Hadoop Cluster 72 Palomino Cluster ToolUse Configuration Management to build your cluster: Ansible easiest and quickest. Opscode Chef most popular, must love Ruby. Puppet most mature.The Palomino Cluster Tool (open source on Github)uses the above tools to build a cluster for you: Pre-written Configuration Management scripts. Sets up HDFS, Hadoop, HBase, Monitoring. In the future, will also set up alerting and backups. Also sets up MySQL+MHA, may be relevant?

73. Running the Hadoop Cluster 73 Typical ProblemsHadoop Clusters are Distributed Systems. Network stressed? Reduce-heavy workload.

74. Running the Hadoop Cluster 74 Typical ProblemsHadoop Clusters are Distributed Systems. Network stressed? Reduce-heavy workload. CPUs stressed? Map-heavy workload.

75. Running the Hadoop Cluster 75 Typical ProblemsHadoop Clusters are Distributed Systems. Network stressed? Reduce-heavy workload. CPUs stressed? Map-heavy workload. Disks stressed? Map-heavy workload.

76. Running the Hadoop Cluster 76 Typical ProblemsHadoop Clusters are Distributed Systems. Network stressed? Reduce-heavy workload. CPUs stressed? Map-heavy workload. Disks stressed? Map-heavy workload. RAM stressed? This is a DBMS after all!

77. Running the Hadoop Cluster 77 Typical ProblemsHadoop Clusters are Distributed Systems. Network stressed? Reduce-heavy workload. CPUs stressed? Map-heavy workload. Disks stressed? Map-heavy workload. RAM stressed? This is a DBMS after all!Watch your storage subsystems. 120TB is a lot of disk space.

78. Running the Hadoop Cluster 78 Typical ProblemsHadoop Clusters are Distributed Systems. Network stressed? Reduce-heavy workload. CPUs stressed? Map-heavy workload. Disks stressed? Map-heavy workload. RAM stressed? This is a DBMS after all!Watch your storage subsystems. 120TB is a lot of disk space. Until you put in 120TB of data.

79. Running the Hadoop Cluster 79 Typical ProblemsHadoop Clusters are Distributed Systems. Network stressed? Reduce-heavy workload. CPUs stressed? Map-heavy workload. Disks stressed? Map-heavy workload. RAM stressed? This is a DBMS after all!Watch your storage subsystems. 120TB is a lot of disk space. Until you put in 120TB of data. 400 spindles is a lot of IOPS.

80. Running the Hadoop Cluster 80 Typical ProblemsHadoop Clusters are Distributed Systems. Network stressed? Reduce-heavy workload. CPUs stressed? Map-heavy workload. Disks stressed? Map-heavy workload. RAM stressed? This is a DBMS after all!Watch your storage subsystems. 120TB is a lot of disk space. Until you put in 120TB of data. 400 spindles is a lot of IOPS. Until you query everything. Ten times.

81. Running the Hadoop Cluster 81 Administration by Scientific MethodWhat did we just learn...?

82. Running the Hadoop Cluster 82 Administration by Scientific MethodHadoop Clusters are Distributed Systems! Instability on system X? Could be Ys fault.

83. Running the Hadoop Cluster 83 Administration by Scientific MethodHadoop Clusters are Distributed Systems! Instability on system X? Could be Ys fault. Temporal correlation of ERRORs across nodes.

84. Running the Hadoop Cluster 84 Administration by Scientific MethodHadoop Clusters are Distributed Systems! Instability on system X? Could be Ys fault. Temporal correlation of ERRORs across nodes. Correlation of WARNINGs and ERRORs.

85. Running the Hadoop Cluster 85 Administration by Scientific MethodHadoop Clusters are Distributed Systems! Instability on system X? Could be Ys fault. Temporal correlation of ERRORs across nodes. Correlation of WARNINGs and ERRORs. Do log events correlate to graph anomolies?

86. Running the Hadoop Cluster 86 Administration by Scientific MethodHadoop Clusters are Distributed Systems! Instability on system X? Could be Ys fault. Temporal correlation of ERRORs across nodes. Correlation of WARNINGs and ERRORs. Do log events correlate to graph anomolies?The Procedure: 1. Problems occurring on the cluster? 2. Formulate hypothesis from input (graphs/logs).

87. Running the Hadoop Cluster 87 Administration by Scientific MethodHadoop Clusters are Distributed Systems! Instability on system X? Could be Ys fault. Temporal correlation of ERRORs across nodes. Correlation of WARNINGs and ERRORs. Do log events correlate to graph anomolies?The Procedure: 1. Problems occurring on the cluster? 2. Formulate hypothesis from input (graphs/logs). 3. Test hypothesis (tweak configurations).

88. Running the Hadoop Cluster 88 Administration by Scientific MethodHadoop Clusters are Distributed Systems! Instability on system X? Could be Ys fault. Temporal correlation of ERRORs across nodes. Correlation of WARNINGs and ERRORs. Do log events correlate to graph anomolies?The Procedure: 1. Problems occurring on the cluster? 2. Formulate hypothesis from input (graphs/logs). 3. Test hypothesis (tweak configurations). 4. Go to 1. Youre graphing EVERYTHING, right?

89. Running the Hadoop Cluster 89 Graphing your LogsYou need to graph everything. How about graphingyour logs?

90. Running the Hadoop Cluster 90 Graphing your LogsYou need to graph everything. How about graphingyour logs? grep ERROR | cut | uniq -c 2012-07-29 06 15692 2012-07-29 07 30432 2012-07-29 08 76943 2012-07-29 09 54955 2012-07-29 10 15652

91. Running the Hadoop Cluster 91 Graphing your LogsYou need to graph everything. How about graphingyour logs? grep ERROR | cut | uniq -c 2012-07-29 06 15692 2012-07-29 07 30432 2012-07-29 08 76943 2012-07-29 09 54955 2012-07-29 10 15652Thats close, but what if thats hundreds of lines? Youcan put the data into LibreOffice Calc, but slows downiteration cycle.

92. Running the Hadoop Cluster 92 Graphing your LogsGraphing logs (terminal output) easier with Palominosterminal tool distribution, OSS on Github:

93. Running the Hadoop Cluster 93 Graphing your LogsGraphing logs (terminal output) easier with Palominosterminal tool distribution, OSS on Github: grep ERROR | cut | distribution 2012-07-29 06|15692 ++++++++++ 2012-07-29 07|30432 +++++++++++++++++++ 2012-07-29 08|76943 ++++++++++++++++++++++++++++++++++++++++++++++++ 2012-07-29 09|54955 ++++++++++++++++++++++++++++++++++ 2012-07-29 10|15652 ++++++++++

94. Running the Hadoop Cluster 94 Graphing your LogsGraphing logs (terminal output) easier with Palominosterminal tool distribution, OSS on Github: grep ERROR | cut | distribution 2012-07-29 06|15692 ++++++++++ 2012-07-29 07|30432 +++++++++++++++++++ 2012-07-29 08|76943 ++++++++++++++++++++++++++++++++++++++++++++++++ 2012-07-29 09|54955 ++++++++++++++++++++++++++++++++++ 2012-07-29 10|15652 ++++++++++On a quick iteration cycle in the terminal, this is veryuseful. For presentation to the suits later you can importthe data into another prettier tool.

95. Running the Hadoop Cluster 95 Graphing your LogsA real-life (MySQL) example: root@db49:/var/log/mysql# grep -i error error.log | cut -c 1-9 | distribution | sort -n This file was about 2.5GB in size Just the date/hour portion Distribution sorts by key frequency by default, but well want date/hour ordering.

96. Running the Hadoop Cluster 96 Graphing your LogsA real-life (MySQL) example: root@db49:/var/log/mysql# grep -i error error.log | cut -c 1-9 | distribution | sort -n Val |Ct (Pct) Histogram 120601 12|60 (46.15%) 120601 17|10 (7.69%) 120601 14|4 (3.08%) 120602 14|2 (1.54%) 120602 21|4 (3.08%) 120610 13|2 (1.54%) 120610 14|4 (3.08%) 120611 14|2 (1.54%) 120612 14|2 (1.54%) 120613 14|2 (1.54%) 120616 13|2 (1.54%) 120630 14|5 (3.85%) Obvious: Noon on June 1st was ugly.

97. Running the Hadoop Cluster 97 Graphing your LogsA real-life (MySQL) example: root@db49:/var/log/mysql# grep -i error error.log | cut -c 1-9 | distribution | sort -n Val |Ct (Pct) Histogram 120601 12|60 (46.15%) 120601 17|10 (7.69%) 120601 14|4 (3.08%) 120602 14|2 (1.54%) 120602 21|4 (3.08%) 120610 13|2 (1.54%) 120610 14|4 (3.08%) 120611 14|2 (1.54%) 120612 14|2 (1.54%) 120613 14|2 (1.54%) 120616 13|2 (1.54%) 120630 14|5 (3.85%) Obvious: Noon on June 1st was ugly.But also: What keeps happening at 2pm?

98. Building the MySQL Datawarehouse 98 Hardware Spec and LayoutThis is a typical OLAP role. Fast non-transactional engine: MyISAM. Data typically time-related: partition by date. Data write-only or read-all? Archive engine. Index-everything schemas.

99. Building the MySQL Datawarehouse 99 Hardware Spec and LayoutThis is a typical OLAP role. Fast non-transactional engine: MyISAM. Data typically time-related: partition by date. Data write-only or read-all? Archive engine. Index-everything schemas.Typically beefier hardware is better. Many spindles, many CPUs, much RAM. Reasonably-fast network cards.

100. ETL Framework 100 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem.

101. ETL Framework 101 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem. Copy straight in: cat file | hdfs put

102. ETL Framework 102 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem. Copy straight in: cat file | hdfs put From the network: scp file | hdfs put

103. ETL Framework 103 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem. Copy straight in: cat file | hdfs put From the network: scp file | hdfs put Streaming: (Logs?)FlumeHDFS.

104. ETL Framework 104 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem. Copy straight in: cat file | hdfs put From the network: scp file | hdfs put Streaming: (Logs?)FlumeHDFS. Table loads: Sqoop (select * into ).

105. ETL Framework 105 Getting Data into HadoopHadoop HDFS at its core is simply a filesystem. Copy straight in: cat file | hdfs put From the network: scp file | hdfs put Streaming: (Logs?)FlumeHDFS. Table loads: Sqoop (select * into ).HBase is not as simple, but can be worth it. FlumeHBase. HBase column family == columnar scans. Beware: no secondary indexes.

106. ETL Framework 106 Notice when something is wrongDont skimp ETL alerting! Start with the obvious:

107. ETL Framework 107 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: Yesterday TableX delta == 150k rows. Today 5k.

108. ETL Framework 108 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: Yesterday TableX delta == 150k rows. Today 5k. Yesterday data loads were 120GB. Today 15GB.

109. ETL Framework 109 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: Yesterday TableX delta == 150k rows. Today 5k. Yesterday data loads were 120GB. Today 15GB. Yesterday grep -ci error == 1k. Today 20k.

110. ETL Framework 110 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: Yesterday TableX delta == 150k rows. Today 5k. Yesterday data loads were 120GB. Today 15GB. Yesterday grep -ci error == 1k. Today 20k. Yesterday wc -l etllogs == 700k. Today 10k.

111. ETL Framework 111 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: Yesterday TableX delta == 150k rows. Today 5k. Yesterday data loads were 120GB. Today 15GB. Yesterday grep -ci error == 1k. Today 20k. Yesterday wc -l etllogs == 700k. Today 10k. Yesterday ETL process == 8hrs. Today 1hr.

112. ETL Framework 112 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: Yesterday TableX delta == 150k rows. Today 5k. Yesterday data loads were 120GB. Today 15GB. Yesterday grep -ci error == 1k. Today 20k. Yesterday wc -l etllogs == 700k. Today 10k. Yesterday ETL process == 8hrs. Today 1hr.If you have time, get a bit more sophisticated: Yesterday TableX.ColY was int. Today varchar.

113. ETL Framework 113 Notice when something is wrongDont skimp ETL alerting! Start with the obvious: Yesterday TableX delta == 150k rows. Today 5k. Yesterday data loads were 120GB. Today 15GB. Yesterday grep -ci error == 1k. Today 20k. Yesterday wc -l etllogs == 700k. Today 10k. Yesterday ETL process == 8hrs. Today 1hr.If you have time, get a bit more sophisticated: Yesterday TableX.ColY was int. Today varchar. Yesterday TableX.ColY compressed at 8x, today it compresses at 2x (or 32x?).

114. Getting Data Out 114 Hadoop Reporting ToolsThe oldschool method of retrieving data: select f(col) from table where ... group by ...

115. Getting Data Out 115 Hadoop Reporting ToolsThe oldschool method of retrieving data: select f(col) from table where ... group by ...The NoSQL method of retrieving data:

116. Getting Data Out 116 Hadoop Reporting ToolsThe oldschool method of retrieving data: select f(col) from table where ... group by ...The NoSQL method of retrieving data: select f(col) from table where ... group by

117. Getting Data Out 117 Hadoop Reporting ToolsThe oldschool method of retrieving data: select f(col) from table where ... group by ...The NoSQL method of retrieving data: select f(col) from table where ... group by Hadoop includes Hive (SQLMap/Reduce Converter).In my experience, dedicated business users can learn touse Hive with little extra training.

118. Getting Data Out 118 Hadoop Reporting ToolsThe oldschool method of retrieving data: select f(col) from table where ... group by ...The NoSQL method of retrieving data: select f(col) from table where ... group by Hadoop includes Hive (SQLMap/Reduce Converter).In my experience, dedicated business users can learn touse Hive with little extra training.But there is extra training!

119. Getting Data Out 119 Hadoop Reporting ToolsIts best if your business users have analytical mindsets,technical backgrounds, and no fear of the commandline. Hadoop reporting:

120. Getting Data Out 120 Hadoop Reporting ToolsIts best if your business users have analytical mindsets,technical backgrounds, and no fear of the commandline. Hadoop reporting: Tools that submit SQL and receive tabular data. Tableau has Hadoop connector.

121. Getting Data Out 121 Hadoop Reporting ToolsIts best if your business users have analytical mindsets,technical backgrounds, and no fear of the commandline. Hadoop reporting: Tools that submit SQL and receive tabular data. Tableau has Hadoop connector.Most of Hadoops power is in Map/Reduce: Hive == SQLMap/Reduce. RHadoop == RMap/Reduce.

122. Getting Data Out 122 Hadoop Reporting ToolsIts best if your business users have analytical mindsets,technical backgrounds, and no fear of the commandline. Hadoop reporting: Tools that submit SQL and receive tabular data. Tableau has Hadoop connector.Most of Hadoops power is in Map/Reduce: Hive == SQLMap/Reduce. RHadoop == RMap/Reduce. HadoopStreaming == AnythingMap/Reduce

123. The Hybrid Datawarehouse 123 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. Typical == raw data into DW then summarised.

124. The Hybrid Datawarehouse 124 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop.

125. The Hybrid Datawarehouse 125 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoopcurrent DW. Typical == equivalent summaries from #1.

126. The Hybrid Datawarehouse 126 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoopcurrent DW. Typical == equivalent summaries from #1. Once that works, shut off old data flow.

127. The Hybrid Datawarehouse 127 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoopcurrent DW. Typical == equivalent summaries from #1. Once that works, shut off old data flow. 4. Give everyone access to Hadoop. They will think of cool new uses for the data.

128. The Hybrid Datawarehouse 128 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoopcurrent DW. Typical == equivalent summaries from #1. Once that works, shut off old data flow. 4. Give everyone access to Hadoop. They will think of cool new uses for the data. 5. Work through The Pain of #4. It doesnt come free, but is worth the price.

129. The Hybrid Datawarehouse 129 Putting it All TogetherThe Way Ive Always Done It: 1. Identify a data flow overloading current DW. Typical == raw data into DW then summarised. 2. New parallel ETL into Hadoop. 3. Build ETLs Hadoopcurrent DW. Typical == equivalent summaries from #1. Once that works, shut off old data flow. 4. Give everyone access to Hadoop. They will think of cool new uses for the data. 5. Work through The Pain of #4. It doesnt come free, but is worth the price. 6. Go to #1.

130. The Hybrid Datawarehouse 130 Q&AQuestions?

131. The Hybrid Datawarehouse 131 Q&AQuestions? Some suggestions: What is the average airspeed of a laden sparrow? How can I hire you?

132. The Hybrid Datawarehouse 132 Q&AQuestions? Some suggestions: What is the average airspeed of a laden sparrow? How can I hire you? No really, I have money, you have skills. Lets make this happen.

133. The Hybrid Datawarehouse 133 Q&AQuestions? Some suggestions: What is the average airspeed of a laden sparrow? How can I hire you? No really, I have money, you have skills. Lets make this happen. Wheres the coffee? I never thought I could be so sleepy.

134. The Hybrid Datawarehouse 134 Q&AQuestions? Some suggestions: What is the average airspeed of a laden sparrow? How can I hire you? No really, I have money, you have skills. Lets make this happen. Wheres the coffee? I never thought I could be so sleepy.Thank you! Email me if you desire.domain: palominodb.com username: timePercona Live NYC 2012

hybrid my sql_hadoop_datawarehouse

Technology