real time analytics using hadoop and elasticsearch

Download Real time analytics using Hadoop and Elasticsearch

Post on 01-Jul-2015




2 download

Embed Size (px)


Real time analytics using Hadoop and Elasticsearch


  • 1. Real time analytics usingHadoopandElasticsearchbyABHISHEK ANDHAVARAPU

2. Thank you Sponsors! 3. About Me Currently working as SoftwareEngineer (Data Platform) atAllegiance Software Inc. Passion for DistributedSystem, Data visualizations. Masters in DistributedSystems. 4. AgendaUse Case.Architecture.Elasticsearch 101.Demo.Lessons learnt. 5. Legacy Architecture5 6. Current Architecture 7. Why Hadoop ? 8. Elasticsearch 101 Document oriented search engine Json based, apachelucene under covers. Schema free. Its distributed, supports aggregations similar to group by . Uses bit sets to efficiently cache. Its fast. Super fast. Its has REST and Java based APIs 9. Elasticsearch CRUDIndex a person:curl -XPUT localhost:9200/person/1 -d '{"first_name" : "Abhishek","last_name" : "Andhavarapu"}Get a person:curl -XGET 'localhost:9200/person/1'Delete a person:curl -XDELETE localhost:9200/person/1Update a person:curl -XPOST 'localhost:9200/person/1/_update' -d '{"doc" : {"first_name" : "Abhi"}}' 10. Elasticsearch dataNode1 Node2S0 S1Shard 11. ReplicasNode1 Node2S0 S0S1 S1Blue - ReplicaRed - PrimaryShard 12. More nodes..Node1 Node2S0 S1Node3 Node4S1 S0Blue - ReplicaRed - Primary 13. Node downNode1 Node2S0 S1Node3 Node4S1 S0Blue - ReplicaRed - Primary 14. Node1S0Node downNode3 Node4A1 S1S0Blue - ReplicaRed - PrimaryS1Re-replicatedPromoted to Primary 15. Elasticsearch 101 Lucene is under covers. Each index (like a database) is made up of multipleshards(lucene instance). Shards are distributed amongst all nodes in thecluster. In case of failure or the addition of new nodesshards are automatically moved from one toanother. 16. How is it Fast ?Distributed executionClientNode 2Node 1S0 S1 S0 S1QueryRed - PrimaryBlue - Replica 17. DEMO Import data from SQL databasein to Hive. (Extract) Run the necessarycomputations usingHadoop/Hive. (Transform) Push the data in toElasticsearch. (Load) Run queries againstElasticsearch. 18. Current Elasticsearch Cluster 9 bare metal boxes 128 GB RAM 2X SSD 10 GB Ethernet 2X 10 core Xeon Processors 2X 30GB Elasticsearch instances per box 1 Elasticsearch load balancing instance to handle index requests 19. ZabbixWhats slow ?Any request that takes more than 300ms is slow 20. Lessons Learnt 21. Concurrency More replication for more currency. Updates are costly. More shards much faster. SQL 3 to 5k per minute 22. Filter Cache All the filters have a cache flag that controls if theyare cached or not. Once the filter cache is warmed, all the requests areserved from the memory. Defaults - 10% for the filter cache. LRU. Bit Sets. 23. Field Data For sorting, aggegration etc.. all the field values areloaded in to memory called field data. By default its unbounded. Expensive to build, its recommended to hold this inmemory. They are circuit breakers to protect against this. If the query is gonna use more than 60% of the JVMheap it will kill the query. 24. JVM memory - Friend or Foe ?to replicate which are still serving requests causing additional heap 25. Getting BadSolution ?More memory.Not necessarily more boxes. 26. Elasticsearch Cons Not commodity hardware 6K (Hadoop) vs 10K (SSD) GC issues. Circuit breakers doesnt protect you against everything. No built in security. Use ngnix proxy with authentication. Learning curve. Lot of updates hurt. Filter cache should be rebuilt, merges etc.. 27. Thank you Twitter : abhishek376We are Hiring !!