hadoop summit - interactive big data analysis with solr, spark and hue
TRANSCRIPT
INTERACTIVELY QUERY AND SEARCH YOUR BIG DATA
Romain Rigaux
GOALS
Build a Web app Quickly explore data
… with Solr
make Solr / Hadoop easier to use
+
ARCHITECTURE“Just a view” on top of the standard Solr API
REST
HISTORYV1 USER
HISTORYV1 ADMIN
ARCHITECTURENEXT!
Lot of learning, UX Boost needed
Simple, don’t know it is Solr
HISTORYV2 USER
HISTORYV2 ADMIN
HISTORYV2 BETTER UX
ARCHITECTURE
/select /admin/collections /get /luke...
/add_widget /zoom_in /select_facet /select_range...
REST AJAXTemplates
+ JS Model
www….
ARCHITECTUREUI FOR FACETS
Query
Collection
Layout All the 2D positioning (cell ids), visual, drag&drop
Dashboard, fields, template, widgets (ids)
Search terms, selected facets (q, fqs)
ADDING A WIDGETLIFECYCLE
Load the initial page Edit mode and Drag&Drop
/solr/zookeeper/clusterstate.json /solr/admin/luke…
/get_collection
ADDING A WIDGETLIFECYCLE
/solr/select?stats=true /new_facet
Select the field Guess ranges (number or dates) Rounding (number or dates)
ADDING A WIDGETLIFECYCLE
Query part 1
Query Part 2
Augment Solr response
facet.range={!ex=bytes}bytes&f.bytes.facet.range.start=0&f.bytes.facet.range.end=9000000& f.bytes.facet.range.gap=900000&f.bytes.facet.mincount=0&f.bytes.facet.limit=10
q=Chrome&fq={!tag=bytes}bytes:[900000+TO+1800000]
{ 'facet_counts':{ 'facet_ranges':{ 'bytes':{ 'start':10000, 'counts':[ '900000', 3423, '1800000', 339,
... ] } }}
{ ..., 'normalized_facets':[ { 'extraSeries':[
], 'label':'bytes', 'field':'bytes', 'counts':[ { 'from’:'900000', 'to':'1800000', 'selected':True, 'value':3423, 'field’:'bytes', 'exclude':False } ], ... } }}
JSON TO WIDGET{ "field":"rate_code","counts":[ { "count":97797, "exclude":true, "selected":false, "value":"1", "cat":"rate_code" } ...
{ "field":"medallion","counts":[ { "count":159, "exclude":true, "selected":false, "value":"6CA28FC49A4C49A9A96", "cat":"medallion" } ….
{ "extraSeries":[
],"label":"trip_time_in_secs","field":"trip_time_in_secs","counts":[ { "from":"0", "to":"10", "selected":false, "value":527, "field":"trip_time_in_secs", "exclude":true } ...
{ "field":"passenger_count","counts":[ { "count":74766, "exclude":true, "selected":false, "value":"1", "cat":"passenger_count" } ...
REPEATUNTIL…
FACETFUNCTIONS
Count Sum Avg Percentile Max ...
Count(id) Sum(bytes) Avg(mul(price, quantity)) Percentile(salary, 50, 90) Max(temperature) ...
FACETFUNCTIONS
SUB “NESTED”FACETS
top_os { type: term, field: os, limit: 5 }
top_os { type: term, field: os, limit: 5, facet : { by_country: { type: term, field: country } } }
FUNCTION + NESTED =ANALYTICS states {
type: term, field: state, facet : { by_month : { type: range, field: time, start: “TODAY-‐6MONTHS”, end: “TODAY”, gap: “MONTH”, facet : { avg_sal: “avg(salary)” } } } }
states { type: term, field: state, facet : { avg_sal: “avg(salary)” } }
OPERATIONS ONBUCKETS OF DATA
Counts → Functions
OPERATIONS ONBUCKETS OF DATA
Nested → nD functions
ENTERPRISEFEATURES
- Access to Search App configurable, LDAP/SAML auths - Share by link - Solr Cloud (or non Cloud) - Proxy user
/solr/jobs_demo/select?user.name=hue&doAs=romain&q= - Security
Kerberos - Sentry
Collection level, Solr calls like /admin, /query, Solr UI, ZooKeeper
SEARCH AS ONLYAPP IN HUE
gethue.com/solr-‐search-‐ui-‐only/
• Spark in your browser
• Notebooks
• New REST Server
SPARKINDEXING WHAT
• Open source REST for Spark Shell
• Runs locally or inside YARN
• Spark Scala, PySpark and jar/py submission
SPARKINDEXING WHAT
hsps://github.com/cloudera/hue/tree/master/apps/spark/java
• Pytho
• Scala
• Charts
NOTEBOOKS / SHELL
WHAT
DEMO TIME• Analyze Bay area bike share
• Visualize one year of data
• Know your users, predict behavior
• Full Analyhcs
• Easier indexing
• Geo
• Export/Share results
• “More like this”
• Solr Joins, Solr SQL
• Spark, SQL... integrahon, Hue 4
WHAT’S NEXT
NEW FEATURES
@gethue
USER GROUP
hue-‐user@
WEBSITE
hsp://gethue.com
LEARN
hsp://learn.gethue.com
THANKS!