geoffrey hendrey @geoffhendrey architecture for real-time ad-hoc query on distributed filesystems
TRANSCRIPT
Motivation
• Big Data is more opaque than small data– Spreadsheets choke– BI tools can’t scale– Small samples often fail to replicate issues
• Engineers, data scientists, analysts need:– Faster “time to answer” on Big Data– Rapid “find, quantify, extract”
• Solve “I don’t know what I don’t know”• This is NOT about looking up items in a
product catalog (i.e. not a consumer search problem)
Classic “side system” approach• Definition of KLUDGE: “a system and
especially a computer system made up of poorly matched components” –Merriam-Webster
Hadoop SearchCluster
?????
Classic “search toolkit”
• Built around fulltext use case• Inverted Indexes optimized for on-the-fly
ranking of results– TF-IDF– Okapi BM-25
• Yet never able to fully realize google-style search capability
• Issues:– Phrase detection– Pseudo synonymy– Open loop architecture
Big data ad-hoc query
• Not typically a fulltext “document search” problem• Data is structured, mixed structured, and
denormalized– Log lines– Json records– CSV files– Hadoop native formats (SequenceFile)
• Ranking is explicit (ORDER BY), not relevance based
• Sometimes “needle in haystack” (support, debugging)
• Sometimes “haystack in haystack” (summary analytics, segmentation)
Finer points of Dremel architecture• MapReduce friendly• In-Situ approach is DFS friendly• Excels at aggregation. Not so much for needle-in-
haystack.• Column storage format accelerates mapreduce
(less extraneous data pushed through)• But in some regards still a “side system”• Applications must explicitly store their data in a
columnar format• “massive” is both a benefit and a hazard
– Complex (operationally and WRT query execution)– Queries can execute quickly…on huge clusters
Crawled In-Situ Index Architecture
HDFSMapReduce
Data Crawl
In-situ Index
SimpleSearch
Application
Hadoop
Benefits to crawled In-Situ index• No changes to application data format
– CSV– JSON– SequenceFile
• Clear “separation of concerns” between data and index
• Indexes become “disposable”: easily built, easily thrown away
• There is no “side system” that needs to be maintained
• Use the mapreduce “hammer” to pound a nail
Architect for Elasticity
AWS S3
Elastic MapReduce
JetS3tEC2
M1.large
ApplicationCrawl
Index
HTTP
Interesting: you don’t actually need to have hadoop installed…
Declarative Crawl Indexing
HDFSMapReduce
Data Crawl
In-situ Index
SimpleSearch
Application
Hadoop
{"filter”:"column[4]==\"athens\"" }
Parse.json
• Indexer reads declarative instructions from in-situ file• “pull” vs. traditional “push” indexing approach
Thin index
• Index size is small because data is a holistic part of the system
• data does not need to be “put into” the search system and repicated in the index.
HDFSMapReduce
Data Crawl
In-situ Index
Data
Index
Lazy data loading
HDFSMapReduce
Data Crawl
ExecutionRuntime
Data
IndexLRU
IndexCache
Lazy Pull
Lazy Pull