nosql matters 2013 - introduction to map reduce with couchbase 2.0
DESCRIPTION
Introduction to Map Reduce and how it is used in Couchbase Server 2.0 to query documentsTRANSCRIPT
Friday, April 26, 13
Introduc)on to Map Reducewith Couchbase
Tugdual Grall / @tgrall
NoSQL Ma)ers ‘13 -‐ Cologne -‐ April 25th 2013
Friday, April 26, 13
About Me
• Tugdual “Tug” Grall Couchbase
• Technical Evangelist
eXo
• CTO
Oracle
• Developer/Product Manager
• Mainly Java/SOA
Developer in consul@ng firms
•Web
•@tgrall
• hEp://blog.grallandco.com• tgrall
• NantesJUG co-‐founder• Pet Project :• hEp://www.resultri.com
Friday, April 26, 13
What’s the Problem ?
Lots of DataBig Data SaaS/Cloud
CompuDngBig Users
Friday, April 26, 13
Solu)on
Distribute:• the data• the processing of the data
Friday, April 26, 13
Map Reduce
MapReduce is a programming model for processing large data sets, and the name of an implementa@on of the model by Google. MapReduce is typically used to do distributed compu@ng on clusters of computers.
hEp://research.google.com/archive/mapreduce.html
Friday, April 26, 13
In details
• Developer specifies 2 methods: map (in_key, in_value) -> list(out_key, intermediate_value)
• Processes input data
• Produces key, values pairs reduce (out_key, list(intermediate_value)) -> list(out_value)
• Combines all intermediate values for a par@cular key
• Produce a set of merged output values
Friday, April 26, 13
Execu)on
Friday, April 26, 13
Most common use case
© Yahoo inc.
Friday, April 26, 13
What about Couchbase?
Friday, April 26, 13
Couchbase Open Source Project
• Leading NoSQL database project focused on distributed database technology and surrounding ecosystem
• Supports both key-‐value and document-‐oriented use cases
• All components are available under the Apache 2.0 Public License
• Obtained as packaged soXware in both enterprise and community edi@ons.
Couchbase Open Source Project
Friday, April 26, 13
Couchbase Server Core Principles
Easy Scalability
Consistent High Performance
Always On 24x365
Grow cluster without applica@on changes, without down@me with a
single click
Consistent sub-‐millisecond read and write response @mes with consistent high throughput
No down@me for soXware upgrades, hardware maintenance, etc.
Flexible Data Model
JSON document model with no fixed schema.
JSONJSONJSON
JSONJSON
PERFORMANCE
Friday, April 26, 13
Addi)onal Couchbase Server Features
Built-‐in clustering – All nodes equal
Data replica@on with auto-‐failover
Zero-‐down@me maintenance
Built-‐in managed cached
Append-‐only storage layer
Online compac@on
Monitoring and admin API & UI
SDK for a variety of languages
Friday, April 26, 13
Heartbeat
Process m
onito
r
Glob
al singleton supe
rviso
r
Confi
gura@o
n manager
on each node
Rebalance orchestrator
Nod
e he
alth m
onito
r
one per cluster
vBucket state and
replica@
on m
anager
hVpRE
ST m
anagem
ent A
PI/W
eb UI
HTTP8091
Erlang port mapper4369
Distributed Erlang21100 -‐ 21199
Erlang/OTP
storage interface
Couchbase EP Engine
11210Memcapable 2.0
Moxi
11211Memcapable 1.0
Memcached
New Persistence Layer
8092Query API
Que
ry Engine
Data Manager Cluster Manager
Couchbase Server 2.0 Architecture
Friday, April 26, 13
New Persistence Layer
storage interface
Couchbase EP Engine
11210Memcapable 2.0
Moxi
11211Memcapable 1.0
Object-‐level Cache
Disk Persistence
8092Query API
Que
ry Engine
HTTP8091
Erlang port mapper4369
Distributed Erlang21100 -‐ 21199
Heartbeat
Process m
onito
r
Glob
al singleton supe
rviso
r
Confi
gura@o
n manager
on each node
Rebalance orchestrator
Nod
e he
alth m
onito
r
one per cluster
vBucket state and
replica@
on m
anager
hVp
REST m
anagem
ent A
PI/W
eb UI
Erlang/OTP
Server/Cluster Management & CommunicaDon
(Erlang)
RAM Cache, Indexing & Persistence Management
(C & V8)
The Unreasonable Effectiveness of C by Damien Katz
Couchbase Server 2.0 Architecture
Friday, April 26, 13
COUCHBASE SERVER CLUSTER
Basic Opera)on
• Docs distributed evenly across servers
• Each server stores both ac)ve and replica docsOnly one server ac@ve at a @me
• Client library provides app with simple interface to database
• Cluster map provides map to which server doc is onApp never needs to know
• App reads, writes, updates docs
•Mul)ple app servers can access same document at same )me
User Configured Replica Count = 1
READ/WRITE/UPDATE
ACTIVE
Doc 5
Doc 2
Doc
Doc
Doc
SERVER 1
ACTIVE
Doc 4
Doc 7
Doc
Doc
Doc
SERVER 2
Doc 8
ACTIVE
Doc 1
Doc 2
Doc
Doc
Doc
REPLICA
Doc 4
Doc 1
Doc 8
Doc
Doc
Doc
REPLICA
Doc 6
Doc 3
Doc 2
Doc
Doc
Doc
REPLICA
Doc 7
Doc 9
Doc 5
Doc
Doc
Doc
SERVER 3
Doc 6
APP SERVER 1
COUCHBASE Client LibraryCLUSTER MAP
COUCHBASE Client LibraryCLUSTER MAP
APP SERVER 2
Doc 9
Friday, April 26, 13
How to access the data?
Friday, April 26, 13
Couchbase.get(“my-key”);
Friday, April 26, 13
Key
{ “string” : “string”, “string” : value, “string” : { “string” : “string”, “string” : value }, “string” : [ array ]}
JSONOBJECT
(“DOCUMENT”)
• How to find document based on its aVributes? get employee by email
get products by type
...
• You need to look “into” the document/value
Look at a document
Friday, April 26, 13
Create an index !
How to?
Friday, April 26, 13
{ "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale"}
{ "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json"}
Key Value
Aven@nus 8.2
Avenue Ale 4.1
... ...
{ "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale"}
{ "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json"}
{ "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale"}
{ "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json"}
{ "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale"}
{ "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json"}
{ "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale"}
{ "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json"}
{ "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale"}
{ "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json"}
{ "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale"}
{ "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json"}
{ "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale"}
{ "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json"}
{ "name": "Aventinus", "abv": 8.2, "ibu": 0, "srm": 0, "upc": 0, "type": "beer", "brewery_id": "110f1f2012", "updated": "2010-07-22 20:00:20", "description": "Dark-ruby, ... Weizenbock", "category": "German Ale"}
{ "id": "110f37fa30", "rev": "1-000000000", "expiration": 0, "flags": 0, "type": "json"}
Create the index
Friday, April 26, 13
Concrete Example
• This map func)on: receives the document and metadata
as developer you just have to emit the K,V
Friday, April 26, 13
Map Func)on
Text
Friday, April 26, 13
doc.email meta.id
[email protected] u::1
[email protected] u::7
[email protected] u::2
[email protected] u::5
[email protected] u::6
ye@@couchbase.com u::4
[email protected] u::3
?startkey=”b1” & endkey=”zz”
Pulls the Index-‐Keys between UTF-‐8 Range specified by the startkey and endkey.
?startkey=”bz” & endkey=”zn”
Pulls the Index-‐Keys between UTF-‐8 Range specified by the startkey and endkey.
Friday, April 26, 13
doc.email meta.id
[email protected] u::1
[email protected] u::7
[email protected] u::2
[email protected] u::5
[email protected] u::6
ye@@couchbase.com u::4
[email protected] u::3
?key=”[email protected]”
Match a Single Index-‐Key
Friday, April 26, 13
doc.email meta.id
[email protected] u::1
[email protected] u::7
[email protected] u::2
[email protected] u::5
[email protected] u::6
ye@@couchbase.com u::4
[email protected] u::3
?keys=[“[email protected]”,“[email protected]”]
Query Mul@ple in the Set (Array Nota@on)
Friday, April 26, 13
How it works ?
Friday, April 26, 13
COUCHBASE SERVER CLUSTER
Indexing and Querying
User Configured Replica Count = 1
ACTIVE
Doc 5
Doc 2
Doc
Doc
Doc
SERVER 1
REPLICA
Doc 4
Doc 1
Doc 8
Doc
Doc
Doc
APP SERVER 1
COUCHBASE Client LibraryCLUSTER MAP
COUCHBASE Client LibraryCLUSTER MAP
APP SERVER 2
Doc 9
• Indexing work is distributed amongst nodes
• Large data set possible
• Parallelize the effort
• Each node has index for data stored on it
• Queries combine the results from required nodes
ACTIVE
Doc 5
Doc 2
Doc
Doc
Doc
SERVER 2
REPLICA
Doc 4
Doc 1
Doc 8
Doc
Doc
Doc
Doc 9
ACTIVE
Doc 5
Doc 2
Doc
Doc
Doc
SERVER 3
REPLICA
Doc 4
Doc 1
Doc 8
Doc
Doc
Doc
Doc 9
Query
Friday, April 26, 13
Couchbase Server 2.0: Views
• Views can cover a few different use cases Primary Index
Simple secondary indexes (the most common)
Complex secondary, ter@ary and composite indexes
Aggrega@on func@ons (reduc@on)
• Example: count the number of “North American Ales”
Organizing related data
• Built using Map/Reduce Map func@on creates a matrix from document fields
Reduce func@on summarizes (reduces) informa@on
Friday, April 26, 13
Distributed Index Build Phase
• Op)mized for lookups, in-‐order access and aggrega)ons
• All view reads from disk (different performance profile)
• View builds against every document on every node This is why you should group them in a design document
• Automa)cally kept up to date “Incremental” Map Reduce
Friday, April 26, 13
Dynamic Range Queries with Op5onal Aggrega5on
•Efficiently fetch an row or group of related rows.•Queries use cached values from B-‐tree inner nodes when possible•Take advantage of in-‐order tree traversal with group_level queries
Doc 4
Doc 2
Doc 5
SERVER 1
Doc 6
Doc 4
SERVER 2
Doc 7
Doc 1
SERVER 3
Doc 3
Doc 9
Doc 7
Doc 8 Doc 6
Doc 3
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
DOC
Doc 9
Doc 5
DOC
DOC
DOC
Doc 1
Doc 8 Doc 2
Replica Docs Replica Docs Replica Docs
Ac@ve Docs Ac@ve Docs Ac@ve Docs
?startkey=“J”&endkey=“K”
{“rows”:[{“key”:“Juneau”,“value”:null}]}
Friday, April 26, 13
Append Only Index
• Disk acDvity is slow
• UpdaDng disk blocks is very slow
• Appending new data to the end of the current file is fast
• Overhead of reverse reading is small
• Because exisDng blocks are not re-‐used, can lead to fragmentaDon Couchbase will compact the index automa@cally
DocView
Processor Disk
DocView
Processor
Changed Documents
Appended
Original
Friday, April 26, 13
Adding a new Document
A-R15
I-R8
M-R5
A B C D F G H I K L N O Q R
A-C3
D-F2
G-H2
I-L3
N-R4
A-H7
I-R7
A-R14
M
new root
new key
new reductions
Friday, April 26, 13
What about Reduce ?
• Out of the box func)ons : _count()
_sum()
_stats()
• Create your own if neededfunction(key, values, rereduce) { if (rereduce) { var result = 0; for (var i = 0; i < values.length; i++) { result += values[i]; } return result; } else { return values.length; }}
Friday, April 26, 13
Reduce Func)on
• Key and Arrays of values as parameters
•WriVen Javascript
• Called aner the map func)on
• Used to reduce the result of a map of single values
• Used with grouping• Could be ignored when querying reuse the index
Friday, April 26, 13
•Map() Result
• Reduce()
• Result
Reduce in Ac)onKey Value
Belgian-‐Style Dubbel 1
Belgian-‐Style Dubbel 1
Belgian-‐Style Dubbel 1
Belgian-‐Style Pale Ale 1
Belgian-‐Style White 1
Belgian-‐Style White 1
... ...
_count()
Key Value
Belgian-‐Style Dubbel 3
Belgian-‐Style Pale Ale 1
Belgian-‐Style White 2
Friday, April 26, 13
How to use it?
• Use client SDK to call the view:
View view = client.getView("beer", "by_name");Query query = new Query(); query.setIncludeDocs(true) .setLimit(20) .setRangeStart(ComplexKey.of(startKey)) .setRangeEnd(ComplexKey.of(startKey + "\uefff"));
ViewResponse result = client.query(view, query); for(ViewRow row : result) { ....}
Friday, April 26, 13
Demonstra)on
Friday, April 26, 13
≠Hadoop & Couchbase
• Deal with “Big Data”
• “More” is be)er than “Faster”
• Batch Oriented
• Usually used to “extract/transform” data
• Fully distributed
Map, Shuffle, Reduce
• Distributed
• Executed where the document is
• Deal with “indexing” data
• As fast as possible
• Use to query the data in the Database
Friday, April 26, 13
Map Reduce in Couchbase
• Like many other NoSQL Database : Used for queries !
• Index are distributed on each node of the cluster• Index are updated Incrementally
•Write you Map Reduce in Javascript
Friday, April 26, 13
Thank [email protected]
@tgrall
Get Couchbase Server at hEp://www.couchbase.com/download
Friday, April 26, 13
Friday, April 26, 13