how-to nosql 3.0 webinar series: couchbase 104 - views and indexing
DESCRIPTION
In Couchbase 104 for 3.0, explore the power of creating views and indexes in Couchbase. Learn the underlying view architecture for how views and indexes are built in Couchbase.TRANSCRIPT
Couchbase 104Justin Michaels
[email protected] | @justindmichaels
Views and Indexes Overview
Indexes are “views” into Data
• shortcut derived from and pointing into, a greater volume of values, data,
information or knowledge
Traditional Index Examples
• Table of Contents
• Card Catalog
Indexes and Views
©2014 Couchbase, Inc. 3
In Couchbase Map-Reduce is used to maintain Indexes
Map functions are applied to JSON documents and they output or "emit" data that is organized in an Index form
Each emit() call produces a row in the index
Couchbase Views - Map-Reduce Indexes
©2014 Couchbase, Inc. 4
Map-Reduce is a technique designed for dealing with semi-structured data by parallel processing across a distributed system
Different than Hadoop Map/Reduce
• Map functions identify data with collections, process them, and output transformed values
• Reduce functions take the output of Map functions and perform numeric aggregate calculations on them
What is Map Reduce?
©2014 Couchbase, Inc. 5
Map inputs:
• Document – Application data
• Metadata – Couchbase data
Map outputs:
• Document ID
• View Key: User configurable based on JSON fields
• View Value: Only needed when reducing, use ‘null’ otherwise
Produces Index:
• B-tree Structure
• Sorted Alphabetically
Map Functions
©2014 Couchbase, Inc.
Built-in reduce functions (Optional)
• _count – provides a count of unique keys
• _sum – provides a sum total of values
• _stats – provides statistics (max, min, avg, etc.) of values
Operate on results emitted by map function
Results stored pre-computed for fast access
Custom reductions are possible
Reduce Functions
©2014 Couchbase, Inc.
Architecture
33 2
Architecture - Couchbase View Engine
2
Managed Cache
Dis
k Q
ueu
e
Disk
Replication Queue
App Server
Couchbase Server Node
Doc 1
Doc 1
To other node
View engine Doc 1Doc 1
©2014 Couchbase, Inc.9
COUCHBASE SERVER CLUSTER
User Configured Replica Count = 1
ACTIVE
Doc 5
Doc 2
Doc
Doc
Doc
SERVER 1
REPLICA
Doc 4
Doc 1
Doc 8
Doc
Doc
Doc
APP SERVER 1
COUCHBASE Client Library
CLUSTER MAP
COUCHBASE Client Library
CLUSTER MAP
APP SERVER 2
Doc 9
• Indexing is distributed across nodes
• Parallelize the effort
• Each node has index for data stored on it
• Queries combine the results from required nodes
ACTIVE
Doc 5
Doc 2
Doc
Doc
Doc
SERVER 2
REPLICA
Doc 4
Doc 1
Doc 8
Doc
Doc
Doc
Doc 9
ACTIVE
Doc 5
Doc 2
Doc
Doc
Doc
SERVER 3
REPLICA
Doc 4
Doc 1
Doc 8
Doc
Doc
Doc
Doc 9
Query
Architecture - Couchbase View Engine
Buckets have one or more DESIGN DOCUMENTS
• Distributed across cluster when created
DESIGN DOCUMENTS contain one or more VIEW definitions
• Design Documents are processed in parallel
• All the views in a single design document are processed sequentially
Architecture – Design Document
BUCKET A
Design document 1View 1
View 2
View 3
Design document 2View 4
View 5
Design document 3 View 6
View 7BUCKET B©2014 Couchbase, Inc.
Architecture – Couchbase Map Reduce
©2014 Couchbase, Inc. 12
Patch
Management
Many others..
Individual document operations are atomic
Views are eventually consistent in relation to documents
Incremental Map-Reduce
• Spread load across nodes
• Each node indexes it’s data
Map Reduce
Process, filter, map
and emit a row
Aggregate mapped
data
Default:
_count
_sum
_stats
Architecture - Index Building Details
©2014 Couchbase, Inc. 13
Views are maintained directly from managed cache
• The entire view is recreated if the view definition has changed
• All the views within a design document are incrementally updated
Views are updated automatically according to:
• Update Interval (time period); default 5000 millisecondsOR (as of 3.x)
• Update Documents (number of changes); default 5000 changes
Update Controlled by:
• Configured Globally via REST for Individual Design Document
• Manual updates provide application control
stale = UPDATE_AFTER (default if nothing is specified)
• fast response
• can take two operations to read your own writes
stale = OK (most likely to be used)
• auto update only
• might not see your own writes
• least frequent updates -> least resource impact -> highest performance
stale = FALSE (only when TRULY required)
• use with persistTo during set if data needs to force view update
• BUT aware of delay it adds on set and query operation
Architecture - Index Building Details
©2014 Couchbase, Inc.
In addition to data replicas, optionally create replica for indexes
• Build an index using the data in replica vBuckets
Enabled per bucket (Bucket Config) or per design document (REST API)
• Each node must maintain index for active and replica data
• Implies additional CPU and I/O overhead
Failover and Failures
• Without replica indexes complete view is rebuilt
• Replica indexes enabled if present and queries remain consistent
Architecture - Index Building Details (Replicas)
©2014 Couchbase, Inc.
Architecture - Disk Structure
Each design document creates it’s own set of index files
Index data is always read from disk
• File format allows for successful I/O caching by operating system
Separate disk devices for view versus data files
• Both are append-only
• Both are compacted in parallel
• Better use of IO and caching
• Possible to use SSD’s for improved performance on one or other (or both)
©2014 Couchbase, Inc.
Development vs Production Views
Development Views
• Can be edited
• Can be test on full/partial dataset
• Not automatically maintained
Production Views
• Always operate on full document set
• Cannot be modified
• Automatically updated
Development Views are ‘published’ to Production
Simple creation of the view definition NOT a move to new cluster
Execute Development View on Entire Cluster
Development View
Create
Edit/Refine
Sample Index
Subset
Production View
Full Index
Promote to ProductionFull Data
Full DataBucket Content
©2014 Couchbase, Inc.
Writing Views
Map() Function => Index
function(doc, meta) {emit(doc.username, doc.email)
} indexed key output value(s)create row
json doc doc metadata
Every Document passes through View Map() functions
Map
View Anatomy
©2014 Couchbase, Inc.
Single Element Keys (Text Key)
function(doc, meta) {emit(doc.email, doc.points)
}text key
Map
meta.id doc.email doc.points
u::1 [email protected] 1000
u::35 [email protected] 1200
u::20 [email protected] 900
View Anatomy
©2014 Couchbase, Inc.
Compound Keys (Array)
function(doc, meta) {emit(dateToArray(doc.timestamp), 1)
} array key
Array Based Index Keys get sorted as Strings,
but can be grouped by array elements
Map
meta.id dateToArray(doc.timestamp) value
u::20 [2012,10,9,18,45] 1
u::1 [2012,9,26,11,15] 1
u::35 [2012,8,13,2,12] 1
View Anatomy
key = “” (exact match)
keys = [ ] (set of keys match)
startkey/endkey = “” (range queries on view key)
startkey_docID/endkey_docID = “” (range queries on meta.id)
stale (false, update_after, ok)
group/group_by (aggregate with grouping)
View Anatomy - Parameters
©2014 Couchbase, Inc.
View Anatomy - Collation
©2014 Couchbase, Inc.
23
1234567890 < aAbBcCdDeEfFgGhHiIjJkKlLmM...
Unicode Collation
a < á < A < Á < b
1234567890 < a-z < A-Z
Byte Order
View Anatomy - Sample Document
Document ID
©2014 Couchbase, Inc.
View Anatomy - Sample Index
ValueKey
©2014 Couchbase, Inc.
View Anatomy - Examples
©2014 Couchbase, Inc. 26
Patch
Management
Many others..
View Anatomy - Querying
©2014 Couchbase, Inc. 27
Patch
Management
• Simple View Access
• Exact Match
• Range
• With Reduction
• With Grouping
Best Practices
View size is determined by key and value contents
• Emit as little as possible … not full document
• Only use values when required by a reduce function
• Only emit either null or the secondary key (doc ID included with each row)
View distribution:
• More views per designdoc require more time to update all views in group
• Single views per designdoc may require more CPU
• Group views in designdocs by update frequency, rather than subject/topic
View Best Practices
©2014 Couchbase, Inc.
Queries should have consistent response times
• Indexes are pre-materialized
• Expect to use “stale.ok”
File system cache availability for the index has a big impact on performance
• Indexes are disk based
• Reduce cluster quota to give more system cache
In house performance results show that by doubling system cache availability
• query latency reduces by half
• throughput increases by 50%
View Best Practices
©2014 Couchbase, Inc.
View Best Practices
31
Patch
Management
Many others..
Avoid computing too many things in a single View
Select (filter) data to avoid unnecessary entries in the View
• Use document types to make Views more selective
Project (map) only necessary data and emit it as value
• When possible emit a null value and perform additional Get to retrieve the whole document
Use the built in reduce functions if possible
©2014 Couchbase, Inc.
Couchbase Query Language
32
Querying with N1QL (“Nickel”)
33
Person
JSON can model our
Complex World
N1QL Can Query
that World
N1QL Developer Preview and Tutorial
http://docs.couchbase.com/developer/n1ql-dp3/n1ql-intro.html
http://query.pub.couchbase.com/tutorial/#1©2014 Couchbase, Inc.
Thank You!
Next Session:
Couchbase 105 | December 3, 2014 | 10am Pacific
Cross Data Center Replication (aka XDCR)
34
Justin Michaels
[email protected] | @justindmichaels