building a relevance platform with couchbase and elasticsearch
DESCRIPTION
These slides were from my Goto Amsterdam presentation. During this presentation I went into detail about how we're building a high performance relevance platform at Hippo with Couchbase and Elasticsearch. The talk will also cover why we chose CouchBase for storage and how Elasticsearch can be used for search and analytics. I shared how we integrated and leverage both products full-circle from within our Hippo CMS product.TRANSCRIPT
OneHippo @ Goto
follow the Hippo trail
Building a relevance platform with Couchbase
and Elasticsearch@jreijn | Hippo
#gotoams, June 18
follow the Hippo trail
OneHippo @ Goto
About me
• Architect @ Hippo
• DevOps guy
• Blogger @ http://blog.jeroenreijn.com
follow the Hippo trail
OneHippo @ Goto
About Hippo
follow the Hippo trail
OneHippo @ Goto
OneHippo @ Goto
Relevance?
follow the Hippo trail
OneHippo @ Goto
OneHippo @ Goto
“The capability of a search engine or function to
retrieve data appropriate to a user's needs.”
http://www.thefreedictionary.com/relevance
follow the Hippo trail
OneHippo @ Goto
OneHippo @ Goto
follow the Hippo trail
OneHippo @ Goto
OneHippo @ Goto
How we deliver relevant content
@Hippo
follow the Hippo trail
OneHippo @ Goto
Registration
Visitor - entity making HTTP requests
Collector - records data about a visitor or his behavior
Example: location collector (GeoIPCollector)
Targeting Data - all data about a specific visitor
Example: IP address is located in Amsterdam
follow the Hippo trail
OneHippo @ Goto
MatchingCharacteristic - a type of fact about visitors
Example: "comes from a city", "experiences a type of weather"
Target Group - the specification of a Characteristic
Example: "comes from a European city", "comes from Amsterdam"
Persona - one or more target groups that describe a certain type of visitor
Example: "Jim, the European urban consumer",
"Alice, the Pet owner"
follow the Hippo trail
OneHippo @ Goto
What do we store?Request log
Targeting data
Statistics
Averages, e.g. how many visitors became which persona
follow the Hippo trail
OneHippo @ Goto
Real-time analysis
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoArchitecture
follow the Hippo trail
OneHippo @ Goto
RDBMS
Hippo Delivery Tier
Hippo Repository
App server
XMLJSON (X)HTML
follow the Hippo trail
OneHippo @ Goto
Delivery Tier
URL Matching
Fetch content
Compose output
Request
Response
Request
follow the Hippo trail
OneHippo @ Goto
Delivery Tier
URL Matching
Targeting Data Collection
Compose output
Request
Response
Request
Fetch content
Scoring
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoScaling
follow the Hippo trail
OneHippo @ Goto
RDBMS
Hippo Delivery Tier
Hippo Repository
App server
Hippo Delivery Tier
Hippo Repository
App server
Scaling out
follow the Hippo trail
OneHippo @ Goto
RDBMS
Delivery Tier
Repository
App server
Delivery Tier
Repository
App server
Scaling out
TargetingDatastore
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoWhat kind of ‘storage’?
follow the Hippo trail
OneHippo @ Goto
Distributed Cache?
follow the Hippo trail
OneHippo @ Goto
We have a winner!
follow the Hippo trail
OneHippo @ Goto
OneHippo @ Goto
Requirements change!
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoNoSQL to the rescue
follow the Hippo trail
OneHippo @ Goto
Suitable types• Key-value store
• Document database
follow the Hippo trail
OneHippo @ Goto
Assessment Criteria
Maturity Data model
Consistency model
PerformanceReplication
Caching model Query model
Monitoring
Scalability
Reliability
Support
follow the Hippo trail
OneHippo @ Goto
Selection Criteria• Performance!
• Scalability
• Schema flexibility
• Simplicity
• Monitoring
• Support
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoPerformance !!
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoScalability
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoSchema flexibility
follow the Hippo trail
OneHippo @ Goto
{ "visitorId": "7a1c7e75-8539-40", "pageUrl": "http://localhost:8080/site/news", "pathInfo": "/news", "remoteAddr": "127.0.0.1", "referer": "http://localhost:8080/site/", "timestamp": 1371419505909, "collectorData": { "geo": { "country": "", "city": "", "latitude": 0, "longitude": 0 }, "returningvisitor": false, "channel": "English Website" }, "personaIdScores": [], "globalPersonaIdScores": []}
Request log document
follow the Hippo trail
OneHippo @ Goto
{ "geo": { "collectorId": "geo", "city": "", "country": "", "latitude": 0, "longitude": 0 }, "channel": { "collectorId": "channel", "channels": [ "English Website" ], "lastVisitedChannel": "English Website" }}
Visitor document
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoSimplicity
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoMonitoring
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoSupport
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoCouchbase
follow the Hippo trail
OneHippo @ Goto
Why Couchbase?
• Drop-in replacement for memcached
• Read/Write-through cache
• High throughput
• Easy scalability
• Schema flexibility
• Low latency
follow the Hippo trail
OneHippo @ Goto
Couchbase
• Open Source
• Document-oriented
• Easy Scalable
• Consistent High Performance
follow the Hippo trail
OneHippo @ Goto
Performance
• Object managed cache
• Write Queue to disk
• Avoids Cold Cache
follow the Hippo trail
OneHippo @ Goto
Easy scalable
• Auto sharding
• Cross cluster replication (XDCR)
• Master - Master replication
follow the Hippo trail
OneHippo @ Goto
Flexible data model
• Native JSON support
• Incremental Map Reduce
• Gives power to the developer
follow the Hippo trail
OneHippo @ Goto
OneHippo @ Goto
How we run Couchbase @Hippo
follow the Hippo trail
OneHippo @ Goto
Load Balancer
Database cluster
Hippo Delivery Tier Couchbase cluster
•Request log data•Targeting data•Statistics data
follow the Hippo trail
OneHippo @ Goto
Query capabilities• Querying via views
• Secondary indexes via views
• Views based on Map - Reduce
• Lacks some advanced query capabilities
follow the Hippo trail
OneHippo @ Goto
Elasticsearch
• Apache Lucene
• Designed to be distributed
• Schema free
• Apache 2 licensed
• RESTful API
follow the Hippo trail
OneHippo @ Goto
Added value of ES• Full text search
• Faceted search
• Geo spatial search
• All in (near) real-time
follow the Hippo trail
OneHippo @ Goto
Couchbase Server Cluster Elasticsearch Server Cluster
Hippo Delivery Tier
Java API
Wri
te
Rea
d
XDCR Couchbase ES Transport plugin
Replicating to ES
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoDemo time!
follow the Hippo trail
OneHippo @ Goto
OneHippo @ GotoWhat’s Next?
follow the Hippo trail
OneHippo @ Goto
Advanced analytics
follow the Hippo trail
OneHippo @ Goto
OneHippo @ Goto
Thank you!
Questions?
[email protected]@jreijn
ps. We’re hiring!