no sql at the guardian
DESCRIPTION
Presentation given at No:SQL EU conference describing architectures past, present & future for guardian.co.ukTRANSCRIPT
![Page 1: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/1.jpg)
NoSql at guardian.co.ukMatthew WallSimon Willison
![Page 2: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/2.jpg)
![Page 3: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/3.jpg)
!
![Page 4: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/4.jpg)
SQL
![Page 5: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/5.jpg)
![Page 6: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/6.jpg)
![Page 7: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/7.jpg)
![Page 8: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/8.jpg)
ot
nly
![Page 9: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/9.jpg)
Guardian journalism online: 1995
![Page 10: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/10.jpg)
Guardian journalism online: 1999
![Page 11: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/11.jpg)
Guardian journalism online: 2000
![Page 12: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/12.jpg)
Guardian journalism online: 2010
![Page 13: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/13.jpg)
Read all about it!
![Page 14: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/14.jpg)
I bring you NEWS!!!App server App server App server
Web server Web server Web server
CMS Data feeds
Oracle
Memcached (20Gb)
![Page 15: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/15.jpg)
I bring you NEWS!!!App server App server App server
Web server Web server Web server
CMS Data feeds
Oracle
Memcached
Why RDBMS?
5 years ago, fewer alternatives
Understand operations procedures
Can easily recruit DBAs / devs
Developer/ops tools
Business critical system: a safe choice
![Page 16: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/16.jpg)
![Page 17: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/17.jpg)
![Page 18: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/18.jpg)
![Page 19: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/19.jpg)
![Page 20: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/20.jpg)
Related content from search engine
![Page 21: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/21.jpg)
Introduction of memcached
Related content from search engine
![Page 22: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/22.jpg)
Introduction of memcached
Big traffic spikeRelated content from search engine
![Page 23: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/23.jpg)
Distributed memcached
Protects database from peak load
Entities explicitly decached
Queries given TTL
memcached = database supercharger
![Page 24: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/24.jpg)
Now we have a stable “broadcast” platform
We know how to scale it
SQL running effectively at core
We’ve finished, right?
![Page 25: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/25.jpg)
Digital journalism is changing
We can’t cover everything
We can’t compete with everyone
Need to be “part of the web” not just “on the web”
![Page 26: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/26.jpg)
Mutualisethe news!
![Page 27: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/27.jpg)
Mutualised news!
Mutalisation of journalism
No longer only broadcasting content
User engagement & contribution:journalism
datasoftware
Data curation / linked data
Support engaged developers with data and APIs
![Page 28: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/28.jpg)
Mutualised news!
Be a part of the data fabric of the internet
![Page 29: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/29.jpg)
Mutualised news!Platform strategy
Out: Release our data to the world via APIs
In: Rapidly build new functionality outside the core
Write: Ingest, store & present arbitrary data
![Page 30: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/30.jpg)
Mutualised news!
Data Out
Content API
![Page 31: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/31.jpg)
Mutualised news!
Content API
Delivered using Apache Solr
Document oriented search engine
Loose schema:records, fields, facets
Fields can be multi-value
Supports dynamic field generation
Can apply multiple facets in queries faster than RDBMS
![Page 32: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/32.jpg)
Mutualised news!
![Page 33: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/33.jpg)
Mutualised news!
![Page 34: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/34.jpg)
Mutualised news!
![Page 35: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/35.jpg)
Mutualised news!
Is Solr a database?
![Page 36: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/36.jpg)
Mutualised news!Can perform complex queries, including full text search
Can filter results with facets (WHERE clause)
ANYTHING can be a facet. Very powerful.
On our dataset most queries are of a similar cost
Scales very well horizontally
Handles millions of documents
![Page 37: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/37.jpg)
Mutualised news!No transactions
Excellent for certain types of queries
Not truly general purpose
Schema design very important
Search index not really persistence
![Page 38: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/38.jpg)
App server
Web servers
CMS
Memcached (20Gb)
Solr
Core
Solr
Solr
Solr
Solr
Solr
Cloud, EC2
M/Q
Api
rdbms
![Page 39: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/39.jpg)
Mutualised news!API
Currently powering iPad app
Site components
External applications
Editors tools
More to follow
![Page 40: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/40.jpg)
Mutualised news!
Data In
Application framework
![Page 41: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/41.jpg)
Mutualised news!
Application framework
Simple REST/ HTTP framework allows lightweight development
Applications proxied for performance
Apps generally hosted in the cloud, hot deployment into production
No RDBMs provided for storage
Can develop in news timeline
![Page 42: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/42.jpg)
App server
Web servers
CMS
Memcached (20Gb)
Core
M/Q
App
App
App
App
App
App
Apps
Proxy
external hostingapp engine etc
rdbms
![Page 43: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/43.jpg)
NoSQL for journalism
![Page 44: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/44.jpg)
Some useful characteristics
• Scale down as well as up
• Support rapid production-ready prototyping: turn projects around in hours or days
• Handle massive traffic spikes
![Page 45: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/45.jpg)
Desktop analysis• Leaked BNP
membership list
• Load postcodes to constituencies mapping in to Redis
• Generate heatmaps by looking up all 12,000 postcodes
![Page 46: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/46.jpg)
MP’s expenses
![Page 47: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/47.jpg)
MP’s expenses
SELECT * FROM pages WHERE is_reviewed = 0 ORDER BY RAND()
![Page 48: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/48.jpg)
v2 used Redis
![Page 49: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/49.jpg)
v2 used RedisSet difference:labour MP pages - reviewed pages
SRANDMEMBER
![Page 50: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/50.jpg)
BigTable: Zeitgeist
![Page 51: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/51.jpg)
Zeitgeist stores pre-calculated results in BigTable
• Data comes in from stats system, comments system and OneRiot real-time search API
• AppEngine cron tasks populate task queues
• Task queues recalculate hotness levels
• “Live” BigTable queries are simple SELECT / SORT
![Page 52: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/52.jpg)
Live debate poll
• Over a million votes cast in an hour
• Stretched limits of BigTable / AppEngine
• Sharded counter pattern to handle writes
![Page 53: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/53.jpg)
Spreadsheets are NoSQL too...
![Page 54: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/54.jpg)
Google Docs powered infographics
![Page 55: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/55.jpg)
The Datablog
![Page 56: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/56.jpg)
• Datablog was launched with no development involvement at all - it’s a blog, and a bunch of Google Docs Spreadsheets
• Retrieve data as CSV, XLS, JSON, Atom...
• “Make a copy” and run your own analysis
![Page 57: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/57.jpg)
Mutualised news!
Write
Arbitrary data
![Page 58: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/58.jpg)
Mutualised news!Create schema free database alongside RDBMS
Index in Solr
Provide access in API
Investigating: CouchDB
![Page 59: No SQL at The Guardian](https://reader038.vdocuments.mx/reader038/viewer/2022102823/5483e0c5b4af9f1b5b8b4629/html5/thumbnails/59.jpg)
App server
Web servers
CMS Data feeds
Memcached (20Gb)
Solr
Core
Solr
Solr
Solr
Solr
Solr
Cloud, EC2
M/Q
Out
App
App
App
App
App
App
In
Proxyexternal hostingapp engine etc
CouchDB?rdbms