Download - Big iron 2 (published)
![Page 1: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/1.jpg)
The Return of Big Iron?
Ben StopfordDistinguished Engineer
RBS Markets
![Page 2: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/2.jpg)
Much diversity
![Page 3: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/3.jpg)
What does this mean?
• A change in what customers (we) value
• The mainstream is not serving customers (us) sufficiently
![Page 4: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/4.jpg)
The Database field has problems
![Page 5: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/5.jpg)
We Lose: Joe Hellerstein (Berkeley) 2001
“Databases are commoditised and cornered to slow-moving, evolving, structure intensive, applications that require schema evolution.“ … “The internet companies are lost and we will remain in the doldrums of the enterprise space.” …“As databases are black boxes which require a lot of coaxing to get maximum performance”
![Page 6: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/6.jpg)
His question was how to win them back?
![Page 7: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/7.jpg)
These new technologies also caused frustration
![Page 8: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/8.jpg)
Backlash (2009)Not novel (dates back to the 80’s)
Physical level not the logical level (messy?)Incompatible with tooling
Lack of integrity (referential) & ACIDMR is brute force ignoring indexing, scew
![Page 9: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/9.jpg)
All points are reasonable
![Page 10: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/10.jpg)
And they proved it too!
“A comparison of Approaches to Large Scale Data Analysis” – Sigmod 2009
• Vertica vs. DBMSX vs. Hadoop
• Vertica up to 7 x faster than Hadoop over benchmarks
Databases faster than Hadoop
![Page 11: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/11.jpg)
But possibly missed the point?
![Page 12: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/12.jpg)
Databases were traditionally designed to keep data safe
![Page 13: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/13.jpg)
NoSQL grew from a need to scale
![Page 14: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/14.jpg)
![Page 15: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/15.jpg)
It’s more than just scale, they facilitate different practices
![Page 16: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/16.jpg)
A Better Fit
They better match the way software is engineered today.– Iterative development– Fast feedback– Frequent releases
![Page 17: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/17.jpg)
Is NoSQL a Disruptive Technology?
Christensen’s observation:Market leaders are displaced when markets shift in ways that the incumbent leaders are not prepared for.
![Page 18: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/18.jpg)
Aside: MongoDB
• Impressive trajectory• Slightly crappy product (from a
traditional database standpoint)• Most closely related to relational DB
(of the NoSQLs)• Plays to the agile mindset
![Page 19: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/19.jpg)
Yet the NoSQL market is relatively small
• Currently around $600 but projected to grow strongly
• Database and systems management market is worth around $34billion
![Page 20: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/20.jpg)
There is more to NoSQL than just scale, it sits better with the way we
build software today
Key Point
![Page 21: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/21.jpg)
We have new building blocks to play with!
![Page 22: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/22.jpg)
My Problem
• Sprawling application space, built over many years, grouped into both vertical and horizontal silos
• Duplication of effort• Data corruption & preventative
measures• Consolidation is costly, time
consuming and technically challenging.
![Page 23: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/23.jpg)
Traditional solutions (in chronological order)
–Messaging– SOA– Enterprise Data Warehouse– Data virtualisation
![Page 24: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/24.jpg)
Bringing data, applications, people together is hard
![Page 25: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/25.jpg)
A popular choice is an EDW
![Page 26: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/26.jpg)
EDW pattern is workable, but tough
– As soon as you take a ‘view’ on what the shape of the data is, it becomes harder to change.• Leave ‘taking a view” to the last responsible
moment
–Multifaceted: Shape, diversity of source, diversity of population, temporal change
![Page 27: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/27.jpg)
Harder to do iteratively
![Page 28: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/28.jpg)
Is this the only way?
![Page 29: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/29.jpg)
The Google Approach
MapReduce
Google Filesystem
BigTable
Tenzing
Megastore
F1
Dremel
Spanner
![Page 30: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/30.jpg)
And just one code base!
So no enterprise schema secret society!
![Page 31: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/31.jpg)
The Ebay Approach
![Page 32: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/32.jpg)
The Partial-Schematic Approach
Often termed Clobs & Cracking
![Page 33: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/33.jpg)
Problems with solidifying a schematic representation
• Risk of throwing information away, keeping only what you think you need. – OK if you create data– Bad if you got data from elsewhere
• Data tends to be poly-structured in programs and on the wire
• Early-binding slows down development
![Page 34: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/34.jpg)
But schemas are good
• They guarantee a contract • That contract spans the whole
dataset– Similar to static typing in programming
languages.
![Page 35: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/35.jpg)
Compromise positions
• Query schema can be a subset of data schema.
• Use schemaless databases to capture diversity early and evolve it as you build.
![Page 36: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/36.jpg)
Common solutions today use multiple technologies
![Page 37: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/37.jpg)
We use an late-bound schema, sitting over a schemaless store
Late Bound
Schema
![Page 38: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/38.jpg)
Evolutionary Approach
• Late-binding makes consolidation incremental– Schematic representation delivered at the ‘last
responsible moment’ (schema on demand)– A trade in this model has 4 mandatory nodes.
A fully modeled trade has around 800.
• The system of record is raw data, not our ‘view’ of it
• No schema migration! But this comes at a price.
![Page 39: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/39.jpg)
Scaling
![Page 40: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/40.jpg)
Key based access always scales
Client
![Page 41: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/41.jpg)
But queries (without the sharding key) always broadcast
Client
![Page 42: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/42.jpg)
As query complexity increases so does the overhead
Client
![Page 43: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/43.jpg)
Course grained shardsClien
t
![Page 44: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/44.jpg)
Data Replicas provide hardware isolation
Client
![Page 45: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/45.jpg)
Scaling
• Key based sharding is only sufficient very simple workloads
• Course grained shards help (but suffer from skew)
• Replication provides useful, if expensive, hardware isolation
• Workload management is less useful in my experience
![Page 46: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/46.jpg)
Weak consistency forces the problem onto the developer
Particularly bad for banks!
![Page 47: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/47.jpg)
Scaling two phase commit is hard to do efficiently
• Requires distributed lock/clock/counter
• Requires synchronisation of all readers & writers
![Page 48: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/48.jpg)
Alternatives to traditional 2PC
• MVCC over explicit locking• Timestamp based strong consistency – E.g. Granola
• Optimistic concurrency control– Leverage short running transactions
(avoid cross-network transactions)– Tolerate different temporal viewpoints to
reduce synchronization costs.
![Page 49: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/49.jpg)
Immutable Data
• Safety• ‘As was’ view• Sits well with MVCC• Efficiency problems• Gaining popularity (e.g. Datomic)
![Page 50: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/50.jpg)
Use joins to avoid ‘over aggregating’
Joins are ok, so long as they are– Local– via a unique key Trade
PartyTrade
r
![Page 51: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/51.jpg)
Memory/Disk Tradeoff
• Memory only (possibly overplayed)• Pinned indexes (generally good idea
if you can afford the RAM)• Disk resident (best general purpose
solution and for very large datasets)
![Page 52: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/52.jpg)
Balance flexibility and complexity
![Page 53: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/53.jpg)
Supple at the front, more rigid at the back
![Page 54: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/54.jpg)
Principals
• Record everything• Grow a schema, don’t do it upfront• Avoid using a ‘view’ as your system of record.• Differentiate between sourced data (out of
your control) and generated data (in your control).
• Use automated replication (for isolation) as well as sharding (for scale)
• Leverage asynchronicity to reduce transaction overheads
![Page 55: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/55.jpg)
Consolidation means more
trust, less impedance
mismatches and managing tighter
couplings
![Page 56: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/56.jpg)
Target architectures are starting to look more like large applications of cloud enabled services than heterogeneous application conglomerates
![Page 57: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/57.jpg)
Are we going back to the mainframe?
![Page 58: Big iron 2 (published)](https://reader034.vdocuments.mx/reader034/viewer/2022052619/5565493ed8b42a9b4c8b4c2b/html5/thumbnails/58.jpg)
Thanks
http://www.benstopford.com