stream collections - scala days
TRANSCRIPT
Streams as Scala CollectionsS3 Scala Client with Play Iteratees and Composable Operations
Greg SilinPlatform [email protected]/nitro/streamcollections
ScalaDays 2015
Agenda
• Reactive at Nitro
• Smart Documents at Scale
• Motivation for Streaming Collections
• Building Streams with Iteratees
• Streams as Scala Collections
• Applications
• Questions
Knowledge workers spend approximately 11+ hours a week creating and managing documents
The New Way
Create PrepareSign
(Anywhere)
Nitro accelerates the way businesses create, prepare, and
sign documents.
Anytime and anywhere.
Smarter Documents for EveryoneTM
Reactive Systems at Nitroreact to user expectations <- responsive
react to state changes <- message driven
react to variable load <- elastic
react to failure <- resilient
Smart Documents at Scale
documents / second *
versions / document *
pages / version =
billions of objects in S3
Smart Documents at Scale
millions of new document uploads a day
100MM+/day document state changes resulting in 10x messages
billions of objects in S3
Motivation for Streaming Collections
counting
copying
extracting
cleanup
become non-trivial at scale
Motivation for Streaming Collections
1 percent error margin = 10M objects
That’s money for the business
How?
Can’t load everything in memory
Need some batched solution
Command line tools don’t provide flexibility / scale
How?
Streaming is a natural fit
Amazon SDK has a Java key iterator
Thus asynchronous streams
We are reactive
How?
Streaming is a natural fit
Amazon SDK has a Java key iterator
Thus asynchronous streams
We are reactive
Can’t over-parallelize
What Streams?
Enter Play Iteratees
Enumerator - Source
Enumeratee - Transformer
Iteratee - Consumer / Sink
Building Streams with Iteratees
Why Play Iteratees?
Most mature technology at the time
Production Experience
Streams as Scala Collections
We are all familiar with Scala collections
map
filter
foreach
grouped
count
Streams as Scala Collections - Applications
Can extend this model onto other data sources
We don’t have to stop at S3
➔ Relational DB
➔ ElasticSearch
➔ HBase / Cassandra
➔ Spark
"Much of my work has come from being lazy." - John BackusQuoted in the IBM employee magazine Think in 1979 (http://en.wikiquote.org/wiki/John_Backus)
What We Learned
Iteratees are good for traversing large volume of data
Programming iteratees can get a bit tricky
Scaling ain’t easy
Stream Collections abstraction makes streams simple
Future of Streams as Scala Collections
Continue developing a reactive S3 Client
In use in Nitro Production
Introduce other stream implementations (akka streams, etc)
www.github.com/nitro/streamcollections
Contributors:www.github.com/gregsilin / @gregsbriefswww.github.com/mkolod / @marekinfo
Open Sourcing
Are you interested? We welcome collaborators!
San Francisco Scala Days 2015• Nitro is a Gold sponsor
• Meet us at our community booth
sfscala.org:
• Wed: Scala D’Ehs meetup @ Stock in Trade
• Thu: unconference @ Galvanize
• Thu evening: Spark Notebook & Rapture @ Nitro
• Fri: free Shapeless training @ Nitro