how to manage large amounts of data with iteratee - scaladays berlin 2014
DESCRIPTION
How to manage a large file with HTTP? Is it possible to manage data stream when you are in a HTTP POST request? How to be relax when managing data stream (ie : not use while(true) hack or something I/O block)? This talk is about Play! iteratee and how to use it on real project : what is an iteratee? why it's useful? how to use it? how to manage it? is it clean? is your code readable or not? http://scaladays.org/#schedule/How-to-manage-large-amounts-of-data-with-IterateeTRANSCRIPT
HOW TO MANAGE LARGE AMOUNTS OF DATA WITH ITERATEEBY @WAXZCE – QUENTIN ADAM
SCALA DAYS 2014 BERLIN
MY DAY TO DAY WORK : CLEVER CLOUD, MAKE YOUR APP RUN ALL THE TIME
And learn a lot of things about your code, apps, and good/bad design…
KEEP YOUR APPS ONLINE. MADE WITH NODE.JS, SCALA, JAVA, RUBY, PHP, PYTHON, GO…
AND LEARN A LOT OF THINGS ABOUT YOUR CODE, APPS, AND GOOD/BAD DESIGN…
WHY STREAMS ARE SO TRENDY THESE DAYS?
2 MAIN REASONS
DATA ARE COMING BIGGER AND BIGGER
WE NEED TO ACT WHILE DATA TRANSFERRING IS OCCURRING
&
WE ARE NOW USING LOTS OF PERMANENT CONNECTION
EXAMPLE
WHAT IS INSIDE AN HTTP REQUEST ?
Verb• The action
Resource• The object of the action
Headers • The context of the action
Body • Optional• The datas
IN MANY CASE THE REQUEST IS MANIPULATE ALL FROM MEMORY
UPLOADING A 20 GB FILE ON A POST REQUEST AND STORE IT
EXAMPLE
IT’S A BAD IDEA TO PUT THE BODY PART IN MEMORY
CREATE A TEMP FILE TO STORE THE DATA
Client upload
file
Create a temp file
Send a file in a
backend
Then answer to the client
The file is transferring two
times
NOT SO GOOD
Client upload file
Directly stream it to the backend
Then answer to the client
LET’S ACT ON STREAM!
CLASSIC JAVA STREAM MANAGEMENT
CLASSIC JAVA STREAM MANAGEMENT
• Buffers• Buffer management• Buffer exception• Buffer overflow
CLASSIC JAVA STREAM MANAGEMENT • Low performances if not buffered
• Not modular
• Thread blocking
• Code is ugly
• No back pressure
• Error handling is bad
• i/o management and business code are mixed
I CAN’T SLEEP IN THE NIGHT
WE HAVE TO FIND A BETTER WAY
THAT’S WHY WE NEED ITERATEE
DATA STRUCTURES TO EXPRESS DATA STREAM MANAGEMENT
Like a recipe
Consume the data
ITERATEE : HOW TO MANAGE A STREAM
Produce the data
ENUMERATOR : DATA STREAM
Set of tools to do cool things with Iteratee and Enumerator
ENUMERATEE
SIMPLE EXAMPLE TO START
AN ENUMERATOR
AN ITERATEE
INPUTS
• EOF
• Empty
• El
ITERATEE STATES
• Done
• Error
• Cont
AN ITERATEE
AN ITERATEE TO MANAGE THE STREAM
AN ITERATEE
ITERATEE COMPOSITION
PURE FUNCTIONAL
TYPE SAFE
COMPOSITION
SO WE WILL JUST MANAGE THE BODY STREAM
JUST DO NOT REWRITE HTTP PARSER
CREATE TEMP FILE
Built in on play with
CREATE TEMP FILE
Built in on play with
BODY PARSERSREQUEST HEADERS -> ITERATEE[ARRAY[BYTE], EITHER[SIMPLERESULT, ?]]
GET FILE AND CALCULATE HASH FROM CHUNK
DEMO
I WANT TO USE FORMS
SO PARTHANDLER ALLOW YOU TO COMPOSE A BODYPARSER
ITERATEE <3 COMPOSITION
NOW WE CAN ACCESS TO STREAMS
DATA STORAGE
I HATE FILE SYSTEM
I HATE FILE SYSTEM
DO NOT USE THE FILE SYSTEM AS A DATASTORE
File system are POSIX compliant
• POSIX is ACID• POSIX is powerful but is a bottleneck • File System is the nightmare of ops • File System creates coupling (host provider/OS/language)• SPOF-free multi tenant File System is a unicorn
• Key = hash of the chunk
STORE FILE IN CHUNK INSIDE K/V STORE
STORE AS META DATA THE LIST OF CHUNKS
PROCESS
File is post by browser
Stream is chunk
nomalized
Each chunk is async
store in kv
List of chunk hash is store
with file id
No overhead for data store
NEW DEMO• Play 2 + iteratee
• Datastore is riak
ARE YOU CONVINCE TO USE STREAM ?
DATA STREAMING PROGRAMMING IS NOW TRENDING
THX TO @CLEMENTDCLEMENT DELAFARGUE
I’m @waxzce on twitter
I’m the CEO of
A PaaS provider, give it a try ;-)
THX FOR LISTENING & QUESTIONS TIME
Coupon for Clever Cloud trial :
berlin_scala_days