streams
TRANSCRIPT
Streams of social consciousness
Real-time data transformation
Who am I?
PsycholinguistResearch/Data
analysis
Flex ProgrammerOO, Enterprise
Interactive Developer
Browser + Server
2000 2008 2013
Marielle Lange @widged
Stream expertiseFairly recent and rather limited
๏Gulp -> custom modules written by adapting other modules.
๏Data analysis -> Using streams to process large size data sets.
➡ I will Attempt to provide the minimal orientation to get started. Staying clear of complex topics like back-pressure handling.
Streams for data analysisGarden Data. Aggregating data scrapped from a large number of websites. Parsing them. Normalizing them (Farenheit vs Celsius, March in NH or SH). Reducing them (converting [55-65] to 55 #1, 60 #1, 65 #1). Rendering them (average vs visualisation).
➡
Streams?
Streams manage a data flow.
Sources. Where data pour from.
Sinks. Where results pour to.
Throughs. Where data gets manipulated and transformed.
ReadStream.
WriteStream.
What are they good for?๏ Gulp - writing your own modules.
๏ Real-time data obtained from remote servers that would be too impractical to buffer in a device with limited memory.
๏ Map-reduce types of computations - a programming model for processing and generating large data sets. A map function generates a set of intermediate key/value pairs ({word: ‘hello’, length: 5}) and a reduce function merges all intermediate values associated with the same intermediate key ([‘agile’ , ‘greet’ ,‘hello’] - list of words of length 5). Great if you want to run computations on distributed systems.
Streams 101
Readable Streams
Abstraction for a source of data that you are reading data from.
‣ http responses, on the client‣ http requests, on the server‣ fs read streams‣ zlib streams‣ crypto streams‣ tcp sockets‣ child process stdout and stderr‣ process.stdin
Notes
๏A readable stream will not start emitting data until you indicate that you are ready to receive it.
๏Readable streams have two “modes”: a flowing mode and a non-flowing mode.
var flappyStream = readable.read();
Writable Streams
Abstraction for a destination that you are writing data to.‣ http responses, on the client‣ http requests, on the server‣ fs write streams‣ zlib streams‣ crypto streams‣ tcp sockets‣ child process stdin‣ process.stdout, process.stderr
writeable .write(flappyBird);
Transforms
Compressing a file using gzip
var fs = require(“fs”), zlib = require(“fs”);var readable = fs.createReadStream("foo.txt"), writable = fs.createWriteStream("foo.txt.gz");
readable .pipe(gzip) .pipe(writable);
var evilStream = transform.output .read();
transform.input.write(flappyBird);
Abstraction for a stream that is both readable and writable, where the input is related to the output (map or filter step).
Dominic Tarr’s `through` module provides a similar functionality
Basic APIReadable stream
var fs = require('fs');
var readable = fs.createReadStream('foo.txt');
// this is the classic apireadable .on('data', function (data) { console.log('Data!', data); }) .on('error', function (err) { console.error('Error', err); }) .on('end', function () { console.log('All done!'); });
var fs = require('fs');
var readable = fs.createReadStream('foo.txt') , writable = fs.createWriteStream('copy.txt');
readable.pipe(writable) .on('finish', function () { writable.write('an extra line'); });
Writable stream
Toolbox
event-stream (D. Tarr)
var fs = require(“fs”), JSONStream = require('JSONStream'), map = require('map-stream');
var input = fs.createReadStream("twitter-feed.json"), output = fs.createWriteStream("twitter-sentiments.json");
input .pipe(JSONStream.parse("*")) .pipe(map(computeSentiments)) .pipe(output);
Stream playground (J. Resig)
Stream handbook (@Substack)
Rapidly define a list of files to read from with glob strings
Vinyl
var fs = require('fs'), vinyl = require('vinyl-fs')vinyl.src('./data/*/quad/*.comp.json', { buffer: false }).pipe(map(mapSource));function mapSource(file, asyncReturn) { var srcStream = file.contents; srcStream .pipe(JSONStream.parse("*")) .pipe(SomeAnalysis) .pipe(vinyl.dest("./out"))};
Example
Twitter SentimentsRegister an application with the Twitter API – https://dev.twitter.com/
Create an access token.
In your projects, add a file “secret_keys.js” with:
Takes advantage of the sentiment module:
https://github.com/thisandagain/sentiment
module.exports = { twitter : { consumer_key: "YOUR_CONSUMER_KEY", consumer_secret: "YOUR_CONSUMER_SECRET", access_token_key: "USER_ACCESS_TOKEN", access_token_secret: "USER_ACCESS_TOKEN_SECRET" }};
Programming Style
Separation of concernsThe #1 reason to use streams for me is that the piping structure encourages the writing of programs as bite-size modules that are highly interchangeable.
In the early stages of writing the example program, I had:
tweets .pipe(map(englishOnly)) .pipe(map(addSentiment))
Then I found out that the API gave you the option to specify a language filter. All I had to do was drop one line of code.
Functional ProgrammingA more functional style of programming encourages the avoidance of side effects or state mutation.
var fs = require(“fs”), map = require(“map-stream”);var readable = fs.createReadStream("foo.txt"),
readable .pipe(map(filterEnglish))
function filterEnglish(data, asyncReturn) { if(data.language === “en”) { // write these data to the output stream asyncReturn(null, data); } else { // but don’t write these. asyncReturn(); } }
๏ Single Responsibility Principle: "A function should do one thing, and do it well."
๏ Pure functions. No knowledge of the external world whatsoever. Every bit of information required for the running of the function is explicitly passed as paramter.
๏ Immutable data. A function returns a new data that captures the transformation rather than a reference to the old data.
๏ Higher Order Functions. Functions that return functions (partials, currying). A way to capture local context through closure.