google cloud dataflow & apache flink
TRANSCRIPT
![Page 1: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/1.jpg)
GOOGLE CLOUD DATAFLOW & APACHE FL INK
I VA N F E R N A N D E Z P E R E A
![Page 2: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/2.jpg)
GOOGLE CLOUD DATAFLOW DEFINITION
“A fully-managed cloud service and programming model for batch and
streaming big data processing”• Main features
– Fully Managed– Unified Programming Model– Integrated & Open Source
![Page 3: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/3.jpg)
GOOGLE CLOUD DATAFLOW USE CASES• Both batch and streaming data processing• ETL (extract, transform, load) approach• Excels and high volume computation• High parallelism factor (“embarrassingly parallel”)• Cost effective
![Page 4: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/4.jpg)
DATAFLOW PROGRAMMING MODEL• Designed to simplify the mechanics of large-scale
data processing• It creates an optimized job to be executed as a unit by
one of the Cloud Dataflow runner services• You can focus on the logical composition of your data
processing job, rather than the physical orchestratrion of parallel processing
![Page 5: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/5.jpg)
GOOGLE CLOUD DATAFLOW COMPONENTS• Two main components:
– A set of SDK used to define data processing jobs:• Unified programming model. “One fits all” approach• Data programming model (pipelines, collection, transformation, sources
and sinks)
– A Google Cloud Platform managed service that ties together with the Google Cloud Platform, Google Compute Engine, Google Cloud Storage, Big Query, ….
![Page 6: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/6.jpg)
DATAFLOW SDKS Each pipeline is an indepedent entity that reads some input data, transform it, and
generates some output data. A pipeline represents a directed graph of data processing transformation
Simple data representation. Specialized collections called Pcollection Pcollection can represent unlimited size dataset Pcollections are the inputs and the ouputs for each step in your pipeline
Dataflow provides abstractions to manipulate data Transformation over data are known as Ptransform Transformations can be linear or not
I/O APIs for a variety of data formats like text or Avro files, Big Query table, Google Pub/Sub, …
Dataflow SDK for Java available on Github. https://github.com/GoogleCloudPlatform/DataflowJavaSDK
![Page 7: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/7.jpg)
PIPELINE DESIGN PRINCIPLES• Some question before building your Pipeline:
– Where is your input data stored? Read transformations– What does your data look like? It defines your Pcollections– What do you want to do with your data? Core or pre-written transforms– What does your output data look like, and where should it go? Write
transformations
![Page 8: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/8.jpg)
PIPELINE SHAPESLinear Pipeline Branching Pipeline
![Page 9: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/9.jpg)
PIPELINE EXAMPLEpublic static interface Options extends PipelineOptions { ... }
public static void main(String[] args) { // Parse and validate command-line flags, // then create pipeline, passing it a user-defined Options object. Options options = PipelineOptionsFactory.fromArgs(args) .withValidation() .as(Options.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from(input)) // SDK-provided PTransform for reading text data .apply(new CountWords()) // User-written subclass of PTransform for counting words .apply(TextIO.Write.to(output)); // SDK-provided PTransform for writing text data
p.run(); }
![Page 10: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/10.jpg)
PIPELINE COLLECTIONS: PCOLLECTIONS• PCollection represents a potentially large, immutable “bag” of same-type elements• A PCollection can be of any type and it will be encoded based on the Dataflow SDK
Data encoding or on your own.• PCollection requirements
– Immutable. Once created, you cannot add, remove or change individual objects.– Does not support random access– A PCollection belongs to one Pipeline (collections cannot be shared):
• Bounded vs unbounded collections – It depends on your source dataset.– Bounded collections can be processed using batch jobs– Unbounded collections must be processed using streaming jobs (Windowing and
Timestamps)
![Page 11: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/11.jpg)
PIPELINE COLLECTIONS EXAMPLE• A collection created from individual lines of text // Create a Java Collection, in this case a List of Strings. static final List<String> LINES = Arrays.asList( "To be, or not to be: that is the question: ", "Whether 'tis nobler in the mind to suffer ", "The slings and arrows of outrageous fortune, ", "Or to take arms against a sea of troubles, ");
PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options);
p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of()) // create the PCollection
![Page 12: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/12.jpg)
PIPELINE COLLECTIONS TYPES• Bounded PCollections. It represents a fixed data set from data sources/sinks as:
– TextIO– BigQueryIO– DataStoreIO– Custom data sources using the Custom Source/Sink API
• Unbounded PCollections. It represents a continuously updating data set, or streaming data sources/sinks as:
– PubsubIO– BigQueryIO (only as a sink)
• Each element in a PCollection has an associated timestamp. NOTE: it doesn’t happen for all sources(e.g, TextIO)
• Unbounded PCollections are processed as finite logical windows (Windowing). Windowing can also be applied to Bounded PCollections as a global window.
![Page 13: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/13.jpg)
PCOLLECTIONS WINDOWING• Subdivide PCollection processing according to the timestamp.• Uses Triggers to determine when to close each finite window as unbounded
data arrives.• Windowing functions
– Fixed Time Windows– Sliding Time Windows. Two variables, windows size and period.– Per-Session Windows. It relates to when actions are perform (e.g, mouse interactions)– Single Global Window. By default window.– Other windowing function as Calendar-based are found in
com.google.cloud.dataflow.sdk.transforms.windowing• Time Skew, Data Lag, and Late Data. As each element is marked with a
Timestamp it can be known if data arrives with some lag.
![Page 14: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/14.jpg)
PCOLLECTIONS WINDOWING II• Adding Timestamp
PCollection<LogEntry> unstampedLogs = ...; PCollection<LogEntry> stampedLogs = unstampedLogs.apply(ParDo.of(new DoFn<LogEntry, LogEntry>() { public void processElement(ProcessContext c) { // Extract the timestamp from log entry we're currently processing. Instant logTimeStamp = extractTimeStampFromLogEntry(c.element()); // Use outputWithTimestamp to emit the log entry with timestamp attached. c.outputWithTimestamp(c.element(), logTimeStamp); } }));
• Time Skew and Late DataPCollection<String> items = ...; PCollection<String> fixed_windowed_items = items.apply( Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES)) .withAllowedLateness(Duration.standardDays(2)));
• Sliding window PCollection<String> items = ...; PCollection<String> sliding_windowed_items = items.apply( Window.<String>into(SlidingWindows.of(Duration.standardMinutes(60)).every(Duration.standardSeconds(30))));
![Page 15: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/15.jpg)
PCOLLECTION SLIDING TIME WINDOWS
![Page 16: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/16.jpg)
PIPELINE TRANSFORMS: PTRANSFORMS• Represent a processing operation logic in a pipeline as a function object.• Processing operations
– Mathematical computations on data– Converting data from one format to another– Grouping data together– Filtering data– Combining data elements into single values
• PTransforms requirements– Serializable– Thread-compatible. Functions are going to be accessed by a single thread on a worker instance.– Idempotent functions are recommended: for any given input, a function provides the same ouput
• How it works?. Call the apply method over the PCollection
![Page 17: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/17.jpg)
PIPELINE TRANSFORMS TYPES• Core transformation
– ParDo for generic parallel processing– GroupByKey for Key-Grouping Key/Value pairs– Combine for combining collections or grouped values– Flatten for merging collections
• Composite transform– Built for multiple sub-transform in a modular way– Examples, Count and Top composite transform.
• Pre-Written Transform– Proccessing logic as combining, splitting, manipulating and performing statistical analysis
is already written.– They are found in the com.google.cloud.dataflow.sdk.transforms package
• Root Transforms for Reading and Writing Data
![Page 18: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/18.jpg)
PIPELINE TRANSFORM EXAMPLE• A composite transform that count wordsstatic class CountWords extends PTransform<PCollection<String>, PCollection<String>> { @Override public PCollection<String> apply(PCollection<String> lines) { PCollection<String> words = lines.apply( ParDo .named("ExtractWords") .of(new ExtractWordsFn()));
PCollection<KV<String, Integer>> wordCounts = words.apply(Count.<String>perElement());
PCollection<String> results = wordCounts.apply( ParDo .named("FormatCounts") .of(new DoFn<KV<String, Integer>, String>() { @Override public void processElement() { output(element().getKey() + ": " + element().getValue()); } }));
return results; } }
![Page 19: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/19.jpg)
PIPELINE I/O• We need to read/write data from external sources like Google Cloud Storage or BigQuery table • Read/Write transforms are applied to sources to gather data• Read/Write data from Cloud Storage
p.apply(AvroIO.Read.named("ReadFromAvro") .from("gs://my_bucket/path/to/records-*.avro") .withSchema(schema));
records.apply(AvroIO.Write.named("WriteToAvro") .to("gs://my_bucket/path/to/numbers") .withSchema(schema) .withSuffix(".avro"));
Read and write Tranforms in the Dataflow SDKs Text files Big Query tables Avro files Pub/Sub
Custom I/O sources and sink can be created.
![Page 20: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/20.jpg)
GETTING STARTEDLOG INTO GOOGLE DEV CONSOLE ENVIRONMENT SETUP
• JDK 1.7 or higher• Install the Google Cloud SDK. Gcloud tool
is required to run examples in the Dataflow SDK. https://cloud.google.com/sdk/?hl=es#nix
• Download SDK examples from Github. https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples
• Enable Billing (Free for 60 days/$300)• Enable Services & APIs• Create a project for the example• More info:
– https://cloud.google.com/dataflow/getting-started?hl=es#DevEnv
![Page 21: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/21.jpg)
RUN LOCALLY• Run dataflow SDK Wordcount example locallymvn compile exec:java \-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \-Dexec.args="--inputFile=/home/ubuntu/.bashrc --output=/tmp/output/“
![Page 22: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/22.jpg)
RUN IN THE CLOUD - CREATE A PROJECT
![Page 23: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/23.jpg)
RUN IN THE CLOUD - INSTALL GOOGLE CLOUD SDKCURL HTTPS://SDK.CLOUD.GOOGLE.COM | BASH GCLOUD INIT
![Page 24: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/24.jpg)
RUN IN THE CLOUD - EXECUTE WORDCOUNT• Compile & execute Wordcount examples in the cloud:
mvn compile exec:java \
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount \
-Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT ID> \
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> \
--runner=BlockingDataflowPipelineRunner“
– Project is the id of the project you just created– StagingLocation is the Google Storage Location with the following aspect:
gs://bucket/path/to/staging/directory
– Runner associates your code with an specific dataflow pipeline runner– Note: you can only open a Google Cloud Platform account in Europe if you look
for economic benefit.
![Page 25: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/25.jpg)
MANAGE YOUR POM• Google cloud dataflow artifact needs to be added to your POM:
<dependency> <groupId>com.google.cloud.dataflow</groupId> <artifactId>google-cloud-dataflow-java-sdk-all</artifactId> <version>${project.version}</version> </dependency>
• Google services that are also used in the project needs to be added. E.g, bigQuery:<dependency><groupId>com.google.apis</groupId> <artifactId>google-api-services-bigquery</artifactId> <!-- If updating version, please update the javadoc offlineLink --> <version>v2-rev238-1.20.0</version></dependency>
![Page 26: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/26.jpg)
GOOGLE CLOUD PLATFORM
• Google Compute Engine VMs, to provide job workers• Google Cloud Storage, for readinig and writing data• Google BigQuery, for reading and writing data
![Page 27: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/27.jpg)
APACHE FLINK“Apache Flink is an open source platform for
distributed stream and batch data processing.”
• Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
• Flink includes several APIs for creating applications that use the Flink engine:– DataSet API for static data embedded in Java, Scala, and Python,– DataStream API for unbounded streams embedded in Java and Scala, and– Table API with a SQL-like expression language embedded in Java and Scala.
• Flink also bundles libraries for domain-specific use cases:– Machine Learning library, and– Gelly, a graph processing API and library.
![Page 28: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/28.jpg)
APACHE FLINK: BACKGROUND• 2010: "Stratosphere: Information Management on the Cloud" (funded by
the German Research Foundation (DFG)) was started as a collaboration of Technical University Berlin, Humboldt-Universität zu Berlin, and Hasso-Plattner-InstitutPotsdam.
• March 2014: Flink is a Stratosphere fork and it became an Apache Incubator.
• December 2014: Flink was accepted as an Apache top-level project
![Page 29: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/29.jpg)
APACHE FLINK COMPONENTS
![Page 30: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/30.jpg)
APACHE FLINK FEATURES• Streaming first (Kappa approach)
– High Performance & Low latency with little configuration– Flows (events) vs batches– Exactly-once Semantics for Stateful Computations– Continuos Streaming Model with Flow Control and long live operators (no need to run new tasks as in
Spark, ‘similar’ to Storm)– Fault-tolerance via Lightweight Distributed Snapshots
• One runtime for Streaming and Batch Processing– Batch processing runs as special case of streaming– Own memory management (Spark Tungsten project goal)– Iterations and Delta iterations– Program optimizer
• APIs and Libraries– Batch Processing Applications (DataSet API)– Streaming Applications (DataStream API)– Library Ecosystem: Machine Learning, Graph Analytics and Relational Data Processing.
![Page 31: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/31.jpg)
APACHE FLINK FEATURES
• DatasSet: abstract representation of a finite immutable collection of data of the same type that may contain duplicates.
• DataStream: a possibly unbounded immutable collection of data items of a the same type
• Transformation: Data transformations transform one or more DataSets/DataStreams int a new DataSet/DataStream
– Common: Map, FlatMap, MapPartition, Filter, Reduce, union– DataSets: aggregate, join, cogroup– DataStreams: window* transformations (Window, Window Reduce, …)
![Page 32: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/32.jpg)
APACHE FLINK DATA SOURCES AND SINKS• Data Sources
– File-based• readTextFile(path), readTextFileWithValue(path), readFile(path), …
– Socket-based • socketTextStream (streaming)
– Collection-based• fromCollection(Seq), fromCollection(iterator), fromElements(elements: _*)
– Custom.• addSource from Kafka, …
• Data Sinks (similar to Spark actions):– writeAsText()– writeAsCsv()– Print() / printToErr()– Write()– writeToSocket– addSink like Kafka
![Page 33: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/33.jpg)
ENGINE COMPARISION
![Page 34: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/34.jpg)
APACHE FLINK PROCESS MODEL• Processes
– JobManager: coordinator of the Flink system– TaskManagers: workers that execute parts of the parallel programs.
![Page 35: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/35.jpg)
APACHE FLINK EXECUTION MODEL• As a software stack, Flink is a layered system. The different layers of the stack
build on top of each other and raise the abstraction level of the program representations they accept:
– The runtime layer receives a program in the form of a JobGraph. A JobGraph is a generic parallel data flow with arbitrary tasks that consume and produce data streams.
– Both the DataStream API and the DataSet API generate JobGraphs through separate compilation processes. The DataSet API uses an optimizer to determine the optimal plan for the program, while the DataStream API uses a stream builder.
– The JobGraph is lazy executed according to a variety of deployment options available in Flink (e.g., local, remote, YARN, etc)
![Page 36: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/36.jpg)
THE 8 REQUIREMENTS OF REAL-TIME STREAM PROCESSING
• Coined by Michael Stonebraker and others in http://cs.brown.edu/~ugur/8rulesSigRec.pdf
– Pipelining: Flink is built upon pipelining– Replay: Flink acknowledges batches of records– Operator state: flows pass by different operators– State backup: Flink operators can keep state– High-level language(s): Java, Scala, Python (beta)– Integration with static sources– High availability
![Page 37: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/37.jpg)
FLINK STREAMING NOTES
• Hybrid runtime architecture– Intermediate results are a handle to the data produced by an operator.– Coordinate the “handshake” between data producer and the consumer.
• Current DataStream API has support for flexible windows• Apache SAMOA on Flink for Machine Learning on streams• Google Dataflow (stream functionality upcoming)• Table API (window definition upcoming)
![Page 38: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/38.jpg)
FLINK STREAMING NOTES II• Flink supports different streaming windowing.
– Instant event-at-a-time– Arrival time windows– Event time windows
K. Tzoumas & S. Ewen – Flink Forward Keynotehttp://www.slideshare.net/FlinkForward/k-tzoumas-s-ewen-flink-forward-keynote?qid=ced740f4-8af3-4bc7-8d7c-388eb26f463f&v=qf1&b=&from_search=5
![Page 39: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/39.jpg)
GETTING STARTED (LOCALLY)• Download Apache Flink latest release and unzip it. https://flink.apache.org/downloads.html
– Don’t need to install Hadoop beforehand, but you want to use HDFS.• Run a JobManager .
– $FLINK_HOME/bin/start-local.sh
• Run an example code (e.g. WordCount)
– $FLINK_HOME/bin/flink run ./examples/WordCount.jar /path/input_data /path/output_data
• Setup Guide. https://ci.apache.org/projects/flink/flink-docs-release-0.10/quickstart/setup_quickstart.html
• If you want to develop with Flink you need to add dependencies to your code development tool. E.g, Maven:<dependency>
<groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>0.10.0</version>
</dependency> <dependency>
<groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java</artifactId> <version>0.10.0</version>
</dependency> <dependency>
<groupId>org.apache.flink</groupId> <artifactId>flink-clients</artifactId> <version>0.10.0</version>
</dependency>
![Page 40: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/40.jpg)
WORDCOUNT EXAMPLE
SCALA
val env = ExecutionEnvironment.getExecutionEnvironment // get input data val text = env.readTextFile("/path/to/file") val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } } .map { (_, 1) } .groupBy(0) .sum(1) counts.writeAsCsv(outputPath, "\n", " ")
SCALA
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<String> text = env.readTextFile("/path/to/file"); DataSet<Tuple2<String, Integer>> counts = // split up the lines in pairs (2-tuples) containing: (word,1) text.flatMap(new Tokenizer()) // group by the tuple field "0" and sum up tuple field "1"
.groupBy(0)
.sum(1); counts.writeAsCsv(outputPath, "\n", " ");
// User-defined functions public static class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { // normalize and split the line String[] tokens = value.toLowerCase().split("\\W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect(new Tuple2<String, Integer>(token, 1)); } } } }
![Page 41: Google cloud Dataflow & Apache Flink](https://reader034.vdocuments.mx/reader034/viewer/2022051006/58a7bca21a28ab70368b69c5/html5/thumbnails/41.jpg)
DEPLOYING
• Local• Cluster (standalone)• YARN• Google Cloud• Flink on Tez• JobManager High Availability