data-intensive applications apache beam: portable and ...€¦ · cache: redis, memcached (in...
TRANSCRIPT
![Page 1: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/1.jpg)
Apache Beam: portable and evolutive data-intensive applications
Ismaël Mejía - @iemejia
Talend
![Page 2: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/2.jpg)
Who am I?
2
@iemejiaSoftware EngineerApache Beam PMC / Committer
ASF member
Integration SoftwareBig Data / Real-TimeOpen Source / Enterprise
![Page 3: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/3.jpg)
We are hiring !
New products
3
![Page 4: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/4.jpg)
4
Introduction: Big data state of affairs
![Page 5: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/5.jpg)
The web pushed data analysis / infrastructure boundaries
● Huge data analysis needs (Google, Yahoo, etc)
● Scaling DBs for the web (most companies)
DBs (and in particular RDBMS) had too many constraints and it was hard to operate at scale.
Solution: We need to go back to basics but in a distributed fashion
Before Big Data (early 2000s)
5
![Page 6: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/6.jpg)
● Use distributed file systems (HDFS) to scale data storage horizontally
● Use Map Reduce to execute tasks in parallel (performance)
● Ignore strict model (let representation loose to ease scaling e.g. KV stores).
Great for huge dataset analysis / transformation
but…
● Too low-level for many tasks (early frameworks)
● Not suited for latency dependant analysis
MapReduce, Distributed Filesystems and Hadoop
6
(Produce)
(Prepare)
Map
(Shuffle)
Reduce
![Page 7: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/7.jpg)
The distributed database Cambrian explosion
7… and MANY others, all of them with different properties, utilities and APIs
![Page 8: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/8.jpg)
(yes it is an over-simplification but you get it)
Distributed databases API cycle
8
NoSQL, because
SQL is too limited
NewSQL let's reinvent
our own thing
SQL is back,
because it is awesome
![Page 9: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/9.jpg)
or worse (because of heterogeneity) …
● Data analysis / processing from systems with different semantics
● Data integration from heterogeneous sources
● Data infrastructure operational issues
Good old Extract-Transform-Load (ETL) is still an important need
The fundamental problems are still the same
9
![Page 10: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/10.jpg)
"Data preparation accounts for about 80% of the work of data scientists" [1]
[2]
1 Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task
2 Sculley et al.: Hidden Technical Debt in Machine Learning Systems
The fundamental problems are still the same
10
![Page 11: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/11.jpg)
● Latency needs: Pseudo real-time needs, distributed logs.
● Multiple platforms: On-premise, cloud, cloud-native (also multi-cloud).
● Multiple languages and ecosystems: To integrate with ML tools
Software issues: New APIs, new clusters, different semantics,
… and of course MORE data stores !
and evolution continues ...
11
![Page 12: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/12.jpg)
12
Apache Beam
![Page 13: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/13.jpg)
Apache Beam origin
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
MillwheelApache Beam
Google Cloud Dataflow
![Page 14: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/14.jpg)
Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
What is Apache Beam?
![Page 15: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/15.jpg)
15
Beam Model: Generations Beyond MapReduce
Improved abstractions let you focus on your application logic
Batch and stream processing are both first-class citizens -- no need to choose.
Clearly separates event time from processing time.
![Page 16: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/16.jpg)
Streaming - late data
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
![Page 17: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/17.jpg)
17
Processing Time vs. Event Time
![Page 18: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/18.jpg)
18
Beam Model: Asking the Right Questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
![Page 19: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/19.jpg)
Beam Pipelines
PTransform
PCollection
19
![Page 20: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/20.jpg)
The Beam Model: What is Being Computed?
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
scores = (input
| Sum.integersPerKey())
![Page 21: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/21.jpg)
The Beam Model: What is Being Computed?
Event Time: Timestamp when the event happened
Processing Time: Absolute program time (wall clock)
![Page 22: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/22.jpg)
The Beam Model: Where in Event Time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
scores = (input
| beam.WindowInto(FixedWindows(2 * 60))
| Sum.integersPerKey())
![Page 23: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/23.jpg)
The Beam Model: Where in Event Time?
Event Time
Processing Time 12:0212:00 12:1012:0812:0612:04
12:0212:00 12:1012:0812:0612:04
Input
Output
● Split infinite data into finite chunks
![Page 24: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/24.jpg)
The Beam Model: Where in Event Time?
![Page 25: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/25.jpg)
The Beam Model: When in Processing Time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
scores = (input
| beam.WindowInto(FixedWindows(2 * 60)
.triggering(AtWatermark())
| Sum.integersPerKey())
![Page 26: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/26.jpg)
The Beam Model: When in Processing Time?
![Page 27: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/27.jpg)
The Beam Model: How Do Refinements Relate?PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
scores = (input
| beam.WindowInto(FixedWindows(2 * 60)
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(1 * 60))
.withLateFirings(AtCount(1))
.accumulatingFiredPanes())
| Sum.integersPerKey())
![Page 28: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/28.jpg)
The Beam Model: How Do Refinements Relate?
![Page 29: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/29.jpg)
29
Customizing What Where When How
3Streaming
4Streaming
+ Accumulation
1Classic Batch
2Windowed
Batch
![Page 30: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/30.jpg)
GroupByKeyCoGroupByKey
Combine -> ReduceSumCountMin / MaxMean...
ParDo -> DoFnMapElementsFlatMapElementsFilter
WithKeysKeysValues
Windowing/Triggers
WindowsFixedWindowsGlobalWindowsSlidingWindowsSessions
TriggersAfterWatermarkAfterProcessingTimeRepeatedly
...
Element-wise Grouping
Apache Beam - Programming Model
30
![Page 31: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/31.jpg)
31
The Apache Beam Vision
1. End users: who want to write pipelines in a language that’s familiar.
2. Library / IO connectors: Who want to create generic transforms.
3. SDK writers: who want to make Beam concepts available in new languages.
4. Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners
Apache Flink
Apache Spark
Beam Model: Pipeline Construction
OtherLanguagesBeam Java
Beam Python
Execution Execution
Cloud Dataflow
Execution
![Page 32: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/32.jpg)
Runners
Google Cloud Dataflow
Apache FlinkApache SparkApache Apex
Ali BabaJStorm
Apache BeamDirect Runner
Apache Storm
WIP
Apache Gearpump
Runners “translate” the code into the target runtime
* Same code, different runners & runtimes
Hadoop MapReduce
IBM Streams Apache Samza
![Page 33: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/33.jpg)
Beam IO (Data store connectors)
Filesystems: Google Cloud Storage, Hadoop FileSystem, AWS S3, Azure Storage (in progress)File support: Text, Avro, Parquet, Tensorflow Cloud databases: Google BigQuery, BigTable, DataStore, Spanner, AWS Redshift (in progress)Messaging: Google Pubsub, Kafka, JMS, AMQP, MQTT, AWS Kinesis, AWS SNS, AWS SQSCache: Redis, Memcached (in progress)Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBCIndexing: Apache Solr, Elasticsearch
And other nice ecosystem tools / libraries:Scio: Scala API by SpotifyEuphoria: Alternative Java API closer to Java 8 collectionsExtensions: joins, sorting, probabilistic data structures, etc.
33
![Page 34: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/34.jpg)
34
A simple evolution example
![Page 35: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/35.jpg)
A log analysis simple example
Logs rotated and stored in HDFS and analyzed daily to measure user engagement.Running on-premise Hadoop cluster with Spark
Data:
Output:
35
user01, 32 urls, 2018/03/07
64.242.88.10 user01 07/Mar/2018:16:05:49 /news/abfg6f
64.242.88.10 user01 07/Mar/2018:16:05:49 /news/de0aff
...
![Page 36: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/36.jpg)
A log analysis simple example
PCollection<KV<User, Long>> numVisits =
pipeline
.apply(TextIO.read().from("hdfs://..."))
.apply(MapElements.via(new ParseLog()))
.apply(Count.perKey());
36
$ mvn exec:java -Dexec.mainClass=beam.example.loganalysis.Main -Pspark-runner
-Dexec.args="--runner=SparkRunner --master=tbd-bench"
![Page 37: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/37.jpg)
A log analysis simple example
Remember the software engineering maxima:
Requirements always change
We want to identify user sessions and calculate the number of URL visits per sessionand we need quicker updates from a different source, a Kafka topicand we will run this in a new Flink cluster
* Session = a sustained burst of activity
37
![Page 38: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/38.jpg)
A log analysis simple example
PCollection<KV<User, Long>> numVisitsPerSession =
pipeline
.apply(
KafkaIO.<Long, String>read()
.withBootstrapServers("hostname")
.withTopic("visits"))
.apply(Values.create())
.apply(MapElements.via(new ParseLog()))
.apply(Window.into(Sessions.withGapDuration(Duration.standardMinutes(10))))
.apply(Count.perKey());
38
$ mvn exec:java -Dexec.mainClass=beam.example.loganalysis.Main -Pflink-runner
-Dexec.args="--runner=FlinkRunner --master=realtime-cluster-master"
![Page 39: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/39.jpg)
Apache Beam Summary
Expresses data-parallel batch and streaming algorithms with one unified API.
Cleanly separates data processing logic from runtime requirements.
Supports execution on multiple distributed processing runtime environments.
Integrates with the larger data processing ecosystem.
39
![Page 40: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/40.jpg)
40
Current status and upcoming features
![Page 41: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/41.jpg)
Beam is evolving too...
● Streaming SQL support via Apache Calcite
● Schema-aware PCollections friendlier APIs
● Composable IO Connectors: Splittable DoFn (SDF) (New API)
● Portability: Open source runners support for language portability
● Go SDK finally gophers become first class citizens on Big Data
41
![Page 42: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/42.jpg)
IO connectors APIs are too strict
"Source" "Transform" "Sink"
A B
InputFormat / Receiver / SourceFunction / ...
Configuration:FilepatternQuery stringTopic name…
OutputFormat / Sink / SinkFunction / ...
Configuration:DirectoryTable nameTopic name…
![Page 43: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/43.jpg)
SDF - Enable composable IO APIs
"Source" "Transform" "Sink"
A B
My filenames come on a Kafka topic. I want to know which
records failed to write
I want to kick off another transform after writing
I have a table per client + table of clients
Narrow APIs are not
hackable
![Page 44: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/44.jpg)
Google Cloud Platform 44
Element: what work
Restriction: what part of the work
Design: s.apache.org/splittable-do-fn
Splittable DoFn (SDF): Partial work via restrictions
DoFn
SDF
Element
(Element, Restriction)
Dynamically Splittable
* More details in this video by Eugene Kirpichov
![Page 45: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/45.jpg)
45
Language portability
● If I run a Beam python pipeline on the Spark runner, is it translated to PySpark?
● Wait, can I execute python on a Java based runner?
● Can I use the python Tensorflow transform from a Java pipeline?
● I want to connect to Kafka from Python but there is not a connector can I use the Java one?
No
Beam Model: Fn Runners
Apache Flink
Apache Spark
Beam Model: Pipeline Construction
OtherLanguagesBeam Java
Beam Python
Execution Execution
Cloud Dataflow
Execution
![Page 46: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/46.jpg)
46
How do Java-based runners do work today?
SDK Runner
Client
JobMaster
Cluster
Executor(Runner)
Worker
Worker
Executor / Fn API
WorkerPipeline
UDF
![Page 47: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/47.jpg)
Portability Framework
SDK
Job Server
ArtifactStaging
StagingLocation
DFS
Client
Job
Master
Cluster
Executor(Runner)
Docker Container
Worker
Worker
Executor / Fn API
Provision Control Data
ArtifactRetrieval State Logging
Worker
Artifacts
SDK HarnessPipelineprotobuf UDF
![Page 48: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/48.jpg)
Language portability advantages
Isolation of user codeIsolated configuration of user environmentMultiple language executionMix user code in different languagesMakes creating new SDK easier (homogeneous)
Issues
Performance overhead (15% in early evaluation). via extra RPC + containerExtra component (docker)A bit more complex but it is the price of reuse and consistent environments
![Page 49: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/49.jpg)
Go SDK
func main() {
p := beam.NewPipeline()
s := p.Root()
lines := textio.Read(s, *input)
counted := CountWords(s, lines)
formatted := beam.ParDo(s, formatFn, counted)
textio.Write(s, *output, formatted)
if err := beamx.Run(context.Background(), p); err != nil {
log.Fatalf("Failed to execute job: %v", err)
}
}
First user SDK completely based on Portability API.
49
![Page 50: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/50.jpg)
Contribute
A vibrant community of contributors + companies:Google, data Artisans, Lyft, Talend, Yours?
● Try it and help us report (and fix) issues.● Multiple Jiras that need to be taken care of.● New feature requests, new ideas, more documentation.● More SDKs (more languages) .net anyone please, etc● More runners, improve existing, a native go one maybe?
Beam is in a perfect shape to jump in.
First Stable Release. 2.0.0 API stability contract (May 2017)Current: 2.6.0
![Page 51: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/51.jpg)
51
Learn More!
Apache Beam https://beam.apache.org
The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the mailing lists! [email protected]@beam.apache.org
Follow @ApacheBeam on Twitter
* The nice slides with animations were created by Tyler Akidau and Frances Perry and used with authorization.Special thanks too to Eugene Kirpichov, Dan Halperin and Alexey Romanenko for ideas for this presentation.
![Page 52: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/52.jpg)
52
Thanks
![Page 53: data-intensive applications Apache Beam: portable and ...€¦ · Cache: Redis, Memcached (in progress) Databases: Apache HBase, Cassandra, Hive (HCatalog), Mongo, JDBC Indexing:](https://reader030.vdocuments.mx/reader030/viewer/2022041009/5eb59f2e0d5ddc1c6024fb5e/html5/thumbnails/53.jpg)