embulk - 進化するバルクデータローダ
TRANSCRIPT
![Page 1: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/1.jpg)
Embulk - 進化するバルクデータローダ
Sadayuki Furuhashi Founder & Software Architect
Embulk Meetup Tokyo #2
![Page 2: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/2.jpg)
A little about me…
Sadayuki Furuhashigithub: @frsyuki
Fluentd - Unifid log collection infrastracture
Embulk - Plugin-based parallel ETL Founder & Software Architect
![Page 3: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/3.jpg)
What’s Embulk?
> An open-source parallel bulk data loader > loads records from “A” to “B”
> using plugins > for various kinds of “A” and “B”
> to make data integration easy. > which was very painful…
Storage, RDBMS, NoSQL, Cloud Service,
etc.
broken records,transactions (idempotency),
performance, …
![Page 4: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/4.jpg)
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL > 1. First attempt → fails > 2. Write a script to make the records cleaned
• Convert ”2015-01-27T19:05:00Z” → “2015-01-27 19:05:00 UTC”
• Convert “\N" → “”
• many cleanings…
> 3. Second attempt → another error • Convert “Inf” → “Infinity”
> 4. Fix the script, retry, retry, retry… > 5. Oh, some data got loaded twice!?
![Page 5: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/5.jpg)
The pains of bulk data loading
Example: load a 10GB CSV file to PostgreSQL > 6. Ok, the script worked. > 7. Register it to cron to sync data every day. > 8. One day… it fails with another error
• Convert invalid UTF-8 byte sequence to U+FFFD
![Page 6: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/6.jpg)
The pains of bulk data loading
Example: load 10GB CSV × 720 files > Most of scripts are slow.
• People have little time to optimize bulk load scripts
> One file takes 1 hour → 720 files takes 1 month (!?)
A lot of integration efforts for each storages: > XML, JSON, Apache log format (+some custom), … > SAM, BED, BAI2, HDF5, TDE, SequenceFile, RCFile… > MongoDB, Elasticsearch, Redshift, Salesforce, …
![Page 7: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/7.jpg)
The problems:
> Data cleaning (normalization) > How to normalize broken records?
> Error handling > How to remove broken records?
> Idempotent retrying > How to retry without duplicated loading?
> Performance optimization > How to optimize the code or parallelize?
![Page 8: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/8.jpg)
HDFS
MySQL
Amazon S3
Embulk
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Resuming
Plugins Plugins
bulk load
![Page 9: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/9.jpg)
Input Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
Guess
![Page 10: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/10.jpg)
Output
Embulk’s Plugin Architecture
Embulk Core
Executor Plugin
Filter Filter
GuessFileInput
Parser
Decoder
![Page 11: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/11.jpg)
Guess
Embulk’s Plugin Architecture
Embulk Core
FileInput
Executor Plugin
Parser
Decoder
FileOutput
Formatter
Encoder
Filter Filter
![Page 12: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/12.jpg)
Execution overview
Task
Transaction Task
Task
taskCount
{ taskIndex: 0, task: {…} }
{ taskIndex: 2, task: {…} }
runs on a single thread runs on multiple threads(or machines)
![Page 13: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/13.jpg)
Parallel execution
Task
Task
Task
Task
Threads
Task queue
run tasks in parallel
(embulk-executor-local-thread)
![Page 14: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/14.jpg)
Distributed execution
Task
Task
Task
Task
Map tasks
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
![Page 15: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/15.jpg)
Distributed execution (w/ partitioning)
Task
Task
Task
Task
Map - Shuffle - Reduce
Task queue
run tasks on Hadoop
(embulk-executor-mapreduce)
![Page 16: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/16.jpg)
Transaction control
fileInput.transaction { parser.transaction { filters.transaction { formatter.transaction { fileOutput.transaction { executor.transaction { … } } } } } }
file input plugin
parser plugin
filter plugins
formatter plugin
file output plugin
executor plugin
Task Task
![Page 17: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/17.jpg)
Task configurationfileInput.transaction { fileInputTask, taskCount → parser.transaction { parserTask, schema → filters.transaction { filterTasks, schema → formatter.transaction { formatterTask → fileOutput.transaction { fileOutputTask → executor.transaction { → task = { fileInputTask, parserTask, filterTasks, formatterTask, fileOutputTask, } taskCount.times.inParallel { taskIndex → run(taskIndex, task)
taskCount is decided by input
schema is decided by input, and may be
modified by filters
![Page 18: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/18.jpg)
Task execution
parser.run(fileInput, pageOutput)
fileInput.open() formatter.open(fileOutput)
fileOutput.open()
parser plugin
file input plugin filter plugins
file output plugin
formatter plugin …Task Task …
![Page 19: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/19.jpg)
Type conversionEmbulk type systemInput type system Output type system
boolean
long
double
string
timestamp
boolean integer bigint double precision text varchar date timestamp timestamp with zone …
(e.g. PostgreSQL)
boolean integer long float double string array geo point geo shape … (e.g. Elasticsearch)
Input plugin(parser plugin if input is file-based)
Output plugin(formatter plugin if output is file-based)
![Page 20: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/20.jpg)
What’s added since the first release?
• v0.3 • Resuming • Filter plugin type
• v0.4 • Plugin template generator • Incremental execution (ConfigDiff) • Isolated ClassLoaders for Java plugins • Polyglot command launcher
![Page 21: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/21.jpg)
What’s added since the first release?
• v0.6 • Executor plugin type • Liquid template engine
• v0.7 • EmbulkEmbed & Embulk::Runner • Plugin bundle (embulk-mkbundle) • JRuby 9000 • Gradle v2.6
![Page 22: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/22.jpg)
Resuming
• Retries a failed transaction without retrying everything.
• Skips successful tasks by using information stored in a file by the previous transaction.
• embulk run config.yml -r resume-state.yml
![Page 23: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/23.jpg)
Filter plugin type
• Filtering rows out, filtering columns out, or enrich the data. 18 plugins released.
![Page 24: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/24.jpg)
Plugin template generator
• Generates template of a plugin. • Generated code is already ready to compile.
> You modify & compile it to do your work.
• embulk new <category> <new>
![Page 25: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/25.jpg)
Incremental execution
• Store last file name or row in a file, and next execution starts from there.
• Usecase: sync new files on S3 to Elasticsearch every day.
• embulk run config.yml -o next-config.yml
![Page 26: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/26.jpg)
Isolated ClassLoaders for Java plugins
• Embulk can load multiple versions of java plugins.
![Page 27: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/27.jpg)
Plugin Version Conflicts
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Version conflicts!
aws-sdk.jar v1.10
embulk-output-redshift.jar
![Page 28: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/28.jpg)
Multiple Classloaders in JVM
Embulk Core
Java Runtime
aws-sdk.jar v1.9
embulk-input-s3.jar
Isolated environments
aws-sdk.jar v1.10
embulk-output-redshift.jar
Class Loader 1
Class Loader 2
![Page 29: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/29.jpg)
Polyglot launcher script
• embulk .jar is a jar file. • embulk.jar is a shell script. • embulk.jar is a bat script. • It sets JVM options to improve performance.
• ./embulk run abc
![Page 30: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/30.jpg)
Executor plugin type
• embulk-executor-mapreduce executes tasks on distributed environment.
![Page 31: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/31.jpg)
Liquid template engine
• A config file can include variables.
![Page 32: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/32.jpg)
EmbulkEmbed & Embulk::Runner
• Embed embulk in an application.
![Page 33: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/33.jpg)
Plugin bundle
• Uses fixed version of plugins.
• embulk mkbundle my-project • embulk run -b my-project config.yml
![Page 34: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/34.jpg)
Gradle v2.6
• Continous compiling. • “embulk migrate .” upgrades gradle versio of your
plugin project. • ./gradlew -t build
![Page 35: Embulk - 進化するバルクデータローダ](https://reader031.vdocuments.mx/reader031/viewer/2022012308/5873237b1a28ab673e8b7ea9/html5/thumbnails/35.jpg)
Future plan
• v0.8 • JSON type (issue #306) • Error plugin type (#27, #124) • More (or less) concurrency for output (#231)
• v0.9 • More Guess (#242, #235) • Multiple jobs using a single config file (#167)