an elephant in the roomfiles.messe.de/abstracts/87488_uni_cebit2018_nikolay_markov_alig… · we...

20
An Elephant In The Room Nikolay Markov, Aligned Research Group CEBIT 2018 Hannover June 13, 2018

Upload: others

Post on 05-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

An Elephant In The Room

Nikolay Markov, Aligned Research GroupCEBIT 2018 Hannover

June 13, 2018

• Senior Data Engineer at Aligned Research Group• Building analytical architectures, coding in Python, Go, C++ and

some other languages• Reading lectures, writing articles, organizing PyData events• Don’t like:

• “Who am I” slides• Irony• Bullet lists

An elephant in the room / Nikolay Markov

Who am I?�2

An elephant in the room / Nikolay Markov

Every cook praises his own broth

�3

Business

Engineering

Analytics

Data Science

An elephant in the room / Nikolay Markov

Meet the elephant�4

MapReduce is an awesome paradigm

HDFS is nice for storing data infinitely You can use Python!

You can even glue some SQL over it!

An elephant in the room / Nikolay Markov

e ĕ t a ñ

ȩl

e p ḥä

n ṭ

�5

You write MapReduce instead of writing your business logic Write data on disk at all times!

Ten layers of abstract logic on top of every operation

Python isn’t really fast, also you need a full-blown configuration

management to deploy DS packages

An elephant in the room / Nikolay Markov

Command line�6

https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html

“You should give a second computer to a person only after he or she learns how to use the first one” (с) Someone

[Here was a link to slides on data analysis in CLI, but they are in russian, so take a look at this amazing book instead]: https://www.datascienceatthecommandline.com/

Ad-hoc tasks can be solved 10x times faster

An elephant in the room / Nikolay Markov

Command line�7

zcat file.tar.gz | csvjson --stream | jq -c 'if .createdDate != "" then .createdDate = (.standardRegCreatedDate | split(" ") | .[0:2] | join("T") + "Z") else .createdDate = "9999-01-01T00:00:00Z" | to_entries | map(select(.key | contains("rawText") | not) ) | from_entries' | awk '{ print "{\"index\": {} }","\n" $0 }' | parallel -j8 --pipe -N500 curl -s -XPOST localhost:9200/items/entry/_bulk --data-binary @- > /dev/null

An elephant in the room / Nikolay Markov

Once a day?�8

An elephant in the room / Nikolay Markov

Data bus�9

An elephant in the room / Nikolay Markov

�10

What do we know?

We built an analytic stack from scratch based on Apache Kafka, it processes over 1M binary structures per second

We wrote several analytic tools in Python, C++, Go and Java/Scala, and they work on a dozen machines instead of a hundred

We have our own graph analytic processing system that is about 5 times faster than most advanced available market solutions

We implemented a number of pipelines using “Infrastructure As A Code” approach using Kubernetes and NVIDIA-powered containers

An elephant in the room / Nikolay Markov

To SQL or to NoSQL?�11

• Are you sure you need SQL? I mean, are you REALLY sure?

• Do you need full text search and scalability?

• What about an integration with streaming data processing systems?

• How about graphs, column storages and document-oriented approach?

An elephant in the room / Nikolay Markov

But I do want SQL!�12

Presto is a distributed SQL processing engine that works with many data sources. And you also have SparkSQL!

An elephant in the room / Nikolay Markov

CI/CD�13

An elephant in the room / Nikolay Markov

Enough people?�14

How to reproduce an environment for testing?

Who is going to build these packages?

Which OS to support?

How to provide HA, zero downtime, easy upgrades, solve CAP tradeoff?

An elephant in the room / Nikolay Markov

Let’s hide it under the rug�15

• Just a thin wrapper on top of cgroups• Resource limiting• Monitoring• Snapshots• “Layering” architecture (AUFS/OverlayFS)• Linux network namespaces• Private Docker Registry and public Docker Hub

An elephant in the room / Nikolay Markov

Every company for itself �16

An elephant in the room / Nikolay Markov

Kubernetes�17

An elephant in the room / Nikolay Markov

Task Flow and CI�18

• Airflow is kinda heavy and overloaded with features, but still doesn’t have a declarative API (some guys from Rambler actually tried to fix that in https://github.com/rambler-digital-solutions/airflow-declarative)

• Luigi is too simple and you can write a similar tool in an evening if you have enough beer/coffee/other fuel

• Mistral is mostly designed for OpenStack and its code isn’t really “self-explanatory”

https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/

https://jenkins.io/doc/book/pipeline/

An elephant in the room / Nikolay Markov

ChatOps�19

An elephant in the room / Nikolay Markov

Links and questions�20

We also run some really cool Deep Learning projects on top of all this!

http://alignedresearch.co/http://alignedresearch.com/

https://www.linkedin.com/in/nickmarkovhttps://twitter.com/enchantner

https://angel.co/nikolay-markov-4