an elephant in the roomfiles.messe.de/abstracts/87488_uni_cebit2018_nikolay_markov_alig… · we...
TRANSCRIPT
• Senior Data Engineer at Aligned Research Group• Building analytical architectures, coding in Python, Go, C++ and
some other languages• Reading lectures, writing articles, organizing PyData events• Don’t like:
• “Who am I” slides• Irony• Bullet lists
An elephant in the room / Nikolay Markov
Who am I?�2
An elephant in the room / Nikolay Markov
Every cook praises his own broth
�3
Business
Engineering
Analytics
Data Science
An elephant in the room / Nikolay Markov
Meet the elephant�4
MapReduce is an awesome paradigm
HDFS is nice for storing data infinitely You can use Python!
You can even glue some SQL over it!
An elephant in the room / Nikolay Markov
Ṁ
e ĕ t a ñ
ȩl
e p ḥä
n ṭ
�5
You write MapReduce instead of writing your business logic Write data on disk at all times!
Ten layers of abstract logic on top of every operation
Python isn’t really fast, also you need a full-blown configuration
management to deploy DS packages
An elephant in the room / Nikolay Markov
Command line�6
https://aadrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
“You should give a second computer to a person only after he or she learns how to use the first one” (с) Someone
[Here was a link to slides on data analysis in CLI, but they are in russian, so take a look at this amazing book instead]: https://www.datascienceatthecommandline.com/
Ad-hoc tasks can be solved 10x times faster
An elephant in the room / Nikolay Markov
Command line�7
zcat file.tar.gz | csvjson --stream | jq -c 'if .createdDate != "" then .createdDate = (.standardRegCreatedDate | split(" ") | .[0:2] | join("T") + "Z") else .createdDate = "9999-01-01T00:00:00Z" | to_entries | map(select(.key | contains("rawText") | not) ) | from_entries' | awk '{ print "{\"index\": {} }","\n" $0 }' | parallel -j8 --pipe -N500 curl -s -XPOST localhost:9200/items/entry/_bulk --data-binary @- > /dev/null
An elephant in the room / Nikolay Markov
�10
What do we know?
We built an analytic stack from scratch based on Apache Kafka, it processes over 1M binary structures per second
We wrote several analytic tools in Python, C++, Go and Java/Scala, and they work on a dozen machines instead of a hundred
We have our own graph analytic processing system that is about 5 times faster than most advanced available market solutions
We implemented a number of pipelines using “Infrastructure As A Code” approach using Kubernetes and NVIDIA-powered containers
An elephant in the room / Nikolay Markov
To SQL or to NoSQL?�11
• Are you sure you need SQL? I mean, are you REALLY sure?
• Do you need full text search and scalability?
• What about an integration with streaming data processing systems?
• How about graphs, column storages and document-oriented approach?
An elephant in the room / Nikolay Markov
But I do want SQL!�12
Presto is a distributed SQL processing engine that works with many data sources. And you also have SparkSQL!
An elephant in the room / Nikolay Markov
Enough people?�14
How to reproduce an environment for testing?
Who is going to build these packages?
Which OS to support?
How to provide HA, zero downtime, easy upgrades, solve CAP tradeoff?
An elephant in the room / Nikolay Markov
Let’s hide it under the rug�15
• Just a thin wrapper on top of cgroups• Resource limiting• Monitoring• Snapshots• “Layering” architecture (AUFS/OverlayFS)• Linux network namespaces• Private Docker Registry and public Docker Hub
An elephant in the room / Nikolay Markov
Task Flow and CI�18
• Airflow is kinda heavy and overloaded with features, but still doesn’t have a declarative API (some guys from Rambler actually tried to fix that in https://github.com/rambler-digital-solutions/airflow-declarative)
• Luigi is too simple and you can write a similar tool in an evening if you have enough beer/coffee/other fuel
• Mistral is mostly designed for OpenStack and its code isn’t really “self-explanatory”
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
https://jenkins.io/doc/book/pipeline/
An elephant in the room / Nikolay Markov
Links and questions�20
We also run some really cool Deep Learning projects on top of all this!
http://alignedresearch.co/http://alignedresearch.com/
https://www.linkedin.com/in/nickmarkovhttps://twitter.com/enchantner
https://angel.co/nikolay-markov-4