scaling marketplaces at thumbtack - qcon · scala data retention data retained forever on gcs /...

32
Scaling Marketplaces at Thumbtack QCon SF 2017 Nate Kupp Technical Infrastructure Data Eng, Experimentation, Platform Infrastructure, Security, Dev Tools

Upload: others

Post on 06-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Scaling Marketplaces at ThumbtackQCon SF 2017

Nate Kupp Technical InfrastructureData Eng, Experimentation, Platform Infrastructure, Security, Dev Tools

Page 2: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Infrastructure from early beginnings

Page 3: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

“You see that?” my dad would say,

“That won’t last more than a few years.

They’re going to have to rip the whole thing out within a decade”.

Page 4: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

You have choices. A lot of them.

Page 5: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

There’s no right answer. Sorry.

Page 6: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

ME: HARDWARE SOFTWARE

Semiconductor manufacturing research

Apple iOS battery life

Data Infrastructure @ Thumbtack

Technical Infrastructure @ ThumbtackData, core platform, A/B testing & dev tools

Yup, those are working transistors

Page 7: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud
Page 8: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud
Page 9: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Growth Hacking with Team Philippines

CustomerPro A

Pro B

Pro C

I need a plumber!

Is this a spam request?

SPAM DETECTION

Can we automate quoting?

“QUOTING SERVICE”

Which pros do we invite to quote?

MATCHING

Page 10: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Hacky (especially in the early days) is fine, even necessary

Page 11: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Hacky is going to happen. With or without you.

Page 12: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

“It has long been an axiom of mine that the little things are infinitely the most important.”

- Sherlock Holmes

“Small data” CSV files Upload Web UI

Analysts

“Big data”

Hive/Impala

Page 13: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Change doesn’t happen in a vacuum. Give people a way to get things done while

you build or migrate.

Page 14: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Late 2015Starting to hit “real” scale

Growth @ Thumbtack

2009Thumbtack founded

2014Raise $130mm

Dec. 2014I join Thumbtack

NowContinued growth

Page 15: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Mid-2015Early data infrastructure

2009 - 2014The monolith

2015Finally on AWS! Started using DynamoDB, Golang microservices

The Evolution of Thumbtack’s Infrastructure

Go servicesWebsite Website

Sqoop ETLHDFS

Event Logging

Page 16: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

TodayMixed clouds, fully-containerized infrastructure, terraform

The Evolution of Thumbtack’s Infrastructure

Production Serving Infrastructure Data Infrastructure

Application Layer (Docker on ECS)

PHP, Go, Scala

Storage LayerPostgreSQL, DynamoDB,

Elasticsearch

ProcessingScala/Spark on Dataproc

StorageGCS

SQLBigQuery

Page 17: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

pgbouncer

Master + Hot standbys

Production Serving Infrastructure

ECS

Go services

services

Website

Latency Cluster

Throughput Cluster

...

Sqoop Cluster

WAL-E Backups

EBS + ZFS

Sqoop ETL

Application Layer

Offline Data ProcessingDynamoDB ETL

Indexing Pipelines

Event Logging

PG Logging

StorageLayer

Page 18: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Application Layer

Storage

CloudStorage

Ingest

CloudPub/Sub

Analytics

BigQuery

Data Infrastructure

Spark ETLCloud Dataproc

Airflow

Thrift/JSON events

Search Indexing Pipelines

Pipeline Orchestration

Full table Sqoop ETL

Batch ETL

Cloud DataprocJob-scoped clusters on Dataproc 1.1 / Spark 2.0.2, Scala

Data RetentionData retained forever on GCS / BigQueryCustom DynamoDB ETL

Internal DynamoDB ETL pipelines, scales up/down read capacity

SqoopCloud Dataproc

Pipelines

ML Pipelines

Page 19: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

And it was great!

- Costs comparable to monolithic cluster

- Creating and destroying > 600 clusters per day

- > 12,000 BigQuery queries per workday

Page 20: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

But, getting there took some real work

HDFSStorage

CloudStorage

Analytics

BigQuery

Code Changes- Add retries everywhere (many 500/503 error codes)

- Rewrite HDFS utilities

- Change all Parquet to Avro

- Educate team (100s of users) on new SQL dialect and web interface

Page 21: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

And, still needed to build interfaces to managed services

Ingest

CloudPub/Sub

Go services

services

Website

Analytics

BigQuery

STREAMING PIPELINEDATA INGEST

Spark StreamingCloud Dataproc

Spark StreamingWrote custom Spark receiver for Cloud Pub/Sub, checkpointing WAL to local HDFS, no backpressure (yet)

End-to-end DurabilityBuilt internal solutions; achieving end-to-end durability guarantees here is hard

BigQuery Streaming APIsStreaming writes of O(100s of millions) of events / day into BigQuery tables

Page 22: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

+ Uptime: GCP uptime has actually been great- Unexpected BigQuery API Changes: Google occasionally makes production changes

which break us. When these happen, we can’t really do anything until fixed by Google or a workaround is identified.

“Here be dragons”

April 2017

BigQuery Avro Ingest API Changes

Previously, a field marked as required by the Avro schema could be loaded into a table with the field marked nullable; this started failing.

October 2017

BigQuery Sharded Export Changes

Noticed many hung Dataproc clusters. Team identified workaround to disable BQ sharded export by setting mapred.bq.input.sharded.export.enable to false.

Page 23: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Managed services make some things easier…But only some things.

Go in with eyes wide open.

Page 24: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Our first managed service: DynamoDB

Go services

Website

Page 25: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

DynamoDB table proliferation

Page 26: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

GCP Migration & BigQuery

Why? A few reasons...1. Fully-managed, serverless, petabyte-scale data

warehouse2. Nested/complex data types3. Security & access controls4. Self-serve interface to Google Sheets

Page 27: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Surprise dependencies

Exception: BigQuery job failed. Final error was:

{

'reason': 'invalid',

'message':"Could not parse '.' as int for field category_id

(position 0) starting at location 340 ",

'location': '/gdrive/home/Category taxonomy - master file'

}

Page 28: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

When you make it really easy to deploy infrastructure...

...you get A LOT of random infrastructure

Page 29: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

What does that mean for Thumbtack & Serverless?

Page 30: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Hacky is going to happen. With or without you.

Page 31: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

Empower the team while limiting the wild west.

Page 32: Scaling Marketplaces at Thumbtack - QCon · Scala Data Retention Data retained forever on GCS / BigQuery Custom DynamoDB ETL ... STREAMING PIPELINE DATA INGEST Spark Streaming Cloud

QUESTIONS?