my sql and search at craigslist

29
MySQL and Search at Craigslist Jeremy Zawodny [email protected] http://craigslist.org/ [email protected] http://jeremy.zawodny.com/blog/

Upload: mysqlconference

Post on 15-Jan-2015

1.978 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: My Sql And Search At Craigslist

MySQL and Search at Craigslist

Jeremy [email protected]

http://craigslist.org/

[email protected]://jeremy.zawodny.com/blog/

Page 2: My Sql And Search At Craigslist

Who Am I?

● Creator and co-author of High Performance MySQL

● Creator of mytop● Perl Hacker● MySQL Geek● Craigslist Engineer (as of July, 2008)

– MySQL, Data, Search, Perl

● Ex-Yahoo (Perl, MySQL, Search, Web Services)

Page 3: My Sql And Search At Craigslist

What is Craigslist?

Page 4: My Sql And Search At Craigslist

What is Craigslist?

● Local Classifieds– Jobs, Housing, Autos, Goods, Services

● ~500 cities world-wide● Free

– Except for jobs in ~18 cities and brokered apartments in NYC

– Over 20B pageviews/month

– 50M monthly users

– 50+ countries, multiple languages

– 40+M ads/month, 10+M images

Page 5: My Sql And Search At Craigslist

What is Craigslist?

● Forums– 100M posts

– 100s of forums

Page 6: My Sql And Search At Craigslist

Technical and other Challenges

● High ad churn rate– Post half-life can be short

● Growth● High traffic volume● Back-end tools and data analysis needs● Growth● Need to archive postings... forever!

– 100s of millions, searchable

● Internationalization and UTF-8

Page 7: My Sql And Search At Craigslist

Technical and other Challenges

● Small Team– Fires take priority

– Infrastructure gets creaky

– Organic code and schema growth over years

● Growth● Lack of abstractions

– Too much embedded SQL in code

● Documentation vs. Institutional Knowledge– “Why do we have things configured like this?”

Page 8: My Sql And Search At Craigslist

Goals

● Use Open Source● Keep infrastructure small and simple

– Lower power is good!

– Efficiency all around

– Do more with less

● Keep site easy and appraochable– Don't overload with features

– People are easily confuse

Page 9: My Sql And Search At Craigslist

Craigslist Internals OverviewLoad Balancer

Read Proxy Array Write Proxy ArrayPerl + memcached

Web Read Array Apache 1.3 + mod_perl

Object Cache

Read DB Cluster

Perl + memcached

MySQL 5.0.xxNot Included: - user db, image db - async tasks, email - accounting, internal tools - and more!

Search Cluster Sphinx

...

Page 10: My Sql And Search At Craigslist

Vertical Partitioning: Roles

Users ClassifiedsUsers Classifieds Forums

Stats Archive

Write Read Long Trash

Page 11: My Sql And Search At Craigslist

Vertical Partitioning

● Different roles have different access patterns– Sub-roles based on query type

● Easier to manage and scale● Logical, self-contained data● Servers may not need to be as

big/fast/expensive● Difficult to do retroactively● Various named db “handles” in code

Page 12: My Sql And Search At Craigslist

Horizontal Partitioning: Hydra

cluster_01 cluster_02 cluster_03 cluster_N...

client

Page 13: My Sql And Search At Craigslist

Horizontal Partitioning: Hydra

● Need to retrofit a lot of code● Need non-blocking Perl MySQL client● Wrapped

http://code.google.com/p/perl-mysql-async/● Eventually can size DB boxes based on

price/power and adjust mapping function(s)– Choose hardware first

– Make the db “fit”

● Archiving lets us age a cluster instead of migrating it's data to a new one.

Page 14: My Sql And Search At Craigslist

Search Evolution

● Problem: Users want to find stuff.● Solution: Use MySQL Full Text.● ...time passes...● Problem: MySQL Full Text Doesn't Scale!● Solution: Use Sphinx.● ...time passes...● Problem: Sphinx doesn't scale!● Solution: Patch Sphinx.

Page 15: My Sql And Search At Craigslist

MySQL Full-Text Problems

● Hitting invisible limits– CPU not pegged, Memory available

– Disk I/O not unreasonable

– Locking / Mutex contention? Probably.

● MyISAM has occasional crashing / corruption● 5 clusters of 5 machines

– Partitioning based on city and category

– All “hand balanced” and high-maintenance

● ~30M queries/day– Close to limits

Page 16: My Sql And Search At Craigslist

Sphinx: My First CL Project

● Sphinx is designed for text search● Fast and lean C++ code● Forking model scales well on multi-core● Control over indexing, weighting, etc.● Also spent some time looking at Apache Solr

Page 17: My Sql And Search At Craigslist

Search Implementation Details

● Partitioning based on cities (each has a numeric id)

● Attributes vs. Keywords● Persistent Connections

– Custom client and server modifications

● Minimal stopword List● Partition into 2 clusters (1 master, 4 slaves)

Page 18: My Sql And Search At Craigslist

Sphinx Incremental Indexing

● Re-index every N minutes● Use main + delta strategy

– Adopted as: index + today + delta

– One set per city (~500 * 3)

● Slaves handle live queries, update via rsync● Need lots of FDs● Use all 4 cores to index● Every night, perform “daily merge”● Generate config files via Perl

Page 19: My Sql And Search At Craigslist

Sphinx Incremental Indexing

Page 20: My Sql And Search At Craigslist

Sphinx Issues

● Merge bugs [fixed]● File descriptor corruption [fixed]● Persistent connections [fixed]

– Overhead of fork() was substantial in our testing

– 200 queries/sec vs. 1,000 queries/sec per box

● Missing attribute updates [unreported]● Bogus docids in responses● We need to upgrade to latest Sphinx soon● Andrew and team have been excellent!

Page 21: My Sql And Search At Craigslist

Search Project Results

● From 25 MySQL Boxes to 10 Sphinx● Lots more headroom!● New Features

– Nearby Search

● No seizing or locking issues● 1,000+ qps during peak w/room to grow● 50M queries per day w/steady growth● Cluster partitioning built but not needed (yet?)● Better separation of code

Page 22: My Sql And Search At Craigslist

Sphinx Wishlist

● Efficient delete handling (kill lists)● Non-fatal “missing” indexes● Index dump tool● Live document add/change/delete● Built-in replication● Stats and counters● Text attributes● Protocol checksum

Page 23: My Sql And Search At Craigslist

Data Archiving, Replication, Indexes

● Problem: We want to keep everything.● Solution: Archive to an archive cluster.● Problem: Archiving is too painful. Index

updates are expensive! Slaves affected.● Solution: Archive with home-grown eventually

consistent replication.

Page 24: My Sql And Search At Craigslist

Data Archiving: OOB Replication

● Eventual Consistency● Master process

– SET SQL_LOG_BIN=0

– Select expired IDs

– Export records from live master

– Import records into archive master

– Delete expired from live master

– Add IDs to list

Page 25: My Sql And Search At Craigslist

Data Archiving: OOB Replication

● Slave process– One per MySQL slave

– Throttled to minimize impact

– State kept on slave● Clone friendly

– Simple logic● Select expired IDs added since my sequence number● Delete expired records● Update local “last seen” sequence number

Page 26: My Sql And Search At Craigslist

Long Term Data Archiving

● Schema coupling is bad– ALTER TABLE takes forever

– Lots of NULLs flying around

● CouchDB or similar long-term?– Schema-free feels like a good fit

● Tested some home grown solutions already● Separate storage and indexing?

– Indexing with Sphinx?

Page 27: My Sql And Search At Craigslist

Drizzle, XtraDB, Future Stuff

● CouchDB looks very interesting. Maybe for archive?

● XtraDB / InnoDB plugin– Better concurrency

– Better tuning of InnoDB internals

● libdrizzle + Perl– DBI/DBD may not fit an async model well

– Can talk to both MySQL and Drizzle!

● Oracle buying Sun?!?!

Page 28: My Sql And Search At Craigslist

We're Hiring!

● Work in San Francisco● Flexible, Small Company● Excellent Benefits● Help Millions of People Every Week● We Need Perl/MySQL Hackers● Come Help us Scale and Grow

Page 29: My Sql And Search At Craigslist

Questions?