streaming way to webscale: how we scale bitly via streaming
TRANSCRIPT
16 x 9
Streaming Your Way to Web Scale: Scaling Bitly via Stream-Based Processing!All Things Open!October 23, 2014, 4:15pm
85 slides 23 Images 21 diagrams
Peter [email protected]!@tpherndon
backend developer for Bitly
Scaling Bitly via Stream-Based Processing
or
Streaming Your Way to Web Scale
http://www.mongodb-is-web-scale.com/!(**NSFW due to language)
Define streaming — public API surface that puts events into message queues that are consumed by web services
Need to be asynchronous
we use Tornado
and Go
we had an older queue, but replaced it with a newer one.
written in Go
command v. event messages
Notification of an event
allows interested listeners to respond as they require
λLambda architecture - batch + stream processing
Part Two, NSQ
NSQ
http://nsq.io
Look, Ma, real code!
https://github.com/bitly/nsq https://github.com/bitly/go-nsq https://github.com/bitly/pynsq https://github.com/jehiah/nsqauth-contrib
Servers!• nsqd!• nsqlookupd!• nsqadmin!• pynsqauthd
Parts is parts! (part 1)
Clients!• go-nsq!• pynsq
Parts is parts! (part 2)
Utilities!• nsq_tail!• nsq_to_file!• nsq_to_nsq!• nsq_stat
Parts is parts! (part 3)
two basic building blocks of distributed NSQ
nsqd & nsqlookupd
put into context, look at evolution of a website
Basic Web App
Web App
Database
Basic web app. In Python, Django + Postgres, Flask + Postgres, Tornado + Postgres
Scaling the Mountain (of Load)
Web App
Database
Web App Web App
First bottleneck: web layer
Cache Rules Everything Around Me
Database
Web AppWeb AppWeb App
Cache
Remove web layer bottleneck, next is DB, so add caching layer
You Want Me to Replicate You
DatabaseDatabase
Web AppWeb AppWeb App
Cache
Works for a while, but DB requests still take too long, so replicate
Shards Here, Shards There
Web AppWeb AppWeb App
Cache
DatabaseDatabaseDatabaseDatabase DatabaseDatabase
…and then shard
It’s Off to Work I Go!
Database
Web App
Cache
Queue
Worker
But individual requests still take too long, because doing too much work. So add message queue and worker. In Python, Celery
Message From a Bottle(.py)
Database
Web App
Cache
Queue
Worker
Web app sends messages to queue
Working the Event
Database
Web App
Cache
Queue
Worker
Worker pulls message off queue and processes
Write Here, Write Now
Database
Web App
Cache
Queue
Worker
Worker writes results to database, file system, etc.
Write Here, Write Now (redux)
Web App
Database
Worker
Queue
Instead, imagine worker writes results
Sending Out an SMS
Web App
Database
Worker
Queue
and web app writes event messages to queue local to the web service
Listen, listen, LISTEN
Web App
Database
Worker
Queue
Queue
but worker is listening to a queue running on another server
Workin’ On a Chain(ed) Gang
Web App
Database
Worker
Queue
Web App
Database
Worker
Queue
Web App
Database
Worker
Queue
Look it up!
Web App
Database
Worker
Queue
Queue
Worker finds queue with topic via nsqlookupd
if __name__ == "__main__": tornado.options.parse_command_line() logatron_client.setup() Reader( topic=settings.get('nsqd_output_topic'), channel='queuereader_spam_metrics', validate_method=validate_message, message_handler=count_spam_actions, lookupd_http_addresses=settings.get('nsq_lookupd') ) run()
/<service>/queuereader_<service>.py
How do I find things?
Sending Out an SMS
Web App
Database
Worker
Queue
topic: ‘spam_api’
First time app writes to a TOPIC in the local nsqd
Sending Out an SMS
Web App
Database
Worker
Queue
topic: ‘spam_api’
nsqlookupd
nsqd creates the topic and registers it with nsqlookupd
Where Am I Again?
Web App
Database
Worker
nsqdtopic: 'spam_counter'
nsqdtopic: 'spam_api'
nsqlookupdtopic: 'spam_api'?
Worker in another service looking for a topic asks nsqlookupd, replies with address
Talkin’ ‘Bout Something
Web App
Database
Worker
nsqdtopic: 'spam_counter'
nsqdtopic: 'spam_api'
channel: 'spam_counter'
queuereader connects to nsqd, registers a channel
Cross-Town Traffic
nsqdtopic: 'spam_api'
Worker Worker Worker
channel: 'spam_counter'
messages are divided by # of subscribers to a channel; allows horizontal scaling
Channeling the Ghost
nsqdtopic: 'spam_api'
Worker Worker Worker
channel: 'spam_counter'
Worker
channel: 'nsq_to_file''
full copy of all messages to each channel
nsqadmin
nsqadmin
How we manage our message queues
nsqadmin
nsqadmin
pynsqauthd
It’s made of PEOPLE!
https://github.com/jehiah/nsqauth-contrib
pynsq client
It’s still made of PEOPLE!
https://github.com/bitly/pynsq
• settings.py!• <service>_api.py!• queuereader_<service>.py!• README.md
/<service>
Queuereaders are part of streaming architecture
if __name__ == "__main__": tornado.options.parse_command_line() logatron_client.setup() Reader( topic=settings.get('nsqd_output_topic'), channel='queuereader_spam_metrics', validate_method=validate_message, message_handler=count_spam_actions, lookupd_http_addresses=settings.get('nsq_lookupd') ) run()
/<service>/queuereader_<service>.py, 1 of 4
if __name__ == "__main__": tornado.options.parse_command_line() logatron_client.setup() Reader( topic=settings.get('nsqd_output_topic'), channel='queuereader_spam_metrics', validate_method=validate_message, message_handler=count_spam_actions, lookupd_http_addresses=settings.get('nsq_lookupd') ) run()
/<service>/queuereader_<service>.py, 1 of 4
def validate_message(message): if message.get('o') == '+' and message.get('l'): return True if message.get('o') == '-‐' and message.get('l')\ and message.get('bl'): return True return False
/<service>/queuereader_<service>.py, 2 of 4
if __name__ == "__main__": tornado.options.parse_command_line() logatron_client.setup() Reader( topic=settings.get('nsqd_output_topic'), channel='queuereader_spam_metrics', validate_method=validate_message, message_handler=count_spam_actions, lookupd_http_addresses=settings.get('nsq_lookupd') ) run()
/<service>/queuereader_<service>.py, 1 of 4
def count_spam_actions(message, nsq_msg): key_section = statsd_keys[message['o']] key = key_section.get(message['l'], key_section['default']) statsd.incr(key) if key == 'remove_by_manual': key_section = statsd_keys['-‐manual'] key = key_section.get(message['bl'], key_section['default']) statsd.incr(key) return nsq_msg.finish()
/<service>/queuereader_<service>.py, 3 of 4
def count_spam_actions(message, nsq_msg): key_section = statsd_keys[message['o']] key = key_section.get(message['l'], key_section['default']) statsd.incr(key) if key == 'remove_by_manual': key_section = statsd_keys['-‐manual'] key = key_section.get(message['bl'], key_section['default']) statsd.incr(key) return nsq_msg.finish()
/<service>/queuereader_<service>.py, 3 of 4
if __name__ == "__main__": tornado.options.parse_command_line() logatron_client.setup() Reader( topic=settings.get('nsqd_output_topic'), channel='queuereader_spam_metrics', validate_method=validate_message, message_handler=count_spam_actions, lookupd_http_addresses=settings.get('nsq_lookupd') ) run()
/<service>/queuereader_<service>.py, 4 of 4
if __name__ == "__main__": tornado.options.parse_command_line() logatron_client.setup() Reader( topic=settings.get('nsqd_output_topic'), channel='queuereader_spam_metrics', validate_method=validate_message, message_handler=count_spam_actions, lookupd_http_addresses=settings.get('nsq_lookupd') ) run()
/<service>/queuereader_<service>.py, 4 of 4
utilities
nsq_tail!nsq_to_file!to_nsq!nsq_to_nsq!nsq_stat
Parts is parts! (part 3, redux)
Features & Guarantees!(aka Trade-Offs)
Distributed, No SPOF || Horizontally Scalable || TLS || statsd integration || Easy to Deploy || Cluster Administration
Messages NOT Durable
Delivered at least once
Un-ordered Delivery
Eventually-Consistent Discovery
A Thousand Points of Light (well, 58)
A Thousand Points of Light (well, 58)
Whoa…
8.2 billion decodes per month
Streaming Architecture
Easy to build new services Easy to scale individual components horizontally Durable in the face of single component failure Distributed
THINGS TO THINK ABOUT Monitoring, monitoring, monitoring Failure modes — how can things fail? How does your application as a whole handle the failure of individual components? Measurement — metrics show the range Timeouts — connection timeouts, DNS timeouts — a slow network is the same as a failed service
NSQ
http://nsq.io
Duke of URL
https://github.com/bitly/nsq https://github.com/bitly/go-nsq https://github.com/bitly/pynsq https://github.com/jehiah/nsqauth-contrib
Web Scale - http://www.mongodb-is-web-scale.com!Waterfall - https://www.flickr.com/photos/desatur8/14949285342!Tornado - https://www.flickr.com/photos/indigente/798304!John de Lancie - https://www.flickr.com/photos/cayusa/1394930005!Ben Whishaw - https://www.flickr.com/photos/rossendalewadey/6032496676!Command Key - https://www.flickr.com/photos/klash/3175479797!iPhone6 Event - https://www.flickr.com/photos/notionscapital/15067798867!Wait for iPhone - https://www.flickr.com/photos/josh_gray/662814907!NSQ Logo - http://nsq.io!!
All other photos by T. Peter Herndon!!
Photo Credits
Questions?
Peter [email protected]!@tpherndon