high performance django 1

David Cramer http://www.davidcramer.net/ http://www.ibegin.com/

High Performance Django

Curse

•  Peak daily traffic of approx. 15m pages, 150m hits.

•  Average monthly traffic 120m pages, 6m uniques.

•  Python, MySQL, Squid, memcached, mod_python, lighty.

•  Most developers came strictly from PHP (myself included).

•  12 web servers, 4 database servers, 2 squid caches.

iBegin

•  Massive amounts of data, 100m+ rows.

•  Python, PHP, MySQL, mod_wsgi.

•  Small team of developers.

•  Complex database partitioning/synchronization tasks.

•  Attempting to not branch off of Django.

Areas of Concern

•  Database (ORM)

•  Webserver (Resources, Handling Millions of Reqs)

•  Caching (Invalidation, Cache Dump)

•  Template Rendering (Logic Separation)

•  Profiling

Tools of the Trade

•  Webserver (Apache, Nginx, Lighttpd)

•  Object Cache (memcached)

•  Database (MySQL, PostgreSQL, …)

•  Page Cache (Squid, Nginx, Varnish)

•  Load Balancing (Nginx, Perlbal)

How We Did It

•  “Primary” web servers serving Django using mod_python.

•  Media servers using Django on lighttpd.

•  Static served using additional instances of lighttpd.

•  Load balancers passing requests to multiple Squids.

•  Squids passing requests to multiple web servers.

Lessons Learned

•  Don’t be afraid to experiment. You’re not limited to a one.

•  mod_wsgi is a huge step forward from mod_python.

•  Serving static files using different software can help.

•  Send proper HTTP headers where they are needed.

•  Use services like S3, Akamai, Limelight, etc..

Webserver Software Python Scripts •  Apache (wsgi, mod_py,

fastcgi) •  Lighttpd (fastcgi) •  Nginx (fastcgi) Reverse Proxies •  Nginx •  Squid •  Varnish

Static Content •  Apache •  Lighttpd •  Tinyhttpd •  Nginx Software Load Balancers •  Nginx •  Perlbal

Database (ORM)

•  Won’t make your queries efficient. Make your own indexes.

•  select_related() can be good, as well as bad.

•  Inherited ordering (Meta: ordering) will get you.

•  Hundreds of queries on a page is never a good thing.

•  Know when to not use the ORM.

Handling JOINs class Category(models.Model):

name = models.CharField() created_by = models.ForeignKey(User)

class Poll(models.Model): name = models.CharField() category = models.ForeignKey(Category) created_by = models.ForeignKey(User)

# We need to output a page listing all Poll's with # their name and category's name.

def a_bad_example(request): # We have just caused Poll to JOIN with User and Category, # which will also JOIN with User a second time. my_polls = Poll.objects.all().select_related() return render_to_response('polls.html', locals(), request)

def a_good_example(request): # Use select_related explicitly in each case. poll = Poll.objects.all().select_related('category') return render_to_response('polls.html', locals(), request)

Template Rendering

•  Sandboxed engines are typically slower by nature.

•  Keep logic in views and template tags.

•  Be aware of performance in loops, and groupby (regroup).

•  Loaded templates can be cached to avoid disk reads.

•  Switching template engines is easy, but may not give you

any worthwhile performance gain.

Template Engines

Caching

•  Two flavors of caching: object cache and browser cache.

•  Django provides built-in support for both.

•  Invalidation is a headache without a well thought out plan.

•  Caching isn’t a solution for slow loading pages or improper indexes.

•  Use a reverse proxy in between the browser and your web servers:

Squid, Varnish, Nginx, etc..

Cache With a Plan

•  Build your pages to use proper cache headers.

•  Create a plan for object cache expiration, and invalidation.

•  For typical web apps you can serve the same cached page

for both anonymous and authenticated users.

•  Contain commonly used querysets in managers for

transparent caching and invalidation.

Cache Commonly Used Items def my_context_processor(request):

# We access object_list every time we use our context processors so # it makes sense to cache this, no? cache_key = ‘mymodel:all’ object_list = cache.get(cache_key) if object_list is None: object_list = MyModel.objects.all() cache.set(cache_key, object_list) return {‘object_list’: object_list}

# Now that we are caching the object list we are going to want to invalidate it class MyModel(models.Model):

name = models.CharField()

def save(self, *args, **kwargs): super(MyModel, self).save(*args, **kwargs) # save it before you update the cache cache.set(‘mymodel:all’, MyModel.objects.all())

Profiling Code

•  Finding the bottleneck can be time consuming.

•  Tools exist to help identify common problematic areas.

–  cProfile/Profile Python modules.

–  PDB (Python Debugger)

Profiling Code With cProfile import sys try: import cProfile as profile except ImportError: import profile try: from cStringIO import StringIO except ImportError: import StringIO from django.conf import settings

class ProfilerMiddleware(object): def can(self, request): return settings.DEBUG and 'prof' in request.GET and (not settings.INTERNAL_IPS or request.META['REMOTE_ADDR'] in

settings.INTERNAL_IPS) def process_view(self, request, callback, callback_args, callback_kwargs): if self.can(request): self.profiler = profile.Profile() args = (request,) + callback_args return self.profiler.runcall(callback, *args, **callback_kwargs) def process_response(self, request, response): if self.can(request): self.profiler.create_stats() out = StringIO() old_stdout, sys.stdout = sys.stdout, out self.profiler.print_stats(1) sys.stdout = old_stdout response.content = '<pre>%s</pre>' % out.getvalue() return response

http://localhost:8000/?prof

Profiling Database Queries from django.db import connection class DatabaseProfilerMiddleware(object): def can(self, request): return settings.DEBUG and 'dbprof' in request.GET \ and (not settings.INTERNAL_IPS or \ request.META['REMOTE_ADDR'] in settings.INTERNAL_IPS)

def process_response(self, request, response): if self.can(request): out = StringIO() out.write('time\tsql\n') total_time = 0 for query in reversed(sorted(connection.queries, key=lambda x: x['time'])): total_time += float(query['time'])*1000 out.write('%s\t%s\n' % (query['time'], query['sql']))

response.content = '<pre style="white-space:pre-wrap">%d queries executed in %.3f seconds\n\n%s</pre>' % (len(connection.queries), total_time/1000, out.getvalue())

return response

http://localhost:8000/?dbprof

Summary

•  Database efficiency is the typical problem in web apps.

•  Develop and deploy a caching plan early on.

•  Use profiling tools to find your problematic areas. Don’t pre-

optimize unless there is good reason.

•  Find someone who knows more than me to configure your

server software.

Slides and code available online at: http://www.davidcramer.net/djangocon

Thanks!

high performance django 1

Technology

cache cache

object list

object cache expiration

browser cache

page cache squid

proper cache headers

database servers

list return