Tuesday 3:15 p.m.–3:45 p.m.
Diving into the Wreck: a postmortem look at real-world performance
- Audience level:
- Best Practices & Patterns
As a young engineer interested in performance, much of the advice I saw on performance management focused on algorithms and rules of thumb. It’s good advice, but it doesn’t address the most common problems. This talk will cover a handful of the most common performance problems I’ve encountered in my career. We will talk about how to recognize them, what causes them, and how to resolve them.
Conversations about software performance typically focus around big-O characteristics of algorithms and rules of thumb: put the loop inside the function call; use “is not None” rather than “!= None”; copy global names into default arguments for faster lookup. These rules are helpful, and they offer a programmer guidance toward idiomatic code habits. But most software is slow because of a misplaced loop, or using comparison operators where identity operators would suffice. Practically speaking, most software is slow because of data access and networked services. Logic tells us that single-point chokepoints are the critical path in an application, and physics suggests that the disks and networks involved are almost certainly orders of magnitude slower than our code. But tracing observed behaviors back to root cause can be daunting. Observed behaviors rarely point directly at the root cause, and often point at a proximate cause that may even camouflage the deeper concern. Observed behaviors like webserver queueing, where requests pile up in the webserver’s network stack, waiting for a worker to service them, often seem to indicate insufficient webserver resources. But often the webserver resources are waiting on databases, and adding more webservers only deepens the queue rather than address the problem. Other observed behaviors may mislead: a server that runs out of memory may appear to have sprung a memory leak, but it might simply suffer from a poor data access strategy. Sometimes, the performance degradation of an endpoint may be due entirely to a “noisy neighbor” that seems snappy enough, but is creating a great deal of database contention. Part of what makes databases such an insidious source of latency is that most modern applications are distributed applications. Even the simplest web stacks typically include a webserver, an application server, and a database server. More sophisticated applications add more and more distinct services. Distributed applications add a whole host of full and partial failure modes that can have unexpected impacts on an application’s performance. Consider, for instance, that a particular endpoint might seem to be running slowly on average. The database is fine, the data access design is well-considered. But there’s a call to a remote service where every third or fourth request is timing out and being retried. Or in that same application, where a large number of connections are being created and closed, no longer being able to open new connections. Certainly, some software can benefit from optimizing the language use. And the scalability of a routine will be reflected by its big-O analysis. But in real world performance remediation, it’s very rarely about algorithms or idioms. It’s usually the database.