Make Sure Your Programs Crash

Type:
Talk
Audience level:
Intermediate
Category:
Best Practices/Patterns
March 9th 3:20 p.m. – 3:50 p.m.

Description

With Python, segmentation faults and the like simply don't happen -- programs do not crash. However, the world is a messy, chaotic place. What happens when your programs crash? I will talk about how to make sure that your application survives crashes, reboots and other nasty problems.

Abstract

Handling crashes is divided into two parts -- resilience (making sure that your software maintains correctness in the face of crashes) and speed of recovery (optimizing the time it takes back to get back to full working condition). I will talk about techniques to allow for resilience -- separating master data from cache data, minimizing the amount of master data, using atomic file operations, using databases and persisting structures in the right order. Then I will talk about speedy recovery techniques, among them process separation, working while restarting and more. I will conclude with surveying the options in testing all of these things so that the crashes are made to happen in the development/testing environment.

Outline:

  • Ways Python programs can crash
    • Infinite loops
    • Getting stuck
    • Memory leaks
    • Exceptions
      • Catching exceptions considered scary
    • Threads dead-locks
  • Minimizing effects of a crash
    • Atomic file operations
    • Databases
    • Vertical process splitting
    • Horizontal process splitting
    • Limiting process lifetime
  • Detecting crashes
    • Process death
    • Process inresponsiveness
    • Test communication
    • Helper checker processes
  • Restarting processes
    • Minimize master data
    • Boot-up speed
    • Order of start-up and communication
    • Testing by killing processes
    • Testing by pausing processes
  • Conclusions
    • Python processes can still crash
    • Plan for crashes
    • Test your plan for crashes