Make Sure Your Programs Crash

Moshe Zadka

Type:: Talk
Audience level:: Intermediate
Category:: Best Practices/Patterns

March 9th 3:20 p.m. – 3:50 p.m.

Description

With Python, segmentation faults and the like simply don't happen -- programs do not crash. However, the world is a messy, chaotic place. What happens when your programs crash? I will talk about how to make sure that your application survives crashes, reboots and other nasty problems.

Abstract

Handling crashes is divided into two parts -- resilience (making sure that your software maintains correctness in the face of crashes) and speed of recovery (optimizing the time it takes back to get back to full working condition). I will talk about techniques to allow for resilience -- separating master data from cache data, minimizing the amount of master data, using atomic file operations, using databases and persisting structures in the right order. Then I will talk about speedy recovery techniques, among them process separation, working while restarting and more. I will conclude with surveying the options in testing all of these things so that the crashes are made to happen in the development/testing environment.

Outline:

Ways Python programs can crash
- Infinite loops
- Getting stuck
- Memory leaks
- Exceptions
  - Catching exceptions considered scary
- Threads dead-locks
Minimizing effects of a crash
- Atomic file operations
- Databases
- Vertical process splitting
- Horizontal process splitting
- Limiting process lifetime
Detecting crashes
- Process death
- Process inresponsiveness
- Test communication
- Helper checker processes
Restarting processes
- Minimize master data
- Boot-up speed
- Order of start-up and communication
- Testing by killing processes
- Testing by pausing processes
Conclusions
- Python processes can still crash
- Plan for crashes
- Test your plan for crashes