Like Switching on the Light

Managing an Elastic Compute Cluster with Python


George BelotskyHeath Johns
Chief Scientist VP of Engineering
george@cinematx.com heath@cinematx.com

Updated April 16, 2008

CinematX Digital Inc.

Note: some important URLs appear directly in the slides, so the attendees can write them down immediately if they so desire. Others which are easy to find (e.g. via a Google search) or less critical are present in print-only sections such as this one, as well as in hyperlinks.

These slides try to use complete paragraphs, rather than bullet points. This should make them much more informative, especially for later reference. It can also unfortunately make the slide deck less useful as a prompt to the speaker during the actual talk.

What is Elastic Computing?

Elastic computing is essentially utility computing. Instead of having dedicated machines, you pay for the resources only when they are required. This is similar to purchasing electricity, telephone services, etc.

This talk will focus on Amazon's Elastic Compute Cloud service (EC2). We will discuss the basics of working with EC2, mostly from the perspective of typical datacenter applications, such as Web server farms. If you want to use EC2 for high performance computing, see Peter Skomoroch's talk (#131) in addition to this one.

Ideally, you have already signed up for EC2, and acquired just a little familiarity with the service. It should be possible, however, to follow along even if you have not done these things.

Peter Skomoroch's talk (#131): http://us.pycon.org/2008/conference/schedule/event/10/

Components

Virtualization makes elastic computing practical. EC2 uses Xen. You can also use User Mode Linux (UML) to further subdivide EC2 instances into "smaller" virtuals.

Most of the tools are familiar (SSH, BASH, Python, etc.) but the environment can be very different.

Case in point: Amazon EC2 does not offer persistent storage for your filesystem. Once an AMI instance stops, all changes are lost.

Post-conference note: in mid-April 2008, Amazon announced that EC2 will soon support persistent disk volumes, which you will be able to mount in your AMI instances.

User Mode Linux: http://user-mode-linux.sourceforge.net/

Firefox Plugins

Command line tools are available from Amazon, but you will likely find the EC2 UI and the S3 Organizer (both Firefox plugins) to be absolutely essential tools.

Be sure to install at least the EC2 plugin before doing anything else!

Important Notes: EC2 allocates units of computation on an hourly basis. Do not leave machines running without need, because the costs can add up quickly. There are also bandwidth charges, which tend to be small until you go live, but beware of running lengthy stress tests. Storage charges for Amazon Machine Images (AMIs) are unlikely to be a significant factor. Amazon's Simple Storage Service (S3) stores these images.

EC2 UI: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=609
S3 Organizer: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=771

The Basics

Elementary EC2 Security

You will need the AWS Access Key ID and the AWS Secret Access Key. See the "AWS Access Identifiers" link under "You Web Services Account" drop-down menu on the Amazon AWS homepage.

You will also need to generate a keypair for SSH. By default, EC2 SSH services are strictly key based. Note that you will get a large number of SSH login attempts from automated attacks, so it is best to keep the default behavior.

You can define a number of security groups, which specify open ports. An AMI instance can have several security groups assigned upon launch. Changes to permissions in these groups affect all running instances using the group in question.

AWS security can be quite difficult at first. To start, verify that you can can create and shutdown AMI instances using the EC2 UI plugin for Firefox. Then, try to log into one of your running instances via SSH.

The Boto Library

The Boto library allows you to automate EC2 operations using Python.

http://code.google.com/p/boto/

Boto comes with a good tutorial, which we will review now.

http://boto.googlecode.com/svn/trunk/doc/ec2-tut.txt

Automated Administration with Paramiko

Paramiko is a very flexible SSH library for Python.

SSH is a well known facility with good security, which is constantly maintained. In addition, you will need SSH access to your machines anyway. So, SSH-based automated administration does not open any additional security holes.

The Linux Journal carried an article on this topic. The Capistrano package is another example of automation using shell commands.

http://www.linuxjournal.com/article/8257
http://www.capify.org/

A variation on this theme is to make Python (or IPython) the login shell on your target machines.

Useful Paramiko Subclass

import paramiko

class SSHClient(paramiko.SSHClient):
  def __init__(self):
    super(SSHClient, self).__init__()
    self.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    self.load_system_host_keys('/home/prov/.ssh/known_hosts')

        
  def exec_command(self, command):
    chan = self._transport.open_session()
    chan.set_combine_stderr(True)

    chan.exec_command(command)
    stdin = chan.makefile('wb')
    stdout = chan.makefile('rb')

    return stdin, stdout

The above class combines stdout and stderr when executing commands via SSH. This usually makes it easier to write code to analyze the results.

Establishing the SSH Connection

def _connect(self):
  ssh = SSHClient()
  ssh.connect(self._host, self._port, self._user, key_filename=self._keypath)
  return ssh

Establishing SSH connections can be expensive. You can implement a connection pool here instead of returning a new connection every time.

Sending a Command String

 def _command(self, command):
        ssh = self._connect()

        with closing(ssh):
            inp,out = ssh.exec_command(command)
            inp.close()  
            with closing(out):
                return out.read()

Note the extensive use of contexts, to cleanly release resources even if an exception occurs.

Fabrics, Machines, Commands

This is an approach for designing programs that control an elastic compute facility, or several such facilities.

Fabrics (e.g. EC2) supply "prototype machine" objects. These "prototype" objects contain enough information to build instances of Machine classes, which are capable of connecting to a remote host (e.g. via SSH). Command objects use Machine objects to carry out operations on the remote hosts.

Some Common Problems

Concurrency Model

The logic of configuring and maintaining AMI instances can get quite complex, but commands to each instance tend to run completely independently of one another.

Each command stream is bound by the network, and the length of time it takes for the target machine to execute the commands.

This makes threads the most appealing model for the kind of automation described here.

When is a Machine Really Ready?

This test must be done in two stages.

First, check that EC2 reports the machine's state as "running" (see the Boto library tutorial which we covered earlier).

This, however, is not enough! A machine in the "running" state may not have finished booting. Try executing a no-op command (e.g. echo > /dev/null) via SSH until it succeeds.

Post-conference note: An obvious optimization is to have the machine itself initiate contact after booting (by sending a message to a specific server, downloading and executing scripts, etc.). You may still need to support the above two-stage process, however, to catch machines that failed to fully initialize (e.g. due to an error in the AMI).

Your Own AMIs

Any changes you make to a running instance are not saved automatically.

A "self bundling" AMI includes the tools ec2-bundle-vol and ec2-upload-bundle, to create a new AMI from the running instance, and save it into S3, respectively. Note that you must register the AMI after uploading it, or it will not be usable. The EC2 UI Firefox plugin provides an easy way to register AMIs. When you de-register an AMI, it will remain in S3 until you explicitly delete it (e.g. using the S3 Organizer Firefox plugin). You can also re-register it instead.

If you want to build a Ubuntu AMI, use the instructions at the following links. The last link is a script which you could run on a stock Amazon Fedora Core instance; it should generate a Ubuntu AMI.

http://developer.amazonwebservices.com/connect/message.jspa?messageID=42535#42535
http://blog.atlantistech.com/index.php/2006/10/04/amazon-elastic-compute-cloud-walkthrough/
http://www.anvilon.com/software/download/build-ubuntu-7.10-gutsy-ami

Overview of Selected Advanced Topics

Cloudlets

A root-like, movable, virtual environment, devoted to the needs of a single user. It may run locally on virtual machines in the user's PC, or on the "cloud." Cloudlets keep important user state, but there is absolutely no guarantee that users will always get the same virtual machine every time they connect, or even that the cloudlet will run on the same host.

Running UML instances on AMI instance is an economical way to implement cloudlets on EC2. Such "nested" UML instances can also serve any other purpose you choose. For an excellent source of UML filesystem images, see "uml.nagafix.co.uk". You will, however, need to build your own UML kernel. It is essential to set the "Host memory split" (under "UML-specific options") to 2G/2G, or the kernel will not work under EC2.

To ensure that users get their cloudlets quickly, you can maintain a free list. Before adding a cloudlet to the free list, configure it as much as possible, leaving the bare minimum of tasks that need to be done when the user requests a machine.

Storing System State

Consider keeping all state information related to an AMI instance on that particular instance. In almost all applications, at least some state must reside on each machine. Hence, making the machine responsible for all of its state can avoid synchronization problems, without increasing overall system complexity.

You can create files to mark key events; these can also function as locks.

If you can define a set of diagnostic commands to determine the state of a machine, this is even better.

Normal Accident Theory

Normal Accident Theory (NAT) describes the accident potential of many different systems. It was first developed by Charles Perrow in 1979, and is covered in his classic book, Normal Accidents.

According to NAT, tight coupling and high complexity lead to inevitable failures. A distributed, dynamic environment such as EC2 is naturally complex, so reducing coupling is a major concern.

Even small, innocuous-looking classes (e.g. implementing a no-operation command, to check if an SSH connection is usable) can couple diverse parts of a program together.

Often, it is best to avoid extending the functionality of a class. Instead, you can use design patterns such as adapters and decorators to modify or add behavior where it is needed.

Normal Accidents (book): http://www.amazon.com/Normal-Accidents-Living-High-Risk-Technologies/dp/0691004129

Credits

Thanks to our absolutely brilliant development team.