Monday 10:50 a.m.–11:20 a.m.
Python as a configuration language
- Audience level:
- Systems Administration
The enormous size of Facebook's infrastructure and its "Move Fast" culture pose a challenge for the configuration framework, requiring a safe, fast and usable system to deploy settings to hundreds of thousands of servers. Python plays a special role in this framework, being the tool for generating the configuration, paired with thrift for enhanced schema validation and type-safety.
Managing configuration for large systems and huge numbers of services is a daunting task. Simple configs, hand-edited and stored in some repository don't scale up to interdependent configuration and large numbers of services. Intermediary services often buckle under assumptions of consistency, when availability is paramount. To answer these challenges, Facebook created "Configerator", a complete Python-based framework for human-edited configuration that simplifies the workflow from creation to distribution empowering engineers to programmatically generate configuration with good safety guarantees before deploying it to the server fleet. The language used for config generation is, of course, Python, chosen for its simplicity, readability, flexibility and existing adoption within Facebook. One of the drawbacks of Python is that is a loosely typed language, but given the wide usage of Thrift at Facebook and that it can be used not only for interface definition but also as data definition language, it was an easy choice to complement the config generation. This combination allows any Facebook engineer to be immediately productive with familiar tools aimed at a dramatically different goal. For more advanced use cases, domain-level validation is also supported, allowing extremely tailored verifications to be added when needed. The Configerator framework only requires that a Python module constructs a Thrift object which is then rendered at build time to a “materialized” JSON file, which is read by the configuration consumer. The framework does not impose any other structure. Config files can import other config files and Thrift definitions, define common functionality and override specifics. Config files can be as conceptually simple as: from common.cluster import default_config default_config("cluster1", some_feature=True) Whereas the produced materialized JSON file may be thousands of lines long. This allows for testing, reproducible builds, functional validation and other great practices with a very concise config language versus the highly-redundant and massive materialized JSON files. A **Web UI** makes it easy to edit simple config files directly from a browser, without any local repository checkout or CLI tooling. This is specially useful for the simple cases when the whole power of Python as configuration language is not needed. A commonly-overlooked use case is programmatic access, when config is generated by other tools or the authoritative value of a config is needed. The Configerator has a rich **API** for programmatic access and modification of config. This was recently rewritten to use pygit for much faster git repository access. Configuration history is often required for release activities and the inevitable roll-back of a bad config push. The configuration resides in a set of **git** repositories and the “materialized” JSON files are stored alongside the source config in the same commits, so deterministic roll-backs are as easy as a “git revert” and subsequent tests, code review, etc. Despite all the validation, any new config can have unexpected side effects, and a fast push of a broken config can cause outages. The Configerator framework supports automatic **canaries**, deploying the config to a subset of machines of different profiles and looking for any anomalies it might cause, and finally merging it to the main repository. This only begins after careful testing from developers when making changes, but is an additional requirement as it can help detect things that only happen at scale. Services can define their own testing procedures, including health metrics and deploy stages if they have special needs not covered in the default profiles. A more robust controlled rollout mechanism is also in development to make possible slow rollouts of new configs in production. Distributing the generated configs was an interesting challenge due to Facebook's sheer number of servers. Once a config change is merged and pushed to the main git repos a set of "tailers" see the changes and sync them to a global Zeus ensemble (Facebook's custom version of **ZooKeeper**). Zeus observers fan-out from there to Zeus proxies deployed on each cluster, sending notifications to the end boxes and serving the config contents. On each Facebook server, configerator **proxy** then syncs updates for configs accessed from other services on the host. This results in config updates being visible after a small delay between the change being merged in the repository and being fully distributed to all boxes which need the config. This delay is usually well below 30 seconds. In the unlikely event of network partition or any malfunction, the hosts can still safely use their locally-synced configs, thus being highly available. The usage growth of the Configerator framework within Facebook was amazing, and we believe that this was largely thanks to the simplicity of the framework that hides all the complexities of validation, testing and config distribution, and had the flexibility to create complex configs in a really simple and concise way, in python code. As conclusions we will also talk about lessons learned, the challenges of scaling and how PEP 0484 (Type Hints) might make Python an even better language for configuration generation.