On the Hour Data Ingestion from the Web to a Mongo Database

Will Voorhees, Benjamin Bengfort, Rebecca Bilbro, Tony Ojeda


Systems designers consider the problem of integrating a variety of systems from databases to computational processes in such a way that they can run by themselves. While individual tasks like connecting to a database, or fetching data from an API may seem simple in isolation, systems experience increasing complexity as simple components are integrated in meaningful ways. Important questions about how to run processes on a schedule, how to detect and automatically recover from errors, how to manage each process, and how to view or administer the system as a whole become critically important to success. We were confronted with these questions when we attempted to build a system that on an hourly basis would go out and fetch posts from RSS feeds and store them in a Mongo database. This seemingly simple problem statement, intended to create a corpus of natural language for analytics and application oriented machine learning, became more complex as we integrated each component. In this poster we will present the lessons we learned from building the system and demonstrate an robust architecture that uses Python processes as a backbone for both work, scheduling, and administration. The system, called [Baleen](https://github.com/bbengfort/baleen) after the whales that ingest huge amounts of plankton, has been running since March 2016 and has collected over 1 million HTML posts creating a corpus containing hundreds of millions of words (which we hope will grow to over a billion words by the time PyCon arrives). Immediate problems came up - what happens when you ingest duplicate documents? How often do you synchronize? How do you handle errors without stopping jobs? Other problems were more intermediate - how do you get a quick view of the system as a whole? How do you detect global failure? And still other problems took months until they were noticed -- what happens when you run out of memory, disk space? What happens if you ingest a video or audio instead of text? Our solutions may not be the best, but they are our own. We used the [schedule](#) library to kick off hourly jobs, and stored information both about the jobs and the data collected in a Mongo database. We created a Flask web application that could read information from Mongo and display the status of the app. We created a command line application that allowed us to quickly manage different parts of the system and a utility that configured our app using a YAML file. We created email handlers to notify us of big failures, and utilized a suite of linux tools for deploying our system on AWS.