Saturday 3:15 p.m.–4 p.m.
Graph Database Patterns in Python
- Audience level:
Creating and using models from a graph database can be quite different to the ones used for row/column/document-oriented databases, in the sense that the same query patterns could differ significantly in structure and performance. This session will present how to create models in Python for Titan property graphs, that allow you to manipulate graphs as if you were querying with Gremlin DSL.
A property graph database represented by G = (V, E, λ) is basically a set of vertices (V) connected by edges (E), where each element can contain properties (λ) as key-value pairs. The most important thing of a graph database is that it provides index-free adjacency for fast global lookups in constant time, vs exponential time join-on-join in relational databases; while document databases (a.k.a. NoSQL) encourages denormalization and data embedding. My favorite graph database is Titan, because it allows me to use the storage backend of my preference (Cassandra) and index backend of my preference (Elasticsearch). Titan also provides a seamless integration with Gremlin, which is the graph traversal language created for Blueprints to manipulate Property Graphs. However, Titan is written in Java, and Gremlin provides Groovy syntax. There are Java frameworks, like Rexster, that expose the graph database functionality through a REST API. Although this solution can work for simple use cases, as you need to execute more complex traversals, the Rexster API falls short because it provides just the basic CRUD operations (and read operations executes very simple traversals). For Python programmers, interfacing your program with Rexster API to communicate with Titan generates too much HTTP overhead, and it’s limited to what Rexster functionality can provide. Instead, you can create your own a set of database models that expose Gremlin-like operations and inherit Blueprints Pipes architecture; and allows you to execute complex traversals and get the most of your graph database, without imposing a data model from your code. For achieving this, we must follow especially the S part of SOLID design (Single responsibility principle). You want to create a models of vertices that ONLY return vertices belonging to the same namespace and only manipulate those vertices. The same principle applies for edges. Following this design principle, you can create factory classes for vertices and edges, that implement the most commonly used traversals (like g.v, g.V, g.V.has, g.out,g.outE, etc) and then create a set of models where each model correspond to a namespace (in the case of vertices) or a label (in the case of edges). Most of this is achieved by the correct use of Factory Pattern, Pipes and Filters and Decorators.