PySociaLite: Python integrated query language for parallel/distributed data processing

Jiwon Seo

Audience level:


PySociaLite is a Python integrated query language for data processing. The query language(SociaLite) supports simple annotations for partitioning data over distributed machines; then, SociaLite makes it easy to process distributed data without explicit communication code. Python integration makes SociaLite very powerful, making thousands of existing Python code accessible in SociaLite queries.


This talk presents PySociaLite, Python integrated query language, SociaLite. SociaLite is a query language for distributed data processing. Based on high-level declarative query language Datalog, SociaLite can succinctly express data processing logic. Furthermore, SociaLite compiler automatically compiles queries into efficient parallel code to run on distributed multi-core machines. With its Python integration, PySociaLite allows SociaLite queries to be directly embedded in Python (Jython) programs. This integration allows SociaLite queries to access Python functions and variables, and implement part of the queries in Python code. Syntatically, the embedding of SociaLite query is indicated by a pair of backtik (`). The embedded queries are preprocessed (using PyParsing) and is rewritten into a function call with the queries as a string parameter. Python-level functions/variables are prefixed with a dollar sign ($), and is also recognized by the preprocessor, and passed as parameters to the function call. PySociaLite applies optimizations for the inter-operation between SociaLite and Python. For example, to access Python functions from SociaLite queries, Python frame object (PyFrame object) needs to be created for each Python function call. For better performance, this PyFrame object is cached and reused. With PySociaLite, programmers can have the best of both worlds; the performance of SociaLite and productivity of Python. Large amount of data is very compactly stored in SociaLite tables, and processed in parallel; moreover, data processing logic can be implemented with the help of Python.