pycon logo

PyCon 2011 Atlanta

March 9th–17th

Log in or Sign Up

Handling ridiculous amounts of data with probabilistic data structures

log in to bookmark this presentaton

Experienced / Talk
March 12th 4:15 p.m. – 4:45 p.m.
Part of my job as a scientist involves playing with rather large amounts of data (200 gb+). In doing so we stumbled across some neat CS techniques that scale well, and are easy to understand and trivial to implement. These techniques allow us to make some or many types of data analysis map-reducable. I'll talk about interesting implementation details, fun science, and neat computer science.


If an extreme talk, I will talk about interesting details/issues in:

(1) Python as the backbone for a non-SciPy scientific software package: using Python as a frontend to C++ code, esp for parallelization and testing purposes.

(2) Implementing probabilistic data structures with one-sided error as pre-filters for data retrieval and analysis, in ways that are generally useful.

(3) Efficiently breaking down certain types of sparse graph problems using these probabilistic data structures, so that large graphs can be analyzed straightforwardly. This will be applied to plagiarism detection and/or duplicate code detection.