PyCon Pittsburgh. April 15-23, 2020.

Talk: 1 + 1 = 1 or Record Deduplication with Python

Presented by:

Flávio Juvenal da Silva Junior


How to find duplicate records in a dataset without unique identifiers, like the SSN for US citizens? The answer is to use Record Deduplication techniques: look for matches by cleaning and comparing attributes in a fuzzy way. In this talk, you’ll learn with Python examples how to do this without needing any expert Data Science knowledge.

There are several critical applications of Record Deduplication in government and business. For example, by deduping records from Census data, the Australian government was able to find there were 250,000 fewer people in the country than they previously thought. This reduction impacted the estimations of government agencies and even caused the revision of economic projections. Similarly, businesses can use record deduplication techniques to clean up customers’ data. In this talk, you’ll learn with Python examples the main concepts of Record Deduplication: what kinds of problems can be solved, what’s the most common workflow for the process, what algorithms are involved, and which tools and libraries you can use. Although some of the discussed concepts are related to data mining, any intermediate-level Python developer will be able to learn the basics of how to dedupe data using Python.


Watch on YouTube