Data Profiler is an open source solution from Capital One that uses machine learning to help companies monitor big data and detect private customer information so that it can be protected. Data Profiler provides a pre-trained deep learning model to efficiently identify sensitive information and generate statistics with an infrastructure to build data labelers. Data Profiler can accept a wide range of data formats including csv, avro, parquet, json, text, and pandas DataFrames. Whether the data is structured, semi-structured or unstructured, the library is able to identify the schema, statistics, entities from the data. Versatility of the data labeler allows models to be modified as needed and it’s possible to run multiple models on the same dataset with just a few lines of code. Check out Data Profiler on GitHub here.
We invite data scientists, machine learning engineers, software engineers, from beginner to expert level, to learn how to extract data properties in an efficient way with DataProfiler.