Data science and machine learning rely on high quality datasets for visualization, statistical inference, and modeling. However, the barriers to validating datasets and testing data processing, analysis, or model-training code are high, even with the extensive tooling that the python ecosystem offers, such as
To address this problem, in this talk I define statistical typing as a general concept describing a runtime typing system, which extends logical data types into the class of statistical data types. The additional semantics that statistical data types offer enables us to naturally express schemas as generative data contracts, which serve to both validate data at runtime and generate valid samples for testing purposes.
To illustrate this concept, I'll use
pandas data testing library, to illustrate how statistical typing makes data testing easier by enabling you to validate real-world data with reusable schemas and isolate units of processing, analysis, or model-training code.