PyCon 2019 in Cleveland, Ohio

Sunday 10 a.m.–1 p.m. in Expo Hall

Association Rules Mining Using Python Generators and Pandas to Handle Large Datasets

Srivignessh Pacham Sri Srinivasan


Association rule mining with apriori algorithm is a standard approach to derive association rules. The basic implementations of the algorithm with pandas involving splitting the data into multiple subsets are not suitable for handling large datasets due to excessive use of RAM memory. Hence, the algorithm fails to execute. However, the use of the python generator makes it possible to implement and process one value at a time, discard when finished and move on to process the next value. This feature makes generators perfect for creating item pairs, counting their frequency of co-occurrence and determining the association rules. A generator is a special type of function that returns an iterable sequence of items, unlike regular functions which return all the values at once (eg: returning all the elements of a list). A generator yields one value at a time. To get the next value in the set, we must ask for it - either by explicitly calling the generator's built-in "next" method, or implicitly via a for loop. This is a great property of generators because it means that we don't have to store all of the values in memory at once. This efficient implementation is tested in Market Basket Analysis Dataset for various minimum support thresholds.