Patent attributes
A system, method and computer-readable medium for missing data identification, including identifying columns in tables of a database, generating categorical columns of categorical data by transforming data values in the columns into categorical data values, generating a co-occurrence matrix corresponding to a pair of categorical columns in the categorical columns, determining an expected frequency of co-occurrence corresponding to each unique pair of categorical data values based at least in part on a marginal totals corresponding to categorical data values in the co-occurrence matrix, and identifying one or more locations of missing data based at least in part on the count of co-occurrence of each unique pair of categorical data values and the expected frequency of co-occurrence corresponding to each unique pair of categorical data values.