One example method includes identifying dissimilar items in a data set. A data set may be walked one or more times and the nodes or vertices of the data set may be scored based on the number of times the nodes are touched during the walks. Scores below a threshold score are determined to be dissimilar nodes in the data set. This allows a diverse set of nodes to be identified. A dissimilar data set may be used to prevent unintentional bias in algorithmic training.