A system and method for grouping similar documents is provided. Frequencies of occurrences are determined for terms and noun phrases within a set of documents. A subset of the documents is selected by removing those documents having terms and noun phrases that fall outside a bounded range of upper and lower conditions for frequency of occurrence. Each of the documents in the subset is mapped to a cluster of documents based on a similarity of the documents to the cluster documents.