Patent 12067363 was granted and assigned to ASAPP on August, 2024 by the United States Patent and Trademark Office.
A system, method, and computer program are provided for text sanitization. The system builds a corpus of document vectors (including tokenizing each document, creating a vector representation based on the tokens, and building a corpus of vector representations), obtains a new document for text sanitization, tokenizes the new document, creates a new document vector based on the tokens in the new document, and accesses the corpus of document vectors. The system filters each of the tokens in the new document against a privacy threshold. The system performs a k-anonymity sanitization process such that the new document vector becomes indistinguishable from at least k other document vectors in the corpus of document vectors. The system replaces or redacts the tokens in the document flagged as unsafe. The system updates the corpus of document vectors to include the new document vector in its form prior to the filtering and k-anonymity sanitization steps.