One embodiment of the present invention sets forth a technique for processing recordings of events. The technique includes applying a machine learning model to a plurality of samples from one or more recordings of the events to generate a plurality of embeddings representing the plurality of samples. The technique also includes generating a plurality of labels that identify speakers associated with the plurality of samples based on metadata comprising timestamps of voice activity during the events and participants associated with the voice activity. The technique further includes storing mappings of the plurality of embeddings to the plurality of labels.