Systems and methods are disclosed for enabling a player of a video game to designate custom voice utterances to control an in-game character. One or more machine learning models may learn in-game character actions associated with each of a number of player-defined utterances based on player demonstration of desired character actions. During execution of an instance of a video game, current game state information may be provided to the one or more trained machine learning models based on an indication that a given utterance was spoken by the player. A system may then cause one or more in-game actions to be performed by a non-player character in the instance of the video game based on output of the one or more machine learning models.