Patent attributes
Implementations include receiving, by an application programming interface (API) server of a plurality of API servers, a prediction request from a client system, each of the plurality of API servers including a stateless server, selecting, by the API server, a model server from a plurality of model servers based on the prediction request, each of the plurality of model servers including a stateful server, calling, by the API server, the model server to execute inference using a ML model loaded to memory of the model server, receiving, by the API server, an inference result from the ML model, and sending, by the API server, the inference result to the client system.