A mechanism is provided in a data processing system for rating difficulty of a question. The mechanism receives an input question and generates one or more candidate answers from a corpus of knowledge using a pipeline of software engines. The pipeline of software engines generates a plurality of features extracted from the question, the one or more candidate answers, or the corpus of knowledge. The mechanism then generates a question difficulty score based on the plurality of features using a machine learning model. The machine learning model maps features to assigned weights for scaling the difficulty score.