Software attributes
AI Project attributes
Other attributes
WizardLM is a family of instruction-following large language models (LLMs) powered by Evol-Instruct, a method using LLMs instead of humans to automatically mass-produce open-domain instructions to improve performance. The family of models includes WizardLM, WizardCoder, and WizardMath. WizardLM and the Evol-Instruct method were introduced in an April 2023 paper from researchers at Microsoft and Peking University led by Can Xu, a senior applied scientist at Microsoft's STCA (Software Technology Center at Asia) working in the S+D NLP Science Group.
Instructions are used to train or fine-tune LLMs. This requires open-domain instruction-following data provided by human annotators. However, the manual creation of instructions is time-consuming and labor-intensive. Evol-Instruct leverages LLMs to automatically generate large amounts of instruction data with varying levels of complexity. Starting with an initial set of instructions, the team from Microsoft STCA used Evol-Instruct to rewrite them step by step into more complex instructions. The generated instruction data was then used to fine-tune the LLaMA LLM to produce WizardLM.
Initial attempts to train LLMs for NLP tasks were based on a small amount of hand-written instructions accompanying each task. These closed-domain instructions struggle with the samples in an NLP dataset sharing a few common instructions and the instructions only asking for one task (e.g., translation or summarization). LLMs have achieved better results, performing more complicated and diverse tasks, using open-domain instruction data generated by human users. However, this process is expensive and time-consuming while also introducing skewed data. The proportion of experts among annotators is low compared to the total number, meaning the resulting instruction data tends to skew towards easy or moderate examples.
Evol-Instruct is an automatic method capable of mass-producing open-domain instructions (including more complicated instructions) using LLMs instead of humans. The diagram below shows a running example of Evol-Instruct starting with a simple instruction and then randomly selecting in-depth evolving (blue line) or in-breadth evolving (red line) to generate new and more complicated instructions.

Example of instructions generated using Evol-Instruct starting from a single, simple instruction.
In-depth evolution includes five types of operations: adding constraints, deepening, concretizing, increasing reasoning steps, and complicating input. The In-breadth Evolving is a mutation, i.e., generating a completely new instruction based on the given instruction. These six operations are implemented by prompting an LLM. An instruction eliminator was developed to filter failed instructions created by the LLM, a process known as elimination evolving. The evolutionary process is repeated for several rounds to generate instruction data containing a range of complexity.
WizardLM is an LLM build to validate the Evol-instruct method by fine-tuning the open-source LLaMA model using evolved instructions. In their April 2023 paper, the team behind WizardLM evaluated the model's performance compared to leading works on instruction finetuning. The instruction datasets compared to WizardLM were the data used by Alpaca (generated using self-instruct) and the 70k ShareGPT (shared by real users) dataset used by Vicuna.
Due to the low proportion of difficult instructions in previous instruction-following test datasets, the team created a new difficulty-balanced test dataset, named Evol-Instruct testset. Annotators were hired and GPT-4 was leveraged to evaluate Alpaca, Vicuna, ChatGPT, and WizardLM on Evol-Instruct testset and Vicuna’s testset. The paper shows instructions from Evol-Instruct were superior to those from human-created ShareGPT and that the WizardLM model outperforms Vicuna. Additionally, labelers preferred WizardLM outputs over those from ChatGPT under complex test instructions.
The table below shows the WizardLM models that have been released alongside their evaluation and license. Evaluation is determined using the MT-Bench, AlpacaEval, GSM8k, and Human Eval benchmarks.
WizardLM models
WizardCoder is the result of adapting the Evol-Instruct method to code. Most existing models performing code-related tasks are pre-trained solely on extensive raw code data without instruction finetuning. In a paper released in June 2023, the WizardLM team demonstrated the capabilities of WizardCoder and the extension of Evol-Instruct to code-related instructions.
The table below shows the WizardCoder models that have been released alongside their evaluation and license. Evaluation is based on the HumanEval dataset from OpenAI, 164 programming challenges, and the MBPP (mostly basic Python programming) benchmark consisting of around 1,000 crowd-sourced Python programming problems.
WizardCoder models
WizardMath, first described in an August 2023 paper, is a fine-tuned version of LLaMA-2 using a proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to generate instructions for math tasks. Most open-source models are only pre-trained on large-scale internet data, without specific math-related optimization.
The table below shows the WizardMath models that have been released alongside their evaluation and license. Evaluation is defined in terms of the GSM8k and MATH benchmarks.

