Gorilla is a large language model (LLM) that generates API calls given a natural language query. An end-to-end model tailored to serving correct API calls without additional coding, Gorilla is designed to work as part of a wider ecosystem, integrating with other tools. The model's website describes it as "An Appstore for LLMs."
Created by researchers from UC Berkeley and Microsoft Research, the team behind Gorilla claims it outperforms several baseline models for code generation, including GPT-4. Gorilla is fine-tuned on APIBench, a new dataset of API descriptions from three machine learning hub datasets—Torch Hub, TensorFlow Hub, and HuggingFace. Gorilla can also call out to an external document database containing API definitions, accessing new APIs without re-training.
The first Gorilla model (for HuggingFace API descriptions) was released on May 27, 2023, based on LLaMA-7B, a 7 billion parameter LLM created by Meta. The next day, two LLaMA-based versions of the model were released for Torch Hub and TensorFlow API descriptions. On June 6, 2023, two more versions were released under the Apache 2.0 license, allowing developers to use the models for commercial use. Unlike previous models, these two releases were not based on LLaMA-7B. One is based on MPT-7, and one is based on Falcon-7B from MosaicML and the Technology Innovation Institute, respectively.
The Gorilla code and model files are available on GitHub. A Google Colab notebook demo of the model has also been released, allowing users to launch the three LLaMA-7B-based models with a hosted end-point for the MPT-7 based model and plans to add a Falcon-7B version of the model. Users can also run Gorilla using the command line interface (CLI). On June 29, 2023, a research prototype of Gorilla CLI was released, building on the Gorilla LLMs to provide potential commands for execution based on natural language queries. Gorilla CLI was released under the Apache 2.0 license.
A paper describing Gorilla was first released on May 24, 2023. Titled "Gorilla: Large Language Model Connected with Massive APIs," the paper is authored by Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. At the time of its publication, both lead authors, Patil and Zhang, were fourth-year PhD students under Professor Gonzalez at UC Berkeley. Wang, a former PhD student at UC Berkeley who worked with Gonzalez, is a senior researcher at Microsoft Research within the Physics of AGI group. Previously she was a part of the Computer Vision Group. Gonzales is a professor in the Department of Electrical Engineering and Computer Science at UC Berkeley, a co-director and founding member of the UC Berkeley RISE Lab, and a member of the Berkeley AI Research (BAIR Group).
The day after submitting the paper (May 25, 2023), the APIBench dataset and the evaluation code of Gorilla were released. On May 27, 2023, the team released the first Gorilla model (Hugging Face APIs) as well as the APIZoo contribution guide for community API contributions. On May 28, they released two more versions of the Gorilla model based on Torch Hub and TensorFlow Hub APIs. On May 30, 2023, the CLI to chat with Gorilla was introduced. On June 6, two commercially usable Apache 2.0 licensed Gorilla models were released based on MPT-7 and Falcon-7B. On June 29, 2023, Gorilla-CLI was released, an LLM for providing CLI executions based on natural language queries.
APIBench is a large corpus of APIs developed by the team behind Gorilla. By scraping machine learning APIs from public model hubs, APIBench contains complicated and often overlapping functionality. The researchers chose three major model hubs to construct APIBench:
- HuggingFace—925 API calls
- TorchHub—94 API calls
- TensorHub—696 API calls
TorchHub and TensorHub were scraped exhaustively. However, HuggingFace contains a large number of models (>200,000), many of which have poor documentation, lack dependencies, or have limited information on their model cards. Therefore, only the twenty most downloaded models per task category were used. The task categories included seven in multimodal data, eight in computer vision, twelve in natural language processing, five in audio, two in tabular data, and two in reinforcement learning.
Ten synthetic user question prompts per API were also generated using Self-Instruct, such that each entry in the dataset becomes an instruction reference API pair. GPT-4 was used to generate the synthetic instruction data. Three in-context examples were provided, along with a reference API documentation, tasking GPT-4 to refrain from using API names or hints when generating instructions.
Gorilla is a retrieve-aware fine-tuned model, specifically for API calls. The model employs self-instruct to generate instruction/API pairs. To fine-tune the base model (LLaMA for the initial releases), these pairs are converted into a user-agent chat-style conversation, with each round of the conversation making up a data point. Next standard instruction fine-tuning was performed on the base model. Gorilla was trained with and without the retriever.
API calls often come with constraints. Therefore, Gorilla must not only comprehend the functionality of an API call but also categorize the calls according to different constraint parameters. The paper shows that augmenting an LLM with retrieval does not always improve performance. During inference, the user provides the natural language prompt. These prompts can be a simple task or more vague. Gorilla has two inference modes—zero-shot and retrieval. In zero shot, the prompt is passed to Gorilla LLM, returning the API that helps accomplish the task or goal. In retrieval, the retriever (either of BM25 or GPT-index) returns the most up-to-date API documentation stored in the API database. This is concatenated to the prompt along with the message to "Use this API documentation for reference" before it is passed to Gorilla. Besides the concatenation, no further prompt tuning is performed.
Given a natural language prompt, there are a number of different LLM APIs that Gorilla could provide to complete the task. For example, there are many different image generation models. To evaluate the performance of Gorilla and verify the APIs delivered, their functional equivalence is compared using the dataset collected. An AST tree-matching strategy is adopted to trace which API in the dataset is being called. An AST matching process is used to directly identify hallucinations. The paper defines hallucinations as an API call that is not a sub-tree of any API in the database. The model returned an imagined tool. This is different from invoking an incorrect API, described in the paper as an error not a hallucination.
Gorilla: Large Language Model Connected with Massive APIs
Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez
May 24, 2023