Big-data and machine learning software provider Databricks Inc. today released Dolly 2.0, the next iteration of the company’s open-source generative artificial intelligence model that has ChatGPT-like capabilities.
Databricks released Dolly two weeks ago as a response to the multitude of large language model generative AIs on the market today that are mostly inaccessible to researchers and businesses because they are locked behind paywalls and held by centralized services.
Dolly 2.0, like its predecessor, uses a smaller dataset than most LLMs, which makes it particularly lightweight. Dolly’s model was trained on 6 billion parameters, compared to OpenAI LP’s GPT-3’s 175 billion, whereas Dolly 2.0 features double that at 12 billion parameters. It has also been fine-tuned on a high-quality instruction-following dataset, which was crowdsourced from Databricks employees.
The AI uses its training set to generate coherent sentences and responses, which is the hallmark of generative AI, when asked questions by users. It can do this even though its training data is much smaller than OpenAI’s models, which makes it possible to run on a company’s internal servers without needing to share data with a third party.
“We believe models like Dolly will help democratize LLMs, transforming them from something very few companies can afford into a commodity every company can own and customize to improve their products,” the company said when it released Dolly.
Considering this, Databricks has fully open-sourced Dolly 2.0, including its training code and dataset for commercial use. The dataset included with Dolly 2.0 is the “databricks-dolly-15k” dataset, which contains 15,000 high-quality human-generated prompt and response pairs that anyone can use, modify and extend under a Creative Commons license.
The original Dolly was trained for $30 using a dataset that the Stanford Alpaca team created using the OpenAI API. As OpenAI prohibits the outputs of ChatGPT from being used to create a competing model, so that meant Dolly could not be used commercially. To get around that issue, Databricks set out to create a new model all its own by using only employee responses, and not a training set “tainted” by a commercial license.
To get the new model trained, Databricks set up a contest among its 5,000 employees to answer original questions. Tasks included following instructions such as open questions and answers such as opinions as to “Why do people like comedy movies?” or extracting and summarizing information from Wikipedia. It also required them to do brainstorming and creative writing tasks such as producing love letters, poetry or songs.
According to Databricks, Dolly 2.0 is the only model currently that doesn’t suffer from this limitation. Other examples include Alpaca, Koala, GPT4All and Vicuna, which cannot be used commercially because of their training data.
“We’ve heard repeatedly from our customers that they would be best served by owning their models, allowing them to create higher quality models for their domain-specific applications without handing their sensitive data over to third parties,” the company said.