At the European Open Source AI Index, we welcome any efforts to promote openness in the AI space. Most new models we add end up somewhere midway the index, often because they build on widespread models like Llama or Mistral that are not themselves very open to start with. Much to our delight, there is a genuinely new model in town that has overtaken good old BLOOMZ for the second-place spot.
YuLan-Mini is a new model by the Gaoling School of Artificial Intelligence. Claiming particularly good performance in math and code, the model is fully open according to nearly all of our openness measures. The model creators publish the data used to train their base model on HuggingFace, providing thorough information on the data mixture in an accompanying table. The model weights themselves are published under an open MIT license, and training procedures and code are documented both on GitHub and in a corresponding paper. The model itself is made available through Ollama for convenient use.
The detailed documentation of YuLan-Mini brings home the degree of scaffolding possible in the current open source AI landscape. For instance, to bootstrap maths abilities, YuLan-Mini uses Qwen 2.5 Math 7B Instruct as a teacher model; for instruction tuning, AllenAI's DOLMA dataset plays an important role; and for reward modeling the Skywork Reward model, built on public data, is used. YuLan-Mini exemplifies the continuing reliance of current models on large amounts of synthetic data, in a tradition that goes back to Alpaca's first GPT4-derived datasets. The prominence of synthetic data is something we follow with interest, as model makers have to walk a fine line between performance improvements and model collapse.
We commend the effort involved with open-sourcing this model to such a significant extent. Somewhat surprisingly, YuLan-Mini has received relatively little attention from the open-source community. With all versions of the model currently sitting at less than 150 downloads per month, we think there is a lot more potential for this model out there. We encourage anyone interested in open source generative AI to give this model a spin, peek under the hood, and learn from its high documentation standards.
Parameter descriptions:
Base Model Data
Are datasources for training the base model comprehensively documented and made available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
End User Model Data
Are datasources for training the model that the end user interacts with comprehensively documented and made available?
Base Model Weights
Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
End User Model Weights
Are the weights of the model that the end user interacts with made freely available?
Training Code
Is the source code of dataset processing, model training and tuning comprehensively made available?
Code Documentation
Is the source code of datasource processing, model training and tuning comprehensively documented?
Hardware Architecture
Is the hardware architecture used for datasource processing and model training comprehensively documented?
Preprint
Are archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps?
Paper
Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?
Modelcard
Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?
Datasheet
Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?
Package
Is a packaged release of the model available on a software repository (e.g. a Python Package Index, Homebrew)?
API and Meta Prompts
Is an API available that provides unrestricted access to the model (other than security and CDN restrictions)? If applicable, this entry also collects information on the use and availability of meta prompts.
Licenses
Is the project fully covered by Open Source Initiative (OSI)-approved licenses, including all data sources and training pipeline code?