European Open Source AI Index
DatabaseNewsGuidesAboutContribute

Teuken

by OpenGPT-X

Open-source multilingual LLM that claims to support all 24 official languages of the European Union.
Text
Full
https://huggingface.co/openGPT-X/Teuken-7B-instruct-commercial-v0.4
Teuken-7B-base
Teuken-7B-instruct
Apache-2.0
Project aiming to develop LLMs in Germany.
https://opengpt-x.de/en/
September 2024
Availability
Base Model Data
Dataset described as deriving from the CommonCrawl, but no filtered dataset provided. Either a filtered dataset or a fully reproducible and persistent data pipeline would be preferred here.
https://arxiv.org/pdf/2410.08800
End User Model Data
The Huggingface shows a table with all datasets used for the end model.
https://huggingface.co/openGPT-X/Teuken-7B-instruct-commercial-v0.4#instruction-tuning-data
Base Model Weights
Available via Huggingface repository.
https://huggingface.co/openGPT-X/Teuken-7B-base-v0.6
End User Model Weights
Available via Huggingface repository.
https://huggingface.co/openGPT-X/Teuken-7B-instruct-commercial-v0.4
Training Code
SBATCH script with training code available at fork of Megatron-LM. However, no easily visible and easily navigable repository containing the code used to train the model is available. Making the repository more easily visible would alleviate this.
https://github.com/OpenGPTX/Megatron-LM/blob/main/examples/7B_EU24_juwels_part_3_fw_after3T.sbatch
Documentation
Code Documentation
README of containing training code is unchanged from base repo. More elaborate documentation would be warranted. A good example for a good documentation style would be the repository for the OLMo model: https://github.com/allenai/OLMo
Hardware Architecture
Preprint shows architecture, providing details about design decisions and hyperparameters.
https://arxiv.org/abs/2410.03730
Preprint
Three corresponding preprints, detailing the models, data, and evaluation.
https://arxiv.org/abs/2410.03730https://arxiv.org/abs/2410.08928https://arxiv.org/abs/2410.08800
Paper
Peer-reviewed paper published in ECAI. Other publications only available as preprints.
https://ecai2025.org/accepted-papers/
Modelcard
Detailed modelcard showing training details, data, technical specifications, and example usage.
https://huggingface.co/openGPT-X/Teuken-7B-instruct-commercial-v0.4
Datasheet
No datasheet containing a detailed description of data collection and curation is found attached to a persistent version of the model data, as would be preferred here. A persistent version of the filtered data with attached the information in the data preprint at https://arxiv.org/abs/2410.08800 would be sufficient here.
Access
Licenses
Apache 2.0, an OSI-approved license.
Is this information not up to date?
Contribute here ->
Supported by the Centre for Language Studies and the Dutch Research Council. Website design & development © 2024 by BSTN. This version of the index generated 09 April 2026, website content last updated 11 March 2026.