OpenGPT-X: Open Source AI 'Made in Germany' falls short of own claims
by Dick Blankvoort & Jenia Jitsev
The OpenGPT-X initiative is an organization seeking to create open source AI models 'Made in Germany'. In this article, we evaluate its flagship Teuken models through the lens of open source. In collaboration with Jenia Jitsev of LAION and the Jülich Supercomputing Centre, we explore some of the project's strengths and weaknesses. We find that the model falls short of its own claims of openness, and document early signs of adverse consequences for the European open source AI ecosystem.
Are datasources for training the base model comprehensively documented and made available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
End User Model Data
Are datasources for training the model that the end user interacts with comprehensively documented and made available?
Base Model Weights
Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
End User Model Weights
Are the weights of the model that the end user interacts with made freely available?
Training Code
Is the source code of dataset processing, model training and tuning comprehensively made available?
Code Documentation
Is the source code of datasource processing, model training and tuning comprehensively documented?
Hardware Architecture
Is the hardware architecture used for datasource processing and model training comprehensively documented?
Preprint
Are archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps?
Paper
Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?
Modelcard
Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?
Datasheet
Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?
Package
Is a packaged release of the model available on a software repository (e.g. a Python Package Index, Homebrew)?
API and Meta Prompts
Is an API available that provides unrestricted access to the model (other than security and CDN restrictions)? If applicable, this entry also collects information on the use and availability of meta prompts.
Licenses
Is the project fully covered by Open Source Initiative (OSI)-approved licenses, including all data sources and training pipeline code?
The data used to train the base model of the Teuken series originates from two sources; the FineWeb-Edu dataset and a private dataset constructed by OpenGPT-X itself. The FineWeb-Edu dataset is widely available and used as pretraining data in over 200 models. The data sources of the private dataset, on the other hand, has a number of qualities which make obtaining insight into it challenging.
The full set of training data was not made publicly available. This means that anyone seeking to inspect or reproduce the data behind the models needs to request data access from the model authors. Although this is fairly standard practice, it does pose challenges to transparency and makes it difficult to reconstruct what data went into the model.
Reconstruction is further hampered by a lack of a clear and open source codebase that documents the processes of data collection, curation, filtering and processing. While the model does come with a preprint that documents parts of the training procedure, not enough detail is provided to make reproduction feasible. Here we list a selection of data sources that are difficult or impossible to access.
We ran into various issues attempting to reconstruct how the model was trained, including 404 pages, pages inaccessible without a specific university account, and links to sites without specifying directions on versioning and access methods. Furthermore, as mentioned in the model preprint, the data mixture was adapted before being used for pretraining. This included strategic downsampling of English data and upsampling of data in other languages. However, no information was made available how this processing step was done. Without an exhaustive list of data sources and clarity on data processing procedures, we concluded that reproducing the training procedure was not feasible.
Next, we examined the preprocessing procedure. These are all the ways in which the training data was edited before model training began. A 2025 preprint on data processing for the OpenGPT-X family claims that "each step of our pipeline saves intermediate results", but these checkpoints are not made available. Making such intermediate steps publicly available can greatly contribute to the openness of the model, as we see in AllenAI's OLMo models and the Dolma dataset.
The instruction-tuning data mixture of Teuken is laid out quite clearly. As far as we can tell, all data sources are linked properly and open source datasets are used. The instruction-tuning data access could be further improved by publishing a final dataset, as for instance in the case of Cohere Lab's Aya models.
One of the claimed innovations of the Teuken model is its use of a balanced multilingual tokenizer which, the preprint claims was crucial for bringing down training costs. In this light, it seems like a missed opportunity that relatively little is known about this tokenizer, as it is only available bundled up with the instruction-tuned model. The paper from which it derives provides some implementation details, but does not provide guidance regarding underlying data mixture. The research community could potentially benefit from further information regarding this piece of technology.
All in all, the data processing pipeline underlying the Teuken models is not as open as might appear at first glance. While data sources are partly documented, claims about the openness are hard to verify and full reproducibility is out of reach because there is no publicly downloadable training dataset, no comprehensive documentation of data sources, and little detail about the multilingual tokenizer and its implementation.
Teuken claims to be "open source", a core element of which is that the weights of the model are published (open-weights). In this section, we take a closer look at weight availability of the Teuken base and end models.
The base model weights of (pretrained) Teuken are available only upon request from the model authors. We have approached the model creators with such a request, but did not receive a response yet. For the end model weights, two versions are shared via OpenGPT-X's HuggingFace page: a research model and a commercial model (the preprint makes no mention of such a distinction). The model card and the OpenGPT-X website claim that the research model included training data with a non-permissive license, while for the commercial model this data was excluded. Based on this, we assume that the commercial model excludes the part of the private data mixture that was licensed non-permissively. If this is true, it further highlights the need for training data transparency to allow independent parties to verify these claims.
Like many "open" models, Teuken is perhaps better characterised as "open weights" than as "open source". But whereas many model providers share both base models and final models (>50 in our index), for Teuken only the final model weights are openly shared, with the base models available only upon request and fully at the discretion of the model provider.
A hallmark of "open source" technology is the open availability of source code. While many open models provide most or all of the source code (good examples are Olmo by AllenAI , SmolLM2 by HuggingFace, or DCLM by the DataComp initiative and their collaborators, this is different with Teuken models. We have not been able to locate the repositories used in training or fine-tuning source, nor documentation on how to reproduce model training (apart from imprecise prose descriptions in preprints).
While OpenGPT-X does share a number of code repositories on GitHub, most of them represent forks of software from open source initiatives, for instance EleutherAI's LM evaluation harness and HuggingFace's OLM project. This makes clear that the OpenGPT-X initiative leans heavily on open source technologies, but has not yet reciprocated by releasing source code back to the open source community.
OpenGPT-X claims their model data are processed and stored in a way which is in line with European standards for data storage and processing. For a European AI initiative these data transparency principles are essential to follow. However, given the previously documented lack of data transparency it is difficult to verify this. A particularly thorny issue here is the inheritance of usage restrictions of licenses of dataset that were used in model training.
Examining only the subset of used datasets that we could verify (see table), we found that this portion of the model datasets used for base model training contains a variety of datasets licensed under various non-permissive licenses. Ultimately it remains unclear whether the two published models were trained on data covered by these licenses. Full transparency would alleviate doubts of whether the model adheres with the claimed data standards and was exclusively trained on appropriately licensed data. Notably, the choice to go ahead and publish the model weights provides an easy way out to deliver a working model without having to clarifying the underlying training data licensing situation. This is an example of a widely used practice in the field that nonetheless has a clear adverse impact on the open source AI community.
Parameter descriptions:
Base Model Data
Are datasources for training the base model comprehensively documented and made available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
End User Model Data
Are datasources for training the model that the end user interacts with comprehensively documented and made available?
Base Model Weights
Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
End User Model Weights
Are the weights of the model that the end user interacts with made freely available?
Training Code
Is the source code of dataset processing, model training and tuning comprehensively made available?
Code Documentation
Is the source code of datasource processing, model training and tuning comprehensively documented?
Hardware Architecture
Is the hardware architecture used for datasource processing and model training comprehensively documented?
Preprint
Are archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps?
Paper
Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?
Modelcard
Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?
Datasheet
Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?
Package
Is a packaged release of the model available on a software repository (e.g. a Python Package Index, Homebrew)?
API and Meta Prompts
Is an API available that provides unrestricted access to the model (other than security and CDN restrictions)? If applicable, this entry also collects information on the use and availability of meta prompts.
Licenses
Is the project fully covered by Open Source Initiative (OSI)-approved licenses, including all data sources and training pipeline code?
Impact on the research community. The documented lack of data transparency and model openness has consequences for any party using or building on it. Other models building on Teuken would, similar to Mistral- and Llama-based models, inherit its shortcomings. Truly open models can only be built on truly open foundations. Large and publicly-funded model publishers that actively invite others to adopt and build on their technology have a responsibility to provide (base) models that meet the highest possible transparency and openness standards. In the event of widespread adoption in research and development, the lack of openness of OpenGPT-X would negatively affect the openness standards of any study it is part of or project that builds on it.
Impact on public perception. OpenGPT-X's own press releases and media coverage paint a rosy picture. Teuken is described as a fully open source model which handles data in line with European regulations. Given the limited reproducibility and availability, however, these claims must be taken with a grain of salt. Though some observations regarding model openness are promising, we conclude that the two published models and related technologies (e.g. the published tokenizer) do not warrant the label full open source. We conclude that, when it comes to model openness, the Teuken models are not on par with ideal openness standards.
Impact on companies. The Teuken models claim to offer better control over the technology used while allowing for optimizing models to specific use cases. It may indeed be true that the model can serve as a basis for corporate applications in proprietary settings. However, given the higher performance of other open-weights LLMs, also in multi-lingual scenarios, we do not see a strong use case for Teuken over other open-weight models.
Impact on the EU LLM ecosystem. Within the European LLM ecosystem, OpenGPT-X is has attracted quite some attention, but similar initiatives exist. Salamandra by the Barcelona Supercomputing Center provides simlilar capabilities and is trained on data from 35 different European languages. EuroLLM developed by the large-scale UTTER project is another multilingual LLM with similar aims to Teuken. Occiglot and Pharia are another two large language models with similar aims. The niche that the Teuken model fills in relation to these models is its greater focus on German as a primarily language, and its design for specialist applications. Though Teuken has a place within this broader ecosystem, it remains to be seen to what degree the model is able to strengthen European innovation and competitiveness, as is its aim. A promising step into this direction is that OpenGPT-X has also published its own OpenGPT-X European LLM leaderboard, which has seen some community uptake.
OpenGPT-X has been marketed as a European model that rivals DeepSeek. Based on our investigation, we conclude that Teuken is indeed comparable to DeepSeek in at least one regard: in terms of transparency.
Parameter descriptions:
Base Model Data
Are datasources for training the base model comprehensively documented and made available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
End User Model Data
Are datasources for training the model that the end user interacts with comprehensively documented and made available?
Base Model Weights
Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
End User Model Weights
Are the weights of the model that the end user interacts with made freely available?
Training Code
Is the source code of dataset processing, model training and tuning comprehensively made available?
Code Documentation
Is the source code of datasource processing, model training and tuning comprehensively documented?
Hardware Architecture
Is the hardware architecture used for datasource processing and model training comprehensively documented?
Preprint
Are archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps?
Paper
Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?
Modelcard
Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?
Datasheet
Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?
Package
Is a packaged release of the model available on a software repository (e.g. a Python Package Index, Homebrew)?
API and Meta Prompts
Is an API available that provides unrestricted access to the model (other than security and CDN restrictions)? If applicable, this entry also collects information on the use and availability of meta prompts.
Licenses
Is the project fully covered by Open Source Initiative (OSI)-approved licenses, including all data sources and training pipeline code?