OpenGPT-X: Open Source AI 'Made in Germany' falls short of own claims

by Dick Blankvoort & Jenia Jitsev

The OpenGPT-X initiative is an organization seeking to create open source AI models 'Made in Germany'. In this article, we evaluate its flagship Teuken models through the lens of open source. In collaboration with Jenia Jitsev of LAION and the Jülich Supercomputing Centre, we explore some of the project's strengths and weaknesses. We find that the model falls short of its own claims of openness, and document early signs of adverse consequences for the European open source AI ecosystem.

OpenGPT-X is a project developed under the umbrella of GAIA-X, a European initiative seeking to create an open and transparent platform for merging and sharing data. To this end, OpenGPT-X has received around 15 million euros from the German Federal Ministry for Economic Affairs and Climate Protection. With the project recently concluded, we evaluate the published models for transparency and openness.

Parameter descriptions:

Base Model Data

Are datasources for training the base model comprehensively documented and made available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.

End User Model Data

Are datasources for training the model that the end user interacts with comprehensively documented and made available?

Base Model Weights

Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.

End User Model Weights

Are the weights of the model that the end user interacts with made freely available?

Training Code

Is the source code of dataset processing, model training and tuning comprehensively made available?

Code Documentation

Is the source code of datasource processing, model training and tuning comprehensively documented?

Hardware Architecture

Is the hardware architecture used for datasource processing and model training comprehensively documented?

Preprint

Are archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps?

Paper

Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?

Modelcard

Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?

Datasheet

Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?

Package

Is a packaged release of the model available on a software repository (e.g. a Python Package Index, Homebrew)?

API and Meta Prompts

Is an API available that provides unrestricted access to the model (other than security and CDN restrictions)? If applicable, this entry also collects information on the use and availability of meta prompts.

Licenses

Is the project fully covered by Open Source Initiative (OSI)-approved licenses, including all data sources and training pipeline code?

Last updated 29 Oct 2025

OLMo by Ai2

OLMo-2-0325-32B

YuLan by Gaoling School of Artificial Intelligence

YuLan-Mini

Apertus by Swiss AI Initiative

Apertus-70B-2509

BLOOMZ by BigScience Workshop

BLOOM

Amber by LLM360

Amber

Open Assistant by Open Assistant

Pythia-12B

mT0 by BigScience Workshop

mT5-XXL

Whisper by OpenAI

Whisper-large-v3

Pythia by EleutherAI and Together Computer

Pythia-6.9B

SmolLM by HuggingFace

SmolLM3-3B-Base

K2 by LLM360

Tülu by Ai2

Llama-3.1-405B

OpenChat by OpenChat

Meta-Llama-3-8B

Arabic StableLM by StabilityAI

StableLM-2-1.6B

Vicuna by LMSYS

Vicuna-13B

Teuken by OpenGPT-X

Teuken-7B-base

Skywork-OR1 by Skywork

DeepSeek-R1-Distill-Qwen-32B

MobileLLM by Meta

MobileLLM-R1-950M-base

Instella by AMD

Instella-3B

Minerva by Sapienza Natural Language Processing Group

Minerva-7B-base-v1.0

Dolly by Databricks

Pythia-12B

T5 by Google AI

RedPajama by Together Computer

RedPajama-INCITE-7B-Base

MPT by Databricks

MPT-30B

Lucie by OpenLLM-France

Lucie-7B

Eurus by OpenBMB

Mixtral-8x22B-v0.1

DeepSeek V3.1 by DeepSeek

DeepSeek-V3.1-Base

Poro by AMD Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)

Llama-Poro-2-70B-Base

Neo by Multimodal Art Projection

Neo-7B

BERT by Google AI

unspecified

AquilaChat by Beijing Academy of Artificial Intelligence

Aquila2-70B-Expr

Zephyr by HuggingFace

Mixtral-8x22B-v0.1

Yi by 01.AI

Yi-34B

WizardLM by Microsoft and Peking University

LLaMA-7B

Salamandra by Barcelona Supercomputing Center

Salamandra-7B

Occiglot by Occiglot

Occiglot-7B-EU5

NeuralChat by Intel

Mistral-7B-v0.1

Llama Nemotron by NVIDIA

Llama-3.3-70B-Instruct

Guru by LLM360

Guru-32B

GPT-SW3 by AI Sweden

GPT-SW3-6.7B-V2

GPT-NeoXT by Together Computer

GPT-NeoX-20B

Fietje by Bram Vanroy

Phi-2

BTLM by Cerebras

BTLM-3B-8K-Base

Pharia by Aleph Alpha Research

Pharia-1-LLM-7B

minChatGPT by Ethan Yanjia Li

GPT2-Medium

Xwin-LM by Xwin-LM

Llama-2-13B

Phi by Microsoft

Phi-4

Mistral by Mistral AI

Mistral-Large-2411

Kimi K2 by Moonshot AI

Kimi K2 Base

DeepSeek R1 by DeepSeek

DeepSeek-V3-Base

SynLogic by Minimax AI

SynLogic-32B

OpenMoE by Zheng Zian

OpenMoE-8B

OpenELM by Apple

OpenELM-3B

InternLM by Shanghai AI Laboratory

InternLM3-8B

Hunyuan by Tencent

Hunyuan-7B-Pretrain

GPT OSS by OpenAI

unspecified

Falcon by Technology Innovation Institute

Falcon-H1-34B-Base

DeepHermes by Nous Research

Llama-3.1-70B

CT-LLM by Multimodal Art Projection

CT-LLM-Base

Mistral NeMo by Mistral AI and NVIDIA

Mistral-NeMo-12B-Base

XBai-04 by Yuan Shi Technology

Qwen3-32B

Saul by Equall

Mixtral-8x22B-v0.1

Qwen by Alibaba

Qwen3-235B-A22B-Base

Granite by IBM

Granite-3.3-8B-Base

Apriel by ServiceNow

Mistral-Nemo-Base-2407

MiMo by Xiaomi

MiMo-7B-Base

Airoboros by Jon Durbin

Qwen1.5-110B

Starling by NexusFlow

Llama-2-13B

Solar by Upstage AI

Mistral-7B-v0.1

Gemma by Google AI

Gemma-3-27B-PT

Geitje by Bram Vanroy

Mistral-7B-v0.1

Claire by OpenLLM-France

Falcon-7B

BELLE by KE Technologies

Llama-2-13B

UltraLM by OpenBMB

Llama-13B

Llama 4 by Meta

Llama-4-Maverick-17B-128E

dots.llm1 by RedNote

dots.llm1.base

StripedHyena by Together Computer

StripedHyena-Hessian-7B

Marco by Alibaba

Marco-LLM-GLO

Viking by Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)

unspecified

Llama 3.1 by Meta

Llama-3.1-405B

XVERSE by Shenzhen Yuanxiang Technology

XVERSE-MoE-A4.2B

RWKV by BlinkDL/RWKV

unspecified

Minimax-M1 by Minimax AI

MiniMax-Text-01

LongAlign by Zhipu AI

Llama-2-13B

Stanford Alpaca by Stanford University CRFM

Llama-7B

GLM by Zhipu AI

unspecified

Llama 3.3 by Meta

Llama-3.3-70B

Stable Beluga by StabilityAI

Llama-2-70B

Snowflake Arctic by Snowflake

Snowflake-Arctic-Base

Persimmon by Adept AI Labs

Persimmon-8B-Base

OPT by Meta

OPT-30B

Jais by G42

Llama-2-70B

Infinity-Instruct by Beijing Academy of Artificial Intelligence

Llama-3.1-70B

H2O-Danube by H2O.ai

H2O-Danube3.1-4B-Base

FastChat-T5 by LMSYS

Flan-T5-XL

EXAONE by LG

unspecified

Crystal by LLM360

Crystal

BitNet by Microsoft

unspecified

Baichuan by Baichuan Intelligent Technology

Baichuan2-13B-Base

StableVicuna by CarperAI

LLaMA-13B

Llama 3 by Meta

Meta-Llama-3-70B

Llama 2 by Meta

Llama-2-70B

LeoLM by LAION

Llama-2-70B

Koala by BAIR

Llama-13B

XGen by Salesforce

XGen-Small-9B-Base-R

Gemma Japanese by Google AI

Gemma-2-2B

Command A by Cohere AI

Command A?

Llama-Sherkala by G42

Llama-3.1-8B

Nanbeige by Nanbeige LLM lab

Nanbeige2-16B

Lumo AI by Proton

Undisclosed

Table: An overview of Teuken evaluated for 14 dimensions of openness.

Evaluating the Teuken models

Access to training data

The Teuken models are described as being pretrained on "data from publicly available sources", with openness being marketed as a key feature. We tried to verify these claims, investigating whether the model data is indeed as open as is purported.

The data used to train the base model of the Teuken series originates from two sources; the FineWeb-Edu dataset and a private dataset constructed by OpenGPT-X itself. The FineWeb-Edu dataset is widely available and used as pretraining data in over 200 models. The data sources of the private dataset, on the other hand, has a number of qualities which make obtaining insight into it challenging.

The full set of training data was not made publicly available. This means that anyone seeking to inspect or reproduce the data behind the models needs to request data access from the model authors. Although this is fairly standard practice, it does pose challenges to transparency and makes it difficult to reconstruct what data went into the model.

Reconstruction is further hampered by a lack of a clear and open source codebase that documents the processes of data collection, curation, filtering and processing. While the model does come with a preprint that documents parts of the training procedure, not enough detail is provided to make reproduction feasible. Here we list a selection of data sources that are difficult or impossible to access.

Data source	Reason for highlighting
Corpus of Contemporary Irish	Replaced by newer dataset
medi_notice dataset, Swiss Legislation Corpus	Inaccessible without a University of Zürich account
Open Legal Data German court decisions	Listed twice with different word and document count
Norwegian Colossal Corpus	Newspapers removed after model release
Various subsets of The Pile	No subsets provided, only The Pile and (for some subsets) scrapers
PRINCIPLE bilingual English-Irish datasets	Not clear which datasets were taken
Unknown Slovenian Dataset	404 Error
Swiss Policy Documents	Links to a search query with a single result
Various Wikimedia sites	Link to sites, not datasets

We ran into various issues attempting to reconstruct how the model was trained, including 404 pages, pages inaccessible without a specific university account, and links to sites without specifying directions on versioning and access methods. Furthermore, as mentioned in the model preprint, the data mixture was adapted before being used for pretraining. This included strategic downsampling of English data and upsampling of data in other languages. However, no information was made available how this processing step was done. Without an exhaustive list of data sources and clarity on data processing procedures, we concluded that reproducing the training procedure was not feasible.

Preprocessing and tokenizer

Next, we examined the preprocessing procedure. These are all the ways in which the training data was edited before model training began. A 2025 preprint on data processing for the OpenGPT-X family claims that "each step of our pipeline saves intermediate results", but these checkpoints are not made available. Making such intermediate steps publicly available can greatly contribute to the openness of the model, as we see in AllenAI's OLMo models and the Dolma dataset.

The instruction-tuning data mixture of Teuken is laid out quite clearly. As far as we can tell, all data sources are linked properly and open source datasets are used. The instruction-tuning data access could be further improved by publishing a final dataset, as for instance in the case of Cohere Lab's Aya models.

One of the claimed innovations of the Teuken model is its use of a balanced multilingual tokenizer which, the preprint claims was crucial for bringing down training costs. In this light, it seems like a missed opportunity that relatively little is known about this tokenizer, as it is only available bundled up with the instruction-tuned model. The paper from which it derives provides some implementation details, but does not provide thorough guidance regarding the exact contents of the underlying data mixture. The research community could potentially benefit from further information regarding this piece of technology.

All in all, the data processing pipeline underlying the Teuken models is not as open as might appear at first glance. While data sources are partly documented, claims about the openness are hard to verify and full reproducibility is out of reach because there is no publicly downloadable training dataset, no comprehensive documentation of data sources, and little detail about the multilingual tokenizer and its implementation.

Model weights

Teuken claims to be "open source", a core element of which is that the weights of the model are published (open-weights). In this section, we take a closer look at weight availability of the Teuken base and end models.

Base and end model weights are shared via OpenGPT-X's HuggingFace page. For the end model weights, two versions are shared: a research model and a commercial model (the preprint makes no mention of such a distinction). The model card and the OpenGPT-X website claim that the research model included training data with a non-permissive license, while for the commercial model this data was excluded. Based on this, we assume that the commercial model excludes the part of the private data mixture that was licensed non-permissively. If this is true, it further highlights the need for training data transparency to allow independent parties to verify these claims. Like many "open" models, Teuken is perhaps better characterised as "open weights" than as "open source".

Training and tuning source code

A hallmark of "open source" technology is the open availability of source code. While many open models provide most or all of the source code (good examples are Olmo by AllenAI , SmolLM2 by HuggingFace, or DCLM by the DataComp initiative and their collaborators), this is different with Teuken models. We have not been able to locate the repositories used in training or fine-tuning source, nor documentation on how to reproduce model training (apart from imprecise prose descriptions in preprints).

While OpenGPT-X does share a number of code repositories on GitHub, most of them represent forks of software from open source initiatives, for instance EleutherAI's LM evaluation harness and HuggingFace's OLM project. This makes clear that the OpenGPT-X initiative leans heavily on open source technologies, but has not yet reciprocated by releasing source code back to the open source community.

Licensing

OpenGPT-X claims their model data are processed and stored in a way which is in line with European standards for data storage and processing. For a European AI initiative these data transparency principles are essential to follow. However, given the previously documented lack of data transparency it is difficult to verify this. A particularly thorny issue here is the inheritance of usage restrictions of licenses of dataset that were used in model training.

Examining only the subset of used datasets that we could verify (see table), we found that this portion of the model datasets used for base model training contains a variety of datasets licensed under various non-permissive licenses. Ultimately it remains unclear whether the two published models were trained on data covered by these licenses. Full transparency would alleviate doubts of whether the model adheres with the claimed data standards and was exclusively trained on appropriately licensed data. Notably, the choice to go ahead and publish the model weights provides an easy way out to deliver a working model without having to clarifying the underlying training data licensing situation. This is an example of a widely used practice in the field that nonetheless has a clear adverse impact on the open source AI community.

Parameter descriptions:

Base Model Data

End User Model Data

Are datasources for training the model that the end user interacts with comprehensively documented and made available?

Base Model Weights

Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.

End User Model Weights

Are the weights of the model that the end user interacts with made freely available?

Training Code

Is the source code of dataset processing, model training and tuning comprehensively made available?

Code Documentation

Is the source code of datasource processing, model training and tuning comprehensively documented?

Hardware Architecture

Is the hardware architecture used for datasource processing and model training comprehensively documented?

Preprint

Are archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps?

Paper

Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?

Modelcard

Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?

Datasheet

Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?

Package

Is a packaged release of the model available on a software repository (e.g. a Python Package Index, Homebrew)?

API and Meta Prompts

Licenses

Is the project fully covered by Open Source Initiative (OSI)-approved licenses, including all data sources and training pipeline code?

Last updated 29 Oct 2025

OLMo by Ai2

OLMo-2-0325-32B

YuLan by Gaoling School of Artificial Intelligence

YuLan-Mini

Apertus by Swiss AI Initiative

Apertus-70B-2509

BLOOMZ by BigScience Workshop

BLOOM

Amber by LLM360

Amber

Open Assistant by Open Assistant

Pythia-12B

mT0 by BigScience Workshop

mT5-XXL

Whisper by OpenAI

Whisper-large-v3

Pythia by EleutherAI and Together Computer

Pythia-6.9B

SmolLM by HuggingFace

SmolLM3-3B-Base

K2 by LLM360

Tülu by Ai2

Llama-3.1-405B

OpenChat by OpenChat

Meta-Llama-3-8B

Arabic StableLM by StabilityAI

StableLM-2-1.6B

Vicuna by LMSYS

Vicuna-13B

Teuken by OpenGPT-X

Teuken-7B-base

Skywork-OR1 by Skywork

DeepSeek-R1-Distill-Qwen-32B

MobileLLM by Meta

MobileLLM-R1-950M-base

Instella by AMD

Instella-3B

Minerva by Sapienza Natural Language Processing Group

Minerva-7B-base-v1.0

Dolly by Databricks

Pythia-12B

T5 by Google AI

RedPajama by Together Computer

RedPajama-INCITE-7B-Base

MPT by Databricks

MPT-30B

Lucie by OpenLLM-France

Lucie-7B

Eurus by OpenBMB

Mixtral-8x22B-v0.1

DeepSeek V3.1 by DeepSeek

DeepSeek-V3.1-Base

Poro by AMD Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)

Llama-Poro-2-70B-Base

Neo by Multimodal Art Projection

Neo-7B

BERT by Google AI

unspecified

AquilaChat by Beijing Academy of Artificial Intelligence

Aquila2-70B-Expr

Zephyr by HuggingFace

Mixtral-8x22B-v0.1

Yi by 01.AI

Yi-34B

WizardLM by Microsoft and Peking University

LLaMA-7B

Salamandra by Barcelona Supercomputing Center

Salamandra-7B

Occiglot by Occiglot

Occiglot-7B-EU5

NeuralChat by Intel

Mistral-7B-v0.1

Llama Nemotron by NVIDIA

Llama-3.3-70B-Instruct

Guru by LLM360

Guru-32B

GPT-SW3 by AI Sweden

GPT-SW3-6.7B-V2

GPT-NeoXT by Together Computer

GPT-NeoX-20B

Fietje by Bram Vanroy

Phi-2

BTLM by Cerebras

BTLM-3B-8K-Base

Pharia by Aleph Alpha Research

Pharia-1-LLM-7B

minChatGPT by Ethan Yanjia Li

GPT2-Medium

Xwin-LM by Xwin-LM

Llama-2-13B

Phi by Microsoft

Phi-4

Mistral by Mistral AI

Mistral-Large-2411

Kimi K2 by Moonshot AI

Kimi K2 Base

DeepSeek R1 by DeepSeek

DeepSeek-V3-Base

SynLogic by Minimax AI

SynLogic-32B

OpenMoE by Zheng Zian

OpenMoE-8B

OpenELM by Apple

OpenELM-3B

InternLM by Shanghai AI Laboratory

InternLM3-8B

Hunyuan by Tencent

Hunyuan-7B-Pretrain

GPT OSS by OpenAI

unspecified

Falcon by Technology Innovation Institute

Falcon-H1-34B-Base

DeepHermes by Nous Research

Llama-3.1-70B

CT-LLM by Multimodal Art Projection

CT-LLM-Base

Mistral NeMo by Mistral AI and NVIDIA

Mistral-NeMo-12B-Base

XBai-04 by Yuan Shi Technology

Qwen3-32B

Saul by Equall

Mixtral-8x22B-v0.1

Qwen by Alibaba

Qwen3-235B-A22B-Base

Granite by IBM

Granite-3.3-8B-Base

Apriel by ServiceNow

Mistral-Nemo-Base-2407

MiMo by Xiaomi

MiMo-7B-Base

Airoboros by Jon Durbin

Qwen1.5-110B

Starling by NexusFlow

Llama-2-13B

Solar by Upstage AI

Mistral-7B-v0.1

Gemma by Google AI

Gemma-3-27B-PT

Geitje by Bram Vanroy

Mistral-7B-v0.1

Claire by OpenLLM-France

Falcon-7B

BELLE by KE Technologies

Llama-2-13B

UltraLM by OpenBMB

Llama-13B

Llama 4 by Meta

Llama-4-Maverick-17B-128E

dots.llm1 by RedNote

dots.llm1.base

StripedHyena by Together Computer

StripedHyena-Hessian-7B

Marco by Alibaba

Marco-LLM-GLO

Viking by Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)

unspecified

Llama 3.1 by Meta

Llama-3.1-405B

XVERSE by Shenzhen Yuanxiang Technology

XVERSE-MoE-A4.2B

RWKV by BlinkDL/RWKV

unspecified

Minimax-M1 by Minimax AI

MiniMax-Text-01

LongAlign by Zhipu AI

Llama-2-13B

Stanford Alpaca by Stanford University CRFM

Llama-7B

GLM by Zhipu AI

unspecified

Llama 3.3 by Meta

Llama-3.3-70B

Stable Beluga by StabilityAI

Llama-2-70B

Snowflake Arctic by Snowflake

Snowflake-Arctic-Base

Persimmon by Adept AI Labs

Persimmon-8B-Base

OPT by Meta

OPT-30B

Jais by G42

Llama-2-70B

Infinity-Instruct by Beijing Academy of Artificial Intelligence

Llama-3.1-70B

H2O-Danube by H2O.ai

H2O-Danube3.1-4B-Base

FastChat-T5 by LMSYS

Flan-T5-XL

EXAONE by LG

unspecified

Crystal by LLM360

Crystal

BitNet by Microsoft

unspecified

Baichuan by Baichuan Intelligent Technology

Baichuan2-13B-Base

StableVicuna by CarperAI

LLaMA-13B

Llama 3 by Meta

Meta-Llama-3-70B

Llama 2 by Meta

Llama-2-70B

LeoLM by LAION

Llama-2-70B

Koala by BAIR

Llama-13B

XGen by Salesforce

XGen-Small-9B-Base-R

Gemma Japanese by Google AI

Gemma-2-2B

Command A by Cohere AI

Command A?

Llama-Sherkala by G42

Llama-3.1-8B

Nanbeige by Nanbeige LLM lab

Nanbeige2-16B

Lumo AI by Proton

Undisclosed

Table: Allen AI's OLMo, a model that fairs better than Teuken when it comes to openness standards, for reference.

Community impact

Impact on the research community. The documented lack of data transparency and model openness has consequences for any party using or building on it. Other models building on Teuken would, similar to Mistral- and Llama-based models, inherit its shortcomings. Truly open models can only be built on truly open foundations. Large and publicly-funded model publishers that actively invite others to adopt and build on their technology have a responsibility to provide (base) models that meet the highest possible transparency and openness standards. In the event of widespread adoption in research and development, the lack of openness of OpenGPT-X would negatively affect the openness standards of any study it is part of or project that builds on it.

Impact on public perception. OpenGPT-X's own press releases and media coverage paint a rosy picture. Teuken is described as a fully open source model which handles data in line with European regulations. Given the limited reproducibility and availability, however, these claims must be taken with a grain of salt. Though some observations regarding model openness are promising, we conclude that the two published models and related technologies (e.g. the published tokenizer) do not warrant the label full open source. We conclude that, when it comes to model openness, the Teuken models are not on par with ideal openness standards.

Impact on companies. The Teuken models claim to offer better control over the technology used while allowing for optimizing models to specific use cases. It may indeed be true that the model can serve as a basis for corporate applications in proprietary settings. However, given the higher performance of other open-weights LLMs, also in multi-lingual scenarios, we do not see a strong use case for Teuken over other open-weight models.

Impact on the EU LLM ecosystem. Within the European LLM ecosystem, OpenGPT-X is has attracted quite some attention, but similar initiatives exist. Salamandra by the Barcelona Supercomputing Center provides simlilar capabilities and is trained on data from 35 different European languages. EuroLLM developed by the large-scale UTTER project is another multilingual LLM with similar aims to Teuken. Occiglot and Pharia are another two large language models with similar aims. The niche that the Teuken model fills in relation to these models is its greater focus on German as a primarily language, and its design for specialist applications. Though Teuken has a place within this broader ecosystem, it remains to be seen to what degree the model is able to strengthen European innovation and competitiveness, as is its aim. A promising step into this direction is that OpenGPT-X has also published its own OpenGPT-X European LLM leaderboard, which has seen some community uptake.

Conclusion

Teuken models cannot be considered fully open source. This is reflected in the model's current position in the European Open Source AI index, where the limitations of its openness are documented on a running basis. Currently, the model lacks important elements that enable reproducibility, a key element of true open source. As the model authors claim, "it is crucial that the technology and expertise to build these models is democratized to enable different communities and organizations to employ these models for their use cases."

OpenGPT-X has been marketed as a European model that rivals DeepSeek. Based on our investigation, we conclude that Teuken is indeed comparable to DeepSeek in at least one regard: in terms of transparency.

Parameter descriptions:

Base Model Data

End User Model Data

Are datasources for training the model that the end user interacts with comprehensively documented and made available?

Base Model Weights

Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.

End User Model Weights

Are the weights of the model that the end user interacts with made freely available?

Training Code

Is the source code of dataset processing, model training and tuning comprehensively made available?

Code Documentation

Is the source code of datasource processing, model training and tuning comprehensively documented?

Hardware Architecture

Is the hardware architecture used for datasource processing and model training comprehensively documented?

Preprint

Are archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps?

Paper

Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?

Modelcard

Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?

Datasheet

Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?

Package

Is a packaged release of the model available on a software repository (e.g. a Python Package Index, Homebrew)?

API and Meta Prompts

Licenses

Is the project fully covered by Open Source Initiative (OSI)-approved licenses, including all data sources and training pipeline code?

Last updated 29 Oct 2025

OLMo by Ai2

OLMo-2-0325-32B

YuLan by Gaoling School of Artificial Intelligence

YuLan-Mini

Apertus by Swiss AI Initiative

Apertus-70B-2509

BLOOMZ by BigScience Workshop

BLOOM

Amber by LLM360

Amber

Open Assistant by Open Assistant

Pythia-12B

mT0 by BigScience Workshop

mT5-XXL

Whisper by OpenAI

Whisper-large-v3

Pythia by EleutherAI and Together Computer

Pythia-6.9B

SmolLM by HuggingFace

SmolLM3-3B-Base

K2 by LLM360

Tülu by Ai2

Llama-3.1-405B

OpenChat by OpenChat

Meta-Llama-3-8B

Arabic StableLM by StabilityAI

StableLM-2-1.6B

Vicuna by LMSYS

Vicuna-13B

Teuken by OpenGPT-X

Teuken-7B-base

Skywork-OR1 by Skywork

DeepSeek-R1-Distill-Qwen-32B

MobileLLM by Meta

MobileLLM-R1-950M-base

Instella by AMD

Instella-3B

Minerva by Sapienza Natural Language Processing Group

Minerva-7B-base-v1.0

Dolly by Databricks

Pythia-12B

T5 by Google AI

RedPajama by Together Computer

RedPajama-INCITE-7B-Base

MPT by Databricks

MPT-30B

Lucie by OpenLLM-France

Lucie-7B

Eurus by OpenBMB

Mixtral-8x22B-v0.1

DeepSeek V3.1 by DeepSeek

DeepSeek-V3.1-Base

Poro by AMD Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)

Llama-Poro-2-70B-Base

Neo by Multimodal Art Projection

Neo-7B

BERT by Google AI

unspecified

AquilaChat by Beijing Academy of Artificial Intelligence

Aquila2-70B-Expr

Zephyr by HuggingFace

Mixtral-8x22B-v0.1

Yi by 01.AI

Yi-34B

WizardLM by Microsoft and Peking University

LLaMA-7B

Salamandra by Barcelona Supercomputing Center

Salamandra-7B

Occiglot by Occiglot

Occiglot-7B-EU5

NeuralChat by Intel

Mistral-7B-v0.1

Llama Nemotron by NVIDIA

Llama-3.3-70B-Instruct

Guru by LLM360

Guru-32B

GPT-SW3 by AI Sweden

GPT-SW3-6.7B-V2

GPT-NeoXT by Together Computer

GPT-NeoX-20B

Fietje by Bram Vanroy

Phi-2

BTLM by Cerebras

BTLM-3B-8K-Base

Pharia by Aleph Alpha Research

Pharia-1-LLM-7B

minChatGPT by Ethan Yanjia Li

GPT2-Medium

Xwin-LM by Xwin-LM

Llama-2-13B

Phi by Microsoft

Phi-4

Mistral by Mistral AI

Mistral-Large-2411

Kimi K2 by Moonshot AI

Kimi K2 Base

DeepSeek R1 by DeepSeek

DeepSeek-V3-Base

SynLogic by Minimax AI

SynLogic-32B

OpenMoE by Zheng Zian

OpenMoE-8B

OpenELM by Apple

OpenELM-3B

InternLM by Shanghai AI Laboratory

InternLM3-8B

Hunyuan by Tencent

Hunyuan-7B-Pretrain

GPT OSS by OpenAI

unspecified

Falcon by Technology Innovation Institute

Falcon-H1-34B-Base

DeepHermes by Nous Research

Llama-3.1-70B

CT-LLM by Multimodal Art Projection

CT-LLM-Base

Mistral NeMo by Mistral AI and NVIDIA

Mistral-NeMo-12B-Base

XBai-04 by Yuan Shi Technology

Qwen3-32B

Saul by Equall

Mixtral-8x22B-v0.1

Qwen by Alibaba

Qwen3-235B-A22B-Base

Granite by IBM

Granite-3.3-8B-Base

Apriel by ServiceNow

Mistral-Nemo-Base-2407

MiMo by Xiaomi

MiMo-7B-Base

Airoboros by Jon Durbin

Qwen1.5-110B

Starling by NexusFlow

Llama-2-13B

Solar by Upstage AI

Mistral-7B-v0.1

Gemma by Google AI

Gemma-3-27B-PT

Geitje by Bram Vanroy

Mistral-7B-v0.1

Claire by OpenLLM-France

Falcon-7B

BELLE by KE Technologies

Llama-2-13B

UltraLM by OpenBMB

Llama-13B

Llama 4 by Meta

Llama-4-Maverick-17B-128E

dots.llm1 by RedNote

dots.llm1.base

StripedHyena by Together Computer

StripedHyena-Hessian-7B

Marco by Alibaba

Marco-LLM-GLO

Viking by Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)

unspecified

Llama 3.1 by Meta

Llama-3.1-405B

XVERSE by Shenzhen Yuanxiang Technology

XVERSE-MoE-A4.2B

RWKV by BlinkDL/RWKV

unspecified

Minimax-M1 by Minimax AI

MiniMax-Text-01

LongAlign by Zhipu AI

Llama-2-13B

Stanford Alpaca by Stanford University CRFM

Llama-7B

GLM by Zhipu AI

unspecified

Llama 3.3 by Meta

Llama-3.3-70B

Stable Beluga by StabilityAI

Llama-2-70B

Snowflake Arctic by Snowflake

Snowflake-Arctic-Base

Persimmon by Adept AI Labs

Persimmon-8B-Base

OPT by Meta

OPT-30B

Jais by G42

Llama-2-70B

Infinity-Instruct by Beijing Academy of Artificial Intelligence

Llama-3.1-70B

H2O-Danube by H2O.ai

H2O-Danube3.1-4B-Base

FastChat-T5 by LMSYS

Flan-T5-XL

EXAONE by LG

unspecified

Crystal by LLM360

Crystal

BitNet by Microsoft

unspecified

Baichuan by Baichuan Intelligent Technology

Baichuan2-13B-Base

StableVicuna by CarperAI

LLaMA-13B

Llama 3 by Meta

Meta-Llama-3-70B

Llama 2 by Meta

Llama-2-70B

LeoLM by LAION

Llama-2-70B

Koala by BAIR

Llama-13B

XGen by Salesforce

XGen-Small-9B-Base-R

Gemma Japanese by Google AI

Gemma-2-2B

Command A by Cohere AI

Command A?

Llama-Sherkala by G42

Llama-3.1-8B

Nanbeige by Nanbeige LLM lab

Nanbeige2-16B

Lumo AI by Proton

Undisclosed

_{Addendum (12 Aug 2025):}

_{After correspondance with the OpenGPT-X team, some updates relevant to model openness have been published. A new script for model training has been made available on GitHub. And a new version of the preprint that details data usage and preprocessing has been published on ArXiv. This article and OpenGPT-X database entry have been updated accordingly.}

Supported by the Centre for Language Studies and the Dutch Research Council. Website design & development © 2024 by BSTN. This version of the index generated 29 Oct 2025, website content last updated 04 Sep 2025.