European Open Source AI Index
DatabaseNewsGuidesAboutContribute
News
by Andreas Liesenfeld
01 August 2025

In late 2018, before large language models (LLMs) became mainstream, Google released a large language model bound to shake up the natural language processing (NLP) research community. The model dramatically improved the state-of-the-art for many benchmarks in the field, trailblazing the upcoming explosion of interest in LLMs. An awe-struck resarch community celebrated what the model could do and marvelled at the new "transformer" machine learning architecture it was build on. BERT even spawned a new scientific field dedicated to the study of it, BERTology. Amidst the buzz, few asked what data it was trained on or whether is was really "open source" as Google claimed.

Google's BERT continues to shape research in NLP to this day. The paper that introduced it became the most cited scientific publication in the field with over 125.000 citations. A host of popular spin-off models were built on top of the original, providing BERT-based technologies for domains as diverse as Biology (BioBERT), Science (SciBERT), Finance (FinBERT) or Medicine (Med-BERT). In short, with the help of BERT, Google shaped the course of an entire scientific discipline for years to come. Here's why this still matters today.

We still don't know exactly how "open source" BERT was built. Dazzled by it's technological marvels, even the scientists that introduced the field of BERTology that examines what BERT can do did not ask how it was built. Their work doesn't mention terms such as "training" or "data". The authors of BERT themselves vaguely stated in their paper that the model was trained on English Wikipedia and a now defunct corpus of English novels called the book corpus. Even before the release of BERT, these datasets were in use by Google's "Brain" division for research purposes. But the use of the book corpus, a collection of self-published books, for AI training violated the publisher's Terms of Service and records of correspondance between journalists and Google show that critical questions regarding the legitimacy of using this data were brought to Google's attention as early as 2016. Since BERT was released in 2018, neither Google's press release nor any other official source has clarified what data was used in the BERT model.

The field of open source AI still suffers from a lack of transparency when it comes to sharing training data and since 2018 many other genAI heavyweights have followed Google's lead of obfuscating open weight and open source. As the first LLM that was highly influential in science despite deliberate nondisclosure of it's training data, BERT was the original sin of open-washing that lead an entire scientific community down a dubious path. And the dazzled NLP research community failed to call the adverse effects of this practice on science out. BERT should be remembered both as a milestone in the history of NLP as well as the first testament of adverse effects of overclaiming transparency in the field of generative AI. Open-washing is as old as LLMs are.

Last updated 09 April 2026
YuLan by Gaoling School of Artificial Intelligence
YuLan-Mini
BLOOMZ by BigScience Workshop
BLOOM
OLMo by Ai2
Olmo-3-1125-32B
Apertus by Swiss AI Initiative
Apertus-70B-2509
SmolLM by HuggingFace
SmolLM3-3B-Base
mT0 by BigScience Workshop
mT5-XXL
Amber by LLM360
Amber
Pythia by EleutherAI and Together Computer
Pythia-6.9B
Open Assistant by Open Assistant
Pythia-12B
Lucie by OpenLLM-France
Lucie-7B
Instella by AMD
Instella-3B
EuroLLM by UTTER
EuroLLM-22B-2512
K2 by LLM360
K2-V2-Instruct
CT-LLM by Multimodal Art Projection
CT-LLM-Base
Arabic StableLM by StabilityAI
StableLM-2-1.6B
Skywork-OR1 by Skywork
DeepSeek-R1-Distill-Qwen-32B
Omnilingual ASR by Meta
Omnilingual ASR
MobileLLM by Meta
MobileLLM-R1-950M-base
BTLM by Cerebras
BTLM-3B-8K-Base
Minerva by Sapienza Natural Language Processing Group
Minerva-7B-base-v1.0
Whisper by OpenAI
Whisper-large-v3
Teuken by OpenGPT-X
Teuken-7B-base
T5 by Google AI
T5
Eurus by OpenBMB
Mixtral-8x22B-v0.1
RedPajama by Together Computer
RedPajama-INCITE-7B-Base
Poro by AMD Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)
Llama-Poro-2-70B-Base
OpenChat by OpenChat
Meta-Llama-3-8B
Neo by Multimodal Art Projection
Neo-7B
Guru by LLM360
Guru-32B
BERT by Google AI
unspecified
Dolly by Databricks
Pythia-12B
Vicuna by LMSYS
Vicuna-13B
Tülu by Ai2
Llama-3.1-405B
TildeOpen by Tilde.ai
TildeOpen-30b
Salamandra by Barcelona Supercomputing Center
Salamandra-7B
OpenMoE by Zheng Zian
OpenMoE-8B
Occiglot by Occiglot
Occiglot-7B-EU5
Llama Nemotron by NVIDIA
Llama-3.3-70B-Instruct
GPT-SW3 by AI Sweden
GPT-SW3-6.7B-V2
GPT-NeoXT by Together Computer
GPT-NeoX-20B
Fietje by Bram Vanroy
Phi-2
AquilaChat by Beijing Academy of Artificial Intelligence
Aquila2-70B-Expr
Baguettotron by PleIAs
Baguettotron
Zephyr by HuggingFace
Mixtral-8x22B-v0.1
WizardLM by Microsoft and Peking University
LLaMA-7B
SynLogic by Minimax AI
SynLogic-32B
Phi by Microsoft
Phi-4
OpenELM by Apple
OpenELM-3B
NeuralChat by Intel
Mistral-7B-v0.1
DeepHermes by Nous Research
Llama-3.1-70B
Pharia by Aleph Alpha Research
Pharia-1-LLM-7B
minChatGPT by Ethan Yanjia Li
GPT2-Medium
DeepSeek V3.2 by DeepSeek
DeepSeek-V3.1-Base
Yi by 01.AI
Yi-34B
XBai-04 by Yuan Shi Technology
Qwen3-32B
StripedHyena by Together Computer
StripedHyena-Hessian-7B
Saul by Equall
Mixtral-8x22B-v0.1
Hunyuan by Tencent
Hunyuan-7B-Pretrain
Apriel by ServiceNow
Apriel-1.5-15b-Thinker
MiMo by Xiaomi
MiMo-V2-Flash-Base
DeepSeek R1 by DeepSeek
DeepSeek-V3-Base
Xwin-LM by Xwin-LM
Llama-2-13B
Geitje by Bram Vanroy
Mistral-7B-v0.1
GPT OSS by OpenAI
unspecified
Claire by OpenLLM-France
Falcon-7B
BELLE by KE Technologies
Llama-2-13B
UltraLM by OpenBMB
Llama-13B
Airoboros by Jon Durbin
Qwen1.5-110B
Solar by Upstage AI
unspecified
Marco by Alibaba
Marco-LLM-GLO
MPT by Databricks
MPT-30B
Intern-S1 by Shanghai AI Laboratory
Intern-S1-Pro
Granite by IBM
Granite-4.0-H-Small-Base
Bielik by SpeakLeash AI
Bielik-11B-v3-Base-20250730
Viking by Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)
unspecified
Mistral NeMo by Mistral AI and NVIDIA
Mistral-NeMo-12B-Base
Starling by NexusFlow
Llama-2-13B
Seed-OSS by ByteDance
Seed-OSS-36B-Base
Nanbeige by Nanbeige LLM lab
Nanbeige4-3B-Base
LongAlign by Zhipu AI
Llama-2-13B
Ling by Inclusion AI
Ling-2.5-1T
Kimi K2.5 by Moonshot AI
unspecified
Gemma by Google AI
Gemma-3-27B-PT
Falcon by Technology Innovation Institute
Falcon-H1-7B-Base
Llama 4 by Meta
Llama-4-Maverick-17B-128E
dots.llm1 by RedNote
dots.llm1.base
Stanford Alpaca by Stanford University CRFM
Llama-7B
Jamba by AI21
AI21-Jamba2-Mini
GLM by Zhipu AI
GLM-5
Llama 3.1 by Meta
Llama-3.1-405B
XVERSE by Shenzhen Yuanxiang Technology
XVERSE-MoE-A4.2B
TeleChat by Tele-AI
unknown
Snowflake Arctic by Snowflake
Snowflake-Arctic-Base
RWKV by BlinkDL/RWKV
unspecified
Persimmon by Adept AI Labs
Persimmon-8B-Base
OPT by Meta
OPT-30B
Mistral Large 3 by Mistral AI
Mistral-Large-3-675B-Base-2512
LFM2 by Liquid AI
unknown
Infinity-Instruct by Beijing Academy of Artificial Intelligence
Llama-3.1-70B
H2O-Danube by H2O.ai
H2O-Danube3.1-4B-Base
FastChat-T5 by LMSYS
Flan-T5-XL
EXAONE by LG
unspecified
Crystal by LLM360
Crystal
Cogito by DeepCogito
DeepSeek-V3-Base
BitNet by Microsoft
unspecified
StableVicuna by CarperAI
LLaMA-13B
Llama 2 by Meta
Llama-2-70B
LeoLM by LAION
Llama-2-70B
Koala by BAIR
Llama-13B
XGen by Salesforce
XGen-Small-9B-Base-R
Reka Flash by Reka AI
unknown
Jais by G42
unspecified
Llama 3.3 by Meta
Llama-3.3-70B
Stable Beluga by StabilityAI
Llama-2-70B
Minimax-M2.5 by Minimax AI
MiniMax-Text-01?
Llama-Sherkala by G42
Llama-3.1-8B
Llama 3 by Meta
Meta-Llama-3-70B
Qwen by Alibaba
unspecified
Gemma Japanese by Google AI
Gemma-2-2B
Baichuan by Baichuan Intelligent Technology
Baichuan2-13B-Base
Command A by Cohere AI
Command A?
Grok 2 by xAI
unknown
Lumo AI by Proton
Undisclosed
Latest news
02 December 2025
Mistral 3: A Heavyweight Model, Lightweight on Openness
The Mistral 3 family, including the massive 675B Mistral Large 3, fills a gap for high-performance European models and aids local data governance, it fails to set a new standard for openness.
29 August 2025
Lumo: the least open 'open' AI assistant
Proton sets a new record in open-washing
01 August 2025
BERT: The original sin of open-washing large language models
Already in 2018 Google's pioneering LLM 'BERT' was marketed as open source, leading an entire field down a dubious path.
18 July 2025
MusGO: a new openness index for the music domain
A new resource building on our openness framework
28 May 2025
YuLan-Mini: A new highly open model
Gaoling School of Artificial Intelligence releases a new highly open AI model
27 May 2025
OpenGPT-X: Open Source AI 'Made in Germany' falls short of own claims
In this deep dive we evaluate OpenGPT-X's flagship Teuken models through the lens of openness
03 April 2025
Introducing Performance Classes in the OS AI Index
How we classify models under 3 performance classes: Limited, Full and Latest
24 February 2025
Introducing the European Open Source AI Index
In which we introduce the European Open Source AI Index and explain how to follow a moving target
17 February 2025
Luminous: the language model that wasn't
In which we trace the silent demise of a hyped German large language model
25 January 2025
How Open is DeepSeek?
Investigating the open-source status of DeepSeek-R1
Read all news ->
Supported by the Centre for Language Studies and the Dutch Research Council. Website design & development © 2024 by BSTN. This version of the index generated 09 April 2026, website content last updated 11 March 2026.