The IndexGuidesNews
AboutContribute

BERT: The original sin of open-washing large language models

by Andreas Liesenfeld
01 August 2025

In late 2018, before large language models (LLMs) became mainstream, Google released a large language model bound to shake up the natural language processing (NLP) research community. The model dramatically improved the state-of-the-art for many benchmarks in the field, trailblazing the upcoming explosion of interest in LLMs. An awe-struck resarch community celebrated what the model could do and marvelled at the new "transformer" machine learning architecture it was build on. BERT even spawned a new scientific field dedicated to the study of it, BERTology. Amidst the buzz, few asked what data it was trained on or whether is was really "open source" as Google claimed.

Google's BERT continues to shape research in NLP to this day. The paper that introduced it became the most cited scientific publication in the field with over 125.000 citations. A host of popular spin-off models were built on top of the original, providing BERT-based technologies for domains as diverse as Biology (BioBERT), Science (SciBERT), Finance (FinBERT) or Medicine (Med-BERT). In short, with the help of BERT, Google shaped the course of an entire scientific discipline for years to come. Here's why this still matters today.

We still don't know exactly how "open source" BERT was built. Dazzled by it's technological marvels, even the scientists that introduced the field of BERTology that examines what BERT can do did not ask how it was built. Their work doesn't mention terms such as "training" or "data". The authors of BERT themselves vaguely stated in their paper that the model was trained on English Wikipedia and a now defunct corpus of English novels called the book corpus. Even before the release of BERT, these datasets were in use by Google's "Brain" division for research purposes. But the use of the book corpus, a collection of self-published books, for AI training violated the publisher's Terms of Service and records of correspondance between journalists and Google show that critical questions regarding the legitimacy of using this data were brought to Google's attention as early as 2016. Since BERT was released in 2018, neither Google's press release nor any other official source has clarified what data was used in the BERT model.

The field of open source AI still suffers from a lack of transparency when it comes to sharing training data and since 2018 many other genAI heavyweights have followed Google's lead of obfuscating open weight and open source. As the first LLM that was highly influential in science despite deliberate nondisclosure of it's training data, BERT was the original sin of open-washing that lead an entire scientific community down a dubious path. And the dazzled NLP research community failed to call the adverse effects of this practice on science out. BERT should be remembered both as a milestone in the history of NLP as well as the first testament of adverse effects of overclaiming transparency in the field of generative AI. Open-washing in as old as LLMs are.

Parameter descriptions:

Base Model Data
Are datasources for training the base model comprehensively documented and made available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
End User Model Data
Are datasources for training the model that the end user interacts with comprehensively documented and made available?
Base Model Weights
Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.
End User Model Weights
Are the weights of the model that the end user interacts with made freely available?
Training Code
Is the source code of dataset processing, model training and tuning comprehensively made available?
Code Documentation
Is the source code of datasource processing, model training and tuning comprehensively documented?
Hardware Architecture
Is the hardware architecture used for datasource processing and model training comprehensively documented?
Preprint
Are archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps?
Paper
Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?
Modelcard
Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?
Datasheet
Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?
Package
Is a packaged release of the model available on a software repository (e.g. a Python Package Index, Homebrew)?
API and Meta Prompts
Is an API available that provides unrestricted access to the model (other than security and CDN restrictions)? If applicable, this entry also collects information on the use and availability of meta prompts.
Licenses
Is the project fully covered by Open Source Initiative (OSI)-approved licenses, including all data sources and training pipeline code?
Last updated 12 Aug 2025
OLMo by Ai2
OLMo-2-0325-32B
YuLan by Gaoling School of Artificial Intelligence
YuLan-Mini
BLOOMZ by BigScience Workshop
BLOOM
Poro by Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)
Poro-34B
Open Assistant by LAION
Pythia-12B
mT0 by BigScience Workshop
mT5-XXL
Whisper by OpenAI
Whisper-large-v3
Pythia by EleutherAI and Together Computer
Pythia-6.9B
Amber by LLM360
Amber
K2 by LLM360
K2
SmolLM by HuggingFace
SmolLM2-1.7B
OpenChat by OpenChat
Meta-Llama-3-8B
Arabic StableLM by StabilityAI
StableLM-2-1.6B
Instella by AMD
Instella-3B
Dolly by Databricks
Pythia-12B
Tülu by Ai2
Llama-3.1-405B
T5 by Google AI
T5
RedPajama by Together Computer
RedPajama-INCITE-7B-Base
Neo by Multimodal Art Projection
Neo-7B
BERT by Google AI
BERT
AquilaChat by Beijing Academy of Artificial Intelligence
Aquila2-70B-Expr
Eurus by OpenBMB
Mixtral-8x22B-v0.1
DeepSeek V3 by DeepSeek
DeepSeek-V3-Base
Yi by 01.AI
Yi-34B
Teuken by OpenGPT-X
Teuken-7B-base
Salamandra by Barcelona Supercomputing Center
Salamandra-7B
NeuralChat by Intel
Mistral-7B-v0.1
MPT by Databricks
MPT-30B
Lucie by OpenLLM-France
Lucie-7B
Guru by LLM360
Guru-32B
GPT-SW3 by AI Sweden
GPT-SW3-6.7B-V2
GPT-NeoXT by Together Computer
GPT-NeoX-20B
Fietje by Bram Vanroy
Phi-2
BTLM by Cerebras
BTLM-3B-8K-Base
Pharia by Aleph Alpha Research
Pharia-1-LLM-7B
minChatGPT by Ethan Yanjia Li
GPT2-Medium
Xwin-LM by Xwin-LM
Llama-2-13B
Vicuna by LMSYS
Vicuna-13B
Phi by Microsoft
Phi-4
OpenELM by Apple
OpenELM-3B
Occiglot by Occiglot
Occiglot-7B-EU5
Mistral by Mistral AI
Mistral-Large-2411
GLM by Zhipu AI
GLM-4-32B-0414
Falcon by Technology Innovation Institute
Falcon3-10B-Base
Minerva by Sapienza Natural Language Processing Group
Minerva-7B-base-v1.0
DeepSeek R1 by DeepSeek
DeepSeek-V3-Base
Zephyr by HuggingFace
Mixtral-8x22B-v0.1
WizardLM by Microsoft and Peking University
LLaMA-7B
SynLogic by Minimax AI
SynLogic-32B
InternLM by Shanghai AI Laboratory
InternLM3-8B
CT-LLM by Multimodal Art Projection
CT-LLM-Base
Mistral NeMo by Mistral AI and NVIDIA
Mistral-NeMo-12B-Base
Saul by Equall
Mixtral-8x22B-v0.1
Qwen by Alibaba
Qwen3-235B-A22B-Base
Kimi K2 by Moonshot AI
Kimi K2 Base
Granite by IBM
Granite-3.3-8B-Base
MiMo by Xiaomi
MiMo-7B-Base
Airoboros by Jon Durbin
Qwen1.5-110B
Starling by NexusFlow
Llama-2-13B
Gemma by Google AI
Gemma-3-27B-PT
Geitje by Bram Vanroy
Mistral-7B-v0.1
BELLE by KE Technologies
Llama-2-13B
Llama 4 by Meta
Llama-4-Maverick-17B-128E
dots.llm1 by RedNote
dots.llm1.base
Marco by Alibaba
Marco-LLM-GLO
Viking by Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)
Viking-33B
Llama 3.1 by Meta
Llama-3.1-405B
OpenMoE by Zheng Zian
OpenMoE-8B
LongAlign by Zhipu AI
Llama-2-13B
UltraLM by OpenBMB
Llama-13B
Command-R by Cohere AI
C4AI-Command-R-V01
Stanford Alpaca by Stanford University CRFM
Llama-7B
StripedHyena by Together Computer
StripedHyena-Hessian-7B
Claire by OpenLLM-France
Falcon-7B
Llama 3.3 by Meta
Llama-3.3-70B
Stable Beluga by StabilityAI
Llama-2-70B
Solar by Upstage AI
Mistral-7B-v0.1
RWKV by BlinkDL/RWKV
RWKV-x070-Pile-1.47B-ctx4096
Persimmon by Adept AI Labs
Persimmon-8B-Base
OPT by Meta
OPT-30B
Nanbeige by Nanbeige LLM lab
Nanbeige2-16B
Jais by G42
Llama-2-70B
Infinity-Instruct by Beijing Academy of Artificial Intelligence
Llama-3.1-70B
H2O-Danube by H2O.ai
H2O-Danube3.1-4B-Chat
FastChat-T5 by LMSYS
Flan-T5-XL
Crystal by LLM360
Crystal
BitNet by Microsoft
BitNet-b1.58-2B4T
Baichuan by Baichuan Intelligent Technology
Baichuan2-13B-Base
StableVicuna by CarperAI
LLaMA-13B
Llama 3 by Meta
Meta-Llama-3-70B
Llama 2 by Meta
Llama-2-70B
Koala by BAIR
Llama-13B
XGen by Salesforce
XGen-Small-9B-Base-R
Hunyuan by Tencent
Hunyuan-A52B-Pretrain
Snowflake Arctic by Snowflake
Snowflake-Arctic-Base
Llama-Sherkala by G42
Llama-3.1-8B
DeepHermes by Nous Research
Llama-3.1-8B
Minimax-Text by Minimax AI
MiniMax-Text-01
Gemma Japanese by Google AI
Gemma-2-2B

Supported by the Centre for Language Studies and the Dutch Research Council. Website design & development © 2024 by BSTN. This version of the index generated 12 Aug 2025, website content last updated 12 Aug 2025.