BERT: The original sin of open-washing large language models

by Andreas Liesenfeld

01 August 2025

In late 2018, before large language models (LLMs) became mainstream, Google released a large language model bound to shake up the natural language processing (NLP) research community. The model dramatically improved the state-of-the-art for many benchmarks in the field, trailblazing the upcoming explosion of interest in LLMs. An awe-struck resarch community celebrated what the model could do and marvelled at the new "transformer" machine learning architecture it was build on. BERT even spawned a new scientific field dedicated to the study of it, BERTology. Amidst the buzz, few asked what data it was trained on or whether is was really "open source" as Google claimed.

Google's BERT continues to shape research in NLP to this day. The paper that introduced it became the most cited scientific publication in the field with over 125.000 citations. A host of popular spin-off models were built on top of the original, providing BERT-based technologies for domains as diverse as Biology (BioBERT), Science (SciBERT), Finance (FinBERT) or Medicine (Med-BERT). In short, with the help of BERT, Google shaped the course of an entire scientific discipline for years to come. Here's why this still matters today.

We still don't know exactly how "open source" BERT was built. Dazzled by it's technological marvels, even the scientists that introduced the field of BERTology that examines what BERT can do did not ask how it was built. Their work doesn't mention terms such as "training" or "data". The authors of BERT themselves vaguely stated in their paper that the model was trained on English Wikipedia and a now defunct corpus of English novels called the book corpus. Even before the release of BERT, these datasets were in use by Google's "Brain" division for research purposes. But the use of the book corpus, a collection of self-published books, for AI training violated the publisher's Terms of Service and records of correspondance between journalists and Google show that critical questions regarding the legitimacy of using this data were brought to Google's attention as early as 2016. Since BERT was released in 2018, neither Google's press release nor any other official source has clarified what data was used in the BERT model.

The field of open source AI still suffers from a lack of transparency when it comes to sharing training data and since 2018 many other genAI heavyweights have followed Google's lead of obfuscating open weight and open source. As the first LLM that was highly influential in science despite deliberate nondisclosure of it's training data, BERT was the original sin of open-washing that lead an entire scientific community down a dubious path. And the dazzled NLP research community failed to call the adverse effects of this practice on science out. BERT should be remembered both as a milestone in the history of NLP as well as the first testament of adverse effects of overclaiming transparency in the field of generative AI. Open-washing in as old as LLMs are.

Parameter descriptions:

Base Model Data

Are datasources for training the base model comprehensively documented and made available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.

End User Model Data

Are datasources for training the model that the end user interacts with comprehensively documented and made available?

Base Model Weights

Are the weights of the base models made freely available? In case a distinction between base (foundation) and end (user) model is not applicable, this mirrors the end model data entries.

End User Model Weights

Are the weights of the model that the end user interacts with made freely available?

Training Code

Is the source code of dataset processing, model training and tuning comprehensively made available?

Code Documentation

Is the source code of datasource processing, model training and tuning comprehensively documented?

Hardware Architecture

Is the hardware architecture used for datasource processing and model training comprehensively documented?

Preprint

Are archived preprint(s) are available that detail all major parts of the system including datasource processing, model training and tuning steps?

Paper

Are peer-reviewed scientific publications available that detail all major parts of the system including datasource processing, model training and tuning steps?

Modelcard

Is a model card available in standardized format that provides comprehensive insight on model architecture, training, fine-tuning, and evaluation?

Datasheet

Is a datasheet as defined in "Datasheets for Datasets" (Gebru et al. 2021) available?

Package

Is a packaged release of the model available on a software repository (e.g. a Python Package Index, Homebrew)?

API and Meta Prompts

Is an API available that provides unrestricted access to the model (other than security and CDN restrictions)? If applicable, this entry also collects information on the use and availability of meta prompts.

Licenses

Is the project fully covered by Open Source Initiative (OSI)-approved licenses, including all data sources and training pipeline code?

Last updated 10 Nov 2025

OLMo by Ai2

OLMo-2-0325-32B

YuLan by Gaoling School of Artificial Intelligence

YuLan-Mini

Apertus by Swiss AI Initiative

Apertus-70B-2509

BLOOMZ by BigScience Workshop

BLOOM

Amber by LLM360

Amber

Open Assistant by Open Assistant

Pythia-12B

mT0 by BigScience Workshop

mT5-XXL

Whisper by OpenAI

Whisper-large-v3

Pythia by EleutherAI and Together Computer

Pythia-6.9B

SmolLM by HuggingFace

SmolLM3-3B-Base

K2 by LLM360

Tülu by Ai2

Llama-3.1-405B

OpenChat by OpenChat

Meta-Llama-3-8B

Arabic StableLM by StabilityAI

StableLM-2-1.6B

Vicuna by LMSYS

Vicuna-13B

Teuken by OpenGPT-X

Teuken-7B-base

Skywork-OR1 by Skywork

DeepSeek-R1-Distill-Qwen-32B

MobileLLM by Meta

MobileLLM-R1-950M-base

Instella by AMD

Instella-3B

Minerva by Sapienza Natural Language Processing Group

Minerva-7B-base-v1.0

Dolly by Databricks

Pythia-12B

T5 by Google AI

RedPajama by Together Computer

RedPajama-INCITE-7B-Base

MPT by Databricks

MPT-30B

Lucie by OpenLLM-France

Lucie-7B

Eurus by OpenBMB

Mixtral-8x22B-v0.1

DeepSeek V3.1 by DeepSeek

DeepSeek-V3.1-Base

Poro by AMD Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)

Llama-Poro-2-70B-Base

Neo by Multimodal Art Projection

Neo-7B

BERT by Google AI

unspecified

AquilaChat by Beijing Academy of Artificial Intelligence

Aquila2-70B-Expr

Zephyr by HuggingFace

Mixtral-8x22B-v0.1

Yi by 01.AI

Yi-34B

WizardLM by Microsoft and Peking University

LLaMA-7B

TildeOpen by Tilde.ai

TildeOpen-30b

Salamandra by Barcelona Supercomputing Center

Salamandra-7B

Occiglot by Occiglot

Occiglot-7B-EU5

NeuralChat by Intel

Mistral-7B-v0.1

Llama Nemotron by NVIDIA

Llama-3.3-70B-Instruct

Guru by LLM360

Guru-32B

GPT-SW3 by AI Sweden

GPT-SW3-6.7B-V2

GPT-NeoXT by Together Computer

GPT-NeoX-20B

Fietje by Bram Vanroy

Phi-2

BTLM by Cerebras

BTLM-3B-8K-Base

Pharia by Aleph Alpha Research

Pharia-1-LLM-7B

minChatGPT by Ethan Yanjia Li

GPT2-Medium

Xwin-LM by Xwin-LM

Llama-2-13B

Phi by Microsoft

Phi-4

Mistral by Mistral AI

Mistral-Large-2411

Kimi K2 by Moonshot AI

Kimi K2 Base

DeepSeek R1 by DeepSeek

DeepSeek-V3-Base

SynLogic by Minimax AI

SynLogic-32B

OpenMoE by Zheng Zian

OpenMoE-8B

OpenELM by Apple

OpenELM-3B

InternLM by Shanghai AI Laboratory

InternLM3-8B

Hunyuan by Tencent

Hunyuan-7B-Pretrain

GPT OSS by OpenAI

unspecified

Falcon by Technology Innovation Institute

Falcon-H1-34B-Base

DeepHermes by Nous Research

Llama-3.1-70B

CT-LLM by Multimodal Art Projection

CT-LLM-Base

Mistral NeMo by Mistral AI and NVIDIA

Mistral-NeMo-12B-Base

XBai-04 by Yuan Shi Technology

Qwen3-32B

Saul by Equall

Mixtral-8x22B-v0.1

Qwen by Alibaba

Qwen3-235B-A22B-Base

Granite by IBM

Granite-3.3-8B-Base

Apriel by ServiceNow

Mistral-Nemo-Base-2407

MiMo by Xiaomi

MiMo-7B-Base

Airoboros by Jon Durbin

Qwen1.5-110B

Starling by NexusFlow

Llama-2-13B

Solar by Upstage AI

Mistral-7B-v0.1

Gemma by Google AI

Gemma-3-27B-PT

Geitje by Bram Vanroy

Mistral-7B-v0.1

Claire by OpenLLM-France

Falcon-7B

BELLE by KE Technologies

Llama-2-13B

UltraLM by OpenBMB

Llama-13B

Llama 4 by Meta

Llama-4-Maverick-17B-128E

dots.llm1 by RedNote

dots.llm1.base

StripedHyena by Together Computer

StripedHyena-Hessian-7B

Marco by Alibaba

Marco-LLM-GLO

Viking by Silo AI and TurkuNLP and High Performance Language Technologies (HPLT)

unspecified

Llama 3.1 by Meta

Llama-3.1-405B

XVERSE by Shenzhen Yuanxiang Technology

XVERSE-MoE-A4.2B

RWKV by BlinkDL/RWKV

unspecified

Minimax-M1 by Minimax AI

MiniMax-Text-01

LongAlign by Zhipu AI

Llama-2-13B

Stanford Alpaca by Stanford University CRFM

Llama-7B

GLM by Zhipu AI

unspecified

Llama 3.3 by Meta

Llama-3.3-70B

Stable Beluga by StabilityAI

Llama-2-70B

Snowflake Arctic by Snowflake

Snowflake-Arctic-Base

Persimmon by Adept AI Labs

Persimmon-8B-Base

OPT by Meta

OPT-30B

Jais by G42

Llama-2-70B

Infinity-Instruct by Beijing Academy of Artificial Intelligence

Llama-3.1-70B

H2O-Danube by H2O.ai

H2O-Danube3.1-4B-Base

FastChat-T5 by LMSYS

Flan-T5-XL

EXAONE by LG

unspecified

Crystal by LLM360

Crystal

BitNet by Microsoft

unspecified

Baichuan by Baichuan Intelligent Technology

Baichuan2-13B-Base

StableVicuna by CarperAI

LLaMA-13B

Llama 3 by Meta

Meta-Llama-3-70B

Llama 2 by Meta

Llama-2-70B

LeoLM by LAION

Llama-2-70B

Koala by BAIR

Llama-13B

XGen by Salesforce

XGen-Small-9B-Base-R

Gemma Japanese by Google AI

Gemma-2-2B

Command A by Cohere AI

Command A?

Llama-Sherkala by G42

Llama-3.1-8B

Nanbeige by Nanbeige LLM lab

Nanbeige2-16B

Lumo AI by Proton

Undisclosed