In this first of a series of blog posts on the benefits of open-source AI, we discuss key principles which allow open-source to enable greater oversight. We subsequently discuss the theoretical motivations behind the EU AI Act treating open-source models with lesser scrutiny, and lastly explore how releasing models in an open-source way makes it easier for watchdogs to investigate legal violations in a model's training.
Open-source AI requires open data. This means that any model seeking to describe itself as open must publish the data on which it is trained. This is crucial for oversight, as it allows for inspecting exactly on which material a model was trained and for ensuring that no copyright violations are associated with the training procedure.
Open-source AI requires open training code. By publishing code in a clearly-readable manner, it becomes possible to double-check that the model is neither trained with censorship in mind nor by stealing a proprietary model's code or data.
Open-source AI requires clear documentation. Lastly, open-source AI models must be published in such a way that both its data and training code are clearly documented. This prevents the mechanics behind a model's creation becoming difficult to interpret based on a lack of good documentation, and enables more straightforward oversight.
It is often thought that the EU AI act contains a specific exception for open-source models. In reality, this is not the case. Though the act does allow some lesser degree of scrutiny for models published under a free and open-source license by waiving the duty to maintain up-to-date technical information about the model, it still requires model creators to disclose model architecture, weights, and training procedures. One possible reason for treating open-source models in this manner is that models published under a free and open-license are by definition open to adoption by outside parties. These authors, through adapting the model, will likely be able to exercise their own kind of scrutiny. Another reason would be to incentivize model creators to publish models in the public sphere.
A famous example of open-source status enabling greater oversight in the AI field is that of LAION-5B. This dataset marketed itself as a large-scale fully-open image dataset for creating image diffusion models, enabling widespread adoption by model creators. The inherent openness of this dataset also enabled greater scrutiny. The open-source status of LAION enabled not only widespread adoption but also the auditing of the pretraining data of a large number of major image models.
By publishing models in a more open manner, model creators enable greater oversight. Openness allows for a closer investigation of a model's inner workings, catching legal issues with the way a model is constructed and allowing model creators to be held to higher standards of accountability. This allows for more ethical AI for all.