Open-source definition of AI is here, but data remains point of discussion

Open-source definition of AI is here, but data remains point of discussion

Update October 28, 2024 (Erik van Klinken): The OSI has unveiled the 1.0 version of its open-source AI definition. There is no material difference to its contents when compared to Release Candidate 1, detailed below.

Original article posted on October 10, 2024:

The Open Source Initiative (OSI) recently presented the long-awaited first version of its open-source AI definition. Although the definition is quite straightforward, there is criticism about how the standards organisation defines training data as open-source.

OSI, a standardization organisation for open-source definitions, has worked hard for the past two years to develop a standard for open-source AI. This definition, of which the so-called Release Candidate 1 (RC1) version has finally been released, should determine whether a particular AI model or LLM can truly be considered open-source.

Requirements for ‘true’ open-source AI

In the now-released RC1 version, OSI states that an AI model or LLM must have three components. First, the software used to compile the dataset and run its training, second, the model parameters and code to run inference, and third, any data that can be made legally available.

Finally, four requirements further determine whether the LLM in question is actually open-source. These are the ability to use the LLM for any purpose without prior permission, insight into its operation, its adaptability for any application, and shareability—the latter either with or without modifications.

The now-released definition has other stringent requirements in addition to these components. For example, the complete source code for training and running the LLMs must be available under OSI-approved licenses. Model parameters and weights must also be shared under open terms.

With this, the open-source standards organisation wants to prevent companies from ‘openwashing’ their LLMs by calling them open-source without complying with actual open-source standards. Something can only be open-source if it is according to prior definitions, according to the OSI.

Definition of data criticized

The arrival of this open-source definition for AI and LLMs in particular is warmly welcomed. Yet there is also criticism, especially regarding whether under this definition the required training data, as one of the three major components, must also be truly open-source, writes ZDnet.

Critics feel that for an LLM to be truly open-source, the underlying data must also be. In their view, if this data is not fully public, it also affects the full reproducibility of LLMs, transparency and security.

The open-source standards organization acknowledges that full data disclosure is a problem. Unfortunately, it is not possible to share complete datasets publicly. The organisation uses several categories for data: open data, public data, and non-shareable data. Different legal rules apply to each category, which means that data can only be shared in the form that these regulations suggest, OSI comments.

The now-released open-source AI definition suggests that training data must contain ‘sufficiently detailed information’ to train the LLM instead of requiring a full (open-source) dataset.

Compromise

This is a compromise. Laws and regulations, such as privacy rules or intellectual property rights, often restrict data sharing. Therefore, the definition provided now tries to balance transparency with practical and legal considerations.

OSI also argues that if the data purists had their way, open-source AI would be limited to a niche, with only LLMs trained solely on open data.

According to OSI, data purists are not the only ones taking issue with the new Open-Source AI Definition. AI companies are also struggling with it. This is because they consider their training schedules, how they train their LLMs, and how they compile and filter their datasets to be secrets. And they do not want to make these public, as the definition now requires.

Also read: ‘Regulations are no fun, but uncertainty is even less so’