Open source has always been an integral part of artificial intelligence. By definition, the term open source refers to software for which the original source code is made publicly available. Thus, anyone can become a contributor, redistributor, or user of this software within the terms of its open-source license.
The values of open source are the same values to which the machine learning (ML) community has always aspired: collaboration, peer review, transparency, reliability, flexibility, and accessibility. Originating from academia, this mindset of sharing and transparency now also permeates industry, with leading companies such as Google and Microsoft being among the most well-known contributors of open-source machine learning models.
Due to the complexity and fast pace of the ML world, the most-adopted models, pipelines, frameworks, and infrastructures have generally been open source. Following this open-source ML trend, the number of publicly available resources for open source machine learning models keeps increasing.
This article defines open-source AI models, reviews the most popular releases, discusses pros and cons of the current open-source landscape, and concludes with an overview of how to productionize these models.
Open-source models are binaries of machine learning algorithms pre-trained on often-large datasets in order to achieve state-of-the-art performance in a machine learning application. These model binaries are released to the public for everyone to use, for either model inference or transfer learning, as we’ll explore in the last section of this article.
Usually, these trained models are released with the code that implements the underlying machine learning algorithm and, sometimes the data is also publicly available. In such cases, full reproducibility is ensured, and users can also review, modify, and contribute to the solution, as per the standard open-source definition.
Most open-source AI models are deep learning models. This is because neural networks benefit from huge datasets and sophisticated architectures that can grow to encompass a vast number of parameters; thus, training them requires extensive time and hardware. In the few cases where code and data are both available, fully replicating the training of these models is extremely resource inefficient and thus, for most individuals and organizations, unfeasible.
One example of a deep learning open-source model—and one of the largest ever released—is a GPT-like model named YaLM 100B which has been trained on 1.7 TB of text for 65 days on a pool of 800 high-grade A100 graphic cards.
To find this and other open-source models, we can look at some of the most well-established providers:
These platforms predominantly redirect to GitHub which is, ultimately, the largest open-source model repository.
Open-source models offer various benefits, which help boost adoption of a wider variety of AI applications:
Open-source AI is sometimes criticized too, with some of the reasons being:
The AI community seems to have reached the consensus that adopting open-source models is the new standard when delivering AI applications, especially for NLP and computer vision tasks. An alternative often considered is AutoML.
To productionize open-source models, the first step is to download the model library to the development and production environment. From here, two paths are most commonly taken:
When deploying the original or tuned pre-trained model in production, similar considerations for the model deployment of a model trained from scratch should be taken, with the exception that the continuous retraining pipeline is not needed for path one.
The philosophy of productionizing machine learning models and pipelines built upon open source principles and software is often referred to as “open-source model infrastructure.” Iguazio supports this approach by offering MLRun, the first end-to-end open source MLOps orchestration framework.