Pruna AI, a European start-up that focuses on compression algorithms for AI models, has made its optimization framework open source.
Pruna AI has built a framework that applies various efficiency methods to AI models, such as caching, pruning, quantization and distillation. The company also standardizes the saving and loading of compressed models, the combining of these compression methods and the evaluation of compressed models after compression. John Rachwan, co-founder and CTO of Pruna AI, told TechCrunch that this framework helps developers to streamline these processes.
The framework is also able to assess whether there is a significant loss of quality after compression, and it shows the performance improvements achieved as a result.
Larger AI laboratories are already using various compression methods. For example, OpenAI uses distillation to develop faster versions of its leading models. It is likely that GPT-4 Turbo was created in this way as a faster version of GPT-4. Similarly, the Flux.1 fast image generation model is a distilled version of the Flux.1 model from Black Forest Labs.
Teacher-student model
Distillation is a technique in which knowledge is extracted from a large AI model by means of a teacher-student model. In this process, developers send requests to the teacher model and record the results. These responses are sometimes compared to a dataset to measure accuracy. The student model is then trained to approximate the behavior of the teacher.
Rachwan noted that large companies usually develop these types of solutions internally. According to him, what you find in the open-source world is usually based on a few methods, such as one quantification method for LLMs or one caching method for diffusion models. He added that it is difficult to find a tool that brings all these methods together, makes them easy to use and combines them, and that this is exactly the great added value that Pruna now offers.
Image and video generation in particular
Although Pruna AI supports every type of model, from large language models to diffusion, speech-to-text and computer vision models, the company is currently focusing specifically on image and video generation. Pruna AI’s existing customers include Scenario and PhotoRoom. In addition to the open-source version, the company offers an enterprise edition with advanced optimization features, including an optimization agent.
One feature that Pruna will be launching soon is the compression agent, Rachwan says. This agent allows users to specify their model and a specific performance requirement, such as increasing speed without decreasing accuracy by more than 2%. The agent then does the work by finding the best combination of compression methods. This saves developers the trouble of manual optimization.
Pruna AI charges an hourly rate for the pro version, comparable to renting a GPU on a cloud service such as AWS, according to Rachwan. The optimized model can significantly reduce inference costs, especially if the model is a crucial part of the AI infrastructure. Rachwan also shared an example in which Pruna AI made an Llama model eight times smaller without too much loss of quality using its compression framework. The company hopes that customers will see the compression framework as an investment that pays for itself.
A few months ago, Pruna AI raised seed investment of $6.5 million. Investors in the start-up include EQT Ventures, Daphni, Motier Ventures and Kima Ventures.