Historically, Small Language Models (SLMs) can be said to be following the same evolutionary pattern as microservices. In the old days of building services, application development teams would package all their logic into a single service, a monolith, only to then deploy it to a cluster. As cluster management technologies improved, application development teams worked out that a better way to utilise resources in a cluster would be to break the larger service up into smaller services (microservices) that could be configured and scaled separately. Now that we realise Large Language Models (LLMs) are not the AI panacea that they were initially thought to be, is it time to go big on going small?
Keith Pijanowski certainly thinks so. As an AI solutions engineer at enterprise AI object storage specialist company MinIO, he has seen enough of the intelligence hype-cycle play out to be able to lay down some sagacious but still mostly sanguine advice for software engineers now looking to embrace language models of all sizes.
“Harking back to our microservices example, let’s remember that infrequently used services could be given a fraction of a CPU, minimal memory and scaled to two instances,” said Pijanowski. “Those services used for critical capabilities could be given more resources and scaled to hundreds of instances if needed. This was not only a more efficient use of resources – it also made the application easier to update and maintain. Fast forward to the generative AI boom… and the same pattern is playing out – only the names have changed.”
When generative AI started, everyone rushed to deploy open sourced frontier models, or LLMs, that can be thought of as the monoliths of this space. Many of these LLMs cannot fit into the memory of a single GPU, making them even more costly to deploy in a fault-tolerant manner.
For example, notes Pijanowski, if a team needs 10 instances of an LLM to handle the needs of its users and the LLM requires two GPUs for a single instance, then we need 20 GPUs in total. GPUs are expensive, and this is a costly deployment. He suggests that SLMs solve this problem in the same way that microservices made clusters more efficient. Through various techniques (quantization, distillation, etc), LLMs can be compressed. From there, they can be fine-tuned to do one thing well. When deployed, they can be scaled according to usage.
Future uses of LLMs & SLMs
“In my opinion, an SLM should always fit into a single GPU; otherwise, you need to employ training and inference techniques developed for very large LLMs.) Now that we understand how history repeats itself, let’s look at the future of generative AI and how SLMs will be used,” insisted Pijanowski. “The future of generative AI is agentic AI, where LLMs know more than the information locked up within an organisation’s custom corpus. Specifically, the LLMs that drive agentic AI will also learn about all of an organisation’s data and not just a custom corpus. This includes APIs, databases of all types, SLMs that have been trained for a specific purpose and other LLMs.”
An easy way to visualize agentic AI is to think of it as two layers working together to solve a more complex problem than could be solved with basic generative AI. The first layer is a control layer – this is an LLM that has been trained on all the data types previously listed. It also can plan the tasks it has been asked to do and some implementations of this layer will allow the LLM to generate and execute code to get the data needed. (The scariness of this is a subject for another post.)
The bottom line for Pijanowski is that this layer will be a very large LLM that requires a lot of resources. The second layer is a data layer, or the collection of APIs, databases, SLMs, and possibly other LLMs not trained to be a control plane that can bring back valuable information. When a request is sent to an agentic AI, the control layer will decide which entities in the data layer need to be used to solve the problem. This style of AI demonstrates LLMs working alongside SLMs. It is not a single industry phenomenon – all industries will benefit from this new AI.
What size is small?
“Whenever you classify something, you should always clearly define the dividing line. So what is small and what is large when it comes to language models? A conventional delineation is that SLMs are a few hundred MBs to a few GBs in size. LLMs, on the other hand, would be tens to thousands of GBs. I like this dividing line; it is empirical and works with the loose definition I previously mentioned: an SLM fits into a small GPU, and an LLM requires a high-end GPU and may still need to be split,” said Pijanowski. “However, for fun, maybe we can come up with an engineer-friendly definition that takes into account the day-to-day life of an AI/ML engineer who has to train and fine-tune these SLMs.”
He says that such a definition would state that the SLM has to fit on an engineering workstation. Capitalisation-focused technology company Nvidia recently announced DGX Spark, an engineering workstation better suited for AI/ML engineering than a gaming system. It comes with 128 GB of unified system memory.
Once this system is generally available, perhaps we can bump our definition of SLMs up to tens of GBs in size.