Why Nvidia’s rivals think they have a chance to topple it

Why Nvidia’s rivals think they have a chance to topple it

Nvidia’s market dominance in the AI hardware segment is hard to overstate. However, let’s delve into why its would-be successors think they’re in with a shot.

We’ve outlined Nvidia’s stellar position in the AI hardware space before. Its GPU chips power the enterprise AI ramp-up from training to inference inside data centers worldwide. Thousands of them have come together to train OpenAI’s ChatGPT-powering frontier models, Meta’s Llama LLMs and most others. Even DeepSeek, struck with export restrictions and limited hardware, still opted to train its models on Nvidia. The company carries enormous weight with its near-monopoly on GPUs in the data center. Developers have long embraced CUDA. It’s the framework exclusive to Nvidia’s chips. No genuine competitor has emerged despite claimants to its throne lining up in droves.

But we’ve been here before. Though none have enjoyed a surge quite so dramatic as Nvidia’s into the trillions of dollars in market cap, the likes of IBM and (more recently) Intel know the mighty can indeed fall. A complacent attitude can seep in when one is on top of the heap. A braindrain might occur when all individuals key to the Nvidia story are off to enjoy an early and highly lucrative retirement.

A potential weakness

Nvidia’s other potential weakness is in its actual products: GPUs. They are extremely capable general-purpose parallel processors. They’ve been designed to handle parallelization better than any commonly found processor in the world. Nevertheless, they appear to be hitting a limit. The latest two generations of chips (Ada Lovelace and Blackwell) have already hit TSMC’s maximum reticle size.

In other words, the actual silicon cannot grow beyond its current scale. Blackwell is supposed to fix this by fusing chips, but those have already led to costly delays. They may crop up with the next-gen Rubin architecture and beyond, too. On top of that, there is inherent overhead in the way GPUs are run, which is fat to be trimmed away by a potential rival. GPUs are multifaceted, and have been co-opted as AI engines not because they were designed to efficiently process GenAI workloads, but because they were the nearest bit of compute at hand with scale to boot.

The road to toppling Nvidia is travelled by a specific GenAI-centered architecture. A chip essentially tailor-made for the AI revolution, boasting all the efficiency and speed that comes with it. Their key advantage would be to simply get rid of all the GPU overhead and chop away any needless silicon that makes Nvidia hardware so versatile. There are companies that are seeking to do exactly this.

Cerebras: a wafer-scale “model on a chip”

The promises made by Cerebras Systems are lofty indeed. Boasting the “world’s fastest inference” – 70x faster than on GPUs – the US company has come to the fore as one of the most prominent alternatives to Nvidia. If one thought a Blackwell chip was vast, have a look at Cerebras’ behemoths. Their processors are “wafer-scale”. This means they’re a rectangular silicon cut pretty much as large as the foundry standard 300mm wafer size will allow.

A single Cerebras WSE-3 carries 44GB of on-chip memory, some 880 times that of an Nvidia H100. The real triumph is its memory bandwidth. This is often the bottleneck of GenAI training and inference: at 21 Petabyes a second, this figure beats an H100 seven thousand times over. Of course, this is all theoretical throughput. Even supposed apples-to-apples benchmarks don’t tell you how much optimization was required to exploit the imposing specs.

While impressive, Cerebras’ list of clients yields more concrete belief in the company’s future. Meta, Docker, Aleph Alpha and Nasdaq are among those utilizing the company’s tech. These may have to do with one or various of Cerebras’ offerings. These range from all-encompassing AI Model Services to pay-per-hour or pay-per-model schemes to train, fine-tune and/or inference at scale. The line of Llama 3.3 models, Mistral and Starcoder emerge as prime examples of Cerebras-compatible LLMs with some real weight behind them.

Cerebras is exceedingly likely to need more than the 720 million dollars it has raised across six funding rounds so far. Knowing Nvidia spends upwards of 80 billion dollars per year on R&D, an eventual IPO from its wafer-scale rivals may narrow the chasm somewhat. Ultimately, performance and efficiency can also swing matters in favour of Cerebras.

What’s clear is that the swathes of on-chip memory make the chip design a far closer match to what AI models require for nourishment than a cluster of GPUs tethered together through Ethernet or Nvidia’s own InfiniBand. After all, the weights and activations are right there, available at near-lightspeed rather than having to travel through the comparatively glacial interconnects. We’re talking nanoseconds of difference here, but still orders of magnitude. When you add those up across months of AI training and inference, such gaps become enormous.

SambaNova: data is key

Another challenger is taking a different architectural route: SambaNova. Four years ago, well before ChatGPT arrived on the scene, the company had already accrued a billion dollars. Just like Cerebras, the current offerings aim squarely at Nvidia’s GPU solutions and highlight their inherent AI shortcomings. SambaNova, meanwhile, lists its RDU (reconfigurable dataflow unit) as “built for the next generation of AI workloads, known as Agentic AI”. In other words, the company has organized its hardware around the model’s compute graph rather than relying on sequential instructions.

A single SN40L RDU can supposedly contain “hundreds of models” in-memory. That’s due to a frankly gargantuan 1.5 TB of DRAM, with 64GB of co-packaged HBM and a hyper-fast 520MB of SRAM in cache. A single SN40L node can transfer data at over 1TB per second. At face value, Nvidia has got that covered with 8 TB/s for the latest generation of GPUs Blackwell. Nevertheless, as things stand SambaNova claims its dataflow architecture leads to the fastest inference on Llama 3.1 405B on Earth. According to the company, the built-in efficiency of the RDU when it comes to handling data means higher performance than traditional GPUs is possible at “a fraction of the footprint”.

It’s somewhat less clear where SambaNova has actually deployed in the enterprise. National laboratories such as Argonne and Lawrence Livermore appear to be a fan, as are some specialized companies targeting healthcare. SambaNova’s ultimate aim is to deliver an on-prem AI training solution for enterprises. Despite great funding, we need to see more big names flock to SambaNova before being more assured of their viability long-term – whether that’s through an official announcement or not.

Groq: the lowest latency promise

Another AI startup targeting a GPU alternative is Groq. The development of its Language Processing Unit (LPU) was led by ex-Google TPU designer Jonathan Ross. The LPU rolled out in early 2024 and is available to try online. With other would-be Nvidia rivals targeting training as well as inference, Groq’s aims are clear-cut: “Groq is Fast AI Inference”. Thanks to OpenAI-compatible API connectivity, the company’s goal is to shift users away from using closed models like GPT-4o and o1. There’s a real chance for a partnership with the likes of Meta and DeepSeek, then.

This already shows that perhaps, Groq isn’t really intending to take on Nvidia directly. Ever since we covered the company a year ago, it’s noticeable that companies like Groq would rather target the end user directly and abstract away the hardware itself. The end goal is the lowest latency possible. If you’re simply trying to run Llama 3.3 70B fast without local hardware, this may be the right product. Given Groq provides no clear information about major hardware deals, we can only assume there aren’t many of them beyond the national laboratories experimenting away and customers approaching Groq via its API.

However, the LPU is another example of tweaking the GPU towards what enterprises actually intend to calculate. “The Groq LPU architecture started with the principle of software-first”, the company, says, and this has resulted in a chip dedicated to linear algebra – “the primary requirement for AI inference”. In effect, the compiler has determined the chip layout, and no routers or controllers stand in the way of the hardware communicating with itself. The LPU is an “assembly line” or “conveyor belt” in Groq’s parlance, shifting data from its various on-chip memory modules and chips. This is intended to sidestep the GPU overhead inherent to what the company calls a “hub and spoke” approach made by Nvidia.

The end result is a chip capable of 750 TOPS. It contains 230 MB of SRAM per chip and an 80 TB/s on-die memory bandwidth. Given one GroqChip is merely part of a GroqRack compute clusters, these base specs ultimately aren’t what the company highlights most. Its claim to fame is fast inference above all. Perhaps mass AI adoption will allow Groq to find its niche and tell the world about its successes. So far, we can only go by the 1,425,093,318 total requests made to Groq-based LLMs at the time of writing.

Etched: a transformer ASIC to rule the roost

The closest simile to a transformer model is a transformer ASIC. “Transformers etched into silicon”, as Nvidia challenger Etched describes its Sohu chip. It looks rather a lot like a GPU, complete with its VRMs surrounding the silicon die and rectangular add-in card shape. 8 of them seemingly dwarf the throughput of 8 Nvidia B200 GPUs, not to mention 8 of the earlier H100s. The end result: 500,000 tokens per second with Llama 70B.

144GB of HBM3E deliver data to just a single ‘core’, in effect an LLM’s architecture transposed onto a silicon wafer. Support is said to reach to even 100 trillion parameter models, far beyond the current crop of state-of-the-art LLMs. A fully open-source software stack should charm those unwilling to stick with Nvidia’s walled CUDA garden.

Critically, Etched hits Nvidia where it hurts. As mentioned, GPUs are about as big as they can get. They can’t grow without a bunch of tricks like building interconnects that usually fall short of in-silicon speed, at least. And, with regard to some of the competitors out there: those aren’t algorithm-specific like what Etched is doing. One thing that has remained unclear, however, is exactly when Sohu will emerge. After creating a buzz in mid-2024, things have gone rather quiet.

AMD, Intel, Google, Amazon…

We should note some other far more familiar potential Nvidia rivals. The most obvious one is AMD, which builds its Instinct MI series of accelerators as the closest to a drop-in equivalent to Nvidia’s GPUs. Some of the company’s models have even integrated Instinct with Epyc onto the chiplet design. This fuses GPU and CPU power to deliver a promising all-in-one AI package. The problem is that its ROCm software appears to be under-adopted and underappreciated. CUDA dominates, and so does Nvidia. Why develop frameworks or model pipelines for a chip that isn’t ubiquitous like its rival?

A similar issue faces Intel, though even more so. Its Gaudi line of GPUs has not produced the kind of demand that has helped AMD’s stock climb over the past two years. On top of that, with its CEO Pat Gelsinger out the door, it appears rudderless and unable to execute on AI when its other market segments are being challenged hard. Without a performance lead to its name or the challenger status AMD possesses, hopes of a change in fortunes are slim.

The cloud providers, meanwhile, are some of Nvidia’s biggest customers. They’re all looking to shift away from their reliance on the AI chip behemoth. They are doing this by building out an alternative all to themselves. Google has done so for years, and its Tensor Processing Units (TPUs) are venerable options for those looking to run AI in the cloud. Nevertheless, they cannot ever be ubiquitous when they’re only available through Google Cloud.

The same goes for AWS’ impressive line of Trainium chips and Inferentia, all available through AWS. These, too, are never going to be found outside an Amazon-held data center. It’s up to Google and AWS (with Microsoft likely to follow) to build out a developer stack to abstract away the architecture. This usually means a portable shift to an Nvidia option is never far away. After all, you can only capture a major audience if they’re already likely to pick your stack anyway.

Conclusion: no end in sight

There are plenty more Nvidia alternatives to name. We could go on about Graphcore, which we haven’t anything about since dire news emerged in 2023. Or Tenstorrent, which is building AI chips on the open-source RISC-V architecture. The selection seen above is but a fraction of the overall playing field. Nevertheless, we’ve decided these are the most notable challengers. There’s always a chance for a left-field candidate to emerge in the hardware space just like DeepSeek did in the AI model-maker race.

We’ll end where we began. Nvidia has a firm hold on the GenAI market, especially for training. Despite lofty benchmarks shown by the AI chip startups mentioned, we’ve not seen anything to dissuade the average AI infrastructure decision-maker from buying Nvidia. Any alternative has to leap forward with stellar efficiency promises, an outright performance crown or both.

Even then, the incumbent never lets go that easily. Nvidia’s already busy infiltrating areas of AI which it had yet to. Apart from its dominant presence in consumer machines, it is now proposing devkits entirely dedicated to GenAI with Project Digits. Jetson Nano, meanwhile, serves edge deployments. No competitor, not even Nvidia’s closest rival AMD, has this amount of flexibility. That will aid the company to weather future storms, even if it needs to abandon the allrounder status of GPUs to succeed further. A shift to a dedicated transformer/GenAI processor is most easily done when you’re backed by (around) 3 trillion dollars of market capitalization.

Also read: Nvidia announces an AI supercomputer for your desktop