Differentiations for AI Hardware Vendors

Differentiations for AI Hardware Vendors


Part 7 of learning in public about multiagent systems, specifically on technical/strategic differentiations between AI hardware vendors. Here I’m covering NVIDIA, AMD, Tenstorrent, Cerebras, Sambanova, and Groq. There are a ton of established players, but I’m more interested in the new entrants.

(Disclaimer: My promise is that these write-ups are written by me, a real human, rather than just an LLM!)

If you’re new to this space, I shared an intro to how AI hardware works through the metaphor of a factory.

The observations below are based on discussions with technical experts/startups/industry leaders and secondary research. If you’re working in this space, please reach out! I’d be grateful to chat with practitioners. I’ll be distributing a final report in a few months. Please reach out if you’d be interested in getting/contributing to it!

How Established and Emerging Players are Differentiating

  • Software ecosystems/vertical integration (either internally or through strategic partnerships)
  • Specialization for unique data flows of model architectures (e.g. for Transformers)
  • Expanding memory to fit full models on single chips
  • Improving high-bandwidth memory (especially important for Transformers given the O(N^2) memory footprint of attention mechanisms)
  • Implementing high-speed interconnects to scale computation power through distributed processing
  • Optimization for low-latency inference
  • Power optimization

1. NVIDIA (market hegemon; software leader)

(+) NVIDIA’s GPUs are obviously ubiquitous with a hard to believe 85% of datacenter GPU revenue.

(+) CUDA, NVIDIA’s software stack, is immensely dominant. Many argue (and I agree) that CUDA/cuDNN are NVIDIA’s real strategic advantage. The position may be further entrenched with AI-generated code (which will likely favor CUDA implementations, given how well-represented they are in training data).

(+) NVIDIA powers every major data center.

(+) NVIDIA leads in its interconnects, particularly through NVLink and NVSwitch for high peer-to-peer bandwidth within systems. However, NVIDIA needs to be a leader in interconnections because…

( – ) GPUs were never built for Transformers (they predate them by 20 years). They’re great for parallelized computation, but not necessarily for LLMs, though NVIDIA has added a “Transformer Engine” for FP8 mixed-precision computations on its latest products. NVIDIA is facing its first meaningful competition with Transformer-native hardware architectures that have just started to reach production-level maturity.

( – ) NVIDIA is committed to backwards compatibility with CUDA. This is great but means that CUDA can be slower to evolve.

2. AMD (established player focusing on memory)

(+) AMD is one of the few long-standing competitors to NVIDIA, and AMD is focused on expanding memory. For example, a single MI300X fits models up to 70B parameters in FP16 entirely in memory. This is 2.4x the memory of NVIDIA’s flagship H100 (192 GB HBM3 for AMD vs 80 GB for the H100).

(+) AMD prices below NVIDIA, and data centers (such as Azure) like having a counterbalance to NVIDIA.

(+) Newer silicon is focusing on mixed workloads, which includes CPUs and GPUs in a single chip, which is targeting the convergence of high performance computing and AI.

( – ) AMD is severely limited by its software maturity when compared to NVIDIA, but AMD is navigating this through strategic partnerships, such as with Hugging Face, which is the world’s largest repository for Transformer models.

3. SambaNova (new entrant focused on utilization and vertical integration)

(+) SambaNova’s inspiration was that GPUs are often not fully utilized (like running our factory at 40% capacity), so they developed a “Reconfigurable Data Unit” that maps the entire model’s computation graph onto the chip. Additionally, they use reconfigurable circuits (akin to an FPGA) to adapt to different model architectures.

(+) The result of this is huge performance improvements for giant models, which also means that scaling gets easier/more attractive because you don’t have to deal with complicated distributed programming. In an optimal task, 16 SambaNova SN40L chips achieved 198 tokens/second on a text generation task for DeepSeek-R1-671B.

(+) SambaNova is focusing on vertical integration through its “SambaNova” suite, which includes hardware, foundational models, SambaStudio, and fine tuning capacities.

( – ) The price of vertical integration is that it requires users to leverage the full SambaNova stack, and it’s unclear if customers will be willing.

4. Groq (new entrant focused on low-latency inference)

(+) Like SambaNoca, Groq is addressing low utilization, but in contrast, Groq has focused on small, single-batch inferences (basically processing one request at a time). The result is very low-latency inference. In practice, this is useful for things that demand immediate responses, such as high-frequency trading, user-facing interfaces, or real-time agents. In practice, this means that for an optimal use case, Groq’s LPU delivered 300 tokens/second in generating text with a Llama2-70B model as 13-B token context, which is ~6x the H100’s performance (though the speed depends on kernels and a number of other factors).

(+) Groq has innovated through its “Latency Processing Unit”, which basically uses a single large sequential core, augmented with a deterministic, statically scheduled execution model (which means you know a priori exactly how to execute computation).

( – ) Groq’s focus on single batch inference means that it’s not well-suited to larger batches, so its best customers are going to be the aforementioned time-sensitive computations. This limits its reach.

5. Tenstorrent (new entrant focused on software flexibility, licensable IP, and power efficiency)

(+) Tenstorrent is taking an approach focused on modularity, in which they’re aiming for the high utilization of an optimized ASIC (application-specific IC) that can be scaled out via “chiplets.” They’re focused on their custom Tensix cores, which include a full RISC-V CPU AND a neural processing unit (NPU) (great for matrix computation). Early results suggest that Tenstorrent may be highly competitive with its power efficiency given this granularity of control over the architecture.

(+) The focus on flexibility (both hardware and software) is attractive for developers who want to experiment beyond CUDA.

(+) Tenstorrent’s focus on licenseable IP could become a really interesting play if companies start to build in-house AI chips. This is well-precedented in the semiconductor industry.

( – ) Tenstorrent’s ecosystem is still early stage compared to some of its competitors, so it remains to be seen how they’ll expand as they mature.

6. Cerebras (new entrant focused on fitting entire models onto single chips and high memory bandwidth)

(+) Cerebras basically said “Let’s put this whole model on a single chip and give it a whole bunch of bandwidth” with an internal bandwidth of 20 PB/s (which I think is both amazing and hilarious. The youths would say that Cerebras is the most “extra” hardware vendor).

(+) Cerebras attracts customers who want to simplify training huge models and/or running big inference jobs. This might be entities like government labs, if I had to guess.

( – ) This, like Groq, represents the commitment to a segment of use cases.

Broad observations

  • We’re only now seeing the real emergency of Transformer-native hardware architectures, which makes sense given how long it takes to design, validate, and manufacture application-specific silicon at scale.
  • Companies are commiting to application-specific advantages (such as Groq focusing on low-latency inference at the cost of high batch inference or Cerebras capturing the other end of the spectrum). I read this as the industry maturing.
  • As it relates to multiagent systems, I’m still thinking about who I think is best suited. Groq is attractive because of its focus on single batches, but I’m not sure how a nondeterministic swarm of agents would work with its scheduled execution. Any of the vertically integrated players (NVIDIA, SambaNova, Google (who I didn’t profile here), etc) are also attractive options.
  • I am curious to what extent partner companies are desperate to offset NVIDIA’s dominance. Will data centers try to play hardware vendors off of each other? Will this end up with another Kubernetes-style agnosticism to the underlying hardware?

As mentioned, I’ll be distributing a final report in a few months. Please reach out if you’d be interested in getting/contributing to it!



Linkedin


Disclaimer

Views expressed above are the author’s own.



END OF ARTICLE





Source link

CATEGORIES
TAGS
Share This

COMMENTS

Wordpress (0)
Disqus ( )