NVIDIA Vera CPU Benchmarks: Olympus Cores Establish ARM as a Credible Server Threat

For the last decade, the data center conversation has been almost entirely dominated by GPUs. CPUs were treated as host processors, important but ultimately supporting actors to the accelerators doing the real work. That framing has been quietly eroding for the past few years as agentic AI workloads have created entirely new bottlenecks that GPUs don’t solve, and Phoronix’s first independent benchmarks of NVIDIA’s new Vera CPU, published this week, make clear just how aggressive NVIDIA’s response is. The 88-core Olympus-powered chip didn’t just match its Intel Xeon and AMD EPYC competitors; it beat them, with Phoronix calling it the most performant ARM Linux server processor they have ever tested.

Part 1: The Goals and Strategic Implications

Why NVIDIA Built Vera

Vera is NVIDIA’s second-generation data center CPU, succeeding the Grace processor that launched in 2022. Where Grace used standard Arm Neoverse V2 cores in a relatively conservative implementation, Vera represents something fundamentally different: a fully custom CPU design built around NVIDIA’s in-house Olympus cores and explicitly targeted at agentic AI and reinforcement learning workloads. Vera, most notably, will be found with NVIDIA’s NVL72 Vera Rubin as the host CPU powering these AI racks, while it will also be found standalone for CPU racks. The strategic intent is clear: NVIDIA isn’t just building a host processor for its own GPUs; it’s building a general-purpose data center CPU intended to compete directly with Intel and AMD in their core territory.

The motivation behind this becomes obvious when you understand what agentic AI workloads actually demand from the underlying hardware. Modern AI agents don’t just run inference once and return a result; they execute loops, manage state, call tools, parse responses, decide next actions, and coordinate multi-tenant data pipelines. Each of those activities is a control-heavy, latency-sensitive task that benefits enormously from strong single-thread CPU performance and high memory bandwidth. The Olympus cores deliver the leadership single-thread performance required to run control-heavy software environments at scale, while spatial multithreading enables 176 total threads by physically partitioning each core’s resources rather than time-slicing them.

The Benchmark Results That Shook the Server Market

The Phoronix benchmarks provide the first independent validation of NVIDIA’s performance claims. In the geometric mean of all test results, NVIDIA’s Vera topped the chart, performing nearly 11% better than AMD’s most advanced designs and about 55.3% better than the best single-socket Intel Xeon. More specifically, the NVIDIA Vera CPU with 88 Olympus cores ended up 63% faster than its predecessor, the 72-core Grace CPU. The chip was also 10% faster than AMD’s EPYC 9575F, which has a total of 64 Zen 5 cores clocked at 5 GHz, and beat the Intel Xeon 6980P, a 128-core chip based on the Granite Rapids family, by 55%.

These numbers deserve careful interpretation. Vera is beating a 128-core Intel chip with just 88 cores, suggesting substantial per-core performance advantages of the Olympus architecture. Vera is also beating a 64-core AMD chip clocked at 5 GHz, which is harder to dismiss because AMD’s Zen 5 represents the current state of the art in x86 server cores. The 63% lead over Grace, NVIDIA’s previous-generation CPU, validates the company’s claim that switching from standard Arm Neoverse cores to fully custom Olympus cores represented a generational performance leap rather than an incremental update.

The Power Efficiency Question

One important caveat: Phoronix was not permitted by NVIDIA to publish power-efficiency metrics for the early Vera silicon it tested. Some important metrics, such as performance per watt, are missing from the tests. Phoronix states that they were not permitted to run or showcase these tests. The Vera module they got was early pre-production hardware, so it looks like there may be pre-release power tuning and optimizations that will further improve the system’s performance per watt. This is a normal practice for pre-production silicon evaluations, but it’s worth keeping in mind that the headline performance numbers tell only half the story.

What we do know is that the Vera CPU, as tested, had a peak 450 Watt socket TDP, with the LPDDR5X memory adding around 50 Watts or less of additional power consumption. For comparison, the recent Intel Xeon 6980P and AMD EPYC 9755 designs fall within similar TDP ranges. The key question is whether Vera’s performance advantage scales when normalized to power consumption, which would determine whether it’s actually more efficient or just more aggressively configured. Until NVIDIA permits independent efficiency benchmarks, the AI industry will have to take the company’s 2x efficiency claims at face value.

What This Means for Intel and AMD

The strategic implications for the established server CPU vendors are significant. Intel and AMD have spent years optimizing for cloud-scale data center workloads, building out broad ecosystem support, and maintaining deep relationships with hyperscalers. NVIDIA’s entry into this market with a competitive, performance-leading product creates a structural challenge that neither company has fully faced before. AMD, in particular, is positioned uncomfortably because its agentic AI use case has been the kind of workload where it has been gaining ground against Intel, and NVIDIA is now contesting precisely that segment.

The longer-term competitive picture isn’t entirely one-sided. AMD’s next-generation EPYC Venice, based on the Zen 6 core, is expected to deliver substantial performance improvements when it ships, and Intel’s Diamond Rapids, the successor to Granite Rapids, is similarly positioned to close the gap. But NVIDIA’s entry as a credible third competitor changes the dynamics of an industry that has effectively been a two-horse race since AMD’s Zen-era resurgence. Hyperscalers will now have a third validated option for high-performance server CPUs, which changes negotiation dynamics and accelerates pressure for both Intel and AMD to continue innovating aggressively.

The ARM Server Inflection Point

Vera’s performance also represents an inflection point for ARM in the data center more broadly. Phoronix has been benchmarking ARM Linux back to the Calxeda server days, and the Vera CPU with Olympus cores delivers performance competitiveness with Intel and AMD x86_64 CPUs that they have never seen from any other ARM or non-x86_64 processors. ARM-based server chips have been competitive in specific niches, AWS Graviton for cloud-scale workloads, Ampere for cost-sensitive deployments, but they have generally traded off peak performance for efficiency or cost. Vera is the first ARM server CPU that competes at the absolute top of the performance envelope.

This matters because it changes the question that software ecosystem maintainers, enterprise IT departments, and hyperscalers have been asking. The question used to be whether ARM could match x86 performance for specific workloads at attractive economics. The new question is whether x86 can match ARM performance at all, in workloads where NVIDIA’s Olympus implementation now leads. National laboratories planning to deploy Vera CPUs include Leibniz Supercomputing Center, Los Alamos National Laboratory, Lawrence Berkeley National Laboratory’s National Energy Research Scientific Computing Center, and the Texas Advanced Computing Center, suggesting that the HPC and research segments are already moving.

Part 2: The Olympus Architecture in Technical Detail

Core Microarchitecture

The Olympus core is NVIDIA’s first fully custom data-center CPU core, and its microarchitecture is aggressive. Per the NVIDIA technical blog, it features a 10-wide instruction fetch and decode front end, a neural branch predictor capable of evaluating two taken branches per cycle, and six 128-bit SVE2 vector engines with FP8 support. A 10-wide front end is genuinely large; most modern x86 cores top out at 8 to 9 wide, and the neural branch predictor handling two taken branches per cycle is unusual outside of the most aggressive performance-oriented designs. This is not a conservative implementation. It’s a clear statement that NVIDIA wants leadership single-thread performance, accepting the design complexity that comes with that ambition.

The cache hierarchy reflects a similar ambition. Over Grace, Vera also has double the L2 cache at 2MB per core, a larger unified L3 cache at 164MB, and PCIe Gen 6 and CXL 3.1 connectivity. With 88 cores each having 2MB of L2 cache, the total L2 capacity alone reaches 176MB, on top of 164MB of unified L3. That’s an enormous amount of on-chip cache by current data center CPU standards, and it’s particularly well-suited to agentic workloads that need to keep working sets close to compute units to minimize memory latency.

SVE2 and Native FP8 Support

The vector engine implementation is one of the most distinctive aspects of Olympus. Each of the 88 Olympus cores features six 128-bit SVE2 vector engines, a 50% increase over Grace. More importantly, it is the first CPU to natively support FP8 precision. By processing data in the same 8-bit format used by the latest GPUs, Vera can move and manipulate AI data without the translation tax of converting between different formats, drastically reducing latency during the critical pre-fill stages of model inference.

The technical details here matter. Native FP8 support includes FP8DOT2 and FP8DOT4 instructions, where FDOT with FP8DOT2 takes two vectors of FP8 (E4M3 or E5M2), does pairwise multiply-add, and accumulates into FP16, while FP8DOT4 does 4-way accumulation into FP32. This is the first CPU with hardware FP8 arithmetic, no lookup tables or software conversion required. With 6 engines at 128-bit, each core processes 96 e4m3 elements per cycle, which is a substantial throughput improvement for workloads that can be expressed in FP8 format.

Spatial Multithreading

One of the more unusual architectural decisions in Olympus is NVIDIA’s implementation of Spatial Multithreading. This is described as a new type of multithreading that enables a total of 176 threads by physically partitioning each core’s resources rather than time-slicing them, allowing the system to optimize for performance or density at runtime. This differs from conventional simultaneous multithreading, as in Intel’s Hyper-Threading or AMD’s SMT, where two threads share the same physical execution resources and dynamically contend for them.

NVIDIA’s documentation indicates that SMT affects per-thread performance: most pipelines effectively halve per-thread throughput with two threads active, except for a few per-thread-dedicated pipelines, so developers should decide whether to use SMT for a given workload or keep one thread per core. The implication is that Vera offers fine-grained control over the tradeoff between throughput density and per-thread performance, which is exactly what agentic AI workloads need. Some agent loops want maximum per-thread responsiveness; others want raw parallelism. Spatial Multithreading lets administrators choose.

The Memory Subsystem

Memory bandwidth is where Vera most clearly distinguishes itself from established x86 competitors. The chip offers 1.2 TB/s of memory bandwidth and supports up to 1.5 TB of LPDDR5X memory in the SOCAMM2 format. On per-core memory bandwidth, Vera leads substantially, about 2.8 times a standard-DDR EPYC 9755 or Xeon 6980P, and still a clear lead even against Xeon 6 configurations using MRDIMM-8800. For workloads where memory access dominates execution time, which describes a great deal of modern AI inference and agentic computation, that bandwidth advantage translates directly to performance.

The use of LPDDR5X in SOCAMM2 modules is also notable. LPDDR5X consumes significantly less power than equivalent DDR5 configurations, and the SOCAMM2 form factor enables higher density mounting and easier serviceability than traditional registered DIMMs. The tradeoff is that LPDDR5X-based systems are more difficult to upgrade post-deployment because the memory modules are physically different from the DIMM standards used in conventional servers. For data center operators who refresh hardware every 3 to 5 years anyway, this is generally acceptable; for enterprises that want incremental upgrades, it represents a meaningful change in operating model.

The Scalable Coherency Fabric

The interconnect inside Vera is another aspect that distinguishes it from competing designs. A second-generation Scalable Coherency Fabric provides 3.4 TB/s of bisection bandwidth, connecting the cores across a unified monolithic die and eliminating the latency issues common in chiplet architectures. The choice of a monolithic die for an 88-core chip is significant because both AMD and Intel have moved aggressively toward chiplet-based designs for their highest-core-count parts, accepting some inter-chiplet latency in exchange for better manufacturing economics.

NVIDIA’s bet with Vera is that the latency advantages of a monolithic die matter more than the cost advantages of chiplets for the specific workloads Vera targets. This makes sense for agentic AI workloads where cross-core communication patterns are unpredictable and latency variance translates directly to worst-case response times. Whether it’s the right tradeoff for the broader server market is a more open question; monolithic dies at 88 cores represent significant manufacturing complexity and yield challenges, which is part of why competing vendors moved away from that approach.

NVLink-C2C and Platform Integration

Vera’s integration with NVIDIA’s broader platform is one of its most strategically important characteristics. The chip features NVLink-C2C with 1.8 TB/s of coherent CPU-GPU bandwidth, roughly double that of the previous-generation Grace implementation. In a Vera Rubin system, this enables genuinely tight coupling between CPU and GPU memory, allowing AI workloads to span both compute domains without the bottleneck of slower PCIe interconnects. A full NVL144 rack integrates 144 Rubin GPUs with 36 Vera CPUs to deliver up to 3.6 NVFP4 ExaFLOPS for inference and up to 1.2 FP8 ExaFLOPS for training performance.

This platform integration is also Vera’s competitive moat. Even if Intel or AMD CPUs match Vera’s raw performance, neither can replicate the CPU-GPU coherence and bandwidth that NVLink-C2C enables. For customers buying into the Vera Rubin platform for AI workloads, the host CPU choice is effectively constrained because the integration advantages are too significant to give up. This is the playbook NVIDIA has executed repeatedly across its product lines: build a credible standalone component while reserving the best integration benefits for customers who go all-in on NVIDIA’s full stack.

What Comes Next

NVIDIA Vera data center CPUs remain on track to ship in the second half of 2026, which means the next several months will be dominated by ecosystem readiness work, OEM partner announcements, and the slow process of moving software to fully exploit Olympus-specific capabilities such as native FP8 and 6-wide SVE2. Major cloud providers, including AWS, Google Cloud, Azure, and Oracle, are planning to offer Vera Rubin-based services, with general availability following hardware shipments. The path to Vera becoming a meaningful share of the server CPU market depends heavily on how quickly hyperscaler deployments ramp up and how effectively NVIDIA’s software ecosystem supports both standalone Vera deployments and Vera Rubin-integrated systems.

The Phoronix benchmarks have established that Vera is a credible competitor at the absolute top of the server CPU performance envelope, which was the open question before this week. The remaining questions, power efficiency, total cost of ownership, software ecosystem maturity, and supply availability, will determine how aggressively customers actually move workloads to Vera over the next 18 to 24 months. What is no longer in doubt is that NVIDIA has succeeded in producing a server CPU that can compete on its own technical merits with anything Intel or AMD currently offers. For an industry that has effectively been a two-vendor market for high-end server CPUs since the Itanium era, this represents the most significant structural change in years.

Like this:

Related