NVIDIA unveils NVFP4 4-bit pretraining methodology for efficient AI

NVIDIA has successfully pretrained a 12-billion-parameter Mamba-Transformer model on 10 trillion tokens using a new 4-bit precision method.

DK
David Katzman

May 18, 2026 · 3 min read

Futuristic server room with holographic AI neural network visualizations, representing NVIDIA's NVFP4 4-bit pretraining technology.

NVIDIA has successfully pretrained a 12-billion-parameter Mamba-Transformer model on 10 trillion tokens using a new 4-bit precision method. This method, called NVFP4, achieved accuracy nearly identical to its 8-bit counterpart. The development could reshape how organizations approach large language model development by making advanced AI more attainable.

Training large language models traditionally demands immense computational resources and high precision. However, Near-equivalent performance can be achieved with significantly reduced precision and cost using NVIDIA's NVFP4. This challenges a core assumption in AI training, suggesting that high precision is not always necessary for competitive LLM performance at scale.

The industry is likely to see a rapid acceleration in the development and deployment of even larger and more complex AI models. This makes advanced AI more accessible and cost-effective for a broader range of applications, opening doors beyond a handful of hyper-scale players.

How NVIDIA Achieved 4-bit Precision with High Accuracy

  • NVFP4 uses a two-level scaling strategy, employing a fine-grained E4M3 FP8 scaling factor for each 16-value micro-block. A second-level FP32 scalar is applied per tensor, according to Developer Nvidia.
  • The NVFP4 format has a structure of 1 sign bit, 2 exponent bits, and 1 mantissa bit, supporting a value range of approximately -6 to 6.

This two-level scaling and specific bit allocation are crucial for NVFP4 to maintain numerical stability and precision. It enables efficient operation within a 4-bit format, overcoming traditional low-precision arithmetic challenges.

Setting New Benchmarks for Large-Scale 4-bit Pretraining

NVIDIA validated NVFP4 by pretraining a 12-billion-parameter hybrid Mamba-Transformer on 10 trillion tokens. The longest publicly documented training run in 4-bit precision to date was achieved by NVIDIA's NVFP4, according to MarkTechPost. The resulting NVFP4-trained model achieved 62.58% on MMLU-Pro 5-shot, closely matching the 62.62% of the FP8 baseline.

This validation proves NVFP4's capability to handle massive models and datasets. It pushes low-precision training boundaries, establishing a new industry benchmark for efficiency and accuracy.

NVFP4's Role in Low-Precision AI Training

TetraJet-v2 is an end-to-end 4-bit fully-quantized training (FQT) method that uses NVFP4 for activations, weights, and gradients in all linear layers, according to Arxiv. This method addresses weight oscillation and outlier issues in low-precision LLM training.

TetraJet-v2 proposes an unbiased double-block quantization method, OsciReset for oscillation suppression, and OutControl for outlier accuracy. These innovations show that robust 4-bit training requires a broader suite of algorithmic advancements beyond the NVFP4 format itself.

NVFP4 is not just a standalone innovation but a foundational component enabling more robust and accurate fully-quantized training methods. It significantly closes the performance gap to full-precision models by tackling long-standing technical hurdles through sophisticated, full-stack hardware-software co-design.

The Future of Efficient AI: Faster, Cheaper, Smarter

An explosion in organizations capable of building state-of-the-art LLMs is signaled by NVIDIA's NVFP4, validated by pretraining a 12-billion-parameter model with near FP8 accuracy on 10 trillion tokens. This democratizes access to cutting-edge AI development, moving it beyond hyper-scale entities.

Algorithmic innovations like OsciReset and OutControl within TetraJet-v2, leveraging NVFP4, shift advancements. Future LLM efficiency will come less from brute-force compute and more from sophisticated, full-stack hardware-software co-design. Proprietary optimization techniques will become a key competitive differentiator.

Companies not actively exploring 4-bit pretraining methodologies like NVFP4 risk being outmaneuvered by competitors. These competitors can achieve comparable model performance at a fraction of the computational cost and time. This turns high-precision training into an increasingly unsustainable luxury by late 2026.