Business US

NVIDIA Beats Everyone To DeepSeek V4 With Day-0 Blackwell Support, Pushing 3,500 Tokens Per Second On 1.6T Models

DeepSeek V4 is out, bringing major optimizations, including up to 1.6T model sizes, and NVIDIA is ready with Day-0 support on Blackwell GPUs using NVFP4.

NVIDIA Blackwell NVFP4 Architecture Delivers Major Speed-Ups In DeepSeek v4 With More Optimizations On The Way

With the launch of DeepSeek V4, we saw some major optimizations in compute & memory requirements.

The updated AI model uses just 27% of single-token inference FLOPs & 10% of the KV cache when running a one-million-token context window. Two new models were also introduced, one being a Pro model with a parameter size of 1.6T, and a Flash version with a parameter size of 284B.

SpecificationDeepSeek-V4-ProDeepSeek-V4-FlashModalityTextTextTotal parameters1.6T284BActive parameters49B13BContext length1M tokens1M tokensMax output lengthUp to 384K tokens through DeepSeek API docsUp to 384K tokens through DeepSeek API docsPrimary use casesAdvanced reasoning, coding, long-context agentsHigh-speed efficiency, chat, routing, summarizationLicenseMITHigh-speed efficiency, chat, routing, and summarization

With this launch, NVIDIA is showcasing Day-0 support and performance of Blackwell GPUs in DeepSeek V4. The company states that Blackwell GPUs provide the scale and low-latency performance required to run 1M long-context inference and trillion-parameter AI models that V4 is offering.

From data center deployments on NVIDIA Blackwell to managed NIM microservices and fine-tuning workflows, NVIDIA provides a range of options for integrating DeepSeek and other open models across different stages of development and deployment. NVIDIA is an active contributor to the open-source ecosystem and has released several hundred projects under open-source licenses. NVIDIA is committed to optimizing community software and open models lets users broadly share work in AI safety and resilience.

via NVIDIA

In the performance slide, NVIDIA demonstrates almost 3500 TPS throughput per GPU (GB300 or Blackwell Ultra), and these are just preliminary figures that are expected to rise as further optimizations to the co-design stack are made. The NVIDIA Blackwell stack offers a range of technologies specifically designed for models such as V4, including NVFP4, Dynamo, Optimized CUDA Kernels, advanced parallelization techniques, and more.

📊 Day 0 performance is here: DeepSeek-V4-Pro running on NVIDIA Blackwell Ultra.

Using @vllm_project‘s Day 0 recipe, we’ve captured the initial performance Pareto for DeepSeek’s flagship 1M long-context model. This curve highlights the baseline for balancing AI factory… pic.twitter.com/s6wi1Xvegj

— NVIDIA AI (@NVIDIAAI) April 24, 2026

What’s key to DeepSeek V4 is the application of FP4 (MXFP4) quantization, which is used to accelerate both rollouts and inference passes. With FP4 DeepSeek, V4 models reduce memory traffic and sampling latency.

One thing that should be highlighted is that Huawei’s latest Ascend chips, the Ascend 950PR and Ascend 950DT, both planned for 2026, feature MXFP4 instructions. This shows that DeepSeek V4 will also be fully compatible with China’s domestic AI chips.

With NVIDIA’s ongoing optimizations, upcoming models will see a robust ecosystem support out of the box.

About the author: A Software Engineer by training and a PC enthusiast by passion, Hassan Mujtaba serves as Wccftech’s Senior Editor for hardware section. With years of experience in the industry, he specializes in deep-dive technical analysis of next-generation CPU and GPU architectures, motherboards, and cooling solutions. His work involves not only breaking news on upcoming technologies but also extensive hands-on reviews and benchmarking.

Follow Wccftech on Google to get more of our news coverage in your feeds.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button