
NVIDIA says Blackwell confidential AI inference keeps up to 98% of baseline performance
NVIDIA says Blackwell confidential computing can protect AI inference with benchmark overhead generally below 8.1%.
NVIDIA has published new benchmark results for Confidential Computing on Blackwell GPUs, arguing that hardware-rooted protections for AI inference can run close to normal production speed. The company says its confidential computing mode protects enterprise data, proprietary model weights and the model itself while an inference workload is active, not only while data is stored or moving across a network.
The July 2 technical post focuses on Blackwell systems including RTX PRO 6000, HGX B200 and HGX B300. NVIDIA says HGX B200 and HGX B300 support confidential computing across as many as eight GPUs, with NVLink encryption between them. At the silicon level, the GPU uses a private signing key fused during manufacturing and kept away from software, firmware and the host system. That key anchors an attestation chain used to verify a workload before secrets such as model decryption keys are released.
Why it matters
Confidential computing has become more important as companies move sensitive AI workloads into shared cloud and hosted infrastructure. The tradeoff has often been performance: stronger isolation and encrypted paths can add launch overhead or reduce host-to-device bandwidth. NVIDIA's claim is that the overhead is small enough for large-scale inference deployments, especially when framework changes reduce the cost of secure work submission.
In NVIDIA's test, an HGX B300 system running a Qwen 3.5 397B-A17B model at FP8 precision under SGLang was compared with confidential computing off and on. Across tested concurrency levels from 4 to 256 requests and input/output token settings of 1024/1024 and 8192/1024, the reported throughput and median time-per-output-token deltas generally stayed under 8.1%. NVIDIA summarizes the result as up to 98% of the performance of a non-confidential setup.
- Remote attestation checks the GPU hardware report and CPU trusted execution environment measurements before secrets are delivered.
- Framework work in FlashInfer and SGLang is meant to reduce timing, copy and CUDA graph overhead in confidential mode.
- The benchmark used Ubuntu guests, Intel TDX, NVIDIA driver 595.71.05, CUDA 13.2 and NCCL 2.28.9-1.
The findings are still vendor-run benchmarks, so buyers will need to test their own models, prompt lengths and deployment patterns. Even so, the data gives security and infrastructure teams a fresh reference point for weighing confidential AI inference against traditional throughput targets.
Sources
Cover photo by panumas nikhomkhai on Pexels, used under the Pexels License.
CyberOGZ Team






Comments (0)
Leave a Comment