Why Nvidia Dominates AI: A History of CUDA and Parallel Computing

Bryan White
3 hours ago
16 min read

Neon green digital brain linked to a circuit board with glowing lines on a dark tech background. High-tech, futuristic vibe.

Abstract

The trajectory of Nvidia Corporation—from a fledgling startup sketching ideas in a roadside diner to the world’s most valuable company—is a singular case study in technological convergence. While the company’s public identity for its first two decades was inextricably linked to the consumer video game market, its internal architectural roadmap was progressively steering toward a different horizon: general-purpose parallel computing. This report provides an exhaustive analysis of this thirty-year metamorphosis. It argues that Nvidia’s dominance in Artificial Intelligence (AI) was not merely a result of fortuitous timing during the deep learning boom of 2012, but the consequence of a high-risk, multi-decade strategy to unify graphics processing and scientific computing. By dissecting the technical evolution from the fixed-function pipelines of the 1990s to the unified shaders of the G80, and finally to the stochastic precision engines of the Blackwell architecture, this paper illustrates how the Graphics Processing Unit (GPU) evolved into the fundamental engine of the intelligence economy. Furthermore, it examines the development of the CUDA software platform as a competitive "moat" that effectively captured the academic and industrial research communities long before AI became a commercial imperative.

Part I: The Geometry of Survival (1993–1999)

The Denny’s Covenant

In early 1993, the computing landscape was poised on a precipice. The personal computer, largely defined by the serial processing capabilities of the Central Processing Unit (CPU), was becoming a ubiquitous tool for business, yet it remained fundamentally incapable of handling the complex mathematics required for 3D simulation. At a Denny’s restaurant in San Jose, California, three engineers—Jensen Huang, Chris Malachowsky, and Curtis Priem—gathered to discuss this limitation. Their hypothesis was simple yet radical: the future of computing would not rely solely on the CPU’s ability to process instructions sequentially at higher clock speeds, but on a specialized co-processor dedicated to accelerating heavy computational tasks through parallelism.¹

Founded on April 5, 1993, Nvidia entered a market that was rapidly becoming overcrowded. By the mid-1990s, nearly 90 distinct companies were vying for dominance in the graphics accelerator market, including formidable incumbents like 3dfx Interactive, ATI Technologies, and S3 Graphics.³ The industry was characterized by brutal cyclicality, razor-thin margins, and a "winner-take-all" dynamic where missing a single product cycle could lead to bankruptcy.

The NV1 and the Quadratic Failure

Nvidia’s first foray into the market, the NV1, launched in 1995, illustrates the perils of architectural divergence. Unlike its competitors who were coalescing around polygon-based rendering (constructing 3D objects out of triangles), Nvidia bet on "quadratic texture mapping"—a technique using curved surfaces. While theoretically elegant, this approach was incompatible with the emerging industry standard APIs, specifically Microsoft’s DirectX, which favored polygons. The NV1 was a commercial failure, nearly bankrupting the young company. This early near-death experience instilled a corporate culture of paranoia and agility that would later define its AI pivots.³

RIVA 128 and the Embrace of Standards

Facing insolvency, Jensen Huang made a critical strategic decision: abandon the proprietary curved-surface technology and embrace the industry standard polygon pipeline, but execute it faster than anyone else. The result was the RIVA 128 (1997), a chip that saved the company. It provided competent 3D acceleration and significantly faster 2D performance than rivals, allowing Nvidia to capture the OEM market. This marked the first step in Nvidia’s "velocity" strategy: releasing a new architecture every six months to outpace Moore’s Law—a cycle they termed "The Rhythm".³

The Invention of the GPU: GeForce 256

In 1999, Nvidia released a product that arguably defined the modern era of accelerated computing: the GeForce 256. Nvidia’s marketing team coined a new term for this device: the "GPU" (Graphics Processing Unit). While "graphics card" was the colloquial term, "GPU" signified a specific technical milestone—the integration of hardware-based Transform and Lighting (T&L) directly onto the silicon.¹

Prior to the GeForce 256, the host CPU was responsible for the "transform" (calculating the location of 3D objects in space) and "lighting" (calculating how light hits those objects) phases of the graphics pipeline. The graphics card merely drew the resulting pixels (rasterization). By moving T&L to the graphics card, Nvidia effectively created a second processor within the PC, one specialized for floating-point vector mathematics. The GeForce 256 was the first mass-market device to bring workstation-class geometric processing to consumers. It relieved the CPU of a massive computational burden, freeing it to handle game logic and AI. This was the first clear signal that the graphics processor could evolve into a mathematical co-processor, a concept that would lie dormant for another seven years until the dawn of CUDA.

Part II: The Unified Architecture and the Programmability Revolution (2000–2006)

The early 2000s were characterized by the "Shader War." As graphics became more sophisticated, developers demanded more control over how pixels were colored. They moved from "fixed-function" pipelines (where the hardware dictated the look of water or metal) to "programmable shaders" (where developers wrote small programs to define surface properties).

The Programmable Shader Era

With the GeForce 3 (2001), Nvidia introduced programmable vertex and pixel shaders. This allowed for unprecedented visual effects, but the hardware architecture remained rigid. A GPU would have a specific number of "Vertex Shaders" (to handle geometry) and "Pixel Shaders" (to handle color). If a game scene had millions of triangles but simple lighting, the Pixel Shaders sat idle. If a scene was a flat wall with complex graffiti, the Vertex Shaders sat idle. This inefficiency was a fundamental limitation of the split-architecture design.⁵

The G80 Architecture: The Great Unification

In 2006, Nvidia launched the G80 architecture (GeForce 8800 GTX), a product that represents the single most significant architectural shift in the company’s history. The G80 abandoned the distinction between vertex and pixel shaders entirely. In their place, it introduced the Unified Shader Architecture.

The GPU was no longer a collection of specialized tools; it was a massive array of identical, general-purpose cores, which Nvidia called Stream Multiprocessors (SMs).⁵ These cores could process vertices, pixels, or geometry dynamically. A hardware scheduler assigned tasks to the cores based on immediate demand. If a frame required heavy geometry processing, 80% of the cores became vertex processors. If the next frame required complex lighting, they switched to pixel processing.

The Unintended Consequence: A Supercomputer on a Chip

While the primary goal of the G80 was to render video games more efficiently, the byproduct was revolutionary. By creating a processor composed of hundreds of generic floating-point calculators, Nvidia had accidentally built a massively parallel engine suitable for any mathematical task, not just graphics.

This architecture fundamentally aligned with the mathematical needs of scientific simulation. Whether simulating fluid dynamics, folding proteins, or—crucially—calculating the weights of a neural network, the workload consists of performing the same operation on millions of data points simultaneously. The G80 was the hardware manifestation of this "Single Instruction, Multiple Thread" (SIMT) paradigm.

Part III: The Software Moat – CUDA and the democratization of Supercomputing (2006–2010)

Hardware without software is silicon without a soul. The G80 was a powerful engine, but it spoke a language (graphics APIs like DirectX and OpenGL) that was incomprehensible to physicists, biologists, and mathematicians. To harness the G80 for science, a researcher would have to disguise their numerical data as "pixels" and their equations as "texture maps," a cumbersome process known as "hacking the pipeline."

The Birth of CUDA

In November 2006, alongside the G80, Nvidia released CUDA (Compute Unified Device Architecture). CUDA was a parallel computing platform and programming model that allowed developers to access the GPU’s instruction set using C++, the industry-standard high-level programming language.¹

This was a transformative moment. For the first time, a scientist could write code that looked like standard C++, compile it, and run it on the GPU without knowing anything about 3D graphics. They could allocate memory, copy data, and execute "kernels" (functions that run in parallel on the GPU) explicitly.

The "Bet the Company" Moment

Jensen Huang’s strategy for CUDA was fraught with risk. He insisted that CUDA hardware support be included in every GPU Nvidia shipped, from the $3,000 workstation cards to the $50 budget cards in college dorm rooms.³ This decision cost the company millions in silicon die area (transistors used for CUDA logic instead of graphics performance) and billions in R&D, with no guarantee of a return. Wall Street was skeptical; for years, Nvidia’s stock languished as investors questioned the wisdom of spending resources on a feature used by a niche group of scientists.

However, this ubiquity created a massive installed base. Any graduate student with a laptop and an Nvidia card suddenly had a personal supercomputer. This accessibility fueled a grassroots revolution in high-performance computing. Researchers at Oak Ridge National Laboratory, molecular biologists at Stanford, and oil exploration firms began porting their code to CUDA, seeing speedups of 10x to 100x compared to CPUs.

SIMT vs. SIMD: The Secret Sauce

The success of CUDA was also rooted in its handling of parallelism. Traditional CPUs used SIMD (Single Instruction, Multiple Data), where a single instruction operates on a fixed-width vector (e.g., 8 numbers at once). SIMD is efficient but rigid; the programmer must align data perfectly, and branching logic (if/else statements) destroys performance.⁹

Nvidia’s SIMT (Single Instruction, Multiple Threads) approach was different. In SIMT, the programmer writes code for a single thread. The hardware then groups these threads into "warps" (typically 32 threads) and executes them in lockstep. Crucially, if threads within a warp diverge (some take the 'if' path, others take the 'else'), the hardware handles the serialization and re-convergence automatically.⁵ This made GPUs significantly more flexible and easier to program for complex, irregular algorithms than vector CPUs, lowering the barrier to entry for the scientific community.

Part IV: The Deep Learning Ignition (2010–2015)

By 2010, the "AI Winter"—a period of reduced funding and interest in artificial intelligence—was thawing. However, the dominant approaches were still based on logic rules and hand-crafted features. Neural networks, a concept dating back to the 1950s, were viewed with skepticism. They were theoretically sound but practically impossible to train; they required too much data and too much compute.

The ImageNet Challenge

The crucible for modern AI was the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). This competition required algorithms to classify 1.2 million images into 1,000 distinct categories. For years, the error rates hovered around 25%, with incremental improvements gained through painstaking manual tuning of computer vision algorithms.¹²

AlexNet: The Proof of Concept

In 2012, a team from the University of Toronto—doctoral student Alex Krizhevsky, Ilya Sutskever, and Professor Geoffrey Hinton—entered the competition with a "Deep Convolutional Neural Network" (CNN) named AlexNet.

The architecture of AlexNet was massive for its time: 8 layers, 650,000 neurons, and 60 million parameters.¹² Training such a behemoth on CPUs would have taken months. Instead, Krizhevsky turned to CUDA. He used two consumer-grade Nvidia GeForce GTX 580 GPUs.

The Constraints of the GTX 580

The GTX 580, based on the Fermi architecture, was a powerful card for gaming but had a critical limitation: it possessed only 3GB of memory (VRAM).¹⁴ The AlexNet model was too large to fit into the memory of a single card.

To surmount this, the team implemented Model Parallelism. They sliced the neural network in half. One GPU processed the top half of the image features, and the other processed the bottom half. The two GPUs communicated only at specific synchronization layers to exchange data.¹⁶ This forced innovation—splitting a brain across two chips—foreshadowed the massive multi-GPU clusters of the future.

The Results that Shook the World

When the results of the 2012 ImageNet competition were announced, the academic world was stunned. AlexNet achieved a top-5 error rate of 15.3%, obliterating the second-place entry (which used traditional methods) at 26.2%.¹³ It was a paradigm shift. The paper detailing AlexNet, "ImageNet Classification with Deep Convolutional Neural Networks," explicitly cited the GPU as the enabling technology, noting that "training took five to six days on two GTX 580 3GB GPUs".¹³

This moment validated Jensen Huang’s decade-long bet. The scientific community realized that the "black box" of the neural network, when fed enough data and powered by enough parallel compute, could outperform human-coded logic. Nvidia immediately pivoted its entire corporate strategy to support this new workload.

Part V: The Era of Specialization – From Pascal to Volta (2016–2019)

Post-AlexNet, the demands of deep learning began to outpace the capabilities of standard graphics cards. Neural networks grew deeper and wider, doubling in computational requirements every few months. Nvidia responded by bifurcating its GPU architecture: one line optimized for gaming (GeForce) and another rigorously engineered for the Data Center (Tesla).

Pascal (2016) and the Memory Wall

The first major bottleneck to fall was memory bandwidth. The Pascal architecture (Tesla P100) introduced High Bandwidth Memory (HBM2). Unlike traditional GDDR memory which sits on the printed circuit board, HBM2 stacks memory dies vertically and places them on the same silicon package as the GPU, connected by microscopic wires (through-silicon vias).¹⁹

This technology, known as CoWoS (Chip-on-Wafer-on-Substrate), increased memory bandwidth from roughly 300 GB/s to 720 GB/s.¹⁹ This was critical for AI training, which is often "memory-bound"—meaning the compute cores spend more time waiting for data to arrive than actually processing it.

Simultaneously, Pascal introduced NVLink 1.0, a proprietary interconnect that allowed GPUs to talk to each other at 160 GB/s, five times faster than the standard PCIe bus.²⁰ This allowed researchers to scale their models beyond the "two GPU" limit of AlexNet to clusters of 8 or 16 GPUs, treating them as a single logical unit.

Volta (2017): The Invention of the Tensor Core

If Pascal was about moving data, the Volta architecture (Tesla V100) was about the math itself. Nvidia engineers analyzed the specific mathematical operation that dominates deep learning: the matrix multiplication.

In a neural network layer, the system must multiply a matrix of inputs by a matrix of weights and add a bias. This operation, D = A * B + C, is known as a GEMM (General Matrix Multiply). Standard CUDA cores performed this scalar-style: one number times one number.

Volta introduced the Tensor Core. This was a specialized hardware unit that fused the multiplication and addition steps into a single clock cycle operation for 4x4 matrices.²¹

The Analogy: If a CUDA core is a carpenter hammering one nail at a time, a Tensor Core is a nail gun that drives a 4x4 grid of nails instantly.
The Impact: The V100 delivered 125 TeraFLOPS of deep learning performance, a 12x improvement over the Pascal generation.¹⁹ This massive leap was achieved by embracing "Mixed Precision" (FP16), sacrificing some decimal-point accuracy for raw speed—a trade-off that proved perfectly acceptable for the probabilistic nature of neural networks.

Part VI: The Generative Explosion – Ampere, Hopper, and Blackwell (2020–2025)

As the 2020s approached, the nature of AI models shifted again. The "Transformer" architecture (introduced by Google in 2017) began to replace CNNs. Transformers, such as GPT (Generative Pre-trained Transformer), scaled incredibly well with data and compute. They didn't just classify images; they generated language, code, and reasoning.

Ampere (2020): Sparsity and Ubiquity

The Ampere architecture (A100) became the workhorse of the early generative AI era. It introduced Structural Sparsity, a hardware feature that exploited the fact that many weights in a neural network are zero (or close enough to be ignored). The A100 could automatically "skip" the math for these zeros, doubling the effective throughput for sparse models.²³

Ampere also introduced Multi-Instance GPU (MIG). While training large models required the whole GPU, running them (inference) often left the massive chip underutilized. MIG allowed an A100 to be sliced into seven secure, isolated instances. This made the A100 the economic standard for cloud providers (AWS, Azure, Google Cloud), as they could sell slices of the GPU to different customers.¹⁹

Hopper (2022) and the Transformer Engine

The release of the H100 (Hopper architecture) coincided perfectly with the release of ChatGPT. The H100 was designed specifically for Transformers. Its marquee feature was the Transformer Engine.

This engine dynamically managed precision. In computing, "precision" refers to the number of bits used to store a number (e.g., FP32 uses 32 bits). The Transformer Engine could analyze the data flowing through the chip in real-time. If the numbers were stable, it would switch to FP8 (8-bit floating point), doubling the speed and halving the memory usage compared to FP16. If the numbers became volatile, it would switch back to FP16 to preserve accuracy.²⁵ This allowed the H100 to train the massive Large Language Models (LLMs) up to 9x faster than the A100.²⁷

The demand for H100s became a global phenomenon. Tech CEOs bragged about their H100 stockpiles in earnings calls. Nations competed to secure allocations. The chip became the de facto currency of the AI revolution.²⁸

Blackwell (2025): The Reticle Limit and FP4

As models pushed toward trillions of parameters (e.g., GPT-4), Nvidia faced a physical limit: the Reticle Limit. Lithography machines can only print a chip up to a certain maximum size (around 800-900 mm²). The H100 was already near this limit.

To continue scaling, the Blackwell (B200) architecture adopted a "chiplet" design. It fused two maximum-size dies onto a single package, connected by a 10 TB/s interface. To the software, it appeared as one giant chip, but physically it was two.²⁹

Blackwell also introduced FP4 precision—using just 4 bits to represent numbers. This is an incredibly low resolution (imagine representing a high-def photo with just 16 colors). To make this work without destroying the model's intelligence, Nvidia developed a "Block Scaling" technique, grouping numbers into small blocks that shared a scaling factor. This allowed Blackwell to run inference on massive models with unprecedented energy efficiency.³⁰

Part VII: The Strategic Moat and Market Dominance

The Revenue Inversion

The financial consequence of this architectural journey is a complete inversion of Nvidia’s business model. As recently as 2020, Gaming was the company’s primary revenue source. By late 2025, Data Center revenue had grown to $51.2 billion per quarter, accounting for nearly 90% of total revenue, while Gaming stood at $4.3 billion.³² This shift reflects the reality that Nvidia is no longer a consumer electronics company; it is an industrial infrastructure company.

The CUDA Ecosystem as a Barrier to Entry

Competitors like AMD and Intel have produced capable hardware (e.g., the MI300X and Gaudi 3). However, Nvidia’s true dominance lies in its software. The CUDA ecosystem, nurtured for 20 years, acts as a formidable "moat."

Key frameworks like PyTorch and TensorFlow are deeply optimized for CUDA. Libraries like cuDNN (for deep learning) and TensorRT (for inference) provide "turnkey" performance that competitors struggle to match. While open alternatives like OpenAI’s Triton compiler attempt to bridge the gap by allowing code to run on any GPU, Nvidia’s aggressive release cycle means that by the time competitors optimize for the current generation, Nvidia has introduced a new feature (like FP4) that only its software stack supports fully.⁸

Sovereign AI and the Energy Challenge

Looking forward, Nvidia is pivoting to "Sovereign AI"—building domestic AI supercomputers for nations that want to control their own data and intelligence infrastructure.³⁶ However, the company faces a new, physical constraint: energy. A single rack of Blackwell GPUs (the NVL72) consumes 120 kilowatts of power, necessitating liquid cooling and massive grid upgrades.³⁷ The future of Nvidia’s rise will depend not just on transistor density, but on thermodynamic engineering.

Conclusion: The Foundry of Intelligence

The history of Nvidia is a history of convergence. It is the story of how the trivial pursuit of rendering polygons for video games inadvertently solved the most profound computational challenge of the 21st century. Jensen Huang’s insight—that the GPU could be a general-purpose parallel computer—was a gamble that took nearly 15 years to pay off.

From the G80’s unified shaders to the split-brain experiments of AlexNet, and finally to the stochastic precision of Blackwell, every step in Nvidia’s evolution paved the way for the current AI explosion. The company did not just supply the chips; through CUDA, it built the language in which AI is written. Today, Nvidia stands not merely as a hardware vendor, but as the architect of the "AI Factory," the industrial facilities where the raw material of data is refined into the new electricity of the modern world: intelligence.

Appendix: Technical Comparison of Nvidia Data Center Architectures

Architecture	Year	Process Node	Key Innovation	Tensor Performance (FP16/BF16)	Memory Bandwidth	Max VRAM
Pascal (P100)	2016	TSMC 16nm	HBM2, NVLink 1.0	21.2 TFLOPS (No Tensor Cores)	720 GB/s	16 GB
Volta (V100)	2017	TSMC 12nm	Tensor Cores (1st Gen)	125 TFLOPS	900 GB/s	32 GB
Ampere (A100)	2020	TSMC 7nm	TF32, Sparsity, MIG	312 TFLOPS	2,039 GB/s	80 GB
Hopper (H100)	2022	TSMC 4N	Transformer Engine, FP8	990 TFLOPS (FP16) / 1,980 (FP8)	3,350 GB/s	80 GB
Blackwell (B200)	2025	TSMC 4NP	FP4, Dual-Die, NVLink 5	2,250 TFLOPS (FP16) / 9,000 (FP4)	8,000 GB/s	192 GB

Data sourced from ¹⁹

Appendix: The Evolution of Precision in AI Training

The table below illustrates the aggressive reduction in precision (number of bits) required for AI models, driven by Nvidia's hardware support.

Format	Bits	Range (approx)	Era Introduced	Use Case
FP32 (Single)	32	10^-38 to 10³⁸	Pre-2016	Scientific Simulation, Legacy AI
FP16 (Half)	16	10^-5 to 65,504	Volta (2017)	Mixed Precision Training
BF16 (Bfloat)	16	10^-38 to 10³⁸	Ampere (2020)	Stable Training (Same range as FP32)
TF32	19	10^-38 to 10³⁸	Ampere (2020)	High-Performance Math (Internal only)
FP8	8	Varies (E4M3/E5M2)	Hopper (2022)	Transformer Training & Inference
FP4	4	Limited (Block Scaled)	Blackwell (2025)	Large Scale Inference

Data sourced from ³⁰

Works cited

Our History: Innovations Over the Years - NVIDIA, accessed January 10, 2026, https://www.nvidia.com/en-us/about-nvidia/corporate-timeline/
History of Nvidia: Company timeline and facts - TheStreet, accessed January 10, 2026, https://www.thestreet.com/technology/nvidia-company-history-timeline
Nvidia Part I: The GPU Company (1993-2006) | Acquired Podcast, accessed January 10, 2026, https://www.acquired.fm/episodes/nvidia-the-gpu-company-1993-2006
NVIDIA in Brief, accessed January 10, 2026, https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/corporate-nvidia-in-brief.pdf
A history of NVidia Stream Multiprocessor - Fabien Sanglard's Website, accessed January 10, 2026, https://fabiensanglard.net/cuda/
Nvidia's G80 chip, a story of how a new architecture changed everything in a matter of days : r/pcmasterrace - Reddit, accessed January 10, 2026, https://www.reddit.com/r/pcmasterrace/comments/1iibsa5/nvidias_g80_chip_a_story_of_how_a_new/
CUDA - Wikipedia, accessed January 10, 2026, https://en.wikipedia.org/wiki/CUDA
What exactly is “CUDA”? (Democratizing AI Compute, Part 2) - Modular, accessed January 10, 2026, https://www.modular.com/blog/democratizing-compute-part-2-what-exactly-is-cuda
SIMT vs SIMD: Second Edition - Benjamin H Glick, accessed January 10, 2026, https://www.glick.cloud/blog/simt-vs-simd-second-edition
Single instruction, multiple data - Wikipedia, accessed January 10, 2026, https://en.wikipedia.org/wiki/Single_instruction,_multiple_data
SIMD Versus SIMT What is the difference between SIMT vs SIMD - CUDA Programming and Performance - NVIDIA Developer Forums, accessed January 10, 2026, https://forums.developer.nvidia.com/t/simd-versus-simt-what-is-the-difference-between-simt-vs-simd/10459
AlexNet - Wikipedia, accessed January 10, 2026, https://en.wikipedia.org/wiki/AlexNet
ImageNet Classification with Deep Convolutional Neural Networks, accessed January 10, 2026, https://proceedings.neurips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
8.1. Deep Convolutional Neural Networks (AlexNet) - Dive into Deep Learning, accessed January 10, 2026, http://d2l.ai/chapter_convolutional-modern/alexnet.html
[discussion] Why was AlexNet split on two GPUs each of memory size 3GB when it can fit on 1 GB? - Reddit, accessed January 10, 2026, https://www.reddit.com/r/mlscaling/comments/1gc6pvk/discussion_why_was_alexnet_split_on_two_gpus_each/
AlexNet and ImageNet: The Birth of Deep Learning - Pinecone, accessed January 10, 2026, https://www.pinecone.io/learn/series/image-search/imagenet/
AlexNet Architecture Explained. The convolutional neural network (CNN)… | by Siddhesh Bangar | Medium, accessed January 10, 2026, https://medium.com/@siddheshb008/alexnet-architecture-explained-b6240c528bd5
Accelerating AI with GPUs: A New Computing Model - NVIDIA Blog, accessed January 10, 2026, https://blogs.nvidia.com/blog/accelerating-ai-artificial-intelligence-gpus/
Evolution of NVIDIA Data Center GPUs: From Pascal to Grace ..., accessed January 10, 2026, https://www.serversimply.com/blog/evolution-of-nvidia-data-center-gpus
The Evolution of NVIDIA NVLink: NVSwitch and PCIe Comparison in 2025, accessed January 10, 2026, https://network-switch.com/blogs/networking/the-evolution-of-nvidia-nvlink-technology
Tensor Cores Explained in Simple Terms - DigitalOcean, accessed January 10, 2026, https://www.digitalocean.com/community/tutorials/understanding-tensor-cores
Numerical behavior of NVIDIA tensor cores - PMC - NIH, accessed January 10, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC7959640/
NVIDIA Ampere Architecture, accessed January 10, 2026, https://www.nvidia.com/en-us/data-center/ampere-architecture/
NVIDIA A100 Tensor Core GPU Architecture, accessed January 10, 2026, https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
NVIDIA Hopper GPU Architecture, accessed January 10, 2026, https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/
NVIDIA Ampere, Hopper, and Blackwell GPUs — What's in it for ML Workloads? - Medium, accessed January 10, 2026, https://medium.com/@najeebkan/nvidia-ampere-hopper-and-blackwell-gpus-whats-in-it-for-ml-workloads-c81676e122aa
NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog, accessed January 10, 2026, https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
The AI Boom: 3 Key Takeaways from Nvidia's Q3 Earnings | LeverageSharesUS, accessed January 10, 2026, https://leverageshares.com/us/insights/the-ai-boom-3-key-takeaways-from-nvidias-q3-earnings/
The Engine Behind AI Factories | NVIDIA Blackwell Architecture, accessed January 10, 2026, https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
Introducing NVFP4 for Efficient and Accurate Low-Precision Inference - NVIDIA Developer, accessed January 10, 2026, https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit - NVIDIA Developer, accessed January 10, 2026, https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/
Nvidia: From Gaming to Data Center Giant - Voronoi, accessed January 10, 2026, https://www.voronoiapp.com/technology/-Nvidias-Shift-from-Gaming-to-Data-Center-Dominance-3153
NVIDIA Announces Financial Results for Third Quarter Fiscal 2026, accessed January 10, 2026, https://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-third-quarter-fiscal-2026
Nvidia's AI Moat in 2025: A Deep Dive - Sundeep Teki, accessed January 10, 2026, https://www.sundeepteki.org/blog/nvidias-ai-moat-in-2025-a-deep-dive
The CUDA Advantage: How NVIDIA Came to Dominate AI And The Role of GPU Memory in Large-Scale Model Training | by Aidan Pak | Medium, accessed January 10, 2026, https://medium.com/@aidanpak/the-cuda-advantage-how-nvidia-came-to-dominate-ai-and-the-role-of-gpu-memory-in-large-scale-model-e0cdb98a14a0
Nvidia's $65 Billion Forecast Sends a Clear Message About the AI Boom - Medium, accessed January 10, 2026, https://medium.com/@impactnews-wire/nvidias-65-billion-forecast-sends-a-clear-message-about-the-ai-boom-21fe182fcb7a
Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion, accessed January 10, 2026, https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/
Deep dive into NVIDIA Blackwell Benchmarks — where does the 4x training and 30x inference performance gain, and 25x reduction in energy usage come from? - adrian cockcroft - Medium, accessed January 10, 2026, https://adrianco.medium.com/deep-dive-into-nvidia-blackwell-benchmarks-where-does-the-4x-training-and-30x-inference-0209f1971e71
NVIDIA G80 GPU Specs - TechPowerUp, accessed January 10, 2026, https://www.techpowerup.com/gpu-specs/nvidia-g80.g52
NVLink & NVSwitch: Fastest HPC Data Center Platform | NVIDIA, accessed January 10, 2026, https://www.nvidia.com/en-us/data-center/nvlink/
NVIDIA RTX BLACKWELL GPU ARCHITECTURE, accessed January 10, 2026, https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf
How can using FP16, BF16, or FP8 mixed precision speed up my model training? - Runpod, accessed January 10, 2026, https://www.runpod.io/articles/guides/fp16-bf16-fp8-mixed-precision-speed-up-my-model-training
Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training - NVIDIA Developer, accessed January 10, 2026, https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/
Train With Mixed Precision - NVIDIA Docs, accessed January 10, 2026, https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html