Hardware Architecture for AI: GPUs, TPUs, CUDA & Memory Bandwidth

Introduction

Artificial intelligence requires specialized hardware. This lesson explores the foundational hardware architectures that make AI and LLM inference possible.

CPU vs. GPU Architecture

While CPUs are designed for general-purpose processing with a few powerful cores, GPUs are designed with thousands of smaller, specialized cores that excel at parallel processing. AI workloads heavily rely on matrix multiplication, making GPUs far superior for these computations.

TPUs and Specialized Accelerators

Tensor Processing Units (TPUs) are custom ASICs developed by Google specifically for machine learning workloads. They are highly optimized for TensorFlow and offer massive performance benefits for specific types of AI tasks.

CUDA and Software Interfaces

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model. It allows developers to use CUDA-enabled GPUs for general purpose processing. Most AI frameworks rely on CUDA under the hood.

The Role of Memory Bandwidth

For Large Language Models (LLMs), memory bandwidth is often the primary bottleneck rather than raw compute power. Moving massive model weights from memory to the processing cores requires extreme bandwidth, which is why architectures like High Bandwidth Memory (HBM) are critical in modern AI accelerators.

Site Reliability Engineering - Lesson 5 Local LLM Execution with Ollama & Model Runtime Internals