Stanford CS231n is the defining course for deep learning applied to computer vision. Taught by Justin Johnson, Serena Yeung, and originally by Fei-Fei Li and Andrej Karpathy, it is the course that trained a generation of the engineers who now build the vision systems in every phone, car, and medical device.
These cs231n notes are a complete study reference covering the major topics from the course. The depth here goes beyond a surface survey — each section explains the concepts, the equations that matter, and the intuitions that make the ideas stick.
Stanford CS231n 2017 Lecture 1 — Introduction and Historical Context.
For the mathematical foundations CS231n builds on, see the Stanford CS229 machine learning notes. For the broader deep learning curriculum, see the Stanford CS230 deep learning notes.
What CS231n Covers and Why Computer Vision Is Hard
CS231n is hard because visual recognition is hard — and that hardness is not obvious until you think carefully about what the problem actually requires.
When you see a photograph of a cat, you identify it as a cat in a fraction of a second without conscious effort. The image is a grid of pixel values — integers between 0 and 255 in three channels (R, G, B). There is no "cat" in those numbers. What you are doing, without knowing it, is extracting a hierarchical representation: edges → textures → parts → objects. CS231n is a course about building systems that learn to do that.
The challenges the course addresses:
- Viewpoint variation: the same object looks very different from different angles
- Illumination: lighting changes alter pixel values dramatically without changing the object
- Deformation: cats are not rigid bodies — they stretch, curl, and compress
- Occlusion: objects are often partially hidden
- Intraclass variation: all golden retrievers are one class, but they vary enormously
These challenges are why hand-engineered features failed for general visual recognition and why learned features from deep networks succeeded.
CS231n's curriculum structure:
- Image classification with traditional methods (k-NN, linear classifiers)
- Convolutional Neural Networks: architecture and training
- Training deep networks: optimization, regularization, batch normalization
- Applications: object detection, segmentation, image captioning
- Recurrent networks, attention mechanisms, and transformers
- Generative models: VAEs and GANs
- Video understanding and 3D vision
Image Classification: From Nearest Neighbors to Linear Classifiers
CS231n grounds the classification problem concretely before introducing neural networks. The starting point is the simplest possible approach.
The nearest neighbor classifier assigns the class of the most similar training image to each test image, using pixel-level distance (L1 or L2). It is simple enough to understand completely and bad enough to motivate everything that comes after.
L1 (Manhattan) distance between two images:
d_1(I_1, I_2) = Σ_p |I_1^p - I_2^p|
L2 (Euclidean) distance:
d_2(I_1, I_2) = √(Σ_p (I_1^p - I_2^p)²)
The k-Nearest Neighbor (kNN) classifier votes among the k nearest images. On CIFAR-10 (10-class image classification, 50,000 training images), kNN with L2 distance achieves about 40% accuracy. Neural networks achieve above 99%.
Why does kNN fail for images? Because L2 distance in pixel space has nothing to do with semantic similarity. A shifted version of an image — semantically identical — can have high L2 distance from the original. Two images that look visually dissimilar can have low L2 distance if their backgrounds are similar. Pixel space is the wrong metric space for visual semantics.
Cross-validation is introduced here: split the training data into folds, train on k-1 folds, evaluate on the remaining fold, rotate, and average. For kNN, this is how you choose k. The key principle: the test set is used exactly once — never for hyperparameter selection.
Linear classifiers learn a parametric mapping from images to class scores:
f(x, W) = Wx + b
For CIFAR-10: x is a 3072-dimensional vector (32×32×3 pixels), W is a 10×3072 matrix, b is a 10-dimensional bias vector. The output is 10 class scores.
The "template matching" interpretation: each row of W is a template for one class. The score for a class is the dot product of the image with its template. Visualizing the learned templates shows what the linear classifier "expects" each class to look like — a single averaged image. This is why linear classifiers are limited.
Convolutional Neural Networks: Architecture and Intuition
CNNs are the architecture that made deep learning work for vision. The key innovations over fully connected networks are local connectivity, parameter sharing, and translation equivariance.
Why not just use fully connected networks for images? For a 224×224 RGB image, the input dimension is 224×224×3 = 150,528. A single hidden layer with 1000 neurons would require 150 million parameters just for the first layer. This is computationally intractable and statistically hopeless — you need far more parameters than training examples to fit such a model, and visual patterns are local anyway (an edge detector does not need to see the whole image at once).
The convolutional layer addresses this with two design principles:
- Local connectivity: each neuron connects to a small spatial region of the input (the receptive field), not the entire input.
- Parameter sharing: the same filter (set of weights) is applied at every spatial location. This is valid because a vertical edge detector is useful everywhere in an image, not just in the top-left corner.
A convolutional filter W is a 3D tensor of shape (F, F, C) where F is the filter size and C is the number of input channels. The filter is slid across the spatial dimensions of the input, computing a dot product at each position. The output is a 2D "activation map."
Key hyperparameters:
- Filter size (F): typically 3×3 or 5×5
- Number of filters (K): depth of the output volume
- Stride (S): step size when sliding the filter
- Padding (P): zero-padding around the input border to control output size
Output spatial size: (W - F + 2P) / S + 1
The pooling layer reduces spatial size (and computation) by summarizing spatial regions. Max pooling takes the maximum value in each region. Average pooling takes the mean. Max pooling is more common in practice — it captures the presence of a feature, discarding exact location.
A canonical CNN architecture:
INPUT → [[CONV → RELU] × N → POOL] × M → [FC → RELU] × K → FC → SOFTMAX
Where the convolutional blocks extract features and the fully connected layers classify based on the extracted features.
Classic architectures covered in CS231n:
- LeNet-5 (1998): the first successful CNN, applied to digit recognition
- AlexNet (2012): won ImageNet with 16.4% top-5 error, sparking the deep learning era. First use of ReLU, dropout, and GPU training in a competition-winning architecture
- VGGNet (2014): showed that depth matters; used uniform 3×3 filters throughout
- GoogLeNet/Inception (2014): introduced inception modules — multiple filter sizes in parallel — reducing parameters while increasing depth
- ResNet (2015): introduced residual connections (skip connections) enabling networks of 152+ layers; won ImageNet with 3.57% top-5 error
Residual connections solve the vanishing gradient problem that prevented very deep networks from training. The key insight: instead of learning H(x), learn F(x) = H(x) - x (the residual). If the optimal transformation is close to the identity, this is much easier to learn. The skip connection passes the input directly to the output: y = F(x) + x.
Training Deep Networks: Optimization, Regularization, and Batch Normalization
Understanding how to train a deep network is as important as understanding the architecture. CS231n devotes significant attention to the practical details that determine whether a network trains at all.
Activation functions. The choice of non-linearity affects training dynamics dramatically.
- Sigmoid σ(x) = 1/(1+e^{-x}): saturates at 0 and 1, killing gradients for extreme values. Almost never used in hidden layers anymore.
- Tanh: zero-centered (better than sigmoid), still saturates.
- ReLU (Rectified Linear Unit) f(x) = max(0, x): does not saturate for positive x, fast to compute, default choice for hidden layers. Problem: "dying ReLU" — neurons with consistently negative pre-activations produce zero gradients permanently.
- Leaky ReLU f(x) = max(0.01x, x): allows a small gradient for negative x, preventing dying ReLUs.
- ELU, SELU, Swish: modern alternatives with various tradeoffs in practice.
Weight initialization. If all weights are initialized to zero, all neurons in a layer compute the same output and receive the same gradient — symmetry is never broken. Random initialization breaks symmetry. The scale matters:
- Too large: activations saturate, gradients vanish
- Too small: activations shrink through layers, gradients also shrink
Xavier initialization (for tanh): Var(W) = 1/n_in. He initialization (for ReLU): Var(W) = 2/n_in. These are derived by requiring that the variance of activations remains roughly constant across layers.
Batch Normalization is one of the most impactful practical advances in deep learning. Before feeding the output of a layer to the next, normalize it to have zero mean and unit variance across the batch:
x̂ = (x - μ_B) / √(σ_B² + ε)
y = γ x̂ + β
The learnable parameters γ (scale) and β (shift) allow the network to undo the normalization if needed. Benefits: enables higher learning rates, reduces sensitivity to initialization, acts as regularization (slightly). BatchNorm is typically inserted after convolutional or fully connected layers, before the activation function.
Regularization techniques:
- L2 weight decay: adds λ||W||² to the loss, encouraging small weights
- Dropout: during training, randomly zero out neurons with probability p at each forward pass. Prevents co-adaptation of features. At test time, multiply activations by p (or equivalently, train with p and do nothing at test time with "inverted dropout")
- Data augmentation: random crops, horizontal flips, color jitter, cutout. The most practically important regularization technique for vision
Optimization algorithms:
- SGD with momentum: accumulates a velocity vector in directions of persistent gradient, dampening oscillation: v = μv - α∇J; W += v
- RMSProp: adapts learning rates per-parameter based on recent gradient magnitudes
- Adam: combines momentum and adaptive learning rates. The default choice for most applications
Learning rate scheduling: start with a relatively high learning rate, reduce it (by 10×) when validation performance plateaus. Warm restarts and cosine annealing are alternatives that work well with modern architectures.
Object Detection, Segmentation, and Beyond Classification
Image classification assigns one label to an entire image. Object detection localizes and classifies multiple objects. Semantic segmentation assigns a class label to every pixel. These harder tasks drove most of the architectural innovations in the 2015–2022 period.
Sliding window detection applies a classifier at every position and scale. Accurate, but computationally prohibitive — a 256×256 image might require thousands of classifier evaluations.
Region-based CNNs (R-CNN) propose regions first, then classify them. Selective search proposes ~2000 candidate regions per image; each is warped to a fixed size and passed through a CNN for feature extraction and classification. Better accuracy, still slow (47 seconds per image on a GPU).
Fast R-CNN processes the entire image with a CNN once, extracting features for all regions from the shared feature map using RoI (Region of Interest) pooling. Much faster.
Faster R-CNN replaces selective search with a Region Proposal Network (RPN) that shares computation with the detection network. The RPN slides over the feature map, predicting objectness scores and bounding box offsets at anchor boxes of multiple scales and aspect ratios. End-to-end trainable. State of the art at the time.
YOLO (You Only Look Once) reframes detection as a single regression problem: divide the image into a grid, and have each grid cell predict bounding boxes and class probabilities directly. Orders of magnitude faster than R-CNN methods, enabling real-time detection.
Semantic segmentation assigns a class to every pixel. Fully Convolutional Networks (FCNs) replace the final fully connected layers of a classification network with convolutional layers, producing a spatial prediction map. Upsampling (transposed convolutions, bilinear upsampling) restores spatial resolution.
Instance segmentation (Mask R-CNN) extends Faster R-CNN with a branch that predicts a binary segmentation mask for each detected instance. This combines object detection and segmentation.
Recurrent Networks, Attention, and Vision-Language Models
CS231n's second half covers sequence modeling and how vision connects to language — a connection that has become increasingly central as multimodal models dominate the field.
Recurrent Neural Networks (RNNs) process sequences by maintaining a hidden state that summarizes information from all previous inputs:
h_t = f(h_{t-1}, x_t) = tanh(W_hh h_{t-1} + W_xh x_t)
y_t = W_hy h_t
The same weights are used at every time step (weight sharing over time). The hidden state is the network's "memory."
Backpropagation Through Time (BPTT) unrolls the RNN over time steps and applies backprop. The problem: gradients either vanish (for long sequences, gradients from early time steps become vanishingly small) or explode.
LSTMs (Long Short-Term Memory) address vanishing gradients with explicit memory cells and gating mechanisms:
- Forget gate: what fraction of the previous cell state to retain
- Input gate: how much new information to write to the cell state
- Output gate: how much of the cell state to expose as hidden state
The cell state acts as a "highway" for gradients to flow through time with minimal modification, enabling LSTMs to learn dependencies over hundreds of time steps.
Image captioning applies sequence models to vision. The architecture: a CNN encodes the image into a feature vector; an LSTM decoder generates words one at a time, conditioned on the image feature and the previously generated words. This was state of the art in 2015.
Attention mechanisms allow the model to focus on different parts of the image when generating each word. Instead of a single fixed image feature, the decoder computes a weighted sum of spatial CNN features at each decoding step, with weights (attention weights) computed based on the current decoder state. This is the origin of "attention" in the modern ML sense.
The Transformer architecture (covered in later CS231n lectures) extends attention to an entirely attention-based model without recurrence. The self-attention mechanism allows every position in a sequence to attend to every other position:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Multi-head attention runs multiple attention operations in parallel, each learning different relationships. Transformers parallelize over sequence length (unlike RNNs) and have become the dominant architecture for language, vision, and multimodal tasks.
Vision Transformers (ViT) apply the transformer directly to images by dividing the image into 16×16 pixel patches, linearly embedding each patch, and processing the sequence of patch embeddings with a standard transformer encoder. ViT matches or outperforms CNN-based models on large-scale datasets while being simpler to scale.
Generative Models: VAEs and GANs
CS231n concludes with generative models — systems that learn to generate new images rather than just classify them.
Variational Autoencoders (VAEs) learn a latent representation z of the data distribution. An encoder network maps input x to a distribution over z (typically Gaussian: mean μ and variance σ²). A decoder network maps z back to x. The objective combines reconstruction quality and a KL divergence term that encourages the latent space to be smooth and structured:
L(x) = E_{z~q}[log p(x|z)] - KL(q(z|x) || p(z))
The reparameterization trick makes this differentiable: sample ε ~ N(0, I), then z = μ + σ ⊙ ε. This allows gradients to flow through the sampling operation.
VAEs produce blurry generated images — a known limitation due to the reconstruction loss penalizing any mismatch between generated and target pixel values.
Generative Adversarial Networks (GANs) take a different approach. A generator G maps noise z to images. A discriminator D tries to distinguish real images from generated ones. They are trained simultaneously in a minimax game:
min_G max_D E_{x~p_data}[log D(x)] + E_{z~p_z}[log(1 - D(G(z)))]
GANs produce sharper, more realistic images than VAEs. The training challenges are significant: mode collapse (generator produces only a few types of images), training instability, and sensitivity to hyperparameters. Techniques like progressive growing (PGGAN), Wasserstein distance (WGAN), and spectral normalization address these problems.
Practical applications of generative models: image synthesis, image-to-image translation (pix2pix, CycleGAN), super-resolution, style transfer, data augmentation for training other models.
What Should You Do After CS231n?
CS231n gives you the vocabulary and intuitions to read and implement modern computer vision papers. The natural next steps depend on where you want to go.
For research: Read the key papers that CS231n covers — AlexNet, VGGNet, ResNet, Faster R-CNN, GANs, Attention Is All You Need, ViT. Implement them from scratch in PyTorch. Work through the Stanford CS231n programming assignments (available on GitHub).
For applied work: The course is available at cs231n.stanford.edu with full lecture notes and materials. Hugging Face provides pre-trained vision models for most practical tasks — understanding CS231n means understanding what those models are doing internally.
For a comprehensive ML background: The Stanford CS229 machine learning notes fill in the classical ML foundations. The fast.ai practical deep learning notes give a practical, code-first complementary perspective. For the theory side, our Andrew Ng ML course notes cover supervised learning fundamentals.
For broader AI: See our MIT 6.034 AI course notes for symbolic and classical AI methods, and our Coursera Deep Learning Specialization notes for a structured path through the deep learning curriculum.
Should You Study CS231n in 2026?
Yes — with the same caveat that applies to all foundational courses. The specific architectures have evolved (ViT has partially replaced CNNs for large-scale tasks, diffusion models have superseded GANs for image generation) but the concepts have not. Understanding why CNNs work, what attention mechanisms compute, and how training dynamics affect generalization is as valuable now as it was in 2017.
The course materials through 2017 are fully available on YouTube. The lecture notes are thorough. The programming assignments are excellent. There is no cheaper or better way to build a solid foundation in deep learning for vision.
CS231n is dense material. Every lecture introduces three or four concepts that each deserve their own study session. Notiq turns CS231n lectures into structured notes automatically — key concepts, equations, and self-test questions extracted from the video. Work through a lecture, generate your notes, and review with spaced repetition.

