Stanford CS229 Machine Learning Course Notes (Official 2026 Syllabus) — Free Complete Reference

·20 min read
Stanford CS229 Machine Learning Course Notes (Official 2026 Syllabus) — Free Complete Reference

Share this article

Stanford CS229 is the machine learning course that defined how universities teach the subject. Andrew Ng built it around the idea that students who understand the underlying mathematics — not just how to call a library function, but why the algorithm works — will be far more capable when they encounter problems the textbook did not anticipate.

These cs229 lecture notes are a complete study reference covering every major topic in the course. Whether you are working through the lectures on YouTube, preparing for a technical interview, or returning to CS229 after time away, this document gives you the key ideas, equations, and intuitions organized for efficient review.

Stanford CS229 Autumn 2018 Lecture 1. These notes are organized around the lecture topics — watch alongside or use them for review.

For a broader guide to learning technical courses from YouTube lectures, see our complete guide to YouTube study notes. For notes from the deep learning companion course, see our Stanford CS230 deep learning notes.

What Is CS229, and Who Is It For?

CS229 is Stanford's graduate machine learning course. It covers the field end-to-end: from the mathematics of supervised learning through unsupervised methods, learning theory, and reinforcement learning. The course is mathematically demanding — it assumes comfort with linear algebra, multivariable calculus, and probability theory. But the payoff is a working mental model of why each algorithm behaves the way it does.

The course is taught by Andrew Ng, who originally designed it based on his PhD research at Stanford and his experience building ML systems at Google Brain. The lecture notes he wrote for CS229 are available free on the Stanford CS229 website and are widely considered the best single written reference for classical machine learning — more rigorous than most textbooks and more practical than most academic papers.

CS229 versus CS229A (Coursera): The Stanford on-campus course goes significantly deeper than the Coursera version of Andrew Ng's ML course. CS229 derives the algorithms from first principles, covers learning theory (VC dimension, bias-variance in the PAC learning framework), and includes topics like Gaussian processes, the EM algorithm, and factor analysis that the Coursera version omits. These notes focus on the on-campus curriculum.

Official Stanford CS229 Syllabus — 2026 Course Topics

These notes follow the official Stanford CS229 syllabus as it is taught on campus in the current academic year. The course is divided into four blocks that span the entire field of classical machine learning. Every topic below is treated in this reference, in the same order Andrew Ng covers them.

Block 1 — Supervised learning (the bulk of the course, ~6 weeks):

  • Linear regression, least squares, normal equations
  • Logistic regression and the perceptron
  • Generalized linear models (GLMs) and exponential families
  • Generative learning (Gaussian discriminant analysis, naive Bayes)
  • Support vector machines — geometric margin, primal, dual, kernels, SMO. (For the full dual derivation, see our dedicated CS229 SVM dual formulation notes.)
  • Decision trees and ensemble methods (bagging, boosting, random forests)
  • Neural networks — forward propagation, backpropagation, regularization

Block 2 — Learning theory (~1 week):

  • Bias-variance trade-off in the PAC framework
  • VC dimension and uniform convergence
  • Cross-validation and model selection

Block 3 — Unsupervised learning (~3 weeks):

  • K-means clustering
  • Mixture of Gaussians and the EM algorithm
  • Factor analysis
  • Principal Components Analysis (PCA)
  • Independent Components Analysis (ICA)

Block 4 — Reinforcement learning (~2 weeks):

  • Markov Decision Processes (MDPs)
  • Value iteration and policy iteration
  • Q-learning and function approximation

Each block builds on the previous: supervised learning sets up the algorithmic vocabulary, learning theory explains why algorithms generalize, unsupervised learning generalizes to the no-label setting, and reinforcement learning extends to sequential decision-making. The order matters — the syllabus is structured so each block makes the next one easier.

Linear Regression and Least Squares: The Foundation

CS229 opens with linear regression — not because it is the most important algorithm, but because every concept you need for the rest of the course appears in its simplest form here.

The hypothesis function for linear regression with n features:

h_θ(x) = θ^T x = θ_0 + θ_1 x_1 + ... + θ_n x_n

where x_0 = 1 by convention. The parameters θ are what we learn.

The cost function measures how wrong our parameters are:

J(θ) = (1/2) Σ_i (h_θ(x^(i)) - y^(i))²

The factor of 1/2 is a calculus convenience — the derivative of the squared term brings down a 2 that cancels it.

Gradient descent minimizes J iteratively:

θ_j := θ_j - α ∂J/∂θ_j

where α is the learning rate. For linear regression, this update becomes the LMS (Least Mean Squares) rule:

θ_j := θ_j + α Σ_i (y^(i) - h_θ(x^(i))) x_j^(i)

The batch version sums over all training examples per update. The stochastic version updates after each single example — faster per step, noisier convergence.

The normal equation gives the exact minimum analytically:

θ = (X^T X)^{-1} X^T y

No learning rate, no iterations, no convergence worry. The cost is O(n³) for the matrix inversion — practical for n up to a few thousand features but prohibitively slow for large n. The lecture notes prove this derivation rigorously using matrix calculus, which is worth following once.

Locally Weighted Regression (LWR) extends linear regression to non-linear patterns without feature engineering. Instead of fitting a single global model, LWR fits a local linear model around each query point x, weighting nearby training examples more heavily using:

w^(i) = exp(-( x^(i) - x )² / 2τ²)

LWR is a non-parametric method — the model grows with the training set. It is not practical at massive scale but illustrates the bias-variance tradeoff concretely: small τ fits closely (low bias, high variance), large τ smooths heavily (high bias, low variance).

Logistic Regression, GLMs, and the Exponential Family

Logistic regression is the natural extension of linear regression to binary classification. The sigmoid function maps the linear predictor to a probability:

h_θ(x) = g(θ^T x) = 1 / (1 + e^{-θ^T x})

The sigmoid satisfies g'(z) = g(z)(1 - g(z)), which makes the backpropagation gradient especially clean.

The log-likelihood for logistic regression with Bernoulli-distributed outputs leads to the cross-entropy cost, and gradient descent on this cost gives an update rule that is structurally identical to the LMS rule for linear regression — a hint of deeper structure.

Generalized Linear Models (GLMs) are that deeper structure. CS229 introduces the exponential family distribution:

p(y; η) = b(y) exp(η^T T(y) - a(η))

where η is the natural parameter, T(y) is the sufficient statistic, and a(η) is the log-partition function. Most distributions you use in practice — Gaussian, Bernoulli, Poisson, multinomial, Gamma — belong to this family.

A GLM assumes:

  1. y | x; θ ~ ExponentialFamily(η)
  2. η = θ^T x (the natural parameter is a linear function of the features)
  3. The prediction is h(x) = E[y | x]

Under these assumptions, the gradient descent update for any GLM has exactly the same form. Linear regression (Gaussian output) and logistic regression (Bernoulli output) are both special cases. Softmax regression (multinomial output) is a third, giving a clean foundation for multi-class classification. This unifying view is one of the most elegant parts of CS229.

Support Vector Machines, Kernels, and Margins

SVMs are the most mathematically intricate section of CS229 — and in many ways the most intellectually satisfying.

The functional and geometric margins. For a linear classifier, the functional margin for a training example is:

γ̂^(i) = y^(i) (w^T x^(i) + b)

The geometric margin normalizes by ||w||:

γ^(i) = y^(i) (w^T x^(i) + b) / ||w||

The geometric margin is the actual Euclidean distance from the example to the decision boundary. The SVM objective is to find the w, b that maximize the minimum geometric margin across all training examples. Maximizing the margin produces classifiers with better generalization.

The optimal margin classifier leads to the primal optimization problem:

min_{w,b}  (1/2) ||w||²
subject to: y^(i) (w^T x^(i) + b) ≥ 1  for all i

Converting to the dual (using Lagrange multipliers and the KKT conditions) produces an optimization problem expressed entirely in terms of inner products x^(i)^T x^(j). The full derivation — every step from the primal margin problem through the Lagrangian to the dual, with KKT conditions and the soft-margin extension — is in the dedicated CS229 SVM dual formulation notes.

The kernel trick: if the primal-dual relationship only depends on inner products, you can replace the inner product with a kernel function K(x, z) that computes the inner product in a (potentially infinite-dimensional) feature space without ever computing the features explicitly:

K(x, z) = φ(x)^T φ(z)

The Gaussian kernel K(x, z) = exp(-||x - z||² / 2σ²) corresponds to an infinite-dimensional feature space. This allows SVMs to learn highly non-linear decision boundaries with no change to the algorithm — just substitute K for the inner product.

Mercer's theorem characterizes which kernel functions are valid (correspond to some feature map). A kernel matrix must be positive semi-definite.

Soft-margin SVMs handle non-separable data by introducing slack variables ξ^(i) that allow some margin violations, controlled by the parameter C. Large C → harder margin, risk of overfitting. Small C → softer margin, more robust. The trade-off between margin size and training error is the same bias-variance trade-off in a different coat.

The SMO algorithm solves the SVM dual efficiently by iteratively optimizing over pairs of Lagrange multipliers while holding all others fixed — a clever simplification that avoids the full quadratic programming complexity.

Generative vs. Discriminative Models: GDA and Naive Bayes

CS229 makes a structural distinction between learning algorithms that is worth carrying through your entire ML career.

Discriminative models learn the decision boundary directly — they model p(y | x). Logistic regression is discriminative. So are SVMs.

Generative models learn the data distribution — they model p(x | y) and p(y), then use Bayes' theorem to compute p(y | x). To classify a new point, you ask: which class is more likely to have generated this input?

Gaussian Discriminant Analysis (GDA) assumes that p(x | y = 0) and p(x | y = 1) are both Gaussian (with the same covariance matrix Σ but different means μ_0, μ_1). Fitting GDA means computing the MLE estimates of μ_0, μ_1, Σ, and p(y). The resulting classifier has a linear decision boundary — and can be shown to be equivalent to logistic regression under the Gaussian assumption. GDA has stronger assumptions and is more efficient when the assumptions hold; logistic regression is more robust when they do not.

Naive Bayes assumes that features are conditionally independent given the class label — a strong assumption that is almost always violated in practice and still produces remarkably good classifiers. For text classification, it models each word's presence/absence as an independent Bernoulli variable conditioned on the class. The resulting classifier is fast, interpretable, and competitive with more complex methods on many text tasks.

The key insight from this section: if you have limited data and your generative model assumptions are roughly correct, generative models can outperform discriminative models. With large amounts of data, discriminative models typically win.

Neural Networks: Architecture, Backpropagation, and Training

CS229's neural network coverage is more mathematically grounded than most introductory treatments. The course derives backpropagation from first principles rather than presenting it as a recipe.

Architecture. A neural network is a directed acyclic graph where each node computes a weighted sum of its inputs followed by a non-linear activation function. The input layer, one or more hidden layers, and the output layer define the architecture. In modern notation, a network with L layers has parameters Θ = {W^(1), b^(1), ..., W^(L), b^(L)}.

Forward propagation for a single layer:

z^(l) = W^(l) a^(l-1) + b^(l)
a^(l) = g(z^(l))

The expressivity result: A neural network with one hidden layer and a sigmoid activation can approximate any continuous function to arbitrary precision, given enough neurons (Universal Approximation Theorem). This is an existence result — it does not tell you how to train the network or how many neurons you need.

Backpropagation computes ∂J/∂W^(l) for all layers simultaneously using the chain rule:

δ^(L) = ∂J/∂z^(L)  (output layer error)
δ^(l) = (W^(l+1))^T δ^(l+1) ⊙ g'(z^(l))  (hidden layer error)
∂J/∂W^(l) = δ^(l) (a^(l-1))^T

The symbol ⊙ denotes element-wise multiplication. The computational graph view of backpropagation — where each operation has a forward pass that computes a value and a backward pass that computes the gradient — generalizes to arbitrary computation graphs and is the foundation of modern automatic differentiation libraries.

Training considerations CS229 covers:

  • Weight initialization: random small values, not zero (zero initialization causes symmetry that prevents learning)
  • Learning rate selection: too large diverges, too small converges slowly; adaptive methods like Adam address this
  • Mini-batch gradient descent: practical compromise between batch (stable, slow) and stochastic (fast, noisy)
  • Dropout and regularization for generalization (covered in depth in CS230)

kNN, Decision Trees, and Ensemble Methods

CS229 covers the major non-parametric and tree-based methods that remain practically important.

k-Nearest Neighbors (kNN) is the simplest learning algorithm that works: to classify a new point, find its k nearest neighbors in the training set and take a majority vote. kNN is non-parametric — it stores the entire training set and makes no assumptions about the data distribution.

The bias-variance trade-off in kNN is controlled by k. Small k → low bias (flexible boundary), high variance. Large k → high bias (smooth boundary), low variance. kNN is O(n) per prediction, which is slow for large training sets. Practical implementations use tree structures (KD-tree, ball tree) to speed up neighbor search.

Decision trees partition the feature space by recursively splitting on the feature and threshold that maximizes information gain (for classification) or variance reduction (for regression). A decision tree can represent any function given enough depth, but deep unpruned trees overfit severely.

Ensemble methods address the high variance of individual trees by combining many trees:

  • Bagging (Bootstrap Aggregating): train B trees on B bootstrap samples of the training data, average their predictions. Reduces variance without increasing bias.
  • Random Forests extend bagging by also sampling a random subset of features at each split. This decorrelates the trees, further reducing variance. Random forests are one of the most reliable "out of the box" classifiers available.
  • Boosting (AdaBoost): train trees sequentially, each focusing on the examples the previous trees got wrong (by reweighting training examples). Combines weak learners into a strong one. Reduces bias and can also reduce variance.

The bias-variance perspective on ensembles: bagging reduces variance, boosting primarily reduces bias. Understanding which problem you have guides which method to use.

Unsupervised Learning: K-Means, EM, PCA, and ICA

The unsupervised learning section of CS229 is substantially more rigorous than most introductory treatments.

K-Means is derived as coordinate descent on the distortion function:

J(c, μ) = Σ_i ||x^(i) - μ_{c^(i)}||²

The E-step (assign each point to its nearest centroid) and M-step (move each centroid to the cluster mean) each decrease J, so the algorithm converges. It always converges to a local minimum — which may not be the global minimum. Run multiple times with different random initializations.

The EM Algorithm is K-means generalized to probabilistic cluster assignments. Where K-means makes a hard assignment (each point belongs to exactly one cluster), EM makes soft assignments (each point has a probability of belonging to each cluster).

The EM algorithm alternates between:

  • E-step: Compute the posterior probability that each latent variable (cluster assignment) takes each value, given the current parameters. These are the "soft assignments."
  • M-step: Update the parameters to maximize the expected log-likelihood under the posterior from the E-step.

EM is guaranteed to converge (the log-likelihood is non-decreasing with each iteration), but like K-means it can converge to a local maximum. Gaussian Mixture Models (GMMs) are the most common EM application.

Principal Component Analysis (PCA) finds the directions of maximum variance in the data. The k principal components are the eigenvectors corresponding to the k largest eigenvalues of the sample covariance matrix. CS229 derives PCA both from the variance-maximization perspective and the reconstruction-error perspective — both lead to the same answer.

Independent Component Analysis (ICA) solves the "cocktail party problem": recover independent source signals from a mixture of those signals, without knowing the mixing matrix. Unlike PCA, ICA assumes the sources are non-Gaussian and statistically independent. The FastICA algorithm is the practical implementation.

Reinforcement Learning: MDPs, Value Functions, and Policy Learning

CS229 ends with reinforcement learning — learning not from labeled examples but from rewards and penalties received while interacting with an environment.

The MDP formalism. A Markov Decision Process is defined by:

  • State space S
  • Action space A
  • Transition probabilities P_{sa}(s') = P(s_{t+1} = s' | s_t = s, a_t = a)
  • Reward function R(s) or R(s, a)
  • Discount factor γ ∈ [0, 1)

The agent's goal is to find a policy π: S → A that maximizes the expected discounted total reward:

E[Σ_t γ^t R(s_t)]

Value functions. The value function V^π(s) is the expected total discounted reward when starting in state s and following policy π:

V^π(s) = E[R(s_0) + γR(s_1) + γ²R(s_2) + ... | s_0 = s, π]

The optimal value function V*(s) is the maximum over all policies:

V*(s) = max_a [R(s) + γ Σ_{s'} P_{sa}(s') V*(s')]

This is the Bellman equation — a recursive relationship that defines V* uniquely.

Value iteration solves the Bellman equation iteratively: initialize V arbitrarily, then repeatedly apply the Bellman update until convergence. Policy iteration alternates between evaluating the current policy and improving it greedily. Both converge to the optimal policy.

Q-Learning is the off-policy TD method that learns the optimal action-value function Q*(s, a) — the expected return from taking action a in state s and then following the optimal policy. The update:

Q(s, a) := Q(s, a) + α [r + γ max_{a'} Q(s', a') - Q(s, a)]

Q-learning converges to Q* with probability 1 under mild conditions. Combined with a neural network function approximator for Q, it becomes Deep Q-Networks (DQN) — the algorithm behind AlphaGo and the Atari-playing agent.

The exploration-exploitation trade-off is the fundamental challenge of RL: to learn the optimal policy you must try actions, but to collect high rewards you must exploit what you know. ε-greedy policies (take a random action with probability ε, exploit otherwise) are the simplest approach. Upper Confidence Bound (UCB) methods are theoretically motivated.

Is CS229 the Right Starting Point for You?

CS229 is not an easy course. It is genuinely graduate-level. But it is also freely available, the lecture notes are exceptional, and the mathematical depth it builds is a genuine competitive advantage for anyone who wants to do serious ML work.

If you are new to ML, the Andrew Ng Coursera ML course notes are a better entry point — same instructor, lower mathematical bar, more accessible. Come back to CS229 after you have the intuitions.

If you already know supervised learning basics and want to go deeper, CS229 is exactly right. Follow the lectures on YouTube, work through the problem sets (posted on the Stanford site), and use these notes for review.

For deep learning beyond CS229's scope, the Stanford CS230 deep learning notes cover CNNs, RNNs, and transformers in depth. For AI methods beyond ML, see the MIT 6.034 AI course notes. For a structured deep learning curriculum, the Coursera Deep Learning Specialization notes are a direct path from CS229's foundation.

How to Study CS229 Effectively

The most common mistake CS229 students make is trying to memorize derivations. The course is not about memorization — it is about understanding why each step follows from the previous one.

What to study actively:

  • Derive the normal equation from scratch once. Understand why (X^T X)^{-1} appears.
  • Prove that the logistic regression cost function is convex.
  • Trace through one complete backpropagation example by hand.
  • Run the K-means algorithm on a small example with three clusters.

What to understand conceptually:

  • Why does the SVM maximize margin? What does margin correspond to geometrically?
  • Why is the kernel trick computationally important, not just theoretically interesting?
  • What is the difference between a generative and discriminative model, and why does it matter?

For a structured approach to note-taking during technical lectures, see our guide on reverse engineering college courses from lecture recordings. The active recall techniques in our active recall guide apply directly to the self-test exercises in the official CS229 lecture notes.

Are these the official Stanford CS229 course notes?

No — these are an independent free study reference written from the lectures and the publicly available CS229 materials. The official Stanford CS229 lecture notes are written by Andrew Ng and the CS229 teaching staff, and are hosted at cs229.stanford.edu. They are also free. The two complement each other: the official notes are the canonical reference and are more rigorous in places; these notes are more navigable, hyperlinked across topics, and structured for active recall.

Where can I find the official CS229 course notes in 2026?

The official Stanford CS229 lecture notes, problem sets, and supplementary materials are at cs229.stanford.edu and are updated each academic year. The 2026 syllabus is the same in structure as previous recent years — supervised learning → learning theory → unsupervised learning → reinforcement learning — with minor updates to the deep learning and reinforcement learning blocks reflecting current research. The lectures are also recorded and posted to YouTube each year.

Is CS229 free?

Yes — the lectures, lecture notes, problem sets, and supplementary materials are all free on the Stanford CS229 website. Stanford does not offer the on-campus course for credit to non-enrolled students, but the entire curriculum is available for self-study at no cost. The free Coursera version of Andrew Ng's ML course (notes here) is a lighter alternative if CS229's mathematical depth is intimidating.

What is the difference between CS229 and Andrew Ng's Coursera ML course?

CS229 is the Stanford on-campus graduate course. Andrew Ng's Coursera ML course is a deliberately simplified version covering the same major topics with less mathematical rigor. CS229 derives algorithms from first principles, includes learning theory (VC dimension, PAC framework), and covers topics the Coursera version omits (Gaussian processes, EM algorithm, factor analysis). For a side-by-side: CS229 is ~6 hours/week of dense math; Coursera ML is ~3 hours/week of applied implementation.

What math do I need for CS229?

Comfort with linear algebra (eigenvalues, projections, the SVD), multivariable calculus (gradients, Jacobians, chain rule), and probability theory (Bayes' rule, expectation, common distributions, the central limit theorem). If any of those feel shaky, brush up before starting CS229 — the course assumes them from lecture 1. The MIT 18.06 linear algebra notes cover the linear algebra prerequisites.

In what order should I cover the CS229 topics?

Follow the official syllabus order — supervised learning → learning theory → unsupervised → reinforcement. The order is not arbitrary: each block uses ideas from the previous one. Supervised learning establishes the vocabulary (loss functions, optimization, generalization). Learning theory explains why the algorithms work. Unsupervised learning generalizes the same math to the no-label case. Reinforcement learning is the most advanced and assumes everything before it.


CS229 is one of the highest-leverage investments you can make in your ML education. The mathematical foundations it builds pay dividends for years. If you study it with Notiq — paste the lecture URL, get structured notes automatically, and review with spaced repetition — you will retain far more than passive watching allows.

Start your CS229 study at notiq.study

Share this article

Related Articles