Andrew Ng's Machine Learning Course: Notes from All 11 Weeks

Andrew Ng's Machine Learning course is the course that introduced a generation of engineers to ML. It launched on Coursera in 2012, has been taken by over five million people, and remains — despite the explosion of newer content — the clearest end-to-end introduction to classical machine learning that exists. The explanations are precise without being inaccessible, the math is shown but not fetishized, and the intuition built across 11 weeks is genuine.

These andrew ng machine learning notes cover every week in detail. They are written as a study reference: the kind of document you return to before an interview, before starting a new project, or when a concept you learned two years ago has started to blur.

Watch the course introduction from Andrew Ng himself before diving in:

If you want a framework for how to approach lecture-heavy courses like this one, read our guide on how to learn from YouTube lectures. For notes from a comparable course, see the Stanford CS230 Deep Learning notes.

Week 1: Introduction and Linear Regression with One Variable

The course opens with a framing question: what is machine learning, and why does it matter? Ng defines it as giving computers the ability to learn without being explicitly programmed — a definition borrowed from Arthur Samuel, who coined the term in 1959.

Two types of learning problems are introduced immediately:

Supervised learning — you have labeled training data. You feed the algorithm input-output pairs and it learns to predict outputs for new inputs. Regression (predicting a continuous value) and classification (predicting a category) are the two subtypes.

Unsupervised learning — you have data without labels. The algorithm must find structure on its own. Clustering is the canonical example.

Week 1 then moves to the first concrete model: linear regression with one variable (univariate linear regression). The hypothesis is:

h(x) = θ₀ + θ₁x

The cost function measures how wrong the current parameters are:

J(θ₀, θ₁) = (1/2m) Σ (h(xᵢ) - yᵢ)²

This is the mean squared error divided by two — the factor of two is a convenience that cancels with the derivative. Minimizing J is the goal of training.

Gradient descent is introduced as the optimization algorithm: repeatedly update each parameter by subtracting a fraction of the partial derivative of J with respect to that parameter. The learning rate α controls the step size. Too large and the algorithm diverges. Too small and it converges slowly.

The key intuition: gradient descent moves downhill on the cost surface. For linear regression, the cost surface is a convex bowl — there is only one minimum, so gradient descent always converges if α is set correctly.

Week 2: Linear Regression with Multiple Variables

Extending to multiple features is mostly notational. The hypothesis becomes:

h(x) = θᵀx = θ₀x₀ + θ₁x₁ + ... + θₙxₙ

where x₀ = 1 by convention, making θ₀ the bias term.

Feature scaling becomes important here. If features have very different ranges — house size in square feet (1000–5000) versus number of bedrooms (1–5) — gradient descent converges slowly because the cost surface is elongated. The fix: normalize features to roughly the same scale, either by dividing by the range or using Z-score normalization.

The week also introduces the normal equation as an analytical alternative to gradient descent:

θ = (XᵀX)⁻¹Xᵀy

This gives the exact solution in one step — no learning rate, no iterations. The catch: it requires inverting XᵀX, which is O(n³). For n > 10,000 features, gradient descent is faster.

When is XᵀX non-invertible? When features are linearly dependent (redundant features) or when you have more features than training examples. These are signs of poor problem setup, not algorithm failure.

Week 3: Logistic Regression and Classification

Linear regression predicts continuous values. Classification predicts discrete categories. For binary classification, the target y ∈ {0, 1}.

The key move: replace the linear hypothesis with the sigmoid function:

h(x) = 1 / (1 + e^(-θᵀx))

The sigmoid squashes any real-valued input to (0, 1), which can be interpreted as the probability that y = 1. The decision boundary is h(x) = 0.5, i.e., θᵀx = 0.

The cost function for logistic regression cannot be the same squared-error cost used for linear regression — that would produce a non-convex surface with many local minima. Instead:

J(θ) = -(1/m) Σ [yᵢ log(h(xᵢ)) + (1-yᵢ) log(1-h(xᵢ))]

This is cross-entropy loss. Gradient descent still applies and the update rule looks identical to linear regression — it just means something different because h is now the sigmoid.

Regularization is introduced here and used in every model from here on. L2 regularization adds a penalty term to J:

J(θ) = -(1/m) Σ [log-loss] + (λ/2m) Σⱼ θⱼ²

The regularization parameter λ controls the trade-off between fitting the training data and keeping the weights small. Large λ → underfitting. Small λ → overfitting. The bias term θ₀ is not regularized by convention.

Week 4: Neural Networks — Representation

The motivation for neural networks: logistic regression cannot learn complex non-linear decision boundaries well, even with feature engineering. Neural networks approximate arbitrary functions.

A neural network is a directed graph of computational units called neurons. Each neuron computes a weighted sum of its inputs, adds a bias, and applies an activation function (sigmoid, for now). Neurons are arranged in layers: input → one or more hidden layers → output.

Forward propagation is the process of computing the output for a given input:

a⁽¹⁾ = x (input layer activations are just the features)
z⁽²⁾ = Θ⁽¹⁾a⁽¹⁾ (linear combination)
a⁽²⁾ = g(z⁽²⁾) (apply sigmoid element-wise)
Repeat for each subsequent layer
h(x) = a⁽L⁾ (output of last layer)

The weight matrix Θ⁽ˡ⁾ has shape (s_{l+1} × (s_l + 1)) where s_l is the number of units in layer l.

The key insight about why neural networks can approximate complex functions: each hidden layer learns a representation of the input. The early layers learn low-level features (edges, textures in an image), deeper layers combine those into higher-level concepts (faces, objects). This hierarchy of representations is what makes deep networks powerful.

Week 5: Neural Networks — Learning

Backpropagation is the algorithm for computing the gradient of the cost function with respect to each weight in the network. Without it, training deep networks would require finite-difference approximation — too slow to be practical.

The cost function for a neural network with K output classes:

J(Θ) = -(1/m) Σᵢ Σₖ [y_k^(i) log(h_k(x^(i))) + (1 - y_k^(i)) log(1 - h_k(x^(i)))] + regularization

Backpropagation computes δ⁽ˡ⁾ — the "error" at each layer — by propagating gradients backward from the output:

δ⁽L⁾ = a⁽L⁾ - y (output error)
δ⁽ˡ⁾ = (Θ⁽ˡ⁾)ᵀδ⁽ˡ⁺¹⁾ .* g'(z⁽ˡ⁾) (propagate backward)
∂J/∂Θ⁽ˡ⁾ᵢⱼ = (1/m) Σ a_j^(i⁽ˡ⁾) δ_i^(i⁽ˡ⁺¹⁾) + regularization

Gradient checking is introduced as a debugging technique: compute the numerical gradient using finite differences and verify it matches the analytical gradient from backprop. Always disable gradient checking before actually training — it is O(n) slower.

Random initialization is critical: if all weights start at zero, all neurons in a layer compute the same function and the network never learns different features. Initialize weights to small random values (typically ε ~ 0.12).

Week 6: Advice for Applying Machine Learning

This is arguably the most practically valuable week in the course. It answers the question: your model is not working — what do you do next?

The diagnostic framework starts with the bias-variance trade-off:

High bias (underfitting): training error and validation error are both high. The model is too simple.
High variance (overfitting): training error is low but validation error is much higher. The model memorized the training set.

Fixes for high bias: add features, add polynomial features, decrease λ, use a more complex model.

Fixes for high variance: add more training data, reduce the feature set, increase λ.

Learning curves are the diagnostic tool: plot training error and validation error as a function of training set size. A high-bias model shows both curves converging at a high error. A high-variance model shows a large gap between training and validation curves.

The section on evaluating models introduces the train/validation/test split: use the training set to fit parameters, the validation set to select the model (hyperparameters), and the test set for a final unbiased estimate of performance. Never make model decisions based on test set performance — that leaks information.

Week 7: Support Vector Machines

SVMs are a different approach to classification. Instead of maximizing likelihood (logistic regression), SVMs maximize the margin — the distance between the decision boundary and the nearest training examples from each class.

The optimization problem finds the hyperplane that separates the classes with the largest margin. The training examples that lie on the margin boundary are the support vectors — they are the only points that matter for determining the decision boundary.

The kernel trick is what makes SVMs powerful for non-linear problems. Instead of explicitly mapping features to a high-dimensional space, you define a kernel function K(x, l) that computes the similarity between a point x and a landmark l. The Gaussian (RBF) kernel is most common:

K(x, l) = exp(-||x - l||² / 2σ²)

With σ large, features vary smoothly — lower variance. With σ small, features can fit complex boundaries — higher variance risk.

The reason the kernel trick works at all is that the SVM optimization, once converted to its dual form, depends on the training data only through inner products — and a kernel function is just an inner product in a higher-dimensional space. For the full step-by-step derivation from the primal margin problem to the dual (Lagrangian, KKT, recovering the bias), see the dedicated CS229 SVM dual formulation notes.

When to use SVMs versus logistic regression? When n is large relative to m (many features, few examples), logistic regression or an SVM without a kernel works well. When n is small and m is intermediate, SVMs with a Gaussian kernel work well. When n is small and m is very large, add features and use logistic regression or a linear SVM.

Week 8: Unsupervised Learning and K-Means

The course shifts to unsupervised learning. Without labels, the goal is to find structure in the data.

K-means clustering is the primary algorithm:

Initialize K cluster centroids randomly (or using k-means++ for better initialization)
Assign each example to the nearest centroid
Move each centroid to the mean of the examples assigned to it
Repeat until convergence

K-means minimizes the within-cluster sum of squared distances. It always converges but may converge to a local minimum. The fix: run K-means multiple times with different random initializations and take the best result.

Choosing K is not purely algorithmic — the right K often depends on the downstream use. The elbow method (plotting cost versus K and looking for a kink) can help but is often ambiguous.

Principal Component Analysis (PCA) is introduced as the primary dimensionality reduction technique. PCA finds the directions of maximum variance in the data (principal components) and projects the data onto a lower-dimensional subspace.

The algorithm: zero-mean the data, compute the covariance matrix, compute its SVD, take the first k singular vectors as the principal components.

PCA is useful for data compression, visualization, and as a preprocessing step to speed up supervised learning. Common mistake: using PCA to prevent overfitting. Regularization is the right tool for that. Use PCA to reduce storage/computation, not to fix high variance.

Week 9: Anomaly Detection and Recommender Systems

Anomaly detection flags unusual examples based on a model of what normal looks like. Fit a Gaussian distribution to each feature of the training data, compute the probability density p(x) for a new example, and flag it as anomalous if p(x) < ε.

When to use anomaly detection versus supervised learning? Use anomaly detection when you have very few positive examples (0–20 anomalies), when anomalies are heterogeneous (many different types), or when future anomalies might look different from past ones. Use supervised learning when you have enough positive examples to learn from.

Recommender systems are presented through the movie rating problem: users rate movies and the algorithm predicts ratings for unseen movies. Two approaches:

Content-based filtering: use features of the movies (genre, director, etc.) to predict ratings. Given user preferences (learned from past ratings), predict how much they would like each unrated movie.

Collaborative filtering: no movie features needed. The algorithm simultaneously learns user preference vectors and movie feature vectors by minimizing the prediction error over all known ratings. This is matrix factorization — decomposing the ratings matrix into user and item embeddings.

The practical insight: collaborative filtering works remarkably well with no domain knowledge. The learned features often correspond to interpretable concepts (action versus romance, etc.) even though no one told the algorithm about those concepts.

Week 10: Large-Scale Machine Learning

With millions of training examples, the standard gradient descent (batch gradient descent) is too slow — it requires a pass through the entire dataset per iteration. This week covers algorithms that scale.

Stochastic gradient descent (SGD): update the parameters after each single training example. Much faster per update, but the cost does not decrease monotonically — it oscillates. Converges to near the minimum rather than exactly.

Mini-batch gradient descent: update after each batch of b examples (typically 10–1000). A compromise: more stable than SGD, faster than batch. This is what is used in practice.

Online learning: a variant of SGD where you continuously train on a stream of new data. The model adapts to changing data distributions over time. Useful for systems where user behavior evolves.

Map-reduce for parallelizing gradient descent: split the training set across multiple machines, compute partial sums of the gradient on each machine, sum across machines to get the full gradient. Scales linearly with the number of machines.

The practical takeaway from this week: before trying complex algorithmic improvements, try mini-batch gradient descent with a well-tuned learning rate. It solves most large-scale training problems.

Week 11: Application Example — Photo OCR and the ML Pipeline

The final week uses photo OCR (reading text in images of street signs, storefronts, etc.) as a case study for building an ML pipeline. The pipeline has stages: text detection → character segmentation → character recognition.

The key concepts introduced:

The ML pipeline: most real-world ML systems are not a single model but a sequence of models, each taking the output of the previous as input. Debugging such systems requires knowing which component to improve.

Ceiling analysis: given a pipeline, how much would overall performance improve if each component were perfect? This tells you where to focus your engineering effort. If replacing the text detector with a perfect one would only improve the pipeline by 1% but replacing the character recognizer would improve it by 15%, work on the recognizer.

Artificial data synthesis: if you do not have enough training data, create it. For character recognition, take clean fonts and warp them with transformations (rotations, distortions) to create a large artificial training set. This technique has been validated empirically across many vision problems.

The bigger lesson: most practitioners spend too much time running algorithms and not enough time thinking about what to improve. Ceiling analysis is a simple but underused tool for making that decision with data.

What These Notes Do Not Replace

These notes cover the conceptual content of the course but cannot replace working through the programming assignments. The assignments implement everything from scratch in Octave/MATLAB: the gradient descent, the neural network, the SVM, the recommender system. The implementation experience builds intuition that passive reading does not.

The course is available for free on YouTube and has been translated into multiple languages. For a deeper treatment of neural networks and deep learning, the Stanford CS230 Deep Learning notes and the MIT 6.034 AI course notes are natural continuations.

For practical note-taking strategies when working through technical courses like this, see our guide on taking notes from YouTube lectures and the broader discussion of AI-assisted study notes.

The original course slides are available from Stanford. The best external reference for the mathematical foundations is Michael Nielsen's Neural Networks and Deep Learning — free, thorough, and consistent with Ng's notation.

Is Andrew Ng's ML Course Still Worth Taking in 2026?

Yes, but with context. The course was last updated with the TensorFlow/Python version of the specialization. The original Octave-based version covers classical ML (weeks 1–8) better than any other resource — the explanations of gradient descent, regularization, and SVMs are still the clearest available.

What the course does not cover: modern deep learning architectures (transformers, diffusion models), large language models, or current best practices for training neural networks at scale. For those, you need CS231n, CS224n, or the fast.ai course.

Use Andrew Ng's ML course for what it is: the best possible foundation in classical supervised learning, with enough depth on neural networks to understand what the deeper courses are building on.

If you study courses like this regularly, Notiq was built for you. Paste in a lecture transcript and get structured notes, key equations extracted, and flashcards ready for review. Try it at notiq.study.