CS229 SVM Dual Formulation — Complete Derivation Notes (Lagrangian + KKT + Kernel Trick)

Machine Learning Stanford CS229 SVM Lecture Notes Convex Optimization

June 4, 2026·15 min read

Share this article

These are complete CS229 SVM dual formulation notes — the full derivation Andrew Ng works through across Lecture 6 and Lecture 7 of Stanford's machine learning course. If you came here from search looking for "CS229 SVM dual derivation notes" or "CS229 SVM dual form notes," you are in the right place: every step from the primal max-margin problem to the dual is written out below, with the Lagrangian, the KKT conditions, and the substitution that makes the kernel trick possible.

If you want the broader overview of every CS229 lecture, see our complete CS229 lecture notes. This page is the deep dive on one specific topic: the dual form of the support vector machine.

Quick orientation: what "SVM dual" means and why CS229 spends so much time on it

CS229 introduces the support vector machine as a margin-maximization problem. You have a labelled dataset, you want a hyperplane that separates the classes, and among all separating hyperplanes you specifically want the one that pushes the closest training points as far away from the boundary as possible. That formulation — the one Ng writes on the board first — is called the primal.

The dual is the same optimization problem rewritten in a different set of variables. Mathematically the two are equivalent (they share the same optimum) but the dual has three properties that make it the form everyone actually uses in practice:

The dual depends on the training data only through inner products $\langle x^{(i)}, x^{(j)} \rangle$ — never on the raw feature vectors. This is what makes the kernel trick possible.
The dual has the same number of variables as you have training examples ( $m$ ), regardless of how high-dimensional your feature space is. If you map into an infinite-dimensional space with an RBF kernel, the dual stays finite.
The dual makes it explicit which training points actually determine the boundary (the support vectors). For most points, the dual variable is zero. Only support vectors carry a non-zero dual.

The reason CS229 derives the dual in such painful detail is that the derivation isn't really about SVMs — it's about Lagrange duality, a tool that recurs constantly in machine learning research. Once you have done it once for SVMs, you have done it once for many other problems.

The primal problem (recap from CS229 Lecture 6)

The geometric-margin SVM primal — what Ng writes after the long warm-up about functional vs geometric margins — is:

\min_{w, b} \; \frac{1}{2} \|w\|^2

subject to:

y^{(i)} (w^\top x^{(i)} + b) \ge 1, \quad i = 1, \ldots, m

A few things worth noting before we attack the dual:

We are minimizing $\frac{1}{2} \|w\|^2$ , not maximizing the margin directly. The trick is that the constraint $y^{(i)}(w^\top x^{(i)} + b) \ge 1$ fixes the functional margin to 1, which means the geometric margin equals $1/\|w\|$ . Minimizing $\|w\|^2$ maximizes that geometric margin. The factor of $\frac{1}{2}$ is there to keep the gradient clean later.
The problem is convex — quadratic objective, linear constraints. That means strong duality holds, and the KKT conditions are both necessary and sufficient.
This is the hard-margin version. We will derive the soft-margin dual after the hard-margin one because the structure is identical and the soft-margin slack variables just introduce a box constraint on the duals at the end.

Building the Lagrangian

The general recipe for a constrained optimization problem $\min f(w) \text{ s.t. } g_i(w) \le 0$ is to write the Lagrangian by attaching a non-negative multiplier $\alpha_i \ge 0$ to each inequality constraint:

\mathcal{L}(w, \alpha) = f(w) + \sum_i \alpha_i g_i(w)

Two things go in that definition. First, we need the constraints in the standard form $g_i(w) \le 0$ . Ours are written as $y^{(i)}(w^\top x^{(i)} + b) \ge 1$ , which is equivalent to:

g_i(w, b) = 1 - y^{(i)}(w^\top x^{(i)} + b) \le 0

Second, we have two primal variables ( $w$ and $b$ ), so the Lagrangian is in both. Plugging in:

\mathcal{L}(w, b, \alpha) = \frac{1}{2} \|w\|^2 - \sum_{i=1}^{m} \alpha_i \left[ y^{(i)} (w^\top x^{(i)} + b) - 1 \right]

The sign in front of the sum is minus because we converted $g_i \le 0$ above and the standard form is $+\alpha_i g_i$ . Expanded:

\mathcal{L}(w, b, \alpha) = \frac{1}{2} \|w\|^2 - \sum_i \alpha_i y^{(i)} (w^\top x^{(i)} + b) + \sum_i \alpha_i

This Lagrangian is the object we will minimize over $(w, b)$ to get the dual function $\theta_\mathcal{D}(\alpha)$ , and then maximize over $\alpha$ .

Minimizing the Lagrangian over the primal variables

This is the step where the dual form starts to fall out of the algebra. We take partial derivatives of $\mathcal{L}$ with respect to $w$ and $b$ , set them to zero, and substitute the resulting equations back into the Lagrangian.

Derivative with respect to $w$ :

\nabla_w \mathcal{L} = w - \sum_{i=1}^{m} \alpha_i y^{(i)} x^{(i)} = 0

This gives us the first KKT stationarity condition:

w = \sum_{i=1}^{m} \alpha_i y^{(i)} x^{(i)}

This identity already does a lot of work — it tells you that the optimal weight vector is a linear combination of the training points, weighted by their (signed) dual variable. Points with $\alpha_i = 0$ contribute nothing.

Derivative with respect to $b$ :

\frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^{m} \alpha_i y^{(i)} = 0

So:

\sum_{i=1}^{m} \alpha_i y^{(i)} = 0

This is a constraint on the duals — they must be balanced across the two classes (since $y^{(i)} \in \{-1, +1\}$ ). It does not give us a value for $b$ . We will recover $b$ later from the KKT complementary slackness condition.

Substituting back to get the dual function

Now we plug $w = \sum_i \alpha_i y^{(i)} x^{(i)}$ and $\sum_i \alpha_i y^{(i)} = 0$ back into the Lagrangian. This is the bookkeeping step that is easy to mess up — work through it slowly.

Term 1: $\frac{1}{2} \|w\|^2$ . Using the substitution:

\frac{1}{2} \|w\|^2 = \frac{1}{2} w^\top w = \frac{1}{2} \left( \sum_i \alpha_i y^{(i)} x^{(i)} \right)^\top \left( \sum_j \alpha_j y^{(j)} x^{(j)} \right) = \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

Term 2: $-\sum_i \alpha_i y^{(i)} (w^\top x^{(i)} + b)$ .

Distribute:

-\sum_i \alpha_i y^{(i)} w^\top x^{(i)} - b \sum_i \alpha_i y^{(i)}

The second piece vanishes because $\sum_i \alpha_i y^{(i)} = 0$ . The first piece, substituting $w$ again:

-\sum_i \alpha_i y^{(i)} \left( \sum_j \alpha_j y^{(j)} x^{(j)} \right)^\top x^{(i)} = -\sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

Term 3: $+\sum_i \alpha_i$ . Stays as is.

Adding the three terms:

\mathcal{L} = \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle - \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle + \sum_i \alpha_i

The two double sums combine — one has coefficient $\frac{1}{2}$ , the other $-1$ , leaving $-\frac{1}{2}$ . The result is the dual function:

\theta_\mathcal{D}(\alpha) = \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

This is the equation Ng circles on the board and underlines twice. Memorize the structure: a linear term $\sum \alpha_i$ minus a quadratic term where every pair contributes a kernel-weighted product.

The dual problem

Putting the constraints back in, the dual optimization is:

\max_{\alpha} \; \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

subject to:

\alpha_i \ge 0, \quad i = 1, \ldots, m

and:

\sum_{i=1}^{m} \alpha_i y^{(i)} = 0

That is the hard-margin SVM dual. Compare it to the primal — both are convex quadratic programs, both have the same optimum, but the dual has only $m$ variables and depends on the data only through inner products. That second property is the whole game.

KKT conditions and recovering $b$

The KKT conditions are the system of equations that must hold at the optimum of a convex problem with constraints. There are four:

Stationarity — gradients of the Lagrangian vanish (we already used these to get $w$ and the constraint on $\alpha$ ).
Primal feasibility — $y^{(i)}(w^\top x^{(i)} + b) \ge 1$ for all $i$ .
Dual feasibility — $\alpha_i \ge 0$ for all $i$ .
Complementary slackness — $\alpha_i \left[ y^{(i)}(w^\top x^{(i)} + b) - 1 \right] = 0$ for all $i$ .

Complementary slackness is the magic one. It says that for every training point, either:

$\alpha_i = 0$ (the point doesn't contribute to $w$ and isn't a support vector), or
$y^{(i)}(w^\top x^{(i)} + b) = 1$ (the functional margin is exactly 1 — the point sits on the margin boundary).

Points with $\alpha_i > 0$ are by definition support vectors. They are the only training examples that determine the hyperplane. To recover $b$ , pick any support vector $i^*$ (any point with $\alpha_{i^*} > 0$ ) and solve:

y^{(i^*)}(w^\top x^{(i^*)} + b) = 1 \implies b = y^{(i^*)} - w^\top x^{(i^*)}

In practice you average $b$ across all support vectors for numerical stability.

Why the dual form makes the kernel trick possible

The kernel trick is what made SVMs the dominant classifier from roughly 1995 to 2012, and it is impossible to do in the primal. Look once more at the dual:

\theta_\mathcal{D}(\alpha) = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

The features $x^{(i)}$ appear only in the inner product $\langle x^{(i)}, x^{(j)} \rangle$ . Nowhere else.

That means if we are willing to map the data into a higher-dimensional feature space $\phi(x)$ — to allow non-linear decision boundaries — we never need to actually compute $\phi(x)$ . We only need a function $K(x, z) = \langle \phi(x), \phi(z) \rangle$ that returns the inner product directly. That function is a kernel.

The famous example is the Gaussian (RBF) kernel:

K(x, z) = \exp\left( -\frac{\|x - z\|^2}{2\sigma^2} \right)

This kernel corresponds to an infinite-dimensional feature map. In the primal we would have to store and optimize over infinite-dimensional $w$ , which is impossible. In the dual we just compute $K(x^{(i)}, x^{(j)})$ for each pair of training points, which is $O(m^2)$ kernel evaluations. The optimization stays finite-dimensional in $\alpha$ .

This is the single biggest reason CS229 spends two full lectures on the dual derivation: not because SVMs are uniquely important, but because the kernel trick — and the broader idea that an algorithm depending only on inner products can be silently lifted into a richer feature space — is one of the most reused ideas in classical ML.

The soft-margin extension

The hard-margin SVM only works on linearly separable data. The soft-margin version, which Ng derives in Lecture 7, allows misclassifications via slack variables $\xi_i$ and a penalty parameter $C$ :

\min_{w, b, \xi} \; \frac{1}{2} \|w\|^2 + C \sum_i \xi_i

subject to:

y^{(i)}(w^\top x^{(i)} + b) \ge 1 - \xi_i, \quad \xi_i \ge 0

Run through the same Lagrangian-and-substitute exercise (now with multipliers $\alpha_i$ for the margin constraints and $r_i$ for $\xi_i \ge 0$ ) and the dual comes out almost identical:

\max_{\alpha} \; \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

subject to:

0 \le \alpha_i \le C, \quad \sum_i \alpha_i y^{(i)} = 0

The only change is that $\alpha_i$ now has an upper bound $C$ (the box constraint). The dual function is unchanged. The kernel trick still works the same way. SMO — the optimizer Ng covers next — exploits exactly this box-constrained structure.

Worked example: SVM on the XOR-like dataset

Suppose you have four training points in 2D:

$x^{(1)} = (1, 1)$ , $y^{(1)} = +1$
$x^{(2)} = (-1, -1)$ , $y^{(2)} = +1$
$x^{(3)} = (1, -1)$ , $y^{(3)} = -1$
$x^{(4)} = (-1, 1)$ , $y^{(4)} = -1$

These are not linearly separable in the original space. But with a polynomial kernel $K(x, z) = (1 + x^\top z)^2$ , which corresponds to mapping into a 6-dimensional space including all pairwise products, the data becomes separable.

The dual problem becomes a 4-variable QP. You compute the $4 \times 4$ kernel matrix:

K_{ij} = (1 + x^{(i)\top} x^{(j)})^2

then solve for $\alpha$ , and the resulting classifier is:

f(x) = \text{sign}\left( \sum_{i=1}^{4} \alpha_i y^{(i)} K(x^{(i)}, x) + b \right)

You never compute $\phi(x)$ explicitly. You only ever evaluate $K$ .

Common derivation pitfalls

Three places where CS229 problem sets historically catch students:

Sign errors in the Lagrangian. The constraint $y^{(i)}(w^\top x^{(i)} + b) \ge 1$ becomes $g_i = 1 - y^{(i)}(\ldots) \le 0$ . The Lagrangian adds $+\alpha_i g_i$ , which is $-\alpha_i \left[y^{(i)}(\ldots) - 1\right]$ . Drop a sign here and the dual flips into a $\min$ instead of a $\max$ .
Forgetting $\sum_i \alpha_i y^{(i)} = 0$ . This is not a side note — it's the constraint that makes the $b$ term in the Lagrangian vanish when you substitute. Without it, the substitution doesn't simplify cleanly.
Confusing functional and geometric margin. The primal constraint uses functional margin = 1; the objective minimizes $\|w\|^2$ , which maximizes the geometric margin. They are related by $\gamma_{\text{geo}} = \gamma_{\text{func}} / \|w\|$ . Slipping between the two without realizing it is the most common source of off-by-a-factor errors.

How this fits with the rest of CS229

The SVM dual is the case study for Lagrange duality, but the same machinery shows up in:

Logistic regression with $\ell_1$ regularization — different objective, similar duality argument
The kernel perceptron — direct dual representation, no QP needed
Gaussian processes — also depend only on inner products / kernel matrices
Many other CS229 problem-set problems that ask you to derive a dual from a constrained primal

If you want the broader course, the complete CS229 lecture notes cover every major topic — linear regression, GLMs, neural networks, unsupervised learning, RL. The natural follow-on to this post is the SMO optimizer Ng presents in Lecture 8, which solves the box-constrained QP that this derivation set up.

What does "SVM dual" actually mean?

It's the optimization problem you get when you rewrite the SVM margin-maximization problem (the primal) using Lagrange multipliers and then maximize over those multipliers instead of the original weight vector $w$ . Mathematically equivalent — same optimum — but expressed in different variables that turn out to be more useful.

Why does CS229 derive the dual at all instead of just using the primal?

Two reasons. First, the dual depends on the data only through inner products $\langle x^{(i)}, x^{(j)} \rangle$ , which makes the kernel trick possible — you can implicitly map into arbitrarily-high-dimensional feature spaces by just replacing the inner product with a kernel. The primal can't do this. Second, the dual has the same number of variables as you have training points, regardless of feature dimension. With an RBF kernel (infinite-dimensional feature space) the primal is uncomputable; the dual stays finite.

What are the KKT conditions and which ones matter for SVMs?

KKT (Karush-Kuhn-Tucker) conditions are the necessary-and-sufficient optimality conditions for a convex constrained problem. There are four: stationarity (gradients vanish), primal feasibility (constraints hold), dual feasibility ( $\alpha_i \ge 0$ ), and complementary slackness ( $\alpha_i \cdot g_i = 0$ for every constraint). For SVMs, complementary slackness is the important one — it tells you that for every training point, either $\alpha_i = 0$ (not a support vector) or the point is exactly on the margin. That's how you identify support vectors and how you recover the bias term $b$ .

What's the difference between SVM primal and dual form notes?

The primal optimizes over the weight vector $w$ and bias $b$ . The dual optimizes over Lagrange multipliers $\alpha_i$ , one per training point. The two formulations have the same optimum (strong duality), but the dual lets you use kernels and makes support vectors explicit. Most software (LIBSVM, sklearn) actually solves the dual, even when you ask for "an SVM."

Where does the kernel trick fit in this derivation?

Look at the final dual objective: $\sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle$ . The training features $x^{(i)}$ appear only inside the inner product. Replace every $\langle x^{(i)}, x^{(j)} \rangle$ with a kernel $K(x^{(i)}, x^{(j)})$ and you've implicitly mapped your data into whatever feature space that kernel corresponds to — without ever computing the mapping. That's the kernel trick. It only works because the dual is structured this way.

How do I recover $b$ after solving the dual?

Pick any support vector $i^*$ — any training point with $\alpha_{i^*} > 0$ . By complementary slackness, that point sits exactly on the margin boundary, so $y^{(i^*)}(w^\top x^{(i^*)} + b) = 1$ . Solve for $b$ : $b = y^{(i^*)} - w^\top x^{(i^*)}$ , where $w = \sum_i \alpha_i y^{(i)} x^{(i)}$ . For numerical stability, average across all support vectors.

What's the soft-margin SVM dual?

Same dual objective. The only change is the constraint: $0 \le \alpha_i \le C$ instead of $\alpha_i \ge 0$ . The upper bound $C$ is the regularization parameter from the primal — it caps how much any single training point can influence the boundary. The constraint $\sum_i \alpha_i y^{(i)} = 0$ stays the same. The kernel trick still works identically.

Turn any Stanford lecture into notes like these

These notes were synthesized from the CS229 SVM lectures. Notiq does the same thing for any YouTube lecture — paste a link, get back structured notes with the derivations written out, flashcards, exam questions, and worked examples. Try the Stanford CS229 SVM lecture in Notiq's library or paste your own lecture.

The Notiq Team

Share this article

LinkedIn X / Twitter

On This Page

Study tips in your inbox

Weekly insights on AI studying, note-taking science, and getting more from video lectures.

CS229 SVM Dual Formulation — Complete Derivation Notes (Lagrangian + KKT + Kernel Trick)

Quick orientation: what "SVM dual" means and why CS229 spends so much time on it

The primal problem (recap from CS229 Lecture 6)

Building the Lagrangian

Minimizing the Lagrangian over the primal variables

Substituting back to get the dual function

The dual problem

KKT conditions and recovering $b$

Why the dual form makes the kernel trick possible

The soft-margin extension

Worked example: SVM on the XOR-like dataset

Common derivation pitfalls

How this fits with the rest of CS229

What does "SVM dual" actually mean?

Why does CS229 derive the dual at all instead of just using the primal?

What are the KKT conditions and which ones matter for SVMs?

What's the difference between SVM primal and dual form notes?

Where does the kernel trick fit in this derivation?

How do I recover $b$ after solving the dual?

What's the soft-margin SVM dual?

Turn any Stanford lecture into notes like these

Related Articles

Stanford CS229 Machine Learning Course Notes (Official 2026 Syllabus) — Free Complete Reference

How to Learn Machine Learning from YouTube: A Rigorous Self-Study Roadmap

SICP Notes — Structure and Interpretation of Computer Programs (MIT 6.001)

CS229 SVM Dual Formulation — Complete Derivation Notes (Lagrangian + KKT + Kernel Trick)

Quick orientation: what "SVM dual" means and why CS229 spends so much time on it

The primal problem (recap from CS229 Lecture 6)

Building the Lagrangian

Minimizing the Lagrangian over the primal variables

Substituting back to get the dual function

The dual problem

KKT conditions and recovering bbb

Why the dual form makes the kernel trick possible

The soft-margin extension

Worked example: SVM on the XOR-like dataset

Common derivation pitfalls

How this fits with the rest of CS229

What does "SVM dual" actually mean?

Why does CS229 derive the dual at all instead of just using the primal?

What are the KKT conditions and which ones matter for SVMs?

What's the difference between SVM primal and dual form notes?

Where does the kernel trick fit in this derivation?

How do I recover bbb after solving the dual?

What's the soft-margin SVM dual?

Turn any Stanford lecture into notes like these

Related Articles

Stanford CS229 Machine Learning Course Notes (Official 2026 Syllabus) — Free Complete Reference

How to Learn Machine Learning from YouTube: A Rigorous Self-Study Roadmap

SICP Notes — Structure and Interpretation of Computer Programs (MIT 6.001)

KKT conditions and recovering $b$

How do I recover $b$ after solving the dual?