CS229 SVM Dual Formulation — Complete Derivation Notes (Lagrangian + KKT + Kernel Trick)

·15 min read
CS229 SVM Dual Formulation — Complete Derivation Notes (Lagrangian + KKT + Kernel Trick)

Share this article

These are complete CS229 SVM dual formulation notes — the full derivation Andrew Ng works through across Lecture 6 and Lecture 7 of Stanford's machine learning course. If you came here from search looking for "CS229 SVM dual derivation notes" or "CS229 SVM dual form notes," you are in the right place: every step from the primal max-margin problem to the dual is written out below, with the Lagrangian, the KKT conditions, and the substitution that makes the kernel trick possible.

If you want the broader overview of every CS229 lecture, see our complete CS229 lecture notes. This page is the deep dive on one specific topic: the dual form of the support vector machine.

Quick orientation: what "SVM dual" means and why CS229 spends so much time on it

CS229 introduces the support vector machine as a margin-maximization problem. You have a labelled dataset, you want a hyperplane that separates the classes, and among all separating hyperplanes you specifically want the one that pushes the closest training points as far away from the boundary as possible. That formulation — the one Ng writes on the board first — is called the primal.

The dual is the same optimization problem rewritten in a different set of variables. Mathematically the two are equivalent (they share the same optimum) but the dual has three properties that make it the form everyone actually uses in practice:

  1. The dual depends on the training data only through inner products x(i),x(j)\langle x^{(i)}, x^{(j)} \rangle — never on the raw feature vectors. This is what makes the kernel trick possible.
  2. The dual has the same number of variables as you have training examples (mm), regardless of how high-dimensional your feature space is. If you map into an infinite-dimensional space with an RBF kernel, the dual stays finite.
  3. The dual makes it explicit which training points actually determine the boundary (the support vectors). For most points, the dual variable is zero. Only support vectors carry a non-zero dual.

The reason CS229 derives the dual in such painful detail is that the derivation isn't really about SVMs — it's about Lagrange duality, a tool that recurs constantly in machine learning research. Once you have done it once for SVMs, you have done it once for many other problems.

The primal problem (recap from CS229 Lecture 6)

The geometric-margin SVM primal — what Ng writes after the long warm-up about functional vs geometric margins — is:

minw,b  12w2\min_{w, b} \; \frac{1}{2} \|w\|^2

subject to:

y(i)(wx(i)+b)1,i=1,,my^{(i)} (w^\top x^{(i)} + b) \ge 1, \quad i = 1, \ldots, m

A few things worth noting before we attack the dual:

  • We are minimizing 12w2\frac{1}{2} \|w\|^2, not maximizing the margin directly. The trick is that the constraint y(i)(wx(i)+b)1y^{(i)}(w^\top x^{(i)} + b) \ge 1 fixes the functional margin to 1, which means the geometric margin equals 1/w1/\|w\|. Minimizing w2\|w\|^2 maximizes that geometric margin. The factor of 12\frac{1}{2} is there to keep the gradient clean later.
  • The problem is convex — quadratic objective, linear constraints. That means strong duality holds, and the KKT conditions are both necessary and sufficient.
  • This is the hard-margin version. We will derive the soft-margin dual after the hard-margin one because the structure is identical and the soft-margin slack variables just introduce a box constraint on the duals at the end.

Building the Lagrangian

The general recipe for a constrained optimization problem minf(w) s.t. gi(w)0\min f(w) \text{ s.t. } g_i(w) \le 0 is to write the Lagrangian by attaching a non-negative multiplier αi0\alpha_i \ge 0 to each inequality constraint:

L(w,α)=f(w)+iαigi(w)\mathcal{L}(w, \alpha) = f(w) + \sum_i \alpha_i g_i(w)

Two things go in that definition. First, we need the constraints in the standard form gi(w)0g_i(w) \le 0. Ours are written as y(i)(wx(i)+b)1y^{(i)}(w^\top x^{(i)} + b) \ge 1, which is equivalent to:

gi(w,b)=1y(i)(wx(i)+b)0g_i(w, b) = 1 - y^{(i)}(w^\top x^{(i)} + b) \le 0

Second, we have two primal variables (ww and bb), so the Lagrangian is in both. Plugging in:

L(w,b,α)=12w2i=1mαi[y(i)(wx(i)+b)1]\mathcal{L}(w, b, \alpha) = \frac{1}{2} \|w\|^2 - \sum_{i=1}^{m} \alpha_i \left[ y^{(i)} (w^\top x^{(i)} + b) - 1 \right]

The sign in front of the sum is minus because we converted gi0g_i \le 0 above and the standard form is +αigi+\alpha_i g_i. Expanded:

L(w,b,α)=12w2iαiy(i)(wx(i)+b)+iαi\mathcal{L}(w, b, \alpha) = \frac{1}{2} \|w\|^2 - \sum_i \alpha_i y^{(i)} (w^\top x^{(i)} + b) + \sum_i \alpha_i

This Lagrangian is the object we will minimize over (w,b)(w, b) to get the dual function θD(α)\theta_\mathcal{D}(\alpha), and then maximize over α\alpha.

Minimizing the Lagrangian over the primal variables

This is the step where the dual form starts to fall out of the algebra. We take partial derivatives of L\mathcal{L} with respect to ww and bb, set them to zero, and substitute the resulting equations back into the Lagrangian.

Derivative with respect to ww:

wL=wi=1mαiy(i)x(i)=0\nabla_w \mathcal{L} = w - \sum_{i=1}^{m} \alpha_i y^{(i)} x^{(i)} = 0

This gives us the first KKT stationarity condition:

w=i=1mαiy(i)x(i)w = \sum_{i=1}^{m} \alpha_i y^{(i)} x^{(i)}

This identity already does a lot of work — it tells you that the optimal weight vector is a linear combination of the training points, weighted by their (signed) dual variable. Points with αi=0\alpha_i = 0 contribute nothing.

Derivative with respect to bb:

Lb=i=1mαiy(i)=0\frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^{m} \alpha_i y^{(i)} = 0

So:

i=1mαiy(i)=0\sum_{i=1}^{m} \alpha_i y^{(i)} = 0

This is a constraint on the duals — they must be balanced across the two classes (since y(i){1,+1}y^{(i)} \in \{-1, +1\}). It does not give us a value for bb. We will recover bb later from the KKT complementary slackness condition.

Substituting back to get the dual function

Now we plug w=iαiy(i)x(i)w = \sum_i \alpha_i y^{(i)} x^{(i)} and iαiy(i)=0\sum_i \alpha_i y^{(i)} = 0 back into the Lagrangian. This is the bookkeeping step that is easy to mess up — work through it slowly.

Term 1: 12w2\frac{1}{2} \|w\|^2. Using the substitution:

12w2=12ww=12(iαiy(i)x(i))(jαjy(j)x(j))=12i,jαiαjy(i)y(j)x(i),x(j)\frac{1}{2} \|w\|^2 = \frac{1}{2} w^\top w = \frac{1}{2} \left( \sum_i \alpha_i y^{(i)} x^{(i)} \right)^\top \left( \sum_j \alpha_j y^{(j)} x^{(j)} \right) = \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

Term 2: iαiy(i)(wx(i)+b)-\sum_i \alpha_i y^{(i)} (w^\top x^{(i)} + b).

Distribute:

iαiy(i)wx(i)biαiy(i)-\sum_i \alpha_i y^{(i)} w^\top x^{(i)} - b \sum_i \alpha_i y^{(i)}

The second piece vanishes because iαiy(i)=0\sum_i \alpha_i y^{(i)} = 0. The first piece, substituting ww again:

iαiy(i)(jαjy(j)x(j))x(i)=i,jαiαjy(i)y(j)x(i),x(j)-\sum_i \alpha_i y^{(i)} \left( \sum_j \alpha_j y^{(j)} x^{(j)} \right)^\top x^{(i)} = -\sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

Term 3: +iαi+\sum_i \alpha_i. Stays as is.

Adding the three terms:

L=12i,jαiαjy(i)y(j)x(i),x(j)i,jαiαjy(i)y(j)x(i),x(j)+iαi\mathcal{L} = \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle - \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle + \sum_i \alpha_i

The two double sums combine — one has coefficient 12\frac{1}{2}, the other 1-1, leaving 12-\frac{1}{2}. The result is the dual function:

θD(α)=i=1mαi12i=1mj=1mαiαjy(i)y(j)x(i),x(j)\theta_\mathcal{D}(\alpha) = \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

This is the equation Ng circles on the board and underlines twice. Memorize the structure: a linear term αi\sum \alpha_i minus a quadratic term where every pair contributes a kernel-weighted product.

The dual problem

Putting the constraints back in, the dual optimization is:

maxα  i=1mαi12i,jαiαjy(i)y(j)x(i),x(j)\max_{\alpha} \; \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

subject to:

αi0,i=1,,m\alpha_i \ge 0, \quad i = 1, \ldots, m

and:

i=1mαiy(i)=0\sum_{i=1}^{m} \alpha_i y^{(i)} = 0

That is the hard-margin SVM dual. Compare it to the primal — both are convex quadratic programs, both have the same optimum, but the dual has only mm variables and depends on the data only through inner products. That second property is the whole game.

KKT conditions and recovering bb

The KKT conditions are the system of equations that must hold at the optimum of a convex problem with constraints. There are four:

  1. Stationarity — gradients of the Lagrangian vanish (we already used these to get ww and the constraint on α\alpha).
  2. Primal feasibilityy(i)(wx(i)+b)1y^{(i)}(w^\top x^{(i)} + b) \ge 1 for all ii.
  3. Dual feasibilityαi0\alpha_i \ge 0 for all ii.
  4. Complementary slacknessαi[y(i)(wx(i)+b)1]=0\alpha_i \left[ y^{(i)}(w^\top x^{(i)} + b) - 1 \right] = 0 for all ii.

Complementary slackness is the magic one. It says that for every training point, either:

  • αi=0\alpha_i = 0 (the point doesn't contribute to ww and isn't a support vector), or
  • y(i)(wx(i)+b)=1y^{(i)}(w^\top x^{(i)} + b) = 1 (the functional margin is exactly 1 — the point sits on the margin boundary).

Points with αi>0\alpha_i > 0 are by definition support vectors. They are the only training examples that determine the hyperplane. To recover bb, pick any support vector ii^* (any point with αi>0\alpha_{i^*} > 0) and solve:

y(i)(wx(i)+b)=1    b=y(i)wx(i)y^{(i^*)}(w^\top x^{(i^*)} + b) = 1 \implies b = y^{(i^*)} - w^\top x^{(i^*)}

In practice you average bb across all support vectors for numerical stability.

Why the dual form makes the kernel trick possible

The kernel trick is what made SVMs the dominant classifier from roughly 1995 to 2012, and it is impossible to do in the primal. Look once more at the dual:

θD(α)=iαi12i,jαiαjy(i)y(j)x(i),x(j)\theta_\mathcal{D}(\alpha) = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

The features x(i)x^{(i)} appear only in the inner product x(i),x(j)\langle x^{(i)}, x^{(j)} \rangle. Nowhere else.

That means if we are willing to map the data into a higher-dimensional feature space ϕ(x)\phi(x) — to allow non-linear decision boundaries — we never need to actually compute ϕ(x)\phi(x). We only need a function K(x,z)=ϕ(x),ϕ(z)K(x, z) = \langle \phi(x), \phi(z) \rangle that returns the inner product directly. That function is a kernel.

The famous example is the Gaussian (RBF) kernel:

K(x,z)=exp(xz22σ2)K(x, z) = \exp\left( -\frac{\|x - z\|^2}{2\sigma^2} \right)

This kernel corresponds to an infinite-dimensional feature map. In the primal we would have to store and optimize over infinite-dimensional ww, which is impossible. In the dual we just compute K(x(i),x(j))K(x^{(i)}, x^{(j)}) for each pair of training points, which is O(m2)O(m^2) kernel evaluations. The optimization stays finite-dimensional in α\alpha.

This is the single biggest reason CS229 spends two full lectures on the dual derivation: not because SVMs are uniquely important, but because the kernel trick — and the broader idea that an algorithm depending only on inner products can be silently lifted into a richer feature space — is one of the most reused ideas in classical ML.

The soft-margin extension

The hard-margin SVM only works on linearly separable data. The soft-margin version, which Ng derives in Lecture 7, allows misclassifications via slack variables ξi\xi_i and a penalty parameter CC:

minw,b,ξ  12w2+Ciξi\min_{w, b, \xi} \; \frac{1}{2} \|w\|^2 + C \sum_i \xi_i

subject to:

y(i)(wx(i)+b)1ξi,ξi0y^{(i)}(w^\top x^{(i)} + b) \ge 1 - \xi_i, \quad \xi_i \ge 0

Run through the same Lagrangian-and-substitute exercise (now with multipliers αi\alpha_i for the margin constraints and rir_i for ξi0\xi_i \ge 0) and the dual comes out almost identical:

maxα  iαi12i,jαiαjy(i)y(j)x(i),x(j)\max_{\alpha} \; \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle

subject to:

0αiC,iαiy(i)=00 \le \alpha_i \le C, \quad \sum_i \alpha_i y^{(i)} = 0

The only change is that αi\alpha_i now has an upper bound CC (the box constraint). The dual function is unchanged. The kernel trick still works the same way. SMO — the optimizer Ng covers next — exploits exactly this box-constrained structure.

Worked example: SVM on the XOR-like dataset

Suppose you have four training points in 2D:

  • x(1)=(1,1)x^{(1)} = (1, 1), y(1)=+1y^{(1)} = +1
  • x(2)=(1,1)x^{(2)} = (-1, -1), y(2)=+1y^{(2)} = +1
  • x(3)=(1,1)x^{(3)} = (1, -1), y(3)=1y^{(3)} = -1
  • x(4)=(1,1)x^{(4)} = (-1, 1), y(4)=1y^{(4)} = -1

These are not linearly separable in the original space. But with a polynomial kernel K(x,z)=(1+xz)2K(x, z) = (1 + x^\top z)^2, which corresponds to mapping into a 6-dimensional space including all pairwise products, the data becomes separable.

The dual problem becomes a 4-variable QP. You compute the 4×44 \times 4 kernel matrix:

Kij=(1+x(i)x(j))2K_{ij} = (1 + x^{(i)\top} x^{(j)})^2

then solve for α\alpha, and the resulting classifier is:

f(x)=sign(i=14αiy(i)K(x(i),x)+b)f(x) = \text{sign}\left( \sum_{i=1}^{4} \alpha_i y^{(i)} K(x^{(i)}, x) + b \right)

You never compute ϕ(x)\phi(x) explicitly. You only ever evaluate KK.

Common derivation pitfalls

Three places where CS229 problem sets historically catch students:

  1. Sign errors in the Lagrangian. The constraint y(i)(wx(i)+b)1y^{(i)}(w^\top x^{(i)} + b) \ge 1 becomes gi=1y(i)()0g_i = 1 - y^{(i)}(\ldots) \le 0. The Lagrangian adds +αigi+\alpha_i g_i, which is αi[y(i)()1]-\alpha_i \left[y^{(i)}(\ldots) - 1\right]. Drop a sign here and the dual flips into a min\min instead of a max\max.

  2. Forgetting iαiy(i)=0\sum_i \alpha_i y^{(i)} = 0. This is not a side note — it's the constraint that makes the bb term in the Lagrangian vanish when you substitute. Without it, the substitution doesn't simplify cleanly.

  3. Confusing functional and geometric margin. The primal constraint uses functional margin = 1; the objective minimizes w2\|w\|^2, which maximizes the geometric margin. They are related by γgeo=γfunc/w\gamma_{\text{geo}} = \gamma_{\text{func}} / \|w\|. Slipping between the two without realizing it is the most common source of off-by-a-factor errors.

How this fits with the rest of CS229

The SVM dual is the case study for Lagrange duality, but the same machinery shows up in:

  • Logistic regression with 1\ell_1 regularization — different objective, similar duality argument
  • The kernel perceptron — direct dual representation, no QP needed
  • Gaussian processes — also depend only on inner products / kernel matrices
  • Many other CS229 problem-set problems that ask you to derive a dual from a constrained primal

If you want the broader course, the complete CS229 lecture notes cover every major topic — linear regression, GLMs, neural networks, unsupervised learning, RL. The natural follow-on to this post is the SMO optimizer Ng presents in Lecture 8, which solves the box-constrained QP that this derivation set up.

What does "SVM dual" actually mean?

It's the optimization problem you get when you rewrite the SVM margin-maximization problem (the primal) using Lagrange multipliers and then maximize over those multipliers instead of the original weight vector ww. Mathematically equivalent — same optimum — but expressed in different variables that turn out to be more useful.

Why does CS229 derive the dual at all instead of just using the primal?

Two reasons. First, the dual depends on the data only through inner products x(i),x(j)\langle x^{(i)}, x^{(j)} \rangle, which makes the kernel trick possible — you can implicitly map into arbitrarily-high-dimensional feature spaces by just replacing the inner product with a kernel. The primal can't do this. Second, the dual has the same number of variables as you have training points, regardless of feature dimension. With an RBF kernel (infinite-dimensional feature space) the primal is uncomputable; the dual stays finite.

What are the KKT conditions and which ones matter for SVMs?

KKT (Karush-Kuhn-Tucker) conditions are the necessary-and-sufficient optimality conditions for a convex constrained problem. There are four: stationarity (gradients vanish), primal feasibility (constraints hold), dual feasibility (αi0\alpha_i \ge 0), and complementary slackness (αigi=0\alpha_i \cdot g_i = 0 for every constraint). For SVMs, complementary slackness is the important one — it tells you that for every training point, either αi=0\alpha_i = 0 (not a support vector) or the point is exactly on the margin. That's how you identify support vectors and how you recover the bias term bb.

What's the difference between SVM primal and dual form notes?

The primal optimizes over the weight vector ww and bias bb. The dual optimizes over Lagrange multipliers αi\alpha_i, one per training point. The two formulations have the same optimum (strong duality), but the dual lets you use kernels and makes support vectors explicit. Most software (LIBSVM, sklearn) actually solves the dual, even when you ask for "an SVM."

Where does the kernel trick fit in this derivation?

Look at the final dual objective: iαi12i,jαiαjy(i)y(j)x(i),x(j)\sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y^{(i)} y^{(j)} \langle x^{(i)}, x^{(j)} \rangle. The training features x(i)x^{(i)} appear only inside the inner product. Replace every x(i),x(j)\langle x^{(i)}, x^{(j)} \rangle with a kernel K(x(i),x(j))K(x^{(i)}, x^{(j)}) and you've implicitly mapped your data into whatever feature space that kernel corresponds to — without ever computing the mapping. That's the kernel trick. It only works because the dual is structured this way.

How do I recover bb after solving the dual?

Pick any support vector ii^* — any training point with αi>0\alpha_{i^*} > 0. By complementary slackness, that point sits exactly on the margin boundary, so y(i)(wx(i)+b)=1y^{(i^*)}(w^\top x^{(i^*)} + b) = 1. Solve for bb: b=y(i)wx(i)b = y^{(i^*)} - w^\top x^{(i^*)}, where w=iαiy(i)x(i)w = \sum_i \alpha_i y^{(i)} x^{(i)}. For numerical stability, average across all support vectors.

What's the soft-margin SVM dual?

Same dual objective. The only change is the constraint: 0αiC0 \le \alpha_i \le C instead of αi0\alpha_i \ge 0. The upper bound CC is the regularization parameter from the primal — it caps how much any single training point can influence the boundary. The constraint iαiy(i)=0\sum_i \alpha_i y^{(i)} = 0 stays the same. The kernel trick still works identically.


Turn any Stanford lecture into notes like these

These notes were synthesized from the CS229 SVM lectures. Notiq does the same thing for any YouTube lecture — paste a link, get back structured notes with the derivations written out, flashcards, exam questions, and worked examples. Try the Stanford CS229 SVM lecture in Notiq's library or paste your own lecture.

Share this article

Related Articles