These are complete CS229 SVM dual formulation notes — the full derivation Andrew Ng works through across Lecture 6 and Lecture 7 of Stanford's machine learning course. If you came here from search looking for "CS229 SVM dual derivation notes" or "CS229 SVM dual form notes," you are in the right place: every step from the primal max-margin problem to the dual is written out below, with the Lagrangian, the KKT conditions, and the substitution that makes the kernel trick possible.
If you want the broader overview of every CS229 lecture, see our complete CS229 lecture notes. This page is the deep dive on one specific topic: the dual form of the support vector machine.
Quick orientation: what "SVM dual" means and why CS229 spends so much time on it
CS229 introduces the support vector machine as a margin-maximization problem. You have a labelled dataset, you want a hyperplane that separates the classes, and among all separating hyperplanes you specifically want the one that pushes the closest training points as far away from the boundary as possible. That formulation — the one Ng writes on the board first — is called the primal.
The dual is the same optimization problem rewritten in a different set of variables. Mathematically the two are equivalent (they share the same optimum) but the dual has three properties that make it the form everyone actually uses in practice:
- The dual depends on the training data only through inner products — never on the raw feature vectors. This is what makes the kernel trick possible.
- The dual has the same number of variables as you have training examples (), regardless of how high-dimensional your feature space is. If you map into an infinite-dimensional space with an RBF kernel, the dual stays finite.
- The dual makes it explicit which training points actually determine the boundary (the support vectors). For most points, the dual variable is zero. Only support vectors carry a non-zero dual.
The reason CS229 derives the dual in such painful detail is that the derivation isn't really about SVMs — it's about Lagrange duality, a tool that recurs constantly in machine learning research. Once you have done it once for SVMs, you have done it once for many other problems.
The primal problem (recap from CS229 Lecture 6)
The geometric-margin SVM primal — what Ng writes after the long warm-up about functional vs geometric margins — is:
subject to:
A few things worth noting before we attack the dual:
- We are minimizing , not maximizing the margin directly. The trick is that the constraint fixes the functional margin to 1, which means the geometric margin equals . Minimizing maximizes that geometric margin. The factor of is there to keep the gradient clean later.
- The problem is convex — quadratic objective, linear constraints. That means strong duality holds, and the KKT conditions are both necessary and sufficient.
- This is the hard-margin version. We will derive the soft-margin dual after the hard-margin one because the structure is identical and the soft-margin slack variables just introduce a box constraint on the duals at the end.
Building the Lagrangian
The general recipe for a constrained optimization problem is to write the Lagrangian by attaching a non-negative multiplier to each inequality constraint:
Two things go in that definition. First, we need the constraints in the standard form . Ours are written as , which is equivalent to:
Second, we have two primal variables ( and ), so the Lagrangian is in both. Plugging in:
The sign in front of the sum is minus because we converted above and the standard form is . Expanded:
This Lagrangian is the object we will minimize over to get the dual function , and then maximize over .
Minimizing the Lagrangian over the primal variables
This is the step where the dual form starts to fall out of the algebra. We take partial derivatives of with respect to and , set them to zero, and substitute the resulting equations back into the Lagrangian.
Derivative with respect to :
This gives us the first KKT stationarity condition:
This identity already does a lot of work — it tells you that the optimal weight vector is a linear combination of the training points, weighted by their (signed) dual variable. Points with contribute nothing.
Derivative with respect to :
So:
This is a constraint on the duals — they must be balanced across the two classes (since ). It does not give us a value for . We will recover later from the KKT complementary slackness condition.
Substituting back to get the dual function
Now we plug and back into the Lagrangian. This is the bookkeeping step that is easy to mess up — work through it slowly.
Term 1: . Using the substitution:
Term 2: .
Distribute:
The second piece vanishes because . The first piece, substituting again:
Term 3: . Stays as is.
Adding the three terms:
The two double sums combine — one has coefficient , the other , leaving . The result is the dual function:
This is the equation Ng circles on the board and underlines twice. Memorize the structure: a linear term minus a quadratic term where every pair contributes a kernel-weighted product.
The dual problem
Putting the constraints back in, the dual optimization is:
subject to:
and:
That is the hard-margin SVM dual. Compare it to the primal — both are convex quadratic programs, both have the same optimum, but the dual has only variables and depends on the data only through inner products. That second property is the whole game.
KKT conditions and recovering
The KKT conditions are the system of equations that must hold at the optimum of a convex problem with constraints. There are four:
- Stationarity — gradients of the Lagrangian vanish (we already used these to get and the constraint on ).
- Primal feasibility — for all .
- Dual feasibility — for all .
- Complementary slackness — for all .
Complementary slackness is the magic one. It says that for every training point, either:
- (the point doesn't contribute to and isn't a support vector), or
- (the functional margin is exactly 1 — the point sits on the margin boundary).
Points with are by definition support vectors. They are the only training examples that determine the hyperplane. To recover , pick any support vector (any point with ) and solve:
In practice you average across all support vectors for numerical stability.
Why the dual form makes the kernel trick possible
The kernel trick is what made SVMs the dominant classifier from roughly 1995 to 2012, and it is impossible to do in the primal. Look once more at the dual:
The features appear only in the inner product . Nowhere else.
That means if we are willing to map the data into a higher-dimensional feature space — to allow non-linear decision boundaries — we never need to actually compute . We only need a function that returns the inner product directly. That function is a kernel.
The famous example is the Gaussian (RBF) kernel:
This kernel corresponds to an infinite-dimensional feature map. In the primal we would have to store and optimize over infinite-dimensional , which is impossible. In the dual we just compute for each pair of training points, which is kernel evaluations. The optimization stays finite-dimensional in .
This is the single biggest reason CS229 spends two full lectures on the dual derivation: not because SVMs are uniquely important, but because the kernel trick — and the broader idea that an algorithm depending only on inner products can be silently lifted into a richer feature space — is one of the most reused ideas in classical ML.
The soft-margin extension
The hard-margin SVM only works on linearly separable data. The soft-margin version, which Ng derives in Lecture 7, allows misclassifications via slack variables and a penalty parameter :
subject to:
Run through the same Lagrangian-and-substitute exercise (now with multipliers for the margin constraints and for ) and the dual comes out almost identical:
subject to:
The only change is that now has an upper bound (the box constraint). The dual function is unchanged. The kernel trick still works the same way. SMO — the optimizer Ng covers next — exploits exactly this box-constrained structure.
Worked example: SVM on the XOR-like dataset
Suppose you have four training points in 2D:
- ,
- ,
- ,
- ,
These are not linearly separable in the original space. But with a polynomial kernel , which corresponds to mapping into a 6-dimensional space including all pairwise products, the data becomes separable.
The dual problem becomes a 4-variable QP. You compute the kernel matrix:
then solve for , and the resulting classifier is:
You never compute explicitly. You only ever evaluate .
Common derivation pitfalls
Three places where CS229 problem sets historically catch students:
-
Sign errors in the Lagrangian. The constraint becomes . The Lagrangian adds , which is . Drop a sign here and the dual flips into a instead of a .
-
Forgetting . This is not a side note — it's the constraint that makes the term in the Lagrangian vanish when you substitute. Without it, the substitution doesn't simplify cleanly.
-
Confusing functional and geometric margin. The primal constraint uses functional margin = 1; the objective minimizes , which maximizes the geometric margin. They are related by . Slipping between the two without realizing it is the most common source of off-by-a-factor errors.
How this fits with the rest of CS229
The SVM dual is the case study for Lagrange duality, but the same machinery shows up in:
- Logistic regression with regularization — different objective, similar duality argument
- The kernel perceptron — direct dual representation, no QP needed
- Gaussian processes — also depend only on inner products / kernel matrices
- Many other CS229 problem-set problems that ask you to derive a dual from a constrained primal
If you want the broader course, the complete CS229 lecture notes cover every major topic — linear regression, GLMs, neural networks, unsupervised learning, RL. The natural follow-on to this post is the SMO optimizer Ng presents in Lecture 8, which solves the box-constrained QP that this derivation set up.
What does "SVM dual" actually mean?
It's the optimization problem you get when you rewrite the SVM margin-maximization problem (the primal) using Lagrange multipliers and then maximize over those multipliers instead of the original weight vector . Mathematically equivalent — same optimum — but expressed in different variables that turn out to be more useful.
Why does CS229 derive the dual at all instead of just using the primal?
Two reasons. First, the dual depends on the data only through inner products , which makes the kernel trick possible — you can implicitly map into arbitrarily-high-dimensional feature spaces by just replacing the inner product with a kernel. The primal can't do this. Second, the dual has the same number of variables as you have training points, regardless of feature dimension. With an RBF kernel (infinite-dimensional feature space) the primal is uncomputable; the dual stays finite.
What are the KKT conditions and which ones matter for SVMs?
KKT (Karush-Kuhn-Tucker) conditions are the necessary-and-sufficient optimality conditions for a convex constrained problem. There are four: stationarity (gradients vanish), primal feasibility (constraints hold), dual feasibility (), and complementary slackness ( for every constraint). For SVMs, complementary slackness is the important one — it tells you that for every training point, either (not a support vector) or the point is exactly on the margin. That's how you identify support vectors and how you recover the bias term .
What's the difference between SVM primal and dual form notes?
The primal optimizes over the weight vector and bias . The dual optimizes over Lagrange multipliers , one per training point. The two formulations have the same optimum (strong duality), but the dual lets you use kernels and makes support vectors explicit. Most software (LIBSVM, sklearn) actually solves the dual, even when you ask for "an SVM."
Where does the kernel trick fit in this derivation?
Look at the final dual objective: . The training features appear only inside the inner product. Replace every with a kernel and you've implicitly mapped your data into whatever feature space that kernel corresponds to — without ever computing the mapping. That's the kernel trick. It only works because the dual is structured this way.
How do I recover after solving the dual?
Pick any support vector — any training point with . By complementary slackness, that point sits exactly on the margin boundary, so . Solve for : , where . For numerical stability, average across all support vectors.
What's the soft-margin SVM dual?
Same dual objective. The only change is the constraint: instead of . The upper bound is the regularization parameter from the primal — it caps how much any single training point can influence the boundary. The constraint stays the same. The kernel trick still works identically.
Turn any Stanford lecture into notes like these
These notes were synthesized from the CS229 SVM lectures. Notiq does the same thing for any YouTube lecture — paste a link, get back structured notes with the derivations written out, flashcards, exam questions, and worked examples. Try the Stanford CS229 SVM lecture in Notiq's library or paste your own lecture.

