CS50 AI Notes — Complete Course Summary for All 7 Lectures

·16 min read
CS50 AI Notes — Complete Course Summary for All 7 Lectures

Share this article

Harvard's CS50 AI — formally CS50's Introduction to Artificial Intelligence with Python — is one of the most accessible and complete introductions to AI available anywhere. It is free, it is practical, and it genuinely covers the breadth of the field in a way that builds a coherent mental model rather than a list of disconnected techniques.

These cs50 ai notes cover every topic in the course: search algorithms, knowledge and inference, probability and uncertainty, optimization, machine learning, neural networks, and natural language processing. Each section explains not just what the algorithm does but why it works and where it fits in the broader map of AI.

CS50 AI Lecture 0: Search. The full lecture is available free on the CS50 AI website.

For broader Harvard CS50 course notes, see our Harvard CS50 notes. For more advanced AI coverage, see the MIT 6.034 AI course notes. For a path deeper into machine learning, see the Andrew Ng ML course notes.

What Makes CS50 AI Different From Other AI Courses?

CS50 AI sits at an unusual position in the AI education landscape. It is more applied than the academic AI courses (MIT 6.034, Stanford CS221) but more conceptually rigorous than typical bootcamp content. It assumes basic Python familiarity and uses Python throughout, which makes the implementations directly runnable without language overhead.

The course structure mirrors the historical development of AI: it starts with search-based methods that dominated the field in the 1950s–1980s, moves through probabilistic and optimization methods, and ends with the machine learning and neural network approaches that dominate today. This progression is not just historical — it shows you why each new family of methods was needed and what problems the previous approaches could not solve.

Each week includes:

  • A two-hour lecture with clear explanations and live coding demonstrations
  • Problem sets that require implementing the algorithms from scratch
  • Real projects: a Minesweeper solver, a crossword generator, a shopping classifier, a Nim player, a traffic sign recognizer

The projects are what distinguish CS50 AI from a course you could watch and immediately forget. Implementing a BFS-based maze solver teaches you more about search than any amount of reading.

Week 0 covers search — the oldest and most foundational approach in AI. The problem: given a state space and a goal, find a sequence of actions that leads from the initial state to the goal.

Search problem components:

  • Initial state
  • Actions: what moves are available from each state
  • Transition model: what state results from taking an action
  • Goal test: how to recognize a goal state
  • Path cost: how expensive each path is

Uninformed search explores without information about where the goal is.

Depth-first search (DFS) explores one branch as deeply as possible before backtracking. Data structure: a stack (LIFO). DFS will find a solution if one exists but may find a very long path when a short one exists. Not optimal.

Breadth-first search (BFS) explores all states at depth k before any state at depth k+1. Data structure: a queue (FIFO). BFS finds the shortest path (in terms of number of actions). The cost: it stores all frontier states in memory, which can be enormous.

Greedy best-first search uses a heuristic h(n) to estimate the distance from the current state to the goal, always expanding the state with the lowest h(n). Fast, but not guaranteed to find the optimal path.

A search* combines the actual path cost g(n) with the heuristic h(n):

f(n) = g(n) + h(n)

A* is optimal if the heuristic is admissible (never overestimates the true cost) and consistent (h(n) ≤ cost(n, n') + h(n') for each successor n'). With an admissible heuristic, A* always finds the optimal path. The Manhattan distance is admissible for grid-based navigation; Euclidean distance is admissible for many spatial problems.

Adversarial search handles situations where an opponent is trying to minimize your score. The minimax algorithm computes the optimal move by assuming both players play perfectly:

MAX-VALUE(state):
  if TERMINAL(state): return UTILITY(state)
  v = -∞
  for each action in ACTIONS(state):
    v = max(v, MIN-VALUE(RESULT(state, action)))
  return v

MIN-VALUE(state):
  if TERMINAL(state): return UTILITY(state)
  v = +∞
  for each action in ACTIONS(state):
    v = min(v, MAX-VALUE(RESULT(state, action)))
  return v

Alpha-beta pruning optimizes minimax by eliminating branches that cannot possibly affect the optimal decision. If the maximizing player has already found a value α and the minimizing player has a better option β < α somewhere above, the current subtree can be pruned. This reduces the effective branching factor dramatically — in chess, it cuts from b^d to approximately b^{d/2}.

Depth-limited minimax and evaluation functions handle games too complex to search to terminal states. The evaluation function estimates the value of non-terminal states — a game-specific heuristic.

Knowledge and Inference: Logic and Automated Reasoning

Week 1 covers knowledge representation and inference — how to encode facts about the world in a formal language and derive new facts from them automatically.

Propositional logic uses sentences built from atomic propositions (P, Q, R) and logical connectives:

  • ¬P: NOT P
  • P ∧ Q: P AND Q
  • P ∨ Q: P OR Q
  • P → Q: P IMPLIES Q (equivalent to ¬P ∨ Q)
  • P ↔ Q: P IF AND ONLY IF Q

A model assigns a truth value to every proposition. A knowledge base is a set of sentences that are known to be true. The entailment relation KB ⊨ α means: in every model where KB is true, α is also true.

Model checking (the brute-force approach) enumerates all possible models and checks whether α is true whenever KB is true. For n propositions, there are 2^n models — exponential. Not practical for large knowledge bases.

Resolution is a polynomial-time inference algorithm. Convert all sentences to conjunctive normal form (CNF): a conjunction of clauses, where each clause is a disjunction of literals. The resolution rule:

P ∨ Q₁ ∨ ... ∨ Qₙ
¬P ∨ R₁ ∨ ... ∨ Rₘ
―――――――――――――――――――
Q₁ ∨ ... ∨ Qₙ ∨ R₁ ∨ ... ∨ Rₘ

To prove α, add ¬α to the KB and attempt to derive the empty clause (contradiction). If you succeed, α is entailed.

First-order logic (FOL) extends propositional logic with objects, predicates, and quantifiers:

  • ∀x Person(x) → Mortal(x): all persons are mortal
  • ∃x Cat(x) ∧ Black(x): some cat is black

FOL is expressive enough to represent most factual knowledge. The cost: inference in FOL is semi-decidable — some inferences can be proved, but the algorithm may not terminate for non-entailed conclusions.

The practical takeaway from this section: logic-based systems are excellent for domains with well-defined rules and complete knowledge (constraint satisfaction, planning, configuration). They fail when knowledge is incomplete, uncertain, or too complex to enumerate. That failure motivates the next section.

Uncertainty: Probability, Bayes' Rule, and Bayesian Networks

Week 2 covers probabilistic reasoning — how to make decisions and draw inferences when knowledge is incomplete.

Probability basics. The probability of an event e, P(e), is a number in [0, 1] representing the agent's degree of belief. P(Ω) = 1 (some outcome always occurs). For mutually exclusive events: P(A ∪ B) = P(A) + P(B).

Conditional probability: P(A | B) = P(A ∩ B) / P(B). The probability of A given that B is known to be true.

Bayes' rule relates the two directions of conditioning:

P(A | B) = P(B | A) P(A) / P(B)

This is the foundation of Bayesian reasoning: update your belief P(A) (the prior) using evidence B to get P(A | B) (the posterior). Bayes' rule is how a doctor updates their diagnosis after seeing a test result, how a spam filter updates its classification after seeing the email content, and how a self-driving car updates its world model after receiving sensor readings.

Bayesian Networks encode the joint probability distribution over multiple variables compactly using conditional independence. A Bayesian network is a directed acyclic graph where:

  • Each node is a random variable
  • Each node has a conditional probability table P(X_i | Parents(X_i))
  • The joint distribution factors as: P(X_1, ..., X_n) = Π_i P(X_i | Parents(X_i))

Example: a network for diagnosing whether it rained based on whether the grass is wet and whether the sprinkler ran. Rain and Sprinkler are parents of WetGrass.

Inference in Bayesian networks answers queries like P(Rain | WetGrass = true). Exact inference via enumeration is feasible for small networks. Approximate inference methods (Gibbs sampling, likelihood weighting) scale to larger networks.

Hidden Markov Models (HMMs) model temporal sequences with hidden states. The robot localization problem, speech recognition, and text tagging are all HMM problems. The Viterbi algorithm computes the most likely sequence of hidden states given a sequence of observations efficiently using dynamic programming.

Optimization: Local Search and Constraint Satisfaction

Week 3 covers optimization — finding the best solution according to an objective function, even when exhaustive search is infeasible.

Local search algorithms start with a complete (possibly suboptimal) solution and iteratively improve it by modifying one part at a time.

Hill climbing moves to the best neighboring state at each step. Problems: local maxima (better than all neighbors but not globally optimal), plateaus, and ridges. Hill climbing gets stuck reliably on hard optimization problems.

Simulated annealing addresses local maxima by occasionally accepting worse moves. The probability of accepting a worse move decreases over time (the "temperature" decreases):

P(accept worse move) = e^{ΔE/T}

At high temperature, the algorithm explores broadly. As temperature decreases, it converges toward a good solution. With the right cooling schedule, simulated annealing finds solutions close to the global optimum.

Constraint Satisfaction Problems (CSPs) are optimization problems where variables must be assigned values from their domains such that all constraints are satisfied. Examples: map coloring, Sudoku, scheduling.

Solving CSPs: backtracking search assigns values to variables one at a time, backtracking when a constraint is violated. Two key improvements:

Forward checking — after each assignment, check whether the remaining variables still have legal values. If any variable's domain becomes empty, backtrack immediately rather than continuing.

Arc consistency (AC-3 algorithm) — a variable X_i is arc-consistent with respect to X_j if for every value in X_i's domain, there exists a compatible value in X_j's domain. Enforcing arc consistency before and during search prunes the search space dramatically.

Minimum remaining values (MRV) heuristic — choose the variable with the fewest remaining legal values next. Fails fast, which is efficient.

Least constraining value heuristic — assign the value that eliminates the fewest choices for neighboring variables. Keeps the most options open.

Machine Learning: Supervised, Unsupervised, and Reinforcement

Week 4 covers machine learning at the conceptual level — how to learn from data rather than hand-coding rules.

Supervised learning. Given labeled training examples (x, y), learn a function f such that f(x) ≈ y for new inputs.

k-Nearest Neighbors (kNN) classifies a new point by finding the k nearest training examples and majority-voting their labels. No explicit training phase — the training data is the model. Sensitive to irrelevant features and outliers.

Perceptrons (linear classifiers) learn a decision boundary of the form w·x + b = 0 by updating weights whenever a misclassified example is found:

w = w + α (y - ŷ) x

The perceptron convergence theorem guarantees convergence if the data is linearly separable. If not, it never converges.

Support Vector Machines (introduced conceptually, not derived in full) find the maximum-margin linear classifier. The key insight: the margin is maximized by making the classifier as uncertain as possible while still correct — this is why SVMs generalize well.

Regression vs. Classification. CS50 AI covers both: linear regression for continuous outputs, logistic regression for binary classification.

Overfitting and regularization. A model that memorizes the training data fails to generalize. Regularization penalizes model complexity (large weights). Cross-validation estimates generalization performance without touching the test set.

Unsupervised learning. K-means clustering assigns each data point to one of k clusters, minimizing within-cluster variance. The algorithm is simple, but choosing k is non-trivial.

Reinforcement learning. An agent learns by interacting with an environment: taking actions, receiving rewards, and updating its strategy. The Q-learning algorithm learns Q(s, a) — the expected future reward from taking action a in state s:

Q(s, a) ← Q(s, a) + α [r + γ max_{a'} Q(s', a') - Q(s, a)]

CS50 AI implements a Nim-playing agent that learns purely from self-play using Q-learning — a compelling concrete example of RL in action.

Neural Networks: Perceptrons, Backpropagation, and CNNs

Week 5 covers neural networks — the engine behind most modern AI applications.

From perceptrons to networks. A single perceptron can only learn linearly separable functions. The XOR problem is not linearly separable — a network of two layers can solve it. The Universal Approximation Theorem says a network with one hidden layer can approximate any continuous function to arbitrary accuracy.

Activation functions. The choice of non-linearity determines what functions the network can learn:

  • Sigmoid: smooth, bounded in (0,1), but vanishing gradients for extreme values
  • ReLU: max(0, x), does not saturate for positive inputs, fast to compute, default choice
  • Softmax: normalizes a vector to probabilities, used in output layer for multi-class classification

Forward propagation computes the network's output for a given input by passing activations through each layer. Backpropagation computes the gradient of the loss with respect to all weights by applying the chain rule backward through the network. These two algorithms together enable training.

Gradient descent variants:

  • Batch gradient descent: update using the full dataset — stable but slow
  • Stochastic gradient descent (SGD): update after each single example — fast but noisy
  • Mini-batch gradient descent: update after each small batch (32–256 examples) — the practical default

Overfitting and regularization for neural networks:

  • Dropout: randomly zero out neurons during training, preventing co-adaptation
  • L2 regularization: add ||w||² penalty to the loss
  • Early stopping: monitor validation loss and stop when it starts increasing

Convolutional Neural Networks. For image tasks, CNNs dramatically outperform fully connected networks by exploiting spatial locality and translation invariance. CS50 AI walks through the components:

  • Convolutional layers: learnable filters applied at every spatial location
  • Pooling layers: spatial downsampling (max pooling takes the maximum in each region)
  • Fully connected layers: classification based on extracted features

The course's traffic sign recognition project uses a CNN trained on the German Traffic Sign Recognition Benchmark — a practical example of CNN training from scratch on a real dataset.

Natural Language Processing: Language Models, Syntax, and Meaning

Week 6 covers NLP — a domain where the gap between what AI can do and what it cannot is actively shrinking.

n-gram language models estimate the probability of a sequence of words by factoring over n-grams:

P(w_1, w_2, ..., w_n) ≈ Π_i P(w_i | w_{i-n+1}, ..., w_{i-1})

A bigram model (n=2) estimates P(w_i | w_{i-1}). Trained on large corpora, bigram models generate recognizably language-like text but without coherent meaning.

Bag of words representations ignore word order entirely, treating documents as unordered sets of words. Useful for classification (spam detection, sentiment analysis) where local context matters less than word presence.

TF-IDF (term frequency–inverse document frequency) weights words by their importance: frequent in this document (high TF) but rare across documents (high IDF) → high TF-IDF → likely a topic-specific term. Used for information retrieval and document similarity.

Formal syntax and parsing. Context-free grammars (CFGs) define the grammatical structure of sentences:

S → NP VP
NP → Det N | N
VP → V NP | V

Chart parsing algorithms (CYK, Earley) parse sentences according to a CFG efficiently. The ambiguity problem: most sentences have multiple valid parse trees. The correct interpretation depends on semantics, which syntax alone cannot resolve.

Word vectors and embeddings. Word2Vec and GloVe represent words as dense vectors in a continuous semantic space, trained such that words with similar meanings have similar vectors. The famous analogy: vector(king) - vector(man) + vector(woman) ≈ vector(queen). These embeddings are the input layer of most neural NLP models.

Transformers and attention in NLP. CS50 AI introduces the transformer architecture as the dominant paradigm for language: self-attention allows every word in a sequence to attend to every other word when being encoded. This produces context-sensitive representations — "bank" next to "river" is encoded differently than "bank" next to "money."

Large language models (GPT, BERT, etc.) are transformers pre-trained on vast corpora. Their emergent capabilities — reasoning, question answering, code generation — were not specifically trained for. CS50 AI closes with a discussion of what these models can and cannot do, and why careful evaluation of AI claims is essential.

How Does CS50 AI Compare to Other AI Courses?

CS50 AI is an excellent first course in AI because it is genuinely comprehensive without being mathematically overwhelming. But it is a starting point, not an endpoint.

What CS50 AI gives you: a working understanding of search, logic, probability, optimization, supervised learning, neural networks, and NLP basics, with Python implementations of every major algorithm. This is a very solid foundation.

What it does not cover in depth: the mathematics behind machine learning (see CS229), convolutional networks at production scale (see CS231n), reinforcement learning theory (RL algorithms beyond Q-learning), transformer internals, or current large language model research.

Where to go after CS50 AI:

Is CS50 AI Worth Doing If You Already Know Python?

Yes, for two reasons.

First, the course teaches you to think in terms of AI problem formulations — how to map a real problem onto search, CSP, or MDP formalisms. This skill does not come automatically from knowing Python.

Second, the projects force you to implement algorithms from scratch. You do not truly understand A* until you have implemented it and debugged it on a real problem. The Minesweeper solver, crossword generator, and traffic sign recognizer are all projects you can show a potential employer.

For a guide to studying technical courses effectively, see our YouTube to notes complete guide and our guide on reverse engineering college courses from lecture videos.


CS50 AI covers a lot of ground. Seven weeks, seven major topic areas, seven programming projects. Notiq lets you generate structured notes from each CS50 AI lecture automatically, extract key concepts and equations, and review with spaced repetition. Keep the knowledge sharp between lectures.

Start your CS50 AI study at notiq.study

Share this article

Related Articles