Welcome to STAT 4830: Numerical optimization for data science and machine learning. This course teaches you how to formulate optimization problems, select and implement algorithms, and use frameworks like PyTorch to build and train models. Below are some highlights of the syllabus to get you oriented:
Let’s start with a classic problem: sorting important emails from spam.
Consider two messages that land in your inbox:
email1 = """
Subject: URGENT! You've won $1,000,000!!!
Dear Friend! Act NOW to claim your PRIZE money!!!
Click here: www.totally-legit-prizes.com
"""
email2 = """
Subject: Team meeting tomorrow
Hi everyone, Just a reminder about our 2pm project sync.
Please review the attached agenda.
"""
Computers can’t directly understand text like we do. Instead, we convert each email into numbers called features. Think of features as measurements that help distinguish spam from real mail:
def extract_features(email):
features = {
'exclamation_count': email.count('!'),
'urgent_words': len(['urgent', 'act now', 'prize'] & set(email.lower().split())),
'suspicious_links': len([link for link in email.split() if 'www' in link]), # Any link is suspicious imo
'time_sent': email.timestamp.hour, # Spam often sent at odd hours
'length': len(email)
}
return features
# Our spam email gets turned into numbers
spam_features = extract_features(email1)
print(spam_features)
# {'exclamation_count': 8,
# 'urgent_words': 3,
# 'suspicious_links': 1,
# 'time_sent': 3,
# 'length': 142}
How do we decide if an email is spam? We assign a weight to each feature. Think of weights as measuring how suspicious each feature is:
import torch
weights = {
'exclamation_count': 0.5, # More ! marks → more suspicious
'urgent_words': 0.8, # "Urgent" language → very suspicious
'suspicious_links': 0.6, # Unknown links → somewhat suspicious
'time_sent': 0.1, # Time matters less
'length': -0.2 # Longer emails might be less suspicious
}
# PyTorch needs numbers in tensor form
features = torch.tensor([1.0, 3.0, 1.0, 3.0, 142.0])
w = torch.tensor(list(weights.values()), requires_grad=True)
Features flow through a sequence of transformations:
We combine features and weights to get a “spam score”. But how do we turn this score into a yes/no decision? We use a function called sigmoid that turns any number into a “probability” between 0 and 1:
def sigmoid(x):
return 1 / (1 + torch.exp(-x))
def spam_score(features, weights):
raw_score = torch.dot(features, weights) # Combine features and weights
probability = sigmoid(raw_score) # Convert to probability
return probability
The sigmoid function provides crucial properties:
Our spam filter needs to find weights that correctly classify emails. We can write this as an optimization problem:
\[\min_{w} \frac{1}{n} \sum_{i=1}^n \left[-y_i \log(\sigma(x_i^\top w)) - (1-y_i) \log(1-\sigma(x_i^\top w)) \right]\]Where:
This formula measures our mistakes (called “cross-entropy loss”).
The cross-entropy loss teaches our model to make confident, correct predictions while severely punishing mistakes. Let’s see how it works by examining the two curves in our plot, which show how we penalize predictions for spam and non-spam emails.
Consider a legitimate email (not spam, label = 0). Our model assigns it a probability of being spam:
uncertain_prediction = 0.3 # "Maybe spam (30% chance)?"
confident_wrong = 0.8 # "Probably spam (80% chance)"
confident_right = 0.1 # "Probably not spam (10% chance)"
The green curve (“Wrong: Not Spam”) shows how we penalize mistakes on legitimate emails:
The red curve (“Right: Spam”) works similarly but for actual spam emails. Together, the curves create a powerful learning dynamic:
Uncertainty gets corrected: The point “Uncertain & Wrong” shows a prediction hovering around 0.3 - not catastrophically wrong, but penalized enough to encourage learning.
Confidence gets rewarded: The point “Confident & Right” shows a prediction around 0.1, receiving minimal penalty. This reinforces correct, confident predictions.
Catastrophic mistakes get prevented: Both curves shoot upward at their edges, creating enormous penalties for high-confidence mistakes. This prevents the model from becoming overly confident in the wrong direction.
The two curves meet at 0.5 (our decision boundary), creating perfect symmetry between spam and non-spam mistakes. This balance means our model:
The mathematics behind this intuition uses the logarithm:
# For legitimate email (label = 0):
loss = -log(1 - prediction) # Penalize high predictions
# For spam email (label = 1):
loss = -log(prediction) # Penalize low predictions
Through training, these penalties guide the model from uncertainty toward confident, correct predictions:
# Before training (uncertain)
email1 = 0.48 # Struggles to classify
email2 = 0.51 # Nearly random guesses
# After training (confident + correct)
email1 = 0.02 # Confidently marks non-spam
email2 = 0.97 # Confidently marks spam
Each training step pushes predictions away from the uncertain middle and toward confident extremes - but only when the evidence justifies that confidence. This careful balance between confidence and caution produces a robust spam filter that makes reliable decisions.
Imagine you’re hiking in a valley and want to reach the lowest point. A natural strategy is to:
This is exactly how the most well-known algorithm for optimization–called gradient descent–works.
We can implement the gradient descent algorithm in PyTorch. Here’s how each step or iteration works:
The process repeats until the mistakes can’t be reduced further:
# Start with random weights
weights = torch.randn(5, requires_grad=True)
learning_rate = 0.01
for _ in range(1000):
# 1. Make predictions and calculate mistakes
predictions = spam_score(features, weights)
loss = cross_entropy_loss(predictions, true_labels)
# 2. Calculate gradient (steepest direction)
loss.backward()
# 3. Take a step downhill
with torch.no_grad():
weights -= learning_rate * weights.grad
weights.grad.zero_()
More formally, each iteration measures how well our current weights classify all of our training emails, calculates a gradient of the loss function with respect the weights $w$ (an “error-reducing” direction), and updates all weights by taking a small step in this error-reducing direction. The learning rate (0.01) controls these steps - smaller values move more reliably toward better weights but take longer, while larger values move faster but risk overshooting good solutions.
The optimization process continues until it takes 1000 steps in the gradient direction. This is just one way to stop the algorithm; others exist. For example, we could instead halt when the loss plateaus (stops decreasing significantly) or reaches a target threshold. Each approach balances computation time against solution quality.
When you checkout the notebook for this lecture, this is what you’ll see as you run the training loop: The first plot shows the value of the cross-entropy loss as we train the model. This and the “training accuracy” (shown in the second plot) are both metrics that measure how well our model performs on the training data. They are are diagnostic plots – they say at least the optimization part of the code is working.
The third plot is far more important. It shows the performance of the model on a test set of emails that were not used to train the model. A gap between training and test accuracy would signal overfitting, but here both metrics converge and stay stable, indicating the model generalizes well to new examples.
Together, these plots confirm our model learns a robust set of rules for distinguishing spam from legitimate email. The rapid initial progress followed by stability suggests the feature set captures the essential patterns in our data.
PyTorch makes this optimization process painless by automating the most challenging part - computing gradients. As calculations flow through our code, PyTorch builds a record of operations (a computational graph). When we call backward()
, this graph enables automatic calculation of all required derivatives.
The framework achieves this automation through tensors - its fundamental building blocks that store numbers in grid structures representing single values, lists, tables, or higher-dimensional arrays. PyTorch uses tensors because they track operation history for automatic gradients and enable parallel computation. This design lets code run efficiently on both CPUs and GPUs with minimal changes.
We’ll talk more about these intricacies as the course progresses.
While our spam filter demonstrates these concepts simply, the same pattern powers most machine learning: predict outputs, measure errors, compute gradients, adjust parameters. Different problems require different loss functions or model architectures, but this optimization loop remains central and is the focus of this course.
Our spam filter illustrates the key ideas you’ll learn in this course:
So what will we cover in this course?
We begin with the essential tools: norms, inner products, and matrix decompositions. Linear regression serves as our first optimization problem, solvable through direct methods like LU factorization and Gaussian elimination. These methods work well for moderate problem sizes but struggle with large datasets, motivating our transition to iterative methods.
This section introduces optimization formulations in several core application areas, and classical formulations that are solveable by existing software tools.
First, we explore problem formulations in three core application areas:
Statistical Estimation and Inverse Problems Reconstruct signals from indirect observations - as in the reconstruction of the first black hole image from radio telescope data.
Machine Learning Models Train models that learn patterns from data, whether for prediction (spam detection) or generation (language models).
Sequential Decision Making Make decisions in sequence where each choice reshapes future possibilities - from chess strategies to spacecraft navigation.
Second, we study classical convex formulations, such as Linear Programs (LP), Quadratic Programs (QP), and Semidefinite Programs (SDP). These formulations have well-understood properties and moderately sized problem can be solved efficiently by existing software tools, such as CVXPY and others (e.g., Google’s MathOpt). While our core applications often involve more complex optimization problems, recognizing when parts of them fit into these classical templates allows us to leverage powerful, ready-made solvers.
Gradients, Hessians, and Taylor expansions provide the mathematical foundation for optimization algorithms. This section explains how these tools guide algorithm design and implementation.
Automatic differentiation (AD) powers modern deep learning frameworks. We examine how AD works, starting with a minimal implementation (Micrograd) before moving to PyTorch. This progression reveals how frameworks handle derivatives automatically, enabling rapid iteration on complex models.
Large datasets make computing exact gradients impractical. Stochastic and mini-batch variants of gradient descent offer a solution by sampling data subsets. We examine theoretical guarantees (global minima for convex problems, critical points for nonconvex ones) and practical modifications like Adam, momentum, and learning rate schedules.
Some optimization landscapes require more than gradient information. Newton, Gauss-Newton, and quasi-Newton methods reshape the optimization landscape for faster convergence. Linear solvers (Conjugate Gradient, CoLA) make these approaches practical for large problems.
Based on class interests, we may cover:
We conclude with practical strategies for large-scale optimization:
The details of this outline shift based on class interests. By the end of the course, you will have a toolbox of optimization methods, an understanding of their theoretical underpinnings, and practical experience in applying them to real problems.
What will you learn by the end of this course?
Modeling and Formulation
By the end of this course, you should be able to take a real-world problem (in data science, machine learning, or sequential decision-making) and translate it into a formal optimization problem with well-defined objectives and constraints.
Algorithm Selection and Analysis
You will learn how to choose an appropriate algorithm—from basic gradient descent to more advanced quasi-Newton methods—and understand the trade-offs of different approaches (speed, accuracy, scalability).
Software Proficiency
We will use modern libraries like PyTorch (possibly Jax) to implement and experiment with these algorithms. You will gain experience with auto differentiation and learn best practices for tuning and debugging iterative solvers.
Optimization in Practice
Although we’ll cover fundamental optimization theory (convexity, convergence rates, saddle points, etc.), the focus is on practical usage. You will learn which methods to try first and how to iterate quickly when working with large datasets and complicated models.
Research Methods
This course also prepares you for research or advanced development tasks. You’ll see how to benchmark optimization methods, reproduce existing studies, and innovate ways to handle constraints like privacy and distributed data.
When you finish, you’ll be equipped to handle the optimization component of modern data science and machine learning projects, appreciating both the theoretical and practical dimensions.