In Lecture 0, we classified spam using word frequencies as vectors. Each email became a point in high-dimensional space, where dimensions represented word counts. Linear algebra revealed the underlying geometry - similar emails clustered together, and a linear boundary separated spam from non-spam.
Now we’ll explore how PyTorch implements these operations efficiently. Three key ideas emerge:
Let’s start with vectors and their computational counterpart, tensors.
Vectors form the building blocks of data representation. In the spam example, each dimension measured a word’s frequency. Here, we’ll use temperature readings to build intuition:
# Temperature readings (Celsius)
readings = torch.tensor([22.5, 23.1, 21.8]) # Morning, noon, night
print(readings) # tensor([22.5000, 23.1000, 21.8000])
PyTorch implements vectors as tensors, optimizing the underlying memory and computation:
# Compare two days
morning = torch.tensor([22.5, 23.1, 21.8]) # Yesterday
evening = torch.tensor([21.0, 22.5, 20.9]) # Today
alpha = 0.5 # Averaging weight
# Vector addition: component-wise operation
total = morning + evening # Parallel computation
print(total) # tensor([43.5000, 45.6000, 42.7000])
# Scalar multiplication: uniform scaling
weighted = alpha * morning # Efficient broadcast
print(weighted) # tensor([11.2500, 11.5500, 10.9000])
PyTorch generalizes vectors to n-dimensional arrays. The shape property defines the array structure and guides computation:
# Vector creation methods
temps = torch.tensor([22.5, 23.1, 21.8]) # From data
print(f"Vector shape: {temps.shape}") # torch.Size([3])
zeros = torch.zeros(3) # Initialized
print(f"Zeros shape: {zeros.shape}") # torch.Size([3])
# Matrix: week of readings
weekly = torch.randn(7, 3) # Random normal
print(f"Matrix shape: {weekly.shape}") # torch.Size([7, 3])
morning = torch.tensor([22.5, 23.1, 21.8])
evening = torch.zeros_like(morning) # Same shape as morning
combined = morning + evening
print(combined.shape, combined) # Shape preserved in operations
PyTorch implements these operations:
The dot product reveals relationships between vectors. For temperature data, it shows if two days follow similar patterns - high values indicate similar temperature variations. The norm quantifies the overall magnitude of temperature fluctuations.
# Combining sensor readings
day1 = torch.tensor([22.5, 23.1, 21.8]) # Warmer day
day2 = torch.tensor([21.0, 22.5, 20.9]) # Cooler day
# Average readings
avg = (day1 + day2) / 2
print(avg) # tensor([21.75, 22.80, 21.35])
# Dot product reveals pattern similarity
similarity = torch.dot(day1, day2)
print(f"Similarity: {similarity:.1f}") # 1447.9: high similarity
print(f"Day 1 magnitude: {torch.norm(day1, p=2):.1f}") # 38.9
print(f"Day 2 magnitude: {torch.norm(day2, p=2):.1f}") # 37.2
These numbers reveal the temperature patterns:
The angle between vectors, computed as $\cos(\theta) = \frac{x \cdot y}{|x||y|}$, measures pattern similarity independent of magnitude. For these days:
cos_theta = similarity / (torch.norm(day1) * torch.norm(day2))
print(f"Pattern similarity: {cos_theta:.3f}") # 0.999: nearly identical patterns
This near-1 cosine shows the temperature curves are almost identical shapes, just shifted slightly in magnitude.
# Computing average deviation from mean
readings = torch.tensor([22.5, 23.1, 21.8])
mean = readings.mean()
deviations = readings - mean
magnitude = torch.sqrt(torch.dot(deviations, deviations))
print(f"Average deviation: {magnitude/3:.4f}") # Average deviation: 0.3067
Computers store tensors in physical memory as contiguous blocks. Row-major ordering means elements within a row are adjacent:
This layout affects performance:
For temperature data:
# Fast: accessing one day's readings (row)
day_readings = week_temps[0] # Contiguous memory access
# Slower: accessing one time across days (column)
morning_temps = week_temps[:, 0] # Strided memory access
# Matrix multiply organizes computation to maximize cache usage
result = torch.mm(week_temps, weights.view(-1, 1))
Understanding memory layout helps choose efficient operations:
Given temperature readings from two days:
day1 = torch.tensor([22.5, 23.1, 21.8]) # Morning, noon, night
day2 = torch.tensor([21.0, 22.5, 20.9]) # Morning, noon, night
Matrices help analyze multiple days of temperature readings at once:
# One week of temperature readings (7 days × 3 times per day)
week_temps = torch.tensor([
[22.5, 23.1, 21.8], # Monday
[21.0, 22.5, 20.9], # Tuesday
[23.1, 24.0, 22.8], # Wednesday
[22.8, 23.5, 21.9], # Thursday
[21.5, 22.8, 21.2], # Friday
[20.9, 21.8, 20.5], # Saturday
[21.2, 22.0, 20.8] # Sunday
])
print(f"Shape: {week_temps.shape}") # torch.Size([7, 3])
print(f"Total elements: {week_temps.numel()}") # Number of elements
For matrices $A$ and $B$ of the same size:
These operations preserve the structure of the data while revealing patterns:
# Last week's temperatures
last_week = torch.tensor([
[21.5, 22.1, 20.8], # Morning, noon, night
[20.0, 21.5, 19.9],
[22.1, 23.0, 21.8]
])
# This week's temperatures
this_week = torch.tensor([
[22.5, 23.1, 21.8],
[21.0, 22.5, 20.9],
[23.1, 24.0, 22.8]
])
# Temperature change shows consistent warming
temp_change = this_week - last_week
print("Temperature changes:")
print(temp_change)
# tensor([[1., 1., 1.], # Uniform 1°C increase
# [1., 1., 1.], # across all times
# [1., 1., 1.]]) # and days
# Average temperatures reveal daily pattern
daily_means = this_week.mean(dim=0)
print("\nAverage temperatures:")
print(daily_means) # tensor([22.2000, 23.2000, 21.8333])
# Morning Noon Night
The outputs reveal clear patterns:
# What's the output shape and meaning?
week_temps = torch.randn(7, 3) # Week of readings
day_weights = torch.ones(7) / 7 # Equal weights for each day
weighted_means = ???
Matrix multiplication combines information across dimensions. For matrices $A$ and $B$: $c_{ij} = \sum_{k=1}^n a_{ik}b_{kj}$
For matrix-vector multiplication $(Ax = b)$: $b_i = \sum_{j=1}^n a_{ij}x_j$
This operation is fundamental because:
# Temperature readings and weights
temps = torch.tensor([
[22.5, 23.1, 21.8], # Day 1: morning, noon, night
[21.0, 22.5, 20.9], # Day 2: morning, noon, night
[23.1, 24.0, 22.8] # Day 3: morning, noon, night
])
weights = torch.tensor([0.5, 0.3, 0.2]) # Recent days matter more
# Matrix-vector multiply
weighted_means = torch.mv(temps, weights) # Uses BLAS
print("Weighted averages per time:")
print(weighted_means)
The weighted averages reveal:
This analysis weights recent days more heavily, capturing current trends rather than long-term averages.
Broadcasting generalizes operations between tensors of different shapes. It extends the mathematical concept of scalar multiplication to more general shape-compatible operations. For a vector $v \in \mathbb{R}^n$ and matrix $A \in \mathbb{R}^{m \times n}$:
\[(A * v)_{ij} = a_{ij} * v_j\]This operation implicitly replicates the vector across rows, but without copying memory. The computational advantages are significant:
# Temperature readings across days
day_temps = torch.tensor([
[22.5, 23.1, 21.8], # Day 1: morning, noon, night
[21.0, 22.5, 20.9], # Day 2
[23.1, 24.0, 22.8] # Day 3
])
# Sensor calibration factors (per time of day)
calibration = torch.tensor([1.02, 0.98, 1.01])
# Broadcasting: each time slot gets its calibration
calibrated = day_temps * calibration # Shape [3,3] * [3] -> [3,3]
print("Original vs Calibrated:")
print(day_temps[0]) # Before calibration
print(calibrated[0]) # After calibration
Broadcasting rules follow mathematical intuition:
This enables concise, efficient code:
# Temperature adjustments
base = torch.tensor([[22.5, 23.1, 21.8], # Base readings
[21.0, 22.5, 20.9]])
offset = torch.tensor([0.5, 0.0, -0.5]) # Per-time adjustments
scale = torch.tensor([1.02, 0.98]).view(-1, 1) # Per-day scaling
# Multiple broadcasts in one expression
adjusted = scale * (base + offset) # Combines both adjustments
print("Adjusted shape:", adjusted.shape)
temps = torch.tensor([[22.5, 23.1, 21.8],
[21.0, 22.5, 20.9]])
offset = torch.tensor([0.5, 0.0, -0.5])
adjusted = temps + offset # What's the shape?
Given a week of readings and calibrations:
week_temps = torch.tensor([
[22.5, 23.1, 21.8], # Each row: morning, noon, night
[21.0, 22.5, 20.9],
[23.1, 24.0, 22.8]
])
calibration = torch.tensor([1.02, 0.98, 1.01]) # Per sensor
importance = torch.tensor([0.5, 0.3, 0.2]) # Per day
SVD (Singular Value Decomposition) reveals structure in data by factoring a matrix $A$ into orthogonal components: $A = U\Sigma V^T$. Let’s see how this helps analyze our spam data:
# Feature matrix: 10 emails × 5 features
# Features: exclamation_count, urgent_words, suspicious_links, caps_ratio, length
U, S, V = torch.linalg.svd(X)
print("Singular values:", S)
# tensor([11.3077, 1.4219, 0.5334, 0.2697, 0.0496])
print("Energy per pattern:", 100 * S**2 / torch.sum(S**2), "%")
# tensor([98.17, 1.55, 0.22, 0.06, 0.00]) %
The decomposition reveals:
Looking at the first pattern:
print("Feature pattern:", V[0]) # tensor([-0.0206, -0.0076, -0.0061, -0.0010, -0.9997])
print("Email pattern:", U[:, 0]) # Similar weights around -0.3 for all emails
This dominant pattern, with singular value 431.75 (99.95% of total variation), shows an important principle in data analysis: high variation doesn’t always mean high discriminative power. The first singular vector is dominated by text length (-0.9997), with negligible contributions from other features. Despite capturing most of the data’s variation, this direction doesn’t help classify spam - both legitimate and spam emails can be long or short, as shown by the similar weights (around -0.3) for all emails in U[:, 0].
Looking at the second pattern:
print("Second feature pattern:", V[1]) # tensor([0.9283, 0.2724, 0.2503, 0.0304, -0.0227])
print("Second email pattern:", U[:, 1]) # Positive for spam, negative for non-spam
The second singular vector, despite having a much smaller singular value (9.2474, only 0.046% of total variation), is far more informative for classification. It shows that exclamation marks (0.9283) and urgent words (0.2724) cluster together, and the corresponding email pattern clearly separates spam (positive values for first 5 emails) from non-spam (negative values for last 5 emails).
This illustrates a key insight: the directions of highest variation in your data (found by SVD) may not be the most useful for your task. While text length accounts for most of the variation between emails, the more subtle patterns of exclamation marks and urgent language are what actually distinguish spam from legitimate messages.
To quantify how well these patterns represent our data, we need a way to measure matrix size. The Frobenius norm extends our vector norm concept:
For vectors: $|x| = \sqrt{\sum_i x_i^2}$ measures total magnitude For matrices: \(\|A\|_F = \sqrt{\sum_{i,j} |a_{ij}|^2}\) measures total variation
This norm is natural because:
For our spam features:
# Three equivalent ways to measure total variation
print(f"Element-wise: {torch.sqrt((X**2).sum()):.1f}") # 11.4
print(f"As vector: {X.view(-1).norm():.1f}") # 11.4
print(f"From SVD: {S.norm():.1f}") # 11.4
SVD expresses matrices as sums of rank-1 patterns:
$A = \sum_{i=1}^n \sigma_i u_i v_i^T$
Each term contributes:
We can approximate $A$ using just the first $k$ terms:
$A_k = \sum_{i=1}^k \sigma_i u_i v_i^T = U_k\Sigma_kV_k^T$
For our spam data:
def reconstruct(k):
return (U[:, :k] @ torch.diag(S[:k]) @ V[:k, :])
# Compare reconstructions
original = X[0] # First email features
rank1 = reconstruct(1)[0] # Using only top pattern
rank2 = reconstruct(2)[0] # Using top two patterns
print("Original:", original)
print("Rank 1:", rank1)
print("Rank 2:", rank2)
This truncation isn’t just simple - it’s optimal. The theorem states that for any rank-$k$ matrix $B$:
\[\|A - A_k\|_F \leq \|A - B\|_F\]Moreover, the error is exactly the discarded singular values:
\[\|A - A_k\|_F^2 = \sum_{i=k+1}^n \sigma_i^2\]For our spam data:
def approx_error(k):
"""Compute relative error for rank-k approximation."""
truncated = S[k:].norm(p=2)**2 # Squared Frobenius norm of discarded values
total = S.norm(p=2)**2 # Total squared Frobenius norm
return torch.sqrt(truncated/total)
# Show error decreases with rank
for k in range(1, 4):
print(f"Rank {k} relative error: {approx_error(k):.2e}")
# Rank 1: 1.35e-01 (13.5% error)
# Rank 2: 5.03e-02 (5% error)
# Rank 3: 2.75e-02 (2.75% error)
The rapid decay of singular values (11.3 → 1.4 → 0.5) explains why low-rank approximations work well.
Some patterns require multiple components. Consider this checkerboard:
pattern_image = torch.tensor([
[200, 50, 200, 50], # Alternating bright-dark
[50, 200, 50, 200], # Opposite pattern
[200, 50, 200, 50], # First pattern repeats
[50, 200, 50, 200] # Second pattern repeats
], dtype=torch.float)
U, S, V = torch.linalg.svd(pattern_image)
print("Singular values:", S)
# tensor([5.0000e+02, 3.0000e+02, 8.8238e-06, 2.4984e-06])
print("Energy per pattern:", 100 * S**2 / torch.sum(S**2), "%")
# tensor([7.3529e+01, 2.6471e+01, 2.2900e-14, 1.8359e-15]) %
The SVD reveals two essential patterns:
Both patterns are necessary because:
These principles extend to high-dimensional data:
PyTorch implements linear algebra efficiently through three key mechanisms:
# Data representation
x = torch.tensor([1, 2, 3]) # Vector (like temperature readings)
y = torch.zeros_like(x) # Initialize (like sensor baseline)
z = torch.randn(3, 3) # Random sampling
A = z.float() # Type conversion for computation
# Core computations
b = x + 2 # Broadcasting (calibration)
c = torch.dot(x, x) # Inner product (pattern similarity)
d = A @ x # Matrix multiply (weighted average)
e = torch.mean(A, dim=0) # Reduction (daily averages)
# Structure manipulation
f = A.t() # Transpose (change access pattern)
g = A.view(-1) # Reshape (flatten for computation)
h = A[:, :2] # Slice (select time window)
# Pattern analysis
U, S, V = torch.linalg.svd(A) # Decomposition (find patterns)
norm = torch.norm(x, p=2) # Vector norm (measure magnitude)
These operations combine to solve real problems:
The key is choosing the right operation for each task: