3. Model Compression
This is where we use eigenvectors to learn which directions are important during training, and which are redundant.
The common name for this is Singular Value Decomposition (SVD) — which uses eigenvalue-like singular values. This method is crucial for compressing LLMs. You can approximate large weight matrices as low-rank products, keeping only the top singular values/vectors. This reduces memory and computation while preserving most model performance. Techniques like LoRA (Low-Rank Adaptation) exploit this principle. We will cover both rank and eigenvectors in the next article. Rank measures how many independent directions a matrix has. A 4096×4096 matrix might only have effective rank of 8–16, meaning most directions are redundant. This is what LoRA exploits.
Compressing models like this is not really used for distribution; quantizing models is more useful for that purpose. Instead, compressing models is useful for efficient fine tuning. Let’s get an introduction to this topic and learn the essential basics.
The LoRA Principle and Its Connection to Eigenvalues
This is where the eigenvalue theory becomes really practical. Let’s build this up from intuition to the mathematics.
The Core LoRA insight is that most weight updates live in a low-dimensional subspace. Thus, even though a weight matrix has millions of parameters, the actual changes during fine-tuning happen along relatively few important directions.
Let’s walk through it with full tuning and then with LoRA tuning.
With full tuning — the expensive method — we multiply the entire layer of weights by the entire tuning matrix:
W_pretrained = load_model("llama-2-7b").layer[5].weight
W_new = W_pretrained + ΔW
You can see we have 16 million parameters to store after a large transformation involving 16 million products.
And that’s for just one step in training.
Now consider tuning with LoRA.
Let’s break it down with simple analogies and build up to the relationship between LoRA and eigenvalues.
First, imagine a change matrix: ΔW. This matrix is going to be the matrix of changes that we will apply to the weight matrix. Let’s argue by analogy: Imagine you have a recipe book (the original model weights). Then when you fine-tune that recipe book, you’re making edits. You might want to “Add more salt to recipe 1” or “Use less sugar in recipe 2” or “Cook 10 minutes longer for recipe 3.” This list of changes — the difference between the old recipe book and the new one — is your ΔW (delta W).
Original weights:
W_old = [[1.5, 2.3, 0.8],
[0.2, 1.7, 3.1],
[2.9, 0.5, 1.2]]
Your changes:
ΔW =
[[0.2, 0.2, 0.1],
[0.1, 0.2, 0.2],
[0.2, 0.2, 0.2]]
After fine-tuning:
W_new = W_ old + ΔW =
[[1.7, 2.5, 0.9],
[0.3, 1.9, 3.3],
[3.1, 0.7, 1.4]]
You might notice that the ΔW change matrix has small values. That makes sense: We are tuning the original weights slowly. Thus we are adjusting the original weights with small values.
However, for a real neural network layer, we won’t use a 3×3 matrix. A real network is more likely to have a large weight matrix. Thus this calculation will be more like 4096×4096 = 16 million numbers!
SVD = Singular Value Decomposition
This is why we use singular value decomposition. Think of SVD as a way to analyze and simplify your changes. In The Cooking Analogy, imagine we made 100 recipe changes in our recipe book. SVD helps us determine whether we can group these changes into themes or patterns.
Maybe we notice Theme 1: “Make things sweeter.” And theme 1 affects 40 recipes, representing a very strong pattern among all of our recipe changes.
Then we notice Theme2: “Cook things longer.” And this is a theme that affects 30 recipes; also a strong pattern.
And we recognize Theme 3: “Add more herbs.” But we only apply this change to 10 recipes, so it is a weaker pattern.
And then we see Theme 4: “Use olive oil.” But this applies to only 2 recipes, so it’s a very weak pattern.
Finally we recognize that Themes 5–100 are barely noticeable changes to 1–2 recipes each.
SVD finds these themes automatically!
Let’s drive this home with a second analogy. The Photography Analogy.
Imagine that our change matrix, ΔW, is a photograph of (4096×4096 pixels).
Let’s drive this home with a second analogy. The Photography Analogy.
Imagine that our change matrix, ΔW, is a photograph of (4096×4096 pixels).
Your photograph (ΔW):
████████████████████████████
████████████████████████████
████ YOUR FACE HERE █████
████████████████████████████
████████████████████████████
Singular value decomposition breaks the photo into layers, ordered by importance:
Layer 1 (most important):
░░░░░░░░░░░░░░░░░░░░
░░░ FACE SHAPE ░░░░░
░░░░░░░░░░░░░░░░░░░░
Layer 2 (important):
░░░░░░░░░░░░░░░
░░░ SHADING ░░░
░░░░░░░░░░░░░░░
Layer 3 (less important):
░░░░░░░░░░░░░░░░░
░░ HAIR DETAIL ░░
░░░░░░░░░░░░░░░░░
Layer 4 (minor):
░░░░░░░░░░░
░ TEXTURE ░
░░░░░░░░░░░
Layers 5-4096 are barely visible details
Now if we want to make changes to the photo, we can apply face shape changes only to Layer 1. We can apply shading changes only to Layer 2, and so on. The first few layers contain most of the picture! Layers 100–4096 are just tiny details (noise, texture of paper, etc.)
Mathematiclly, SVD breaks the change matrix into three pieces:
ΔW = U × Σ × V^T
What Are U, Σ, and V^T? For now, here’s the breakdown:
Σ (Sigma) — The Importance Scores
Σ is the EASIEST to understand — it’s just a list of numbers saying “how important is each theme?”
Σ = diag([σ₁, σ₂, σ₃, ..., σ₄₀₉₆])
Actual numbers might look like:
Σ = [15.3, ← Theme 1: VERY important (large number)
12.7, ← Theme 2: Very important
9.4, ← Theme 3: Important
6.8, ← Theme 4: Somewhat important
5.2, ← Theme 5: Somewhat important
3.9, ← Theme 6: Getting less important
2.1, ← Theme 7: Minor importance
1.8, ← Theme 8: Minor importance
0.7, ← Theme 9: Not very important
0.5, ← Theme 10: Not very important
...
0.001, ← Theme 100: Barely matters
...
0.00001] ← Theme 4096: Essentially zero
Think of Σ as volume knobs:
σ₁ = 15.3 → Turn up the volume on Theme 1 to 15.3
σ₂ = 12.7 → Turn up the volume on Theme 2 to 12.7
σ₄₀₉₆ = 0.00001 → Theme 4096 is basically silent
U and V^T — The Actual Themes
These are harder to visualize, but here’s the intuition: U and V^T contain representations of what the PATTERNS themselves are.
U: (4096, 4096) - "Output space themes"
V^T: (4096, 4096) - "Input space themes"
```
Going back to our photo analogy:
```
U = What each layer LOOKS like in the output
Layer 1 in U: [big blob in center, edges dark]
Layer 2 in U: [gradient from left to right]
Layer 3 in U: [fine details in corners]
V^T = What each layer RESPONDS to in the input
Layer 1 in V^T: [responds to overall brightness]
Layer 2 in V^T: [responds to edges]
Layer 3 in V^T: [responds to specific textures]
```
For beginners: Don’t worry too much about U and V^T.
Just know: — They define what each pattern/theme is.
Σ tells you how important each pattern is.
Now let’s come back to eigenvectors and eigenvalues — the whole reason we are here in the first place.
Singular values are the square roots of eigenvalues!
ΔW = (4096, 4096)
ΔW^T @ ΔW
M = ΔW^T @ ΔW
eigenvalues_of_M = [λ₁, λ₂, λ₃, ..., λ₄₀₉₆]
σ₁ = √λ₁
σ₂ = √λ₂
σ₃ = √λ₃
...