Learn how ResNet-152 broke the 22-layer limit.
In weekly articles for the last 6 weeks, we have looked at many aspects of linear algebra and tried to connect it to machine learning. The underlying goal is to motivate ourselves to become experts in the math, so that we can be world class at machine learning.
For this reason, we have looked at the conceptual basis of eigenvalues and eigenvectors. Then we learned how to calculate them, and then we learned how they can be used to analyze real network layers using their cousins, singular vectors and singular values.
We have also looked in detail at recurrent neural networks — mostly to understand the value of matrix transformations. But also to understand the rudiments of network architecture.
All of this work comes together here.
This time we will look at another piece of network architecture: The residual connection.
The residual connection is a simple concept. It’s so simple that we might step back and say: ‘Is that all?’ And then: ‘If that’s all it is, then how can it be so effective?’
But let’s jump to some startling facts: Before residual connections, a network with 20–22 layers — VGG-19 — was considered deep. This was up until about 2015. In 2015 ResNet-152 used 152 layers. It entered and won ImageNet. And after that, networks with 1000+ layers became possible.
Modern transformer based networks don’t have so many layers. Instead, transformer layers are much more complex than a ResNet layer. But ResNet was part of that growth.
At about 2012, AlexNet had 8 layers and performed simple convolutions. At about 2014, VGG had 16–19 layers. This was still relatively simple by today’s standards. But the limit seemed to be about 20–22 layers before vanishing gradients killed training. At the same time (2014), GoogLeNet/Inception achieved the first major complexity increase, using parallel paths and concatenating results. But the problem was still that networks couldn’t go more than about 22 layers deep.
In 2015, ResNet changed that. Its innovation was to skip connections. We will shortly get to what that means. But the net result was that the 22 layer barrier was broken. Networks could train 152+ layers. And they solved the vanishing gradient problem. This made network depth a new frontier.
Later, with ResNeXt, DenseNet and Squeeze-and-Excitation nets in 2016 and 2017, the basic ResNet style had both depth and complexity. But that story is for another time.
Here we are concerned with how ResNet worked, how it stopped vanishing gradients, and how that is related to eigenvectors and eigenvalues. And in the end we will build and use our own ResNet network.
In a standard network with no residual connections, each layer sees only the transformed output of the previous layer:

In a ResNet-type of network, each layer sees the transformed output of the previous layer, but it ALSO gets the output from two layers back:

In the illustration, Layer1 is formed as usual from the input x0. This produces the output of Layer1, which we call x1.
Layer2 is formed from the input x1. But the output x2 is the sum of Layer2 and x0.
Layer3 is formed from the input x2, but the output x3 is the sum of Layer3 and x1.
The following illustration says the same thing. It shows how input contributes both the next layer and also to the layer after that.

Key Difference: The red skip connections on the right allow the input to each layer to be added back to the output. This simple “+ x” is what enables training networks with 152+ layers!
If we were to put this in python pseudo code, it might look like this:

The gradient flow problem without residuals is the problem of exploding or vanishing gradients that we discussed in another recent article. Here is the situation without residuals during backpropagation:
# Forward pass
x1 = W1 @ x0
x2 = W2 @ x1
x3 = W3 @ x2
output = W4 @ x3
<span class="hljs-comment"># Backward pass - chain rule</span>
∂L / ∂x0 = ∂L / ∂output × ∂output / ∂x3 × ∂x3 / ∂x2 × ∂x2 / ∂x1 × ∂x1 / ∂x0
∂L / ∂x0 = ∂output × W4ᵀ × W3ᵀ × W2ᵀ × W1ᵀ
<span class="hljs-comment"># Problem: Multiplying many matrices!</span>
<span class="hljs-comment"># If eigenvalues > 1: explosion</span>
<span class="hljs-comment"># If eigenvalues < 1: vanishing</span>
In the forward pass, the data flows through the network by multiplying with weight matrices at each layer: x0 → W1 → x1 → W2 → x2 → W3 → x3 → W4 → output. Each layer transforms the data using its learned weights. This is normal input and is not the problem.
In the backward pass, during backpropagation, we need to compute how the loss depends on each weight matrix (∂L/∂W₁, ∂L/∂W₂, etc.) so we can adjust them to reduce the loss. This requires propagating the loss derivative backward by multiplying through ALL the weight matrix transposes: ∂L/∂output × W4ᵀ × W3ᵀ × W2ᵀ × W1ᵀ.
This leads to the multiplication problem.
When you multiply many matrices together, the result is dominated by their eigenvalues. If the weight matrices have eigenvalues larger than 1, the loss signal grows exponentially with each multiplication. This leads to something called exploding gradients. The numbers become huge, causing not-a-number (NaN) errors. If the eigenvalues are smaller than 1, the loss signal shrinks exponentially with each multiplication. This leads to vanishing gradients, where numbers become tiny, approaching zero.
In a 20-layer network without skip connections, you’d be multiplying 20 weight matrices together! Even if each matrix shrinks the loss signal by just 0.5, you get 0.5²⁰ ≈ 10^-6 (essentially zero). The early layers receive almost no loss information, so they can’t learn anything.
On the other hand, if each weight matrix scales the gradient by 1.5, then the early layers will explode with very high values, and again, they can’t learn anything because their magnitudes are too high.
The problem only gets worse as we add more layers.
ResNet’s skip connections provide an alternative gradient path that doesn’t require multiplying through all those weight matrices. The gradient can flow directly backward through the “+” additions, bypassing the problematic matrix multiplications. This is why the simple “+ x” enables training 152-layer networks!
The mathematical reason behind this is a little bit cool. The forward pass adds a term, as follows:

And the derivative of that formula is used in the back propagation. Since the derivative of ‘x’ is 1, it looks like this:

But the “+1” means that the gradient won’t vanish!
Let’s trace gradients through 50 layers.
Without residuals, the gradient gets smaller and smaller as we apply it to subsequent layers:
# Gradient shrinks by factor of 0.9 each layer
gradient_layer_50 = <span class="hljs-number">1.0</span>
gradient_layer_40 = <span class="hljs-number">1.0</span> × <span class="hljs-number">0.9</span>^<span class="hljs-number">10</span> = <span class="hljs-number">0.35</span>
gradient_layer_30 = <span class="hljs-number">1.0</span> × <span class="hljs-number">0.9</span>^<span class="hljs-number">20</span> = <span class="hljs-number">0.12</span>
gradient_layer_20 = <span class="hljs-number">1.0</span> × <span class="hljs-number">0.9</span>^<span class="hljs-number">30</span> = <span class="hljs-number">0.04</span>
gradient_layer_10 = <span class="hljs-number">1.0</span> × <span class="hljs-number">0.9</span>^<span class="hljs-number">40</span> = <span class="hljs-number">0.015</span>
gradient_layer_1 = <span class="hljs-number">1.0</span> × <span class="hljs-number">0.9</span>^<span class="hljs-number">50</span> = <span class="hljs-number">0.005</span> ◄── Vanished!
With residuals, the gradient does not vanish:
# Each layer: ∂H/∂x = ∂F/∂x + 1
<span class="hljs-comment"># Even if ∂F/∂x = 0.1 (small), we get 0.1 + 1 = 1.1</span>
gradient_layer_50 = <span class="hljs-number">1.0</span>
gradient_layer_40 = <span class="hljs-number">1.0</span> × <span class="hljs-number">1.1</span>^<span class="hljs-number">10</span> = <span class="hljs-number">2.59</span> ◄── Growing!
gradient_layer_30 = <span class="hljs-number">1.0</span> × <span class="hljs-number">1.1</span>^<span class="hljs-number">20</span> = <span class="hljs-number">6.73</span>
gradient_layer_20 = <span class="hljs-number">1.0</span> × <span class="hljs-number">1.1</span>^<span class="hljs-number">30</span> = <span class="hljs-number">17.4</span>
gradient_layer_10 = <span class="hljs-number">1.0</span> × <span class="hljs-number">1.1</span>^<span class="hljs-number">40</span> = <span class="hljs-number">45.3</span>
gradient_layer_1 = <span class="hljs-number">1.0</span> × <span class="hljs-number">1.1</span>^<span class="hljs-number">50</span> = <span class="hljs-number">117</span> ◄── Still flowing!
It looks like the gradient may be exploding, but the ∂F/∂x terms are often small, so this works.
Actually, we are going to tune an existing classifier. But this will save a little time. It will still take about 10 minutes on an M1 to train about 10% of 1 epoch of training. So… 100 minutes times 10 epochs is still about 16 hours of training. But that’s less than a day for a real neural network that has 152 layers! Pretty cool on your home computer.
We’ll need to set up the following.
1. We’ll need the cat and dog images dataset from Kaggle. As of Feb 2026 the exact link to this dataset is here. If that link is broken, then do a search for ‘cat and dog images dataset’ and ‘Kaggle.’ It’s a pretty widely known dataset, so we can be confident that it’s out there. At this time, the large folder in that dataset contains 19.8k files.
a. Note the name: ‘Cat_And_Dog’ because the script below contains a reference to that exact name.
b. Note the path to that directory. The code below contains a path to that directory. In the code you’ll find the path as “/Users/username/Documents/data/Cat_And_Dog”. You’ll need to change the code to match the path to your Cat_And_Dog directory.
c. Note the folders within the Cat_And_Dog directory. Most of the images are in a folder named ‘train’ and if not, then you must either change the name of the folder or change the code.
2. After you have the data downloaded, you can load the file below. At the github gist link it is called ‘ResNet-152.py’. Note that the same link contains a longer markdown file with instructions. I prefer to load it in VS Code on a Mac or PC. Then open a new terminal window and run it: ‘python3 ResNet-152.py’
3. Be prepared to wait. If you train on almost 20,000 images it takes a long time on a standard PC. The code is set to train for 10 epochs, meaning it will go through the entire dataset 10 times. In our first attempt, it took about 2 hours per epoch.
import os
import shutil
import time
from pathlib import Path
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, models, transforms
from tqdm import tqdm
import random
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def prepare_local_cats_dogs_from_flat_train(
data_dir,
flat_train_subdir="Cat_And_Dog/train", # adjust this if your folder name is different
train_ratio=0.8,
):
"""
Convert a flat train folder with files like 'cat.0.jpg', 'dog.1.jpg', ...
into the structure expected by get_dataloaders:
data_dir/
train/
cat/
dog/
val/
cat/
dog/
Args:
data_dir: root data directory (e.g. /Users/.../data/Cat_And_Dog)
flat_train_subdir: relative path from data_dir to the flat train folder
train_ratio: fraction of images per class to use for training
"""
data_dir = Path(data_dir)
flat_train_dir = data_dir / flat_train_subdir
train_dir = data_dir / "train"
val_dir = data_dir / "val"
# If already prepared, skip
if train_dir.exists() and val_dir.exists():
print(f"[Local] Using existing train/val at {data_dir}")
return
if not flat_train_dir.exists():
raise FileNotFoundError(f"Flat train folder not found: {flat_train_dir}")
# Collect cat and dog files by prefix
cat_files = sorted(
[p for p in flat_train_dir.iterdir() if p.is_file() and p.name.lower().startswith("cat")]
)
dog_files = sorted(
[p for p in flat_train_dir.iterdir() if p.is_file() and p.name.lower().startswith("dog")]
)
print(f"[Local] Found {len(cat_files)} cat files, {len(dog_files)} dog files in {flat_train_dir}")
random.seed(42)
random.shuffle(cat_files)
random.shuffle(dog_files)
def split_files(files):
n_train = int(len(files) * train_ratio)
train_files = files[:n_train]
val_files = files[n_train:]
return train_files, val_files
cat_train, cat_val = split_files(cat_files)
dog_train, dog_val = split_files(dog_files)
# Create target dirs
(train_dir / "cat").mkdir(parents=True, exist_ok=True)
(train_dir / "dog").mkdir(parents=True, exist_ok=True)
(val_dir / "cat").mkdir(parents=True, exist_ok=True)
(val_dir / "dog").mkdir(parents=True, exist_ok=True)
print("[Local] Copying files into train/val structure...")
for src in cat_train:
shutil.copy2(src, train_dir / "cat" / src.name)
for src in cat_val:
shutil.copy2(src, val_dir / "cat" / src.name)
for src in dog_train:
shutil.copy2(src, train_dir / "dog" / src.name)
for src in dog_val:
shutil.copy2(src, val_dir / "dog" / src.name)
print("[Local] Done.")
print(f" Train: {len(cat_train)} cats, {len(dog_train)} dogs")
print(f" Val : {len(cat_val)} cats, {len(dog_val)} dogs")
def get_dataloaders(data_dir, batch_size=16, num_workers=4):
"""
Expects directory structure:
data_dir/
train/
cat/
dog/
val/
cat/
dog/
"""
train_dir = Path(data_dir) / "train"
val_dir = Path(data_dir) / "val"
# Standard ImageNet-like transforms
train_transforms = transforms.Compose([
transforms.Resize((256, 256)),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406], # ImageNet means
std=[0.229, 0.224, 0.225], # ImageNet stds
),
])
val_transforms = transforms.Compose([
transforms.Resize((256, 256)),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])
train_dataset = datasets.ImageFolder(train_dir, transform=train_transforms)
val_dataset = datasets.ImageFolder(val_dir, transform=val_transforms)
train_loader = DataLoader(
train_dataset, batch_size=batch_size, shuffle=True,
num_workers=num_workers, pin_memory=True
)
val_loader = DataLoader(
val_dataset, batch_size=batch_size, shuffle=False,
num_workers=num_workers, pin_memory=True
)
return train_loader, val_loader, train_dataset.classes
def build_resnet152(num_classes=2, pretrained=True):
# Load ResNet-152 from torchvision
model = models.resnet152(weights=models.ResNet152_Weights.DEFAULT if pretrained else None)
# Replace the final fully-connected layer for 2 classes (cat vs dog)
in_features = model.fc.in_features
model.fc = nn.Linear(in_features, num_classes)
return model
def train_one_epoch(model, loader, criterion, optimizer, epoch, device):
model.train()
running_loss = 0.0
running_corrects = 0
total = 0
pbar = tqdm(loader, desc=f"Epoch {epoch} [train]", leave=False)
for inputs, labels in pbar:
inputs = inputs.to(device, non_blocking=True)
labels = labels.to(device, non_blocking=True)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
_, preds = torch.max(outputs, 1)
loss.backward()
optimizer.step()
batch_size = inputs.size(0)
running_loss += loss.item() * batch_size
running_corrects += torch.sum(preds == labels).item()
total += batch_size
pbar.set_postfix({
"loss": f"{running_loss / total:.4f}",
"acc": f"{running_corrects / total:.4f}",
})
epoch_loss = running_loss / total
epoch_acc = running_corrects / total
return epoch_loss, epoch_acc
def eval_one_epoch(model, loader, criterion, epoch, device):
model.eval()
running_loss = 0.0
running_corrects = 0
total = 0
with torch.inference_mode():
pbar = tqdm(loader, desc=f"Epoch {epoch} [val]", leave=False)
for inputs, labels in pbar:
inputs = inputs.to(device, non_blocking=True)
labels = labels.to(device, non_blocking=True)
outputs = model(inputs)
loss = criterion(outputs, labels)
_, preds = torch.max(outputs, 1)
batch_size = inputs.size(0)
running_loss += loss.item() * batch_size
running_corrects += torch.sum(preds == labels).item()
total += batch_size
pbar.set_postfix({
"loss": f"{running_loss / total:.4f}",
"acc": f"{running_corrects / total:.4f}",
})
epoch_loss = running_loss / total
epoch_acc = running_corrects / total
return epoch_loss, epoch_acc
def main():
# ----- CONFIG -----
# Root data directory that contains Cat_And_Dog/train
data_dir = "/Users/username/Documents/data/Cat_And_Dog"
batch_size = 16
num_epochs = 10
lr = 1e-4
weight_decay = 1e-4
# ------------------
print(f"Using device: {DEVICE}")
# 1) Prepare local data from flat train folder (no Kaggle API)
prepare_local_cats_dogs_from_flat_train(
data_dir=data_dir,
flat_train_subdir="train", # change if your flat train path is different
train_ratio=0.8,
)
# 2) Build dataloaders
train_loader, val_loader, classes = get_dataloaders(
data_dir=data_dir,
batch_size=batch_size,
num_workers=4,
)
print(f"Classes: {classes}") # should be ['cat', 'dog'] or similar
# Model
model = build_resnet152(num_classes=len(classes), pretrained=True)
model = model.to(DEVICE)
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
best_val_acc = 0.0
save_path = Path("resnet152_cats_dogs_best.pt")
for epoch in range(1, num_epochs + 1):
t0 = time.time()
train_loss, train_acc = train_one_epoch(
model, train_loader, criterion, optimizer, epoch, DEVICE
)
val_loss, val_acc = eval_one_epoch(
model, val_loader, criterion, epoch, DEVICE
)
dt = time.time() - t0
print(
f"Epoch {epoch:02d}/{num_epochs} "
f"- {dt:.1f}s "
f"train_loss: {train_loss:.4f}, train_acc: {train_acc:.4f} "
f"val_loss: {val_loss:.4f}, val_acc: {val_acc:.4f}"
)
# Save best model
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save({
"model_state_dict": model.state_dict(),
"classes": classes,
}, save_path)
print(f" → New best model saved to {save_path} (val_acc={val_acc:.4f})")
print(f"Training complete. Best val_acc = {best_val_acc:.4f}")
print(f"Best model weights saved to: {save_path}")
if __name__ == "__main__":
main()
Once the training is done, you have a model! Your terminal output will look something like this:
Epoch 8 [val]: 91%|███████████████████▏ | 226/248 [16:03<01:33, 4.26s/it, loss=0.0225, acc=0.9917]/Users/username/Library/Python/3.9/lib/python/site-packages/PIL/TiffImagePlugin.py:868: UserWarning: Truncated File Read
warnings.warn(str(msg))
Epoch 08/10 - 8059.8s train_loss: 0.0659, train_acc: 0.9737 val_loss: 0.0233, val_acc: 0.9912
Epoch 9 [val]: 91%|███████████████████▏ | 226/248 [16:18<01:33, 4.27s/it, loss=0.0373, acc=0.9900]/Users/username/Library/Python/3.9/lib/python/site-packages/PIL/TiffImagePlugin.py:868: UserWarning: Truncated File Read
warnings.warn(str(msg))
Epoch 09/10 - 8092.5s train_loss: 0.0636, train_acc: 0.9743 val_loss: 0.0346, val_acc: 0.9906
Epoch 10 [val]: 91%|██████████████████▏ | 226/248 [17:18<01:38, 4.49s/it, loss=0.0288, acc=0.9917]/Users/username/Library/Python/3.9/lib/python/site-packages/PIL/TiffImagePlugin.py:868: UserWarning: Truncated File Read
warnings.warn(str(msg))
Epoch 10/10 - 8458.0s train_loss: 0.0628, train_acc: 0.9757 val_loss: 0.0269, val_acc: 0.9922
Training complete. Best val_acc = 0.9927
Best model weights saved to: resnet152_cats_dogs_best.pt
The output is telling us that we have a very high training accuracy of 0.9757 after the 10th epoch. And the train_loss started at 0.1137 and dropped to 0.0636 by the end. The output also said it has created a new file with our trained weights: ‘resnet152_cats_dogs_best.pt’ is at the same path. And this is a large file — it might be about 200MB.
Now for the best part. Grab the inference script gist from github or copy and paste the code below into a python file in VS Code. When you downloaded the Kaggle data, most of the photos were in the ‘train’ folder. But there were also test folders. Check this to be sure. Then run the following command to test one photo at a time:
python3 infer_resnet152.py resnet152_cats_dogs_best.pt
"/Users/username/Documents/data/Cat_And_Dog/test/cat_245.jpg"
import sys
from pathlib import Path
import torch
import torch.nn as nn
from torchvision import models, transforms
from PIL import Image
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Same model definition as in training
def build_resnet152(num_classes=2, pretrained=False):
model = models.resnet152(
weights=models.ResNet152_Weights.DEFAULT if pretrained else None
)
in_features = model.fc.in_features
model.fc = nn.Linear(in_features, num_classes)
return model
# Same transforms as validation in ResNet-152.py
val_transforms = transforms.Compose([
transforms.Resize((256, 256)),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225],
),
])
def load_model(checkpoint_path: str):
ckpt = torch.load(checkpoint_path, map_location=DEVICE)
classes = ckpt["classes"]
model = build_resnet152(num_classes=len(classes), pretrained=False)
model.load_state_dict(ckpt["model_state_dict"])
model.to(DEVICE)
model.eval()
return model, classes
def predict_image(model, classes, image_path: str):
img = Image.open(image_path).convert("RGB")
x = val_transforms(img).unsqueeze(0).to(DEVICE) # (1, C, H, W)
with torch.inference_mode():
logits = model(x)
probs = torch.softmax(logits, dim=1)[0]
pred_idx = int(torch.argmax(probs).item())
pred_class = classes[pred_idx]
pred_prob = float(probs[pred_idx].item())
return pred_class, pred_prob
def main():
if len(sys.argv) != 3:
print("Usage: python infer_resnet152.py /path/to/resnet152_cats_dogs_best.pt /path/to/image.jpg")
sys.exit(1)
ckpt_path = sys.argv[1]
image_path = sys.argv[2]
print(f"Loading model from: {ckpt_path}")
model, classes = load_model(ckpt_path)
print(f"Classes: {classes}")
pred_class, pred_prob = predict_image(model, classes, image_path)
print(f"Image: {image_path}")
print(f"Prediction: {pred_class} (p = {pred_prob:.4f})")
if __name__ == "__main__":
main()
if all goes well you’ll see the following output:
Loading model from: resnet152_cats_dogs_best.pt
Classes: ['cat', 'dog']
Image: /Users/username/Documents/data/Cat_And_Dog/test/cat_38.jpg
Prediction: cat (p = 0.9470)
It’s telling us that it loaded the model, that the model has two Classes; that the test image is ‘cat_38.jpg’; and that it’s a cat. With the probability being 0.9470 — very confident. We just found a cat.
Residual connections represent one of the most elegant solutions in the history of deep learning — a single + x that unlocked a decade of progress. By giving gradients a direct path backward through the network, ResNet broke the 22-layer barrier and opened the door to architectures with hundreds or thousands of layers. The mathematical intuition ties directly back to what we’ve been building throughout this series: eigenvalue behavior governs whether a signal grows or vanishes, and the +1 term in the skip connection’s derivative is what keeps the gradient alive. As you train your own ResNet-152 classifier, you’re not just running someone else’s code — you’re standing on one of the pivotal ideas that made modern AI possible.
February 25, 2026