Learn by Directing AI
Unit 3

Transfer Learning for Match Quality

Step 1: What transfer learning means

The baseline model used TF-IDF for the text columns -- nurse bios and hospital requirement notes. TF-IDF counts word frequencies. It captures that a bio mentions "ICU" or "pediatric" but not what those words mean in context.

Transfer learning takes a different approach. Instead of learning language from scratch on Priya's 2400 records, you adapt a model that has already been trained on billions of words of text. That model already understands language structure, word relationships, and context. You are adapting existing knowledge to a specific task rather than building from nothing.

The comparison makes the value concrete: training a language model from scratch on 2400 records produces a weak understanding of text. Adapting a model that already understands language produces a much stronger one, with less data and less training time. The pretrained model starts from meaningful representations rather than random weights.

Step 2: Load a pretrained model

Direct Claude to load a small pretrained model from Hugging Face -- DistilBERT is a good choice. It is small enough to fine-tune on a single machine but powerful enough to produce useful text representations.

When you provide Claude the context for this step, include what the model will be used for (text component of the matching model), what data it will process (nurse bios and hospital requirement notes), and what the baseline comparison is (TF-IDF). That context shapes the code Claude generates -- without it, Claude may produce generic fine-tuning code that does not connect to your Pipeline.

Step 3: Implement the freezing strategy

Here is where AI direction matters. By default, AI fine-tunes all layers of a pretrained model with a uniform learning rate. That risks catastrophic forgetting -- the pretrained knowledge that makes transfer learning valuable gets overwritten by aggressive updates on a small dataset.

The better approach: freeze the base model and train only the classification head first. The base layers keep their pretrained knowledge. The classification head learns to map those representations to your matching task.

AI commonly generates code that updates all parameters from the start. When you direct Claude to implement fine-tuning, specify the freezing strategy explicitly: freeze the base layers, set requires_grad = False for those parameters, and train only the head. After the head converges, you can selectively unfreeze a few top layers with a lower learning rate if needed.

Step 4: Set up learning rate scheduling

A constant learning rate is a blunt instrument. It either moves too aggressively in early epochs -- overshooting good solutions -- or too conservatively in late epochs -- failing to converge. Learning rate scheduling adapts the rate as training progresses.

A warmup period starts with a very low learning rate and ramps up. This prevents large, destructive updates in the first few epochs when the model is still adjusting. After warmup, cosine annealing gradually reduces the learning rate, allowing fine-grained convergence.

Direct Claude to implement a warmup + cosine schedule. AI applies a default schedule (or none) without considering how the learning rate interacts with the batch size and dataset size. You need to specify the warmup epochs and the decay pattern.

Step 5: Combine and train

The model now has two inputs: tabular features through the scikit-learn Pipeline, and text features through the pretrained model. Combine them at evaluation time -- the Pipeline handles structured data, the pretrained model handles text, and the outputs merge for final predictions.

Train and compare against the baseline. Log both experiments in MLflow with the same metrics so the comparison is direct.

Step 6: Evaluate and explain

The transfer learning model should outperform the baseline on the primary matching metric. The comparison in MLflow shows how much improvement the pretrained text representations provide over TF-IDF.

This is worth understanding in terms Priya would appreciate: "We adapted an existing language model that already understands text -- instead of building that understanding from scratch on your 2400 records. The adapted model produces better match scores because it understands the meaning in nurse bios and hospital notes, not just which words appear."

A stakeholder who hears "we fine-tuned DistilBERT" has learned nothing. A stakeholder who hears "we adapted existing language understanding to your matching problem because building it from scratch would have required far more data" has understood the decision.

✓ Check

Check: The transfer learning model's test performance exceeds the baseline on the primary matching metric. The base model layers are frozen during initial training (verify by checking requires_grad is False for base layers).