Learn by Directing AI
All materials

CLAUDE.md

MedConnect Nurse-Hospital Matching Model

Project and client

You are building a nurse-hospital matching model for Priya Krishnamurthy, Director of Placement Operations at MedConnect Staffing in Bangalore, India. MedConnect places ~400 nurses per quarter across 80 hospitals in South India. Priya wants a model that scores nurse-hospital match quality so her team works from a ranked list instead of manual matching.

What you are building

A Pipeline-based matching model that:

  1. Combines structured features (certifications, experience, location) with text features (nurse bios, hospital notes) using scikit-learn ColumnTransformer
  2. Uses transfer learning (pretrained language model) for the text component
  3. Includes a fairness audit with disaggregated evaluation across nurse regions
  4. Produces ranked match quality scores for Priya's team

Tech stack

  • Python 3.11+
  • pandas (data processing)
  • scikit-learn (Pipeline, ColumnTransformer, feature selection, baseline models)
  • Hugging Face Transformers (transfer learning -- DistilBERT or similar)
  • PyTorch (training loop for transfer learning component)
  • MLflow (experiment tracking)
  • Jupyter (development)

File structure

p6/
  materials/
    placement-data.csv        # 2400 placement records (text + tabular)
    pipeline-template.py      # Skeleton Pipeline with ColumnTransformer
    fairness-audit-guide.md   # Guide for disaggregated evaluation
    CLAUDE.md                 # This file
  notebooks/                  # Working notebooks
  models/                     # Saved models and artifacts
  evaluation/                 # Evaluation reports and fairness audit

Key materials

  • placement-data.csv -- 2400 rows of nurse-hospital placement data. Mixed tabular and text columns. Nurse regions include South, West, North, East, Northeast, and Central India.
  • pipeline-template.py -- Skeleton scikit-learn Pipeline with ColumnTransformer structure. Skeleton with blanks to fill in: column lists and transformer choices.
  • fairness-audit-guide.md -- Guide covering disaggregated evaluation, fairness metrics, and intervention options.

Work breakdown

  1. Profile the placement dataset. Understand column types, distributions, missing values, rating skew.
  2. Build a scikit-learn Pipeline with ColumnTransformer for heterogeneous data. Feature selection inside cross-validation. Train baseline model.
  3. Add transfer learning for text features. Implement freezing strategy and learning rate scheduling. Compare against baseline in MLflow.
  4. Fairness audit: disaggregate model predictions by nurse region. Discover and address regional placement bias. Communicate findings to Priya.
  5. Finalize deliverables. Write client summary and README. Push to GitHub.

Verification targets

  • Cross-validation scores consistent across folds (std < 0.05) -- no leakage signal
  • Feature selection inside Pipeline (SelectKBest not applied before Pipeline)
  • Transfer learning model outperforms baseline on primary matching metric
  • Base model layers frozen during initial training (requires_grad = False)
  • Northeast India placement score gap: >15% before intervention, <5% after
  • Overall performance remains above baseline after fairness intervention

Commit convention

Commit after completing each unit's work. Use descriptive messages: "Profile placement dataset", "Build Pipeline with ColumnTransformer", "Add transfer learning model", "Complete fairness audit", "Final deliverables and README".

Important rules

  • Never preprocess before splitting. All preprocessing happens inside the Pipeline.
  • Feature selection must be inside cross-validation, not on the full dataset.
  • Always compute per-region metrics, not just aggregate scores.
  • When fine-tuning pretrained models, freeze the base layers first. Only unfreeze selectively with a lower learning rate.
  • Log all experiments in MLflow with parameters, metrics, and model artifacts.