MedConnect Nurse-Hospital Matching Model

Project and client

You are building a nurse-hospital matching model for Priya Krishnamurthy, Director of Placement Operations at MedConnect Staffing in Bangalore, India. MedConnect places ~400 nurses per quarter across 80 hospitals in South India. Priya wants a model that scores nurse-hospital match quality so her team works from a ranked list instead of manual matching.

What you are building

A Pipeline-based matching model that:

Combines structured features (certifications, experience, location) with text features (nurse bios, hospital notes) using scikit-learn ColumnTransformer
Uses transfer learning (pretrained language model) for the text component
Includes a fairness audit with disaggregated evaluation across nurse regions
Produces ranked match quality scores for Priya's team

Tech stack

Python 3.11+
pandas (data processing)
scikit-learn (Pipeline, ColumnTransformer, feature selection, baseline models)
Hugging Face Transformers (transfer learning -- DistilBERT or similar)
PyTorch (training loop for transfer learning component)
MLflow (experiment tracking)
Jupyter (development)

File structure

p6/
  materials/
    placement-data.csv        # 2400 placement records (text + tabular)
    pipeline-template.py      # Skeleton Pipeline with ColumnTransformer
    fairness-audit-guide.md   # Guide for disaggregated evaluation
    CLAUDE.md                 # This file
  notebooks/                  # Working notebooks
  models/                     # Saved models and artifacts
  evaluation/                 # Evaluation reports and fairness audit

Key materials

placement-data.csv -- 2400 rows of nurse-hospital placement data. Mixed tabular and text columns. Nurse regions include South, West, North, East, Northeast, and Central India.
pipeline-template.py -- Skeleton scikit-learn Pipeline with ColumnTransformer structure. Skeleton with blanks to fill in: column lists and transformer choices.
fairness-audit-guide.md -- Guide covering disaggregated evaluation, fairness metrics, and intervention options.

Work breakdown

Profile the placement dataset. Understand column types, distributions, missing values, rating skew.
Build a scikit-learn Pipeline with ColumnTransformer for heterogeneous data. Feature selection inside cross-validation. Train baseline model.
Add transfer learning for text features. Implement freezing strategy and learning rate scheduling. Compare against baseline in MLflow.
Fairness audit: disaggregate model predictions by nurse region. Discover and address regional placement bias. Communicate findings to Priya.
Finalize deliverables. Write client summary and README. Push to GitHub.

Verification targets

Cross-validation scores consistent across folds (std < 0.05) -- no leakage signal
Feature selection inside Pipeline (SelectKBest not applied before Pipeline)
Transfer learning model outperforms baseline on primary matching metric
Base model layers frozen during initial training (requires_grad = False)
Northeast India placement score gap: >15% before intervention, <5% after
Overall performance remains above baseline after fairness intervention

Commit convention

Commit after completing each unit's work. Use descriptive messages: "Profile placement dataset", "Build Pipeline with ColumnTransformer", "Add transfer learning model", "Complete fairness audit", "Final deliverables and README".

Important rules

Never preprocess before splitting. All preprocessing happens inside the Pipeline.
Feature selection must be inside cross-validation, not on the full dataset.
Always compute per-region metrics, not just aggregate scores.
When fine-tuning pretrained models, freeze the base layers first. Only unfreeze selectively with a lower learning rate.
Log all experiments in MLflow with parameters, metrics, and model artifacts.