MedConnect Nurse-Hospital Matching Model
Project and client
You are building a nurse-hospital matching model for Priya Krishnamurthy, Director of Placement Operations at MedConnect Staffing in Bangalore, India. MedConnect places ~400 nurses per quarter across 80 hospitals in South India. Priya wants a model that scores nurse-hospital match quality so her team works from a ranked list instead of manual matching.
What you are building
A Pipeline-based matching model that:
- Combines structured features (certifications, experience, location) with text features (nurse bios, hospital notes) using scikit-learn ColumnTransformer
- Uses transfer learning (pretrained language model) for the text component
- Includes a fairness audit with disaggregated evaluation across nurse regions
- Produces ranked match quality scores for Priya's team
Tech stack
- Python 3.11+
- pandas (data processing)
- scikit-learn (Pipeline, ColumnTransformer, feature selection, baseline models)
- Hugging Face Transformers (transfer learning -- DistilBERT or similar)
- PyTorch (training loop for transfer learning component)
- MLflow (experiment tracking)
- Jupyter (development)
File structure
p6/
materials/
placement-data.csv # 2400 placement records (text + tabular)
pipeline-template.py # Skeleton Pipeline with ColumnTransformer
fairness-audit-guide.md # Guide for disaggregated evaluation
CLAUDE.md # This file
notebooks/ # Working notebooks
models/ # Saved models and artifacts
evaluation/ # Evaluation reports and fairness audit
Key materials
- placement-data.csv -- 2400 rows of nurse-hospital placement data. Mixed tabular and text columns. Nurse regions include South, West, North, East, Northeast, and Central India.
- pipeline-template.py -- Skeleton scikit-learn Pipeline with ColumnTransformer structure. Skeleton with blanks to fill in: column lists and transformer choices.
- fairness-audit-guide.md -- Guide covering disaggregated evaluation, fairness metrics, and intervention options.
Work breakdown
- Profile the placement dataset. Understand column types, distributions, missing values, rating skew.
- Build a scikit-learn Pipeline with ColumnTransformer for heterogeneous data. Feature selection inside cross-validation. Train baseline model.
- Add transfer learning for text features. Implement freezing strategy and learning rate scheduling. Compare against baseline in MLflow.
- Fairness audit: disaggregate model predictions by nurse region. Discover and address regional placement bias. Communicate findings to Priya.
- Finalize deliverables. Write client summary and README. Push to GitHub.
Verification targets
- Cross-validation scores consistent across folds (std < 0.05) -- no leakage signal
- Feature selection inside Pipeline (SelectKBest not applied before Pipeline)
- Transfer learning model outperforms baseline on primary matching metric
- Base model layers frozen during initial training (requires_grad = False)
- Northeast India placement score gap: >15% before intervention, <5% after
- Overall performance remains above baseline after fairness intervention
Commit convention
Commit after completing each unit's work. Use descriptive messages: "Profile placement dataset", "Build Pipeline with ColumnTransformer", "Add transfer learning model", "Complete fairness audit", "Final deliverables and README".
Important rules
- Never preprocess before splitting. All preprocessing happens inside the Pipeline.
- Feature selection must be inside cross-validation, not on the full dataset.
- Always compute per-region metrics, not just aggregate scores.
- When fine-tuning pretrained models, freeze the base layers first. Only unfreeze selectively with a lower learning rate.
- Log all experiments in MLflow with parameters, metrics, and model artifacts.