Step 1: What sentence-transformers do
The tabular features capture purchase history, product popularity, and category affinity. But the product descriptions -- "Crafted from organic cotton, this wrap midi dress features a relaxed silhouette" -- contain meaning that no tabular feature can express. Two products might share a category and price range but describe completely different styles.
Sentence-transformers convert text into dense numerical vectors -- 384 dimensions that capture semantic meaning. "Relaxed linen dress" and "casual organic cotton shift dress" produce vectors that are close together in that 384-dimensional space, even though they share no exact words. TF-IDF would see them as different. Embeddings see them as similar.
This is not using an LLM to generate text or make predictions. It is using a pre-trained model to produce feature representations that feed a classical scikit-learn classifier. The embedding is a feature, like price or category -- just a much richer one.
Step 2: Generate embeddings from product descriptions
Ask Claude to write a script that generates embeddings from the product description field in materials/products.csv using a pre-trained sentence-transformer model.
AI commonly generates embedding code that processes the entire dataset at once -- computing embeddings on every product description before any train/test split happens. Watch for this. The boundary discipline you applied to tabular preprocessing in earlier projects applies here too.
For a pre-trained model used purely in inference mode (no fine-tuning), the leakage risk from computing embeddings before splitting is low. The model was not trained on your product descriptions, so computing embeddings on the full corpus does not "teach" it anything about your test set. But the habit matters. If you later fine-tune the embedding model on your corpus, computing embeddings before splitting would be genuine leakage. Build the habit now.
Step 3: Enforce the train/test boundary
Direct Claude to restructure the embedding generation so it runs only on the training set first, then applies the same pre-trained model to the test set separately.
Restructure the embedding pipeline to respect the train/test boundary. Split the data first, generate embeddings on the training set, then generate embeddings on the test set using the same model. The model is pre-trained and not fine-tuned, but I want this boundary enforced as practice.
The result: embeddings are computed in the right order. Training set first, test set second. The model parameters do not change between the two -- this is inference, not training. But the pipeline structure is correct.
Step 4: Combine embeddings with tabular features
Ask Claude to build a combined feature set: the tabular features from Unit 2 alongside the embedding features from this unit. Both feed a single scikit-learn classifier.
The combination is straightforward -- concatenate the feature matrices. But check the dimensionality. The tabular features might have 15-20 columns. The embeddings have 384 dimensions. If the embedding dimensions dominate, the classifier may ignore the tabular features entirely. Ask Claude whether dimensionality reduction (PCA or similar) would help balance the two feature types.
Step 5: Version the embedding pipeline with DVC
The embedding pipeline needs the same versioning as the tabular pipeline. Ask Claude to add embedding generation as a DVC stage, with the product descriptions as input and the embedding feature matrix as output.
Now the full feature pipeline is DVC-tracked: raw data, tabular features, and embedding features. Any historical experiment can be reproduced with the exact features that existed at that point.
Step 6: Evaluate combined vs tabular-only
Train two models: one on tabular features alone, one on the combined (tabular + embedding) features. Compare them on recommendation quality metrics -- precision@k, recall@k, or NDCG.
The combined model should outperform on at least one metric. If it does not, investigate. Are the embedding dimensions dominating? Is the classifier not using the embedding signal effectively? This is an evaluation decision, not a coding task -- you decide what "better" means for Max's recommendations.
After seeing these results, Max will have a follow-up. He wants image-based features -- "our customers shop visually." This is a reasonable idea but out of scope for this project. Text embeddings capture semantic similarity. Image features are a different pipeline. Explain what the current embeddings capture and suggest image features as a future enhancement. Max will accept: "Ja, okay, text first, images later -- makes sense."
Check: Embedding generation runs only on the training set before being applied to test. The combined feature set outperforms tabular-only on at least one key metric.