Project Chronicle: A Journey into Virtual Try-On with Diffusion Models
This document outlines the development journey of this project, which aims to implement the "TryOnDiffusion: A Tale of Two UNets" paper. It serves as a log of the learning process, implementation steps, challenges faced, and future goals.
Tech Stack
Phase 1: Foundational Learning (The Groundwork)
- Core Concepts: Started with the fundamentals of Computer Vision and mastered the PyTorch framework.
- Generative Adversarial Networks (GANs): Implemented and trained a POKEGAN to gain practical experience with generative models.
- Introduction to Diffusion Models: Shifted focus to diffusion models, successfully training a Denoising Diffusion Probabilistic Model (DDPM) on the Fashion MNIST dataset (28x28 images) using an NVIDIA RTX 3090.
- Data Pipeline Mastery: Revisited and gained a deeper understanding of PyTorch's
DataLoader
and custom data handling pipelines.
Phase 2: Advanced Concepts & Paper Selection (Scaling Up)
- Advanced Architectures: Studied Transformers and the Attention mechanism to understand how models process long-range dependencies.
- Modulation Techniques: Explored specific neural network techniques like Feature-wise Linear Modulation (FiLM) for conditioning generative models.
- Research & Direction: After a thorough literature review, the "TryOnDiffusion: A Tale of Two UNets" paper was selected as the primary research goal for this project.
Phase 3: Implementation, Training, and Debugging (Getting Hands-On)
- Codebase Adaptation: Forked and analyzed an open-source implementation by fashnAI as a starting point.
- Custom Development:
- Engineered a custom data mapper and
DataLoader
to process the HR-VITON dataset. - Wrote a custom trainer script tailored to the model's specific needs and for better control over the training loop.
- Engineered a custom data mapper and
- Technical Challenges: Successfully debugged and resolved several breaking changes caused by library updates in the original repository.
- Model Training:
- Initiated training on a subset of the HR-VITON dataset (500 images).
- Utilized an NVIDIA RTX 4090 (24GB) for the computationally intensive training process.
- Tracked metrics, losses, and logs meticulously using Weights & Biases (
wandb
).
- Evaluation: Created a sampling script to generate image outputs from checkpoints to qualitatively assess model performance.
Phase 4: The Plateau & The Path Forward (Current Status)
Current Challenge: The model's loss has stagnated and remains constant. This suggests the model is no longer learning, likely due to overfitting on the small dataset or a subtle issue in the data pipeline.
Visual Analysis
Sample model output after 2000 epochs.
Original Input | Input Features | Generated Output |
---|---|---|
![]() |
![]() |
![]() |
W&B loss curve, clearly illustrating the training plateau.
- Immediate Goals:
- Debug the training process: Perform sanity checks like overfitting on a single batch to verify the model's learning capacity.
- Verify the data pipeline: Thoroughly visualize the inputs (warped clothes, agnostic masks, pose maps) being fed to the model to ensure they are correct.
- Investigate Loss Function: The current loss (e.g., L1 or L2) might not be optimal. Experiment with alternatives like a perceptual loss (LPIPS - Learned Perceptual Image Patch Similarity) to better capture visual similarity.
- Tune Hyperparameters: Experiment with the learning rate and other key hyperparameters.
- Long-Term Vision: Resolve the training plateau, scale up the training to a larger dataset, and successfully replicate the results of the TryOnDiffusion paper.