Project Chronicle: A Journey into Virtual Try-On with Diffusion Models

This document outlines the development journey of this project, which aims to implement the "TryOnDiffusion: A Tale of Two UNets" paper. It serves as a log of the learning process, implementation steps, challenges faced, and future goals.

Tech Stack

Phase 1: Foundational Learning (The Groundwork)

Core Concepts: Started with the fundamentals of Computer Vision and mastered the PyTorch framework.
Generative Adversarial Networks (GANs): Implemented and trained a POKEGAN to gain practical experience with generative models.
Introduction to Diffusion Models: Shifted focus to diffusion models, successfully training a Denoising Diffusion Probabilistic Model (DDPM) on the Fashion MNIST dataset (28x28 images) using an NVIDIA RTX 3090.
Data Pipeline Mastery: Revisited and gained a deeper understanding of PyTorch's DataLoader and custom data handling pipelines.

Phase 2: Advanced Concepts & Paper Selection (Scaling Up)

Advanced Architectures: Studied Transformers and the Attention mechanism to understand how models process long-range dependencies.
Modulation Techniques: Explored specific neural network techniques like Feature-wise Linear Modulation (FiLM) for conditioning generative models.
Research & Direction: After a thorough literature review, the "TryOnDiffusion: A Tale of Two UNets" paper was selected as the primary research goal for this project.

Phase 3: Implementation, Training, and Debugging (Getting Hands-On)

Codebase Adaptation: Forked and analyzed an open-source implementation by fashnAI as a starting point.
Custom Development:
- Engineered a custom data mapper and DataLoader to process the HR-VITON dataset.
- Wrote a custom trainer script tailored to the model's specific needs and for better control over the training loop.
Technical Challenges: Successfully debugged and resolved several breaking changes caused by library updates in the original repository.
Model Training:
- Initiated training on a subset of the HR-VITON dataset (500 images).
- Utilized an NVIDIA RTX 4090 (24GB) for the computationally intensive training process.
- Tracked metrics, losses, and logs meticulously using Weights & Biases (wandb).
Evaluation: Created a sampling script to generate image outputs from checkpoints to qualitatively assess model performance.

Phase 4: The Plateau & The Path Forward (Current Status)

Current Challenge: The model's loss has stagnated and remains constant. This suggests the model is no longer learning, likely due to overfitting on the small dataset or a subtle issue in the data pipeline.

Visual Analysis

Sample model output after 2000 epochs.

Original Input	Input Features	Generated Output

W&B loss curve, clearly illustrating the training plateau.

Immediate Goals:
1. Debug the training process: Perform sanity checks like overfitting on a single batch to verify the model's learning capacity.
2. Verify the data pipeline: Thoroughly visualize the inputs (warped clothes, agnostic masks, pose maps) being fed to the model to ensure they are correct.
3. Investigate Loss Function: The current loss (e.g., L1 or L2) might not be optimal. Experiment with alternatives like a perceptual loss (LPIPS - Learned Perceptual Image Patch Similarity) to better capture visual similarity.
4. Tune Hyperparameters: Experiment with the learning rate and other key hyperparameters.
Long-Term Vision: Resolve the training plateau, scale up the training to a larger dataset, and successfully replicate the results of the TryOnDiffusion paper.

Aditya757864
/

TRY_ON