--- license: creativeml-openrail-m language: - en metrics: - accuracy pipeline_tag: image-to-image --- # Project Chronicle: A Journey into Virtual Try-On with Diffusion Models This document outlines the development journey of this project, which aims to implement the "TryOnDiffusion: A Tale of Two UNets" paper. It serves as a log of the learning process, implementation steps, challenges faced, and future goals. ## Tech Stack ![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge&logo=pytorch&logoColor=white) ![Transformers](https://img.shields.io/badge/🤗%20Transformers-yellow?style=for-the-badge) ![Weights & Biases](https://img.shields.io/badge/Weights%26_Biases-FFBE00?style=for-the-badge&logo=WeightsAndBiases&logoColor=black) ![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow)](https://huggingface.co/Aditya757864/TRY_ON) --- ## Phase 1: Foundational Learning (The Groundwork) * **Core Concepts:** Started with the fundamentals of **Computer Vision** and mastered the **PyTorch** framework. * **Generative Adversarial Networks (GANs):** Implemented and trained a **POKEGAN** to gain practical experience with generative models. * **Introduction to Diffusion Models:** Shifted focus to diffusion models, successfully training a **Denoising Diffusion Probabilistic Model (DDPM)** on the Fashion MNIST dataset (28x28 images) using an NVIDIA RTX 3090. * **Data Pipeline Mastery:** Revisited and gained a deeper understanding of PyTorch's `DataLoader` and custom data handling pipelines. --- ## Phase 2: Advanced Concepts & Paper Selection (Scaling Up) * **Advanced Architectures:** Studied **Transformers** and the **Attention** mechanism to understand how models process long-range dependencies. * **Modulation Techniques:** Explored specific neural network techniques like **Feature-wise Linear Modulation (FiLM)** for conditioning generative models. * **Research & Direction:** After a thorough literature review, the **"TryOnDiffusion: A Tale of Two UNets"** paper was selected as the primary research goal for this project. --- ## Phase 3: Implementation, Training, and Debugging (Getting Hands-On) * **Codebase Adaptation:** Forked and analyzed an open-source implementation by **fashnAI** as a starting point. * **Custom Development:** * Engineered a **custom data mapper and `DataLoader`** to process the HR-VITON dataset. * Wrote a **custom trainer script** tailored to the model's specific needs and for better control over the training loop. * **Technical Challenges:** Successfully debugged and resolved several breaking changes caused by library updates in the original repository. * **Model Training:** * Initiated training on a subset of the **HR-VITON dataset (500 images)**. * Utilized an **NVIDIA RTX 4090 (24GB)** for the computationally intensive training process. * Tracked metrics, losses, and logs meticulously using **Weights & Biases (`wandb`)**. * **Evaluation:** Created a **sampling script** to generate image outputs from checkpoints to qualitatively assess model performance. --- ## Phase 4: The Plateau & The Path Forward (Current Status) > **Current Challenge:** The model's loss has **stagnated and remains constant**. This suggests the model is no longer learning, likely due to overfitting on the small dataset or a subtle issue in the data pipeline. ### Visual Analysis *Sample model output after 2000 epochs.* | Original Input | Input Features | Generated Output | | ----- | ----- | ----- | | Original Input Image | Input Features Image | Generated Output Image | *W&B loss curve, clearly illustrating the training plateau.* ![Wandb loss curve showing a flat line](./assets/wandb.png) * **Immediate Goals:** 1. **Debug the training process:** Perform sanity checks like overfitting on a single batch to verify the model's learning capacity. 2. **Verify the data pipeline:** Thoroughly visualize the inputs (warped clothes, agnostic masks, pose maps) being fed to the model to ensure they are correct. 3. **Investigate Loss Function:** The current loss (e.g., L1 or L2) might not be optimal. Experiment with alternatives like a perceptual loss (LPIPS - Learned Perceptual Image Patch Similarity) to better capture visual similarity. 4. **Tune Hyperparameters:** Experiment with the learning rate and other key hyperparameters. * **Long-Term Vision:** Resolve the training plateau, scale up the training to a larger dataset, and successfully replicate the results of the TryOnDiffusion paper.