fmthoker commited on Jun 4

Commit

4940c8b

verified ·

1 Parent(s): 401fa20

Upload 26 files

Browse files

Files changed (26) hide show

INSTALL.md +24 -0
LICENSE +21 -0
README.md +113 -3
datasets.py +271 -0
dynamic_utils.py +133 -0
engine_for_finetuning.py +375 -0
engine_for_pretraining.py +152 -0
environment.yml +259 -0
functional.py +89 -0
kinetics.py +559 -0
masking_generator.py +185 -0
mixup.py +316 -0
modeling_finetune.py +351 -0
modeling_pretrain.py +398 -0
optim_factory.py +175 -0
rand_augment.py +531 -0
random_erasing.py +173 -0
run_class_finetuning.py +582 -0
run_mae_pretraining.py +359 -0
run_videomae_vis.py +198 -0
ssv2.py +363 -0
synthetic_tubelets.py +785 -0
transforms.py +206 -0
utils_mae.py +536 -0
video_transforms.py +1281 -0
volume_transforms.py +131 -0

INSTALL.md ADDED Viewed

	@@ -0,0 +1,24 @@

+# SMILE Installation
+This project relies on several open-source libraries. We recommend using **`conda`** to manage your Python environment and installing dependencies via the provided `environment.yml` file.
+## Installation Steps
+1. **Clone the repository**
+```bash
+git clone https://github.com/fmthoker/SMILE.git
+cd SMILE
+```
+2. **Create a conda environment**
+```bash
+conda env create -f environment.yml
+```
+3. **Activate the environment**
+```bash
+conda activate smile
+```
+4. **Download CLIP weights (Optional, only required for pretraining)**
+```bash
+mkdir clip_weights
+```
+For pretraining, please download the [CLIP weights](https://huggingface.co/fmthoker/SMILE/resolve/main/clip_weights/ViT-B-16.pt) and place them in the `clip_weights` folder created above.

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Fida Mohammad Thoker
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,3 +1,113 @@
----
-license: mit
----

+# Official PyTorch Implementation of SMILE (CVPR 2025).
+![SMILE Framework](figs/smile.jpg)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)<br>
+[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/fmthoker/SMILE/tree/main/SMILE_MODELS)
+> [**SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning**](https://arxiv.org/abs/2504.00527)<br>
+> [Fida Mohammad Thoker](https://fmthoker.github.io/), [Letian Jiang](https://tonnew5418.github.io/), [Chen Zhao](https://zhao-chen.com/), [Bernard Ghanem](https://cemse.kaust.edu.sa/profiles/bernard-ghanem)<br>King Abdullah University of Science and Technology (KAUST)
+## 📰 News
+**[2025.6.2]**  Code and pre-trained models are available now! <br>
+**[2025.5.28]** Code and pre-trained models will be released here. Welcome to **watch** this repository for the latest updates.
+## ✨ Highlights
+### 🔥 State-of-the-art on SSv2 and K400
+Our method achieves state-of-the-art performance on **SSv2** and **K400** benchmarks with a ViT-B backbone, surpassing prior self-supervised video models by up to **2.5%**, thanks to efficient *CLIP-based semantic supervision*.
+### ⚡️ Leading Results Across Generalization Challenges
+We evaluate our method on the [**SEVERE benchmark**](https://bpiyush.github.io/SEVERE-website/), covering domain shift, low-shot learning, fine-grained actions, and task adaptability. Our model consistently outperforms prior methods and achieves a **3.0% average gain** over strong baselines, demonstrating superior generalization in diverse video understanding tasks.
+### 😮 Superior Motion Representation Without Video-Text Alignment
+Compared to CLIP-based methods such as [**ViCLIP**](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) and [**UMT**](https://github.com/OpenGVLab/unmasked_teacher), our model achieves higher accuracy on motion-sensitive datasets, particularly under *linear probing*. This indicates stronger video representations learned with less data and without relying on video-text alignment.
+## 🚀 Main Results and Models
+### ✨ Something-Something V2
+|  Method  | Pretrain Dataset | Pretrain Epochs | Backbone | Top-1 | Finetune |
+| :------: | :--------------: | :-------------: | :------: | :---: | :------: |
+|   SMILE  |       K400       |       800       |   ViT-S  |  69.1 |   TODO   |
+|   SMILE  |       K400       |       600       |   ViT-B  |  72.1 | [log](https://huggingface.co/fmthoker/SMILE/resolve/main/SMILE_MODELS/finetune/ssv2/VIT_B_600_EPOCHS/log.txt) / [checkpoint](https://huggingface.co/fmthoker/SMILE/resolve/main/SMILE_MODELS/finetune/ssv2/VIT_B_600_EPOCHS/ssv2_finetuned_after_k400_pretraining_first_stage_300_epochs_2nd_stage_300_epochs.pth) |
+|   SMILE  |       K400       |       1200      |   ViT-B  |  72.4 | [log](https://huggingface.co/fmthoker/SMILE/resolve/main/SMILE_MODELS/finetune/ssv2/VIT_B_1200_EPOCHS/log.txt) / [checkpoint](https://huggingface.co/fmthoker/SMILE/resolve/main/SMILE_MODELS/finetune/ssv2/VIT_B_1200_EPOCHS/ssv2_finetuned_after_k400_pretraining_first_stage_800_epochs_2nd_stage_400_epochs.pth)
+|   SMILE  |       SSv2       |       800       |   ViT-B  |  72.5 |   TODO   |
+### ✨ Kinetics-400
+|  Method  | Pretrain Dataset | Pretrain Epochs | Backbone | Top-1 | Pretrain | Finetune |
+| :------: | :--------------: | :-------------: | :------: | :---: | :------: | :------: |
+|   SMILE  |       K400       |       800       |   ViT-S  |  79.5 |   TODO   |   TODO   |
+|   SMILE  |       K400       |       600       |   ViT-B  |  83.1 | [checkpoint](https://huggingface.co/fmthoker/SMILE/resolve/main/SMILE_MODELS/pretrain/k400_pretraining_first_stage_300_epochs_2nd_stage_300_epochs.pth) | [log](https://huggingface.co/fmthoker/SMILE/resolve/main/SMILE_MODELS/finetune/k400/VIT_B_600_EPOCHS/log.txt) / [checkpoint](https://huggingface.co/fmthoker/SMILE/resolve/main/SMILE_MODELS/finetune/k400/VIT_B_600_EPOCHS/k400_finetuned_after_k400_pretraining_first_stage_300_epochs_2nd_stage_300_epochs.pth) |
+|   SMILE  |       K400       |       1200      |   ViT-B  |  83.4 | [checkpoint](https://huggingface.co/fmthoker/SMILE/resolve/main/SMILE_MODELS/pretrain/k400_pretraining_first_stage_800_epochs_2nd_stage_400_epochs.pth) | [log](https://huggingface.co/fmthoker/SMILE/resolve/main/SMILE_MODELS/finetune/k400/VIT_B_1200_EPOCHS/log.txt) / [checkpoint](https://huggingface.co/fmthoker/SMILE/resolve/main/SMILE_MODELS/finetune/k400/VIT_B_1200_EPOCHS/k400_finetuned_after_k400_pretraining_first_stage_800_epochs_2nd_stage_400_epochs.pth) |
+## 🔨 Installation
+Please follow the instructions in [INSTALL.md](INSTALL.md).
+## ➡️ Data Preparation
+We follow [VideoMAE Data preparation](https://github.com/MCG-NJU/VideoMAE/blob/main/DATASET.md) to prepare our datasets (K400 and SSv2). Here we provide our annotation files for those two datasets: [annotation_files](annotation_files). For pretraining, we use training sets (train.csv).
+We provide the list of segmented object images  used for pretraining in [object_instances.txt](annotation_files/object_instances.txt). The images will be released later.
+## 🔄 Pre-training
+Following the [VideoMAE pre-training guide](https://github.com/MCG-NJU/VideoMAE/blob/main/PRETRAIN.md), we provide scripts for pre-training on the Kinetics-400 (K400) dataset using the ViT-Base model:  [scripts/pretrain/](./scripts/pretrain/)
+As described in the paper, we adopt a two-stage training strategy. Please refer to the script names to identify which stage to run.
+If you wish to perform your own pre-training, make sure to update the following parameters in the scripts:
+- `DATA_PATH`: Path to your dataset
+- `OUTPUT_DIR`: Directory to save output results
+- `OBJECTS_PATH`: Path to the overlaying objects image dataset   (image data to be released)
+- `FIRST_STAGE_CKPT`: Path to the ckpt from first stage pretraining ( for second stage training)
+> **Note:** Our pre-training experiments were conducted using 8 V100(32 GB) GPUs.
+---
+## ⤴️ Fine-tuning with Pre-trained Models
+Following the [VideoMAE finetuning guide](https://github.com/MCG-NJU/VideoMAE/blob/main/FINETUNE.md), we provide scripts for fine-tuning on the Something-Something v2 (SSv2) and Kinetics-400 (K400) datasets using the ViT-Base model:  [scripts/finetune/](./scripts/finetune)
+To perform your own fine-tuning, please update the following parameters in the script:
+- `DATA_PATH`: Path to your dataset
+- `MODEL_PATH`: Path to the pre-trained model
+- `OUTPUT_DIR`: Directory to save output results
+> **Note:** Our finetuning experiments were conducted using 4 V100(32 GB) GPUs.
+## ☎️ Contact
+Fida Mohammad Thoker: [email protected]
+## 👍 Acknowledgements
+We sincerely thank [Michael Dorkenwald](https://mdorkenwald.com/) for providing the object image dataset that supports this work.<br>
+This project is built upon [VideoMAE](https://github.com/MCG-NJU/VideoMAE) and [tubelet-contrast](https://github.com/fmthoker/tubelet-contrast). Thanks to the contributors of these great codebases.
+## 🔒 License
+This project is released under the MIT license. For more details, please refer to the [LICENSE](https://github.com/fmthoker/SMILE/blob/main/LICENSE) file.
+## ✏️ Citation
+If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:
+```
+@inproceedings{thoker2025smile,
+  author    = {Thoker, Fida Mohammad and Jiang, Letian and Zhao, Chen and Ghanem, Bernard},
+  title     = {SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning},
+  journal   = {CVPR},
+  year      = {2025},
+}
+```

datasets.py ADDED Viewed

	@@ -0,0 +1,271 @@

+import os
+from torchvision import transforms
+from transforms import *
+from masking_generator import TubeMaskingGenerator, TubeletMaskingGenerator
+from kinetics import VideoClsDataset, VideoMAE
+from ssv2 import SSVideoClsDataset
+import synthetic_tubelets as synthetic_tubelets
+import ast
+import random
+class DataAugmentationForVideoMAE(object):
+    def __init__(self, args):
+        self.input_mean = [0.485, 0.456, 0.406]  # IMAGENET_DEFAULT_MEAN
+        self.input_std = [0.229, 0.224, 0.225]  # IMAGENET_DEFAULT_STD
+        normalize = GroupNormalize(self.input_mean, self.input_std)
+        self.train_augmentation = GroupMultiScaleCrop(args.input_size, [1, .875, .75, .66])
+        self.add_tubelets = args.add_tubelets
+        self.mask_type = args.mask_type
+        # original transform without adding tubelets
+        self.transform_original = transforms.Compose([
+            self.train_augmentation,
+            Stack(roll=False),
+            ToTorchFormatTensor(div=True),
+            normalize,
+        ])
+        # tubelet transform
+        if args.add_tubelets:
+            scales = ast.literal_eval(args.scales)
+            self.tubelets = synthetic_tubelets.PatchMask(
+                    use_objects=args.use_objects,
+                    objects_path=args.objects_path,
+                    region_sampler=dict(
+                        scales=scales,
+                        ratios=[0.5, 0.67, 0.75, 1.0, 1.33, 1.50, 2.0],
+                        scale_jitter=0.18,
+                        num_rois=2,
+                    ),
+                    key_frame_probs=[0.5, 0.3, 0.2],
+                    loc_velocity=12,
+                    rot_velocity=6,
+                    size_velocity=0.025,
+                    label_prob=1.0,
+                    motion_type=args.motion_type,
+                    patch_transformation='rotation',)
+            self.transform1 = transforms.Compose([
+                self.train_augmentation,
+                self.tubelets,
+            ])
+            self.transform2 = transforms.Compose([Stack(roll=False),
+                ToTorchFormatTensor(div=True),
+                normalize,
+            ])
+        else:
+            self.transform = self.transform_original
+        self.original_masked_position_generator = TubeMaskingGenerator(
+            args.window_size, args.mask_ratio
+        )
+        if args.mask_type == 'tube':
+            self.masked_position_generator = self.original_masked_position_generator
+        elif args.mask_type == 'tubelet':
+            self.masked_position_generator = TubeletMaskingGenerator(
+                args.window_size, args.mask_ratio, args.visible_frames, args.sub_mask_type
+            )
+        else:
+            raise NotImplemented
+    def __call__(self, images):
+        process_data, _, traj_rois = self.ComposedTransform(images)
+        if self.mask_type == 'tubelet' and traj_rois is not None:
+            return process_data, self.masked_position_generator(traj_rois)
+        else:
+            return process_data, self.masked_position_generator()
+    def ComposedTransform(self, images):
+        traj_rois = None
+        if self.add_tubelets:
+            data = self.transform1(images)
+            process_data, traj_rois = data[:-1], data[-1]
+            process_data, _ = self.transform2(process_data)
+        else:
+            process_data, _ = self.transform(images)
+        return process_data, _, traj_rois
+    def __repr__(self):
+        repr = "(DataAugmentationForVideoMAE,\n"
+        try:
+            self.transform
+        except:
+            repr += "  transform = %s,\n" % (str(self.transform1) + str(self.transform2))
+        else:
+            repr += "  transform = %s,\n" % str(self.transform)
+        repr += "  Masked position generator = %s,\n" % str(self.masked_position_generator)
+        repr += ")"
+        return repr
+def build_pretraining_dataset(args):
+    transform = DataAugmentationForVideoMAE(args)
+    dataset = VideoMAE(
+        root=None,
+        setting=args.data_path,
+        video_ext='mp4',
+        is_color=True,
+        modality='rgb',
+        new_length=args.num_frames,
+        new_step=args.sampling_rate,
+        transform=transform,
+        temporal_jitter=False,
+        video_loader=True,
+        use_decord=True,
+        lazy_init=False)
+    print("Data Aug = %s" % str(transform))
+    return dataset
+def build_dataset(is_train, test_mode, args):
+    if args.data_set == 'Kinetics-400' or args.data_set == "Mini-Kinetics":
+        mode = None
+        anno_path = None
+        if is_train is True:
+            mode = 'train'
+            if 'Mini' in args.data_set:
+                anno_path = os.path.join(args.data_path, 'train_mini_kinetics.csv')
+            else:
+                anno_path = os.path.join(args.data_path, 'train.csv')
+        elif test_mode is True:
+            mode = 'test'
+            if 'Mini' in args.data_set:
+                anno_path = os.path.join(args.data_path, 'test_mini_kinetics.csv')
+            else:
+                anno_path = os.path.join(args.data_path, 'test.csv')
+        else:
+            mode = 'validation'
+            if 'Mini' in args.data_set:
+                anno_path = os.path.join(args.data_path, 'val_mini_kinetics.csv')
+            else:
+                anno_path = os.path.join(args.data_path, 'val.csv')
+        dataset = VideoClsDataset(
+            anno_path=anno_path,
+            data_path='/',
+            mode=mode,
+            clip_len=args.num_frames,
+            frame_sample_rate=args.sampling_rate,
+            num_segment=1,
+            test_num_segment=args.test_num_segment,
+            test_num_crop=args.test_num_crop,
+            num_crop=1 if not test_mode else 3,
+            keep_aspect_ratio=True,
+            crop_size=args.input_size,
+            short_side_size=args.short_side_size,
+            new_height=256,
+            new_width=320,
+            args=args)
+        if 'Mini' in args.data_set:
+            nb_classes = 200
+        else:
+            nb_classes = 400
+    elif args.data_set == 'SSV2' or args.data_set == 'SSV2-Mini':
+        mode = None
+        anno_path = None
+        if is_train is True:
+            mode = 'train'
+            if 'Mini' in args.data_set:
+                 anno_path = os.path.join(args.data_path, 'train_mini.csv')
+            else:
+                 anno_path = os.path.join(args.data_path, 'train.csv')
+        elif test_mode is True:
+            mode = 'test'
+            anno_path = os.path.join(args.data_path, 'test.csv')
+        else:
+            mode = 'validation'
+            anno_path = os.path.join(args.data_path, 'val.csv')
+        dataset = SSVideoClsDataset(
+            anno_path=anno_path,
+            data_path='/',
+            mode=mode,
+            clip_len=1,
+            num_segment=args.num_frames,
+            test_num_segment=args.test_num_segment,
+            test_num_crop=args.test_num_crop,
+            num_crop=1 if not test_mode else 3,
+            keep_aspect_ratio=True,
+            crop_size=args.input_size,
+            short_side_size=args.short_side_size,
+            new_height=256,
+            new_width=320,
+            args=args)
+        nb_classes = 174
+    elif args.data_set == 'UCF101':
+        mode = None
+        anno_path = None
+        if is_train is True:
+            mode = 'train'
+            anno_path = os.path.join(args.data_path, 'train.csv')
+        elif test_mode is True:
+            mode = 'test'
+            anno_path = os.path.join(args.data_path, 'test.csv')
+        else:
+            mode = 'validation'
+            anno_path = os.path.join(args.data_path, 'val.csv')
+        dataset = VideoClsDataset(
+            anno_path=anno_path,
+            data_path='/',
+            mode=mode,
+            clip_len=args.num_frames,
+            frame_sample_rate=args.sampling_rate,
+            num_segment=1,
+            test_num_segment=args.test_num_segment,
+            test_num_crop=args.test_num_crop,
+            num_crop=1 if not test_mode else 3,
+            keep_aspect_ratio=True,
+            crop_size=args.input_size,
+            short_side_size=args.short_side_size,
+            new_height=256,
+            new_width=320,
+            args=args)
+        nb_classes = 101
+    elif args.data_set == 'HMDB51':
+        mode = None
+        anno_path = None
+        if is_train is True:
+            mode = 'train'
+            anno_path = os.path.join(args.data_path, 'train.csv')
+        elif test_mode is True:
+            mode = 'test'
+            anno_path = os.path.join(args.data_path, 'test.csv')
+        else:
+            mode = 'validation'
+            anno_path = os.path.join(args.data_path, 'val.csv')
+        dataset = VideoClsDataset(
+            anno_path=anno_path,
+            data_path='/',
+            mode=mode,
+            clip_len=args.num_frames,
+            frame_sample_rate=args.sampling_rate,
+            num_segment=1,
+            test_num_segment=args.test_num_segment,
+            test_num_crop=args.test_num_crop,
+            num_crop=1 if not test_mode else 3,
+            keep_aspect_ratio=True,
+            crop_size=args.input_size,
+            short_side_size=args.short_side_size,
+            new_height=256,
+            new_width=320,
+            args=args)
+        nb_classes = 51
+    else:
+        raise NotImplementedError()
+    assert nb_classes == args.nb_classes
+    print("Number of the class = %d" % args.nb_classes)
+    return dataset, nb_classes

dynamic_utils.py ADDED Viewed

	@@ -0,0 +1,133 @@

+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+import numpy as np
+from typing import List
+def sample_key_frames(num_frames: int,
+                      key_frame_probs: List[float]) -> np.ndarray:
+    """ Sample the indices of key frames.
+    Args:
+        num_frames (int): number of frames in whole video
+        key_frame_probs (List[float]): the sampling probability of how many
+            key frames will be sampled. The sum of this array should be 1.0.
+    Returns:
+        frame_inds (np.ndarray): key frame index, in range
+            of [0, num_frames - 1]. Note that the first frame and the
+            last frame will always be key frames.
+    Examples:
+        >>> sample_key_frames(16, [1.0, ])
+        np.ndarray([0, 15])
+        >>> sample_key_frames(16, [0.5, 0.5])
+        np.ndarray([0, 15])
+        np.ndarray([0, 7, 15])
+        np.ndarray([0, 8, 15])
+        np.ndarray([0, 15])
+    """
+    # how many key frames
+    num_key_frames = np.random.choice(len(key_frame_probs), p=key_frame_probs)
+    # if there is no inner key frame, we will directly
+    # sample the first frame and the last frame.
+    if num_key_frames == 0:
+        return np.array([0, num_frames - 1], dtype=np.int32)
+    avg_duration = num_frames / (num_key_frames + 1)
+    ticks = np.array([int(avg_duration * i)
+                      for i in range(1, num_key_frames + 1)], dtype=np.int32)
+    # add random jitter
+    jitter_range = int(avg_duration / 3)
+    if jitter_range > 0:
+        jitter = np.random.randint(-jitter_range,
+                                   jitter_range, size=len(ticks))
+    else:
+        jitter = np.zeros((len(ticks),), np.int32)
+    ticks = ticks + jitter
+    # add the first frame and last frame
+    ticks = np.concatenate((ticks, np.array([0, num_frames - 1])), axis=0)
+    # remove duplication and sort array
+    ticks = np.sort(np.unique(ticks))
+    return ticks
+def extend_key_frame_to_all(array: np.ndarray,
+                            key_frame_inds: np.ndarray,
+                            interpolate: str = 'uniform') -> np.ndarray:
+    """ Interpolate the values between key frames.
+    This function is used in some data augmentations for video clips. For
+    example, we first decide the color distortion values in some key frames,
+    then we can interpolate the values in the rest of frames. This strategy
+    will make the data augmentations more smooth over the entire video clip.
+    Args:
+        array (np.ndarray): The values in the key frames, in shape of [K, *]
+        key_frame_inds (np.ndarray): the frame index list of key frames, in
+            shape of [K, ]
+        interpolate (str): interpolation type. 'uniform' means the linear
+            interpolation; 'accelerate' means the constant acceleration.
+            'decelerate' means the reverse order of 'accelerate'.
+    Returns:
+        out_array (np.ndarray): the interpolated values, in shape of [N, *].
+            N denotes the value of key_frame_inds[-1].
+    Examples:
+        >>> values = np.array([0.0, 5.0])
+        >>> inds = np.array([0, 10])
+        >>> extend_key_frame_to_all(values, inds)
+        array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
+        >>> extend_key_frame_to_all(values, inds, 'accelerate')
+        array([0.  , 0.05, 0.2 , 0.45, 0.8 , 1.25, 1.8 , 2.45, 3.2 , 4.05, 5.])
+    """
+    def _uniform_interpolate(start_state, end_state, index_delta):
+        delta_state = (end_state - start_state) * (1.0 / index_delta)
+        return np.concatenate([start_state + _ * delta_state
+                               for _ in range(index_delta+1)], axis=0)
+    def _accelerate_interpolate(start_state, end_state, index_delta):
+        a = 2 * (end_state - start_state) / (index_delta ** 2)
+        return np.concatenate([start_state + 0.5 * a * (_**2)
+                               for _ in range(index_delta+1)], axis=0)
+    def _decelerate_interpolate(start_state, end_state, index_delta):
+        a = 2 * (start_state - end_state) / (index_delta ** 2)
+        return np.concatenate([end_state + 0.5 * a * ((index_delta-_)**2)
+                               for _ in range(index_delta+1)], axis=0)
+    assert key_frame_inds[0] == 0 and key_frame_inds[-1] > 0
+    num_key_frames = len(key_frame_inds)
+    assert num_key_frames == len(array)
+    num_frames = key_frame_inds[-1] + 1
+    out_array = np.zeros((num_frames, ) + array.shape[1:], dtype=array.dtype)
+    for i in range(num_key_frames - 1):
+        # fill the values between i -> i+1
+        st_idx, end_idx = key_frame_inds[i:i+2]
+        if interpolate == 'uniform':
+            inter_func = _uniform_interpolate
+        elif interpolate == 'accelerate':
+            inter_func = _accelerate_interpolate
+        elif interpolate == 'decelerate':
+            inter_func = _decelerate_interpolate
+        elif interpolate == 'random':
+            inter_index = np.random.choice(3, p=[0.7, 0.15, 0.15])
+            if inter_index == 0:
+                inter_func = _uniform_interpolate
+            elif inter_index == 1:
+                inter_func = _accelerate_interpolate
+            else:
+                inter_func = _decelerate_interpolate
+        else:
+            raise NotImplementedError
+        i_out = inter_func(array[i:i+1],
+                           array[i+1:i+2],
+                           end_idx - st_idx)
+        out_array[st_idx:end_idx+1] = i_out
+    return out_array

engine_for_finetuning.py ADDED Viewed

	@@ -0,0 +1,375 @@

+import os
+import numpy as np
+import math
+import sys
+from typing import Iterable, Optional
+import torch
+from mixup import Mixup
+from timm.utils import accuracy, ModelEma
+import utils_mae as utils
+from scipy.special import softmax
+import gc
+import pickle
+def train_class_batch(model, samples, target, criterion):
+    outputs = model(samples)
+    loss = criterion(outputs, target)
+    return loss, outputs
+def get_loss_scale_for_deepspeed(model):
+    optimizer = model.optimizer
+    return optimizer.loss_scale if hasattr(optimizer, "loss_scale") else optimizer.cur_scale
+def train_one_epoch(model: torch.nn.Module, criterion: torch.nn.Module,
+                    data_loader: Iterable, optimizer: torch.optim.Optimizer,
+                    device: torch.device, epoch: int, loss_scaler, max_norm: float = 0,
+                    model_ema: Optional[ModelEma] = None, mixup_fn: Optional[Mixup] = None, log_writer=None,
+                    start_steps=None, lr_schedule_values=None, wd_schedule_values=None,
+                    num_training_steps_per_epoch=None, update_freq=None):
+    model.train(True)
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
+    metric_logger.add_meter('min_lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
+    header = 'Epoch: [{}]'.format(epoch)
+    print_freq = 10
+    if loss_scaler is None:
+        model.zero_grad()
+        model.micro_steps = 0
+    else:
+        optimizer.zero_grad()
+    for data_iter_step, (samples, targets, _, _) in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
+        step = data_iter_step // update_freq
+        if step >= num_training_steps_per_epoch:
+            continue
+        it = start_steps + step  # global training iteration
+        # Update LR & WD for the first acc
+        if lr_schedule_values is not None or wd_schedule_values is not None and data_iter_step % update_freq == 0:
+            for i, param_group in enumerate(optimizer.param_groups):
+                if lr_schedule_values is not None:
+                    param_group["lr"] = lr_schedule_values[it] * param_group["lr_scale"]
+                if wd_schedule_values is not None and param_group["weight_decay"] > 0:
+                    param_group["weight_decay"] = wd_schedule_values[it]
+        samples = samples.to(device, non_blocking=True)
+        targets = targets.to(device, non_blocking=True)
+        if mixup_fn is not None:
+            samples, targets = mixup_fn(samples, targets)
+        if loss_scaler is None:
+            samples = samples.half()
+            loss, output = train_class_batch(
+                model, samples, targets, criterion)
+        else:
+            with torch.cuda.amp.autocast():
+                loss, output = train_class_batch(
+                    model, samples, targets, criterion)
+        loss_value = loss.item()
+        if not math.isfinite(loss_value):
+            print("Loss is {}, stopping training".format(loss_value))
+            sys.exit(1)
+        if loss_scaler is None:
+            loss /= update_freq
+            model.backward(loss)
+            model.step()
+            if (data_iter_step + 1) % update_freq == 0:
+                # model.zero_grad()
+                # Deepspeed will call step() & model.zero_grad() automatic
+                if model_ema is not None:
+                    model_ema.update(model)
+            grad_norm = None
+            loss_scale_value = get_loss_scale_for_deepspeed(model)
+        else:
+            # this attribute is added by timm on one optimizer (adahessian)
+            is_second_order = hasattr(optimizer, 'is_second_order') and optimizer.is_second_order
+            loss /= update_freq
+            grad_norm = loss_scaler(loss, optimizer, clip_grad=max_norm,
+                                    parameters=model.parameters(), create_graph=is_second_order,
+                                    update_grad=(data_iter_step + 1) % update_freq == 0)
+            if (data_iter_step + 1) % update_freq == 0:
+                optimizer.zero_grad()
+                if model_ema is not None:
+                    model_ema.update(model)
+            loss_scale_value = loss_scaler.state_dict()["scale"]
+        torch.cuda.synchronize()
+        if mixup_fn is None:
+            class_acc = (output.max(-1)[-1] == targets).float().mean()
+        else:
+            class_acc = None
+        metric_logger.update(loss=loss_value)
+        metric_logger.update(class_acc=class_acc)
+        metric_logger.update(loss_scale=loss_scale_value)
+        min_lr = 10.
+        max_lr = 0.
+        for group in optimizer.param_groups:
+            min_lr = min(min_lr, group["lr"])
+            max_lr = max(max_lr, group["lr"])
+        metric_logger.update(lr=max_lr)
+        metric_logger.update(min_lr=min_lr)
+        weight_decay_value = None
+        for group in optimizer.param_groups:
+            if group["weight_decay"] > 0:
+                weight_decay_value = group["weight_decay"]
+        metric_logger.update(weight_decay=weight_decay_value)
+        metric_logger.update(grad_norm=grad_norm)
+        if log_writer is not None:
+            log_writer.update(loss=loss_value, head="loss")
+            log_writer.update(class_acc=class_acc, head="loss")
+            log_writer.update(loss_scale=loss_scale_value, head="opt")
+            log_writer.update(lr=max_lr, head="opt")
+            log_writer.update(min_lr=min_lr, head="opt")
+            log_writer.update(weight_decay=weight_decay_value, head="opt")
+            log_writer.update(grad_norm=grad_norm, head="opt")
+            log_writer.set_step()
+    # gather the stats from all processes
+    metric_logger.synchronize_between_processes()
+    print("Averaged stats:", metric_logger)
+    return {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+@torch.no_grad()
+def validation_one_epoch(data_loader, model, device):
+    criterion = torch.nn.CrossEntropyLoss()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    header = 'Val:'
+    # switch to evaluation mode
+    model.eval()
+    for batch in metric_logger.log_every(data_loader, 10, header):
+        videos = batch[0]
+        target = batch[1]
+        videos = videos.to(device, non_blocking=True)
+        target = target.to(device, non_blocking=True)
+        # compute output
+        with torch.cuda.amp.autocast():
+            output = model(videos)
+            loss = criterion(output, target)
+        acc1, acc5 = accuracy(output, target, topk=(1, 5))
+        batch_size = videos.shape[0]
+        metric_logger.update(loss=loss.item())
+        metric_logger.meters['acc1'].update(acc1.item(), n=batch_size)
+        metric_logger.meters['acc5'].update(acc5.item(), n=batch_size)
+    # gather the stats from all processes
+    metric_logger.synchronize_between_processes()
+    print('* Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f} loss {losses.global_avg:.3f}'
+          .format(top1=metric_logger.acc1, top5=metric_logger.acc5, losses=metric_logger.loss))
+    return {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+@torch.no_grad()
+def final_test(data_loader, model, device, file):
+    criterion = torch.nn.CrossEntropyLoss()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    header = 'Test:'
+    # switch to evaluation mode
+    model.eval()
+    final_result = []
+    for batch in metric_logger.log_every(data_loader, 10, header):
+        videos = batch[0]
+        target = batch[1]
+        ids = batch[2]
+        chunk_nb = batch[3]
+        split_nb = batch[4]
+        videos = videos.to(device, non_blocking=True)
+        target = target.to(device, non_blocking=True)
+        # compute output
+        with torch.cuda.amp.autocast():
+            output = model(videos)
+            loss = criterion(output, target)
+        for i in range(output.size(0)):
+            string = "{} {} {} {} {}\n".format(ids[i], \
+                                                str(output.data[i].cpu().numpy().tolist()), \
+                                                str(int(target[i].cpu().numpy())), \
+                                                str(int(chunk_nb[i].cpu().numpy())), \
+                                                str(int(split_nb[i].cpu().numpy())))
+            final_result.append(string)
+        acc1, acc5 = accuracy(output, target, topk=(1, 5))
+        batch_size = videos.shape[0]
+        metric_logger.update(loss=loss.item())
+        metric_logger.meters['acc1'].update(acc1.item(), n=batch_size)
+        metric_logger.meters['acc5'].update(acc5.item(), n=batch_size)
+    if not os.path.exists(file):
+        os.mknod(file)
+    with open(file, 'w') as f:
+        f.write("{}, {}\n".format(acc1, acc5))
+        for line in final_result:
+            f.write(line)
+    # gather the stats from all processes
+    metric_logger.synchronize_between_processes()
+    print('* Acc@1 {top1.global_avg:.3f} Acc@5 {top5.global_avg:.3f} loss {losses.global_avg:.3f}'
+          .format(top1=metric_logger.acc1, top5=metric_logger.acc5, losses=metric_logger.loss))
+    return {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+def merge(eval_path, num_tasks):
+    dict_feats = {}
+    dict_label = {}
+    dict_pos = {}
+    print("Reading individual output files")
+    for x in range(num_tasks):
+        file = os.path.join(eval_path, str(x) + '.txt')
+        lines = open(file, 'r').readlines()[1:]
+        for line in lines:
+            line = line.strip()
+            name = line.split('[')[0]
+            label = line.split(']')[1].split(' ')[1]
+            chunk_nb = line.split(']')[1].split(' ')[2]
+            split_nb = line.split(']')[1].split(' ')[3]
+            data = np.fromstring(line.split('[')[1].split(']')[0], dtype=float, sep=',')
+            data = softmax(data)
+            if not name in dict_feats:
+                dict_feats[name] = []
+                dict_label[name] = 0
+                dict_pos[name] = []
+            if chunk_nb + split_nb in dict_pos[name]:
+                continue
+            dict_feats[name].append(data)
+            dict_pos[name].append(chunk_nb + split_nb)
+            dict_label[name] = label
+    print("Computing final results")
+    input_lst = []
+    print(len(dict_feats))
+    for i, item in enumerate(dict_feats):
+        input_lst.append([i, item, dict_feats[item], dict_label[item]])
+    from multiprocessing import Pool
+    p = Pool(64)
+    ans = p.map(compute_video, input_lst)
+    top1 = [x[1] for x in ans]
+    top5 = [x[2] for x in ans]
+    pred = [x[0] for x in ans]
+    label = [x[3] for x in ans]
+    final_top1 ,final_top5 = np.mean(top1), np.mean(top5)
+    return final_top1*100 ,final_top5*100
+def compute_video(lst):
+    i, video_id, data, label = lst
+    feat = [x for x in data]
+    feat = np.mean(feat, axis=0)
+    pred = np.argmax(feat)
+    top1 = (int(pred) == int(label)) * 1.0
+    top5 = (int(label) in np.argsort(-feat)[:5]) * 1.0
+    return [pred, top1, top5, int(label)]
+def merge_mean_per_class(eval_path, num_tasks,nb_classes):
+    dict_feats = {}
+    dict_label = {}
+    dict_pos = {}
+    #print("Reading individual output files")
+    for x in range(num_tasks):
+        file = os.path.join(eval_path, str(x) + '.txt')
+        lines = open(file, 'r').readlines()[1:]
+        for line in lines:
+            line = line.strip()
+            name = line.split('[')[0]
+            label = line.split(']')[1].split(' ')[1]
+            chunk_nb = line.split(']')[1].split(' ')[2]
+            split_nb = line.split(']')[1].split(' ')[3]
+            data = np.fromstring(line.split('[')[1].split(']')[0], dtype=float, sep=',')
+            data = softmax(data)
+            if not name in dict_feats:
+                dict_feats[name] = []
+                dict_label[name] = 0
+                dict_pos[name] = []
+            if chunk_nb + split_nb in dict_pos[name]:
+                continue
+            dict_feats[name].append(data)
+            dict_pos[name].append(chunk_nb + split_nb)
+            dict_label[name] = label
+    print("Computing mean per class results")
+    input_lst = []
+    all_pred = []
+    all_label = []
+    classes = torch.arange(nb_classes)
+    classwise_top1 = [0 for c in classes]
+    classwise_top5 = [0 for c in classes]
+    actual_nb_classes = nb_classes
+    cnt = 0
+    for c in classes:
+        input_lst = []
+        for i, item in enumerate(dict_feats):
+            if int(dict_label[item]) == c:
+                input_lst.append([i, item, dict_feats[item], dict_label[item]])
+        cnt += len(input_lst)
+        # p = Pool(4)
+        # ans = p.map(compute_video, input_lst)
+        if len(input_lst) == 0:
+            actual_nb_classes -= 1
+            print(f"Class {c} is not present in test set, skip")
+            continue
+        ans = []
+        for i in input_lst:
+            ans.append(compute_video(i))
+        top1 = [x[1] for x in ans]
+        top5 = [x[2] for x in ans]
+        pred = [x[0] for x in ans]
+        label = [x[3] for x in ans]
+        # for i in pred:
+        #     all_pred.append(i)
+        # for j in label:
+        #     all_label.append(j)
+        final_top1 ,final_top5 = np.mean(top1), np.mean(top5)
+        classwise_top1[c] = final_top1*100
+        classwise_top5[c] = final_top5*100
+        del input_lst
+        del ans
+        del top1
+        del top5
+        del pred
+        del label
+        gc.collect()
+    assert cnt == len(dict_feats)
+    # pred_cnt = 0
+    # for idx, p in enumerate(all_pred):
+    #     if int(p) == int(all_label[idx]):
+    #         pred_cnt += 1
+    # print(pred_cnt/len(all_pred))
+    classwise_top1_path = os.path.join(eval_path, "classwise_top1.pkl")
+    with open(classwise_top1_path, 'wb') as file:
+        pickle.dump(classwise_top1, file)
+    classwise_top1 = np.sum(classwise_top1) / actual_nb_classes
+    classwise_top5 = np.sum(classwise_top5) / actual_nb_classes
+    return classwise_top1,classwise_top5

engine_for_pretraining.py ADDED Viewed

	@@ -0,0 +1,152 @@

+import math
+import sys
+from typing import Iterable
+import torch
+import torch.nn as nn
+import utils_mae as utils
+from einops import rearrange
+from timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
+def train_one_epoch(model: torch.nn.Module, data_loader: Iterable, optimizer: torch.optim.Optimizer,
+                    device: torch.device, epoch: int, loss_scaler, max_norm: float = 0, patch_size: int = 16,
+                    normlize_target: bool = True, log_writer=None, lr_scheduler=None, start_steps=None,
+                    lr_schedule_values=None, wd_schedule_values=None,teacher_model=None,target_type='pixel', multiple_sampling=False):
+    model.train()
+    metric_logger = utils.MetricLogger(delimiter="  ")
+    metric_logger.add_meter('lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
+    metric_logger.add_meter('min_lr', utils.SmoothedValue(window_size=1, fmt='{value:.6f}'))
+    header = 'Epoch: [{}]'.format(epoch)
+    print_freq = 10
+    loss_func = nn.MSELoss()
+    for step, batch in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
+        # assign learning rate & weight decay for each step
+        it = start_steps + step  # global training iteration
+        if lr_schedule_values is not None or wd_schedule_values is not None:
+            for i, param_group in enumerate(optimizer.param_groups):
+                if lr_schedule_values is not None:
+                    param_group["lr"] = lr_schedule_values[it] * param_group["lr_scale"]
+                if wd_schedule_values is not None and param_group["weight_decay"] > 0:
+                    param_group["weight_decay"] = wd_schedule_values[it]
+        videos, bool_masked_pos = batch
+        videos = videos.to(device, non_blocking=True)
+        bool_masked_pos = bool_masked_pos.to(device, non_blocking=True).flatten(1).to(torch.bool)
+        #print("input_1",videos.size(),bool_masked_pos.size())
+        bs, _, nf, h, w = videos.shape
+        idx = torch.randperm(bool_masked_pos.size(0))
+        shuffled_bool_masked_pos = bool_masked_pos[idx,:]
+        if 'pixel' in target_type:
+            with torch.no_grad():
+                # calculate the predict label
+                mean = torch.as_tensor(IMAGENET_DEFAULT_MEAN).to(device)[None, :, None, None, None]
+                std = torch.as_tensor(IMAGENET_DEFAULT_STD).to(device)[None, :, None, None, None]
+                unnorm_videos = videos * std + mean  # in [0, 1]
+                if normlize_target:
+                    videos_squeeze = rearrange(unnorm_videos, 'b c (t p0) (h p1) (w p2) -> b (t h w) (p0 p1 p2) c', p0=2, p1=patch_size, p2=patch_size)
+                    videos_norm = (videos_squeeze - videos_squeeze.mean(dim=-2, keepdim=True)
+                        ) / (videos_squeeze.var(dim=-2, unbiased=True, keepdim=True).sqrt() + 1e-6)
+                    # we find that the mean is about 0.48 and standard deviation is about 0.08.
+                    videos_patch = rearrange(videos_norm, 'b n p c -> b n (p c)')
+                else:
+                    videos_patch = rearrange(unnorm_videos, 'b c (t p0) (h p1) (w p2) -> b (t h w) (p0 p1 p2 c)', p0=2, p1=patch_size, p2=patch_size)
+                B, _, C = videos_patch.shape
+                if not multiple_sampling:
+                    labels = videos_patch[bool_masked_pos].reshape(B, -1, C)
+                else:
+                    labels_1 = videos_patch[bool_masked_pos].reshape(B, -1, C)
+                    labels_2 = videos_patch[shuffled_bool_masked_pos].reshape(B, -1, C)
+        elif 'dino' in target_type or 'clip' in target_type:
+            with torch.no_grad():
+                permuted_video = videos.permute(0, 2, 1, 3, 4)
+                bs, nf, _, h, w = permuted_video.shape
+                permuted_video = permuted_video[:, ::2].flatten(0, 1)
+                permuted_video = permuted_video.to(device, non_blocking=True)
+                features = teacher_model(permuted_video)
+                _, np, dim = features.shape
+                features = features.reshape(bs, nf//2, np, dim)
+                features.requires_grad = False
+            features = features.to(device, non_blocking=True)
+            with torch.no_grad():
+                features_squeeze = rearrange(features, 'b n o c -> b (n o) c')
+                if normlize_target:
+                    labels = (features_squeeze - features_squeeze.mean(dim=-2, keepdim=True)
+                        ) / (features_squeeze.var(dim=-2, unbiased=True, keepdim=True).sqrt() + 1e-6)
+                else:
+                    labels = features_squeeze
+                B, _, C = labels.shape
+                if not multiple_sampling:
+                    labels = labels[bool_masked_pos].reshape(B, -1, C)
+                else:
+                    labels_1 = labels[bool_masked_pos].reshape(B, -1, C)
+                    labels_2 = labels[shuffled_bool_masked_pos].reshape(B, -1, C)
+        with torch.cuda.amp.autocast():
+            if not multiple_sampling:
+                outputs = model(videos, bool_masked_pos)
+            else:
+                outputs_1 = model(videos, bool_masked_pos)
+                outputs_2 = model(videos,shuffled_bool_masked_pos)
+                labels = torch.cat((labels_1,labels_2),dim=0)
+                outputs = torch.cat((outputs_1,outputs_2),dim=0)
+            loss = loss_func(input=outputs, target=labels)
+        loss_value = loss.item()
+        if not math.isfinite(loss_value):
+            print("Loss is {}, stopping training".format(loss_value))
+            sys.exit(1)
+        optimizer.zero_grad()
+        # this attribute is added by timm on one optimizer (adahessian)
+        is_second_order = hasattr(optimizer, 'is_second_order') and optimizer.is_second_order
+        grad_norm = loss_scaler(loss, optimizer, clip_grad=max_norm,
+                                parameters=model.parameters(), create_graph=is_second_order)
+        loss_scale_value = loss_scaler.state_dict()["scale"]
+        torch.cuda.synchronize()
+        metric_logger.update(loss=loss_value)
+        metric_logger.update(loss_scale=loss_scale_value)
+        min_lr = 10.
+        max_lr = 0.
+        for group in optimizer.param_groups:
+            min_lr = min(min_lr, group["lr"])
+            max_lr = max(max_lr, group["lr"])
+        metric_logger.update(lr=max_lr)
+        metric_logger.update(min_lr=min_lr)
+        weight_decay_value = None
+        for group in optimizer.param_groups:
+            if group["weight_decay"] > 0:
+                weight_decay_value = group["weight_decay"]
+        metric_logger.update(weight_decay=weight_decay_value)
+        metric_logger.update(grad_norm=grad_norm)
+        if log_writer is not None:
+            log_writer.update(loss=loss_value, head="loss")
+            log_writer.update(loss_scale=loss_scale_value, head="opt")
+            log_writer.update(lr=max_lr, head="opt")
+            log_writer.update(min_lr=min_lr, head="opt")
+            log_writer.update(weight_decay=weight_decay_value, head="opt")
+            log_writer.update(grad_norm=grad_norm, head="opt")
+            log_writer.set_step()
+        if lr_scheduler is not None:
+            lr_scheduler.step_update(start_steps + step)
+    # gather the stats from all processes
+    metric_logger.synchronize_between_processes()
+    print("Averaged stats:", metric_logger)
+    return {k: meter.global_avg for k, meter in metric_logger.meters.items()}

environment.yml ADDED Viewed

	@@ -0,0 +1,259 @@

+name: smile
+channels:
+  - pytorch
+  - nvidia
+  - anaconda
+  - conda-forge
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=conda_forge
+  - _openmp_mutex=4.5=2_gnu
+  - alsa-lib=1.2.8=h166bdaf_0
+  - aom=3.5.0=h27087fc_0
+  - appdirs=1.4.4=pyh9f0ad1d_0
+  - attr=2.5.1=h166bdaf_1
+  - blas=1.0=mkl
+  - bottleneck=1.3.5=py310ha9d4c09_0
+  - brotli-python=1.1.0=py310hc6cd4ac_1
+  - bzip2=1.0.8=hd590300_5
+  - c-ares=1.23.0=hd590300_0
+  - ca-certificates=2023.11.17=hbcca054_0
+  - cairo=1.16.0=ha61ee94_1014
+  - certifi=2023.11.17=py310h06a4308_0
+  - charset-normalizer=3.3.2=pyhd8ed1ab_0
+  - click=8.1.7=unix_pyh707e725_0
+  - colorama=0.4.6=pyhd8ed1ab_0
+  - cuda-cudart=11.8.89=0
+  - cuda-cupti=11.8.87=0
+  - cuda-libraries=11.8.0=0
+  - cuda-nvrtc=11.8.89=0
+  - cuda-nvtx=11.8.86=0
+  - cuda-runtime=11.8.0=0
+  - dbus=1.13.6=h5008d03_3
+  - docker-pycreds=0.4.0=py_0
+  - einops=0.7.0=pyhd8ed1ab_1
+  - expat=2.5.0=hcb278e6_1
+  - ffmpeg=5.1.2=gpl_h8dda1f0_106
+  - fftw=3.3.10=nompi_hc118613_108
+  - filelock=3.13.1=pyhd8ed1ab_0
+  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
+  - font-ttf-inconsolata=3.000=h77eed37_0
+  - font-ttf-source-code-pro=2.038=h77eed37_0
+  - font-ttf-ubuntu=0.83=h77eed37_1
+  - fontconfig=2.14.2=h14ed4e7_0
+  - fonts-conda-ecosystem=1=0
+  - fonts-conda-forge=1=0
+  - freeglut=3.2.2=h9c3ff4c_1
+  - freetype=2.12.1=h267a509_2
+  - fsspec=2023.12.0=pyhca7485f_0
+  - gettext=0.21.1=h27087fc_0
+  - gitdb=4.0.11=pyhd8ed1ab_0
+  - gitpython=3.1.40=pyhd8ed1ab_0
+  - glib=2.78.1=hfc55251_1
+  - glib-tools=2.78.1=hfc55251_1
+  - gmp=6.3.0=h59595ed_0
+  - gmpy2=2.1.2=py310h3ec546c_1
+  - gnutls=3.7.9=hb077bed_0
+  - graphite2=1.3.13=h58526e2_1001
+  - gst-plugins-base=1.22.0=h4243ec0_2
+  - gstreamer=1.22.0=h25f0c4b_2
+  - gstreamer-orc=0.4.34=hd590300_0
+  - harfbuzz=6.0.0=h8e241bc_0
+  - hdf5=1.14.0=nompi_hb72d44e_103
+  - huggingface_hub=0.19.4=pyhd8ed1ab_0
+  - icu=70.1=h27087fc_0
+  - idna=3.6=pyhd8ed1ab_0
+  - intel-openmp=2023.1.0=hdb19cb5_46306
+  - jack=1.9.22=h11f4161_0
+  - jasper=2.0.33=h0ff4b12_1
+  - jinja2=3.1.2=pyhd8ed1ab_1
+  - jpeg=9e=h166bdaf_2
+  - keyutils=1.6.1=h166bdaf_0
+  - krb5=1.20.1=h81ceb04_0
+  - lame=3.100=h166bdaf_1003
+  - lcms2=2.15=hfd0df8a_0
+  - ld_impl_linux-64=2.40=h41732ed_0
+  - lerc=4.0.0=h27087fc_0
+  - libabseil=20230802.1=cxx17_h59595ed_0
+  - libaec=1.1.2=h59595ed_1
+  - libblas=3.9.0=1_h86c2bf4_netlib
+  - libcap=2.67=he9d0100_0
+  - libcblas=3.9.0=5_h92ddd45_netlib
+  - libclang=15.0.7=default_hb11cfb5_4
+  - libclang13=15.0.7=default_ha2b6cf4_4
+  - libcublas=11.11.3.6=0
+  - libcufft=10.9.0.58=0
+  - libcufile=1.8.1.2=0
+  - libcups=2.3.3=h36d4200_3
+  - libcurand=10.3.4.101=0
+  - libcurl=8.1.2=h409715c_0
+  - libcusolver=11.4.1.48=0
+  - libcusparse=11.7.5.86=0
+  - libdb=6.2.32=h9c3ff4c_0
+  - libdeflate=1.17=h0b41bf4_0
+  - libdrm=2.4.114=h166bdaf_0
+  - libedit=3.1.20191231=he28a2e2_2
+  - libev=4.33=h516909a_1
+  - libevent=2.1.10=h28343ad_4
+  - libexpat=2.5.0=hcb278e6_1
+  - libffi=3.4.2=h7f98852_5
+  - libflac=1.4.3=h59595ed_0
+  - libgcc-ng=13.2.0=h807b86a_3
+  - libgcrypt=1.10.3=hd590300_0
+  - libgfortran-ng=13.2.0=h69a702a_3
+  - libgfortran5=13.2.0=ha4646dd_3
+  - libglib=2.78.1=h783c2da_1
+  - libglu=9.0.0=he1b5a44_1001
+  - libgomp=13.2.0=h807b86a_3
+  - libgpg-error=1.47=h71f35ed_0
+  - libhwloc=2.9.1=hd6dc26d_0
+  - libiconv=1.17=h166bdaf_0
+  - libidn2=2.3.4=h166bdaf_0
+  - libjpeg-turbo=2.0.0=h9bf148f_0
+  - liblapack=3.9.0=5_h92ddd45_netlib
+  - liblapacke=3.9.0=5_h92ddd45_netlib
+  - libllvm15=15.0.7=hadd5161_1
+  - libnghttp2=1.58.0=h47da74e_0
+  - libnpp=11.8.0.86=0
+  - libnsl=2.0.1=hd590300_0
+  - libnvjpeg=11.9.0.86=0
+  - libogg=1.3.4=h7f98852_1
+  - libopencv=4.7.0=py310hb48cf42_1
+  - libopus=1.3.1=h7f98852_1
+  - libpciaccess=0.17=h166bdaf_0
+  - libpng=1.6.39=h753d276_0
+  - libpq=15.3=hbcd7760_1
+  - libprotobuf=3.21.12=hfc55251_2
+  - libsndfile=1.2.2=hc60ed4a_1
+  - libsqlite=3.44.2=h2797004_0
+  - libssh2=1.11.0=h0841786_0
+  - libstdcxx-ng=13.2.0=h7e041cc_3
+  - libsystemd0=253=h8c4010b_1
+  - libtasn1=4.19.0=h166bdaf_0
+  - libtiff=4.5.0=h6adf6a1_2
+  - libtool=2.4.7=h27087fc_0
+  - libudev1=253=h0b41bf4_1
+  - libunistring=0.9.10=h7f98852_0
+  - libuuid=2.38.1=h0b41bf4_0
+  - libva=2.18.0=h0b41bf4_0
+  - libvorbis=1.3.7=h9c3ff4c_0
+  - libvpx=1.11.0=h9c3ff4c_3
+  - libwebp-base=1.3.2=hd590300_0
+  - libxcb=1.13=h7f98852_1004
+  - libxkbcommon=1.5.0=h79f4944_1
+  - libxml2=2.10.3=hca2bb57_4
+  - libzlib=1.2.13=hd590300_5
+  - llvm-openmp=15.0.7=h0cdce71_0
+  - lz4-c=1.9.4=hcb278e6_0
+  - markupsafe=2.1.3=py310h2372a71_1
+  - mkl=2023.1.0=h213fc3f_46344
+  - mkl-service=2.4.0=py310h5eee18b_1
+  - mpc=1.3.1=hfe3b2da_0
+  - mpfr=4.2.1=h9458935_0
+  - mpg123=1.32.3=h59595ed_0
+  - mpmath=1.3.0=pyhd8ed1ab_0
+  - mysql-common=8.0.33=hf1915f5_6
+  - mysql-libs=8.0.33=hca2cd23_6
+  - ncurses=6.4=h59595ed_2
+  - nettle=3.9.1=h7ab15ed_0
+  - networkx=3.2.1=pyhd8ed1ab_0
+  - nspr=4.35=h27087fc_0
+  - nss=3.95=h1d7d5a4_0
+  - numexpr=2.8.7=py310h85018f9_0
+  - numpy=1.26.2=py310hb13e2d6_0
+  - opencv=4.7.0=py310hff52083_1
+  - openh264=2.3.1=hcb278e6_2
+  - openjpeg=2.5.0=hfec8fc6_2
+  - openssl=3.1.4=hd590300_0
+  - p11-kit=0.24.1=hc5aa10d_0
+  - packaging=23.2=pyhd8ed1ab_0
+  - pandas=2.1.1=py310h1128e8f_0
+  - pathtools=0.1.2=py_1
+  - pcre2=10.42=hcad00b1_0
+  - pillow=9.4.0=py310h023d228_1
+  - pip=23.3.1=pyhd8ed1ab_0
+  - pixman=0.42.2=h59595ed_0
+  - protobuf=4.21.12=py310heca2aa9_0
+  - pthread-stubs=0.4=h36c2ea0_1001
+  - pulseaudio=16.1=hcb278e6_3
+  - pulseaudio-client=16.1=h5195f5e_3
+  - pulseaudio-daemon=16.1=ha8d29e2_3
+  - py-opencv=4.7.0=py310hfdc917e_1
+  - pysocks=1.7.1=pyha2e5f31_6
+  - python=3.10.13=hd12c33a_0_cpython
+  - python-dateutil=2.8.2=pyhd3eb1b0_0
+  - python-tzdata=2023.3=pyhd3eb1b0_0
+  - python_abi=3.10=4_cp310
+  - pytorch=2.1.1=py3.10_cuda11.8_cudnn8.7.0_0
+  - pytorch-cuda=11.8=h7e8668a_5
+  - pytorch-mutex=1.0=cuda
+  - pytz=2023.3.post1=py310h06a4308_0
+  - pyyaml=6.0.1=py310h2372a71_1
+  - qt-main=5.15.8=h5d23da1_6
+  - readline=8.2=h8228510_1
+  - requests=2.31.0=pyhd8ed1ab_0
+  - safetensors=0.3.3=py310hcb5633a_1
+  - scipy=1.11.3=py310h5f9d8c6_0
+  - sentry-sdk=1.38.0=pyhd8ed1ab_0
+  - setproctitle=1.3.3=py310h2372a71_0
+  - setuptools=68.2.2=pyhd8ed1ab_0
+  - six=1.16.0=pyh6c4a22f_0
+  - smmap=5.0.0=pyhd8ed1ab_0
+  - svt-av1=1.4.1=hcb278e6_0
+  - sympy=1.12=pypyh9d50eac_103
+  - tbb=2021.9.0=hf52228f_0
+  - tensorboardx=2.6.2.2=pyhd8ed1ab_0
+  - timm=0.9.12=pyhd8ed1ab_0
+  - tk=8.6.13=noxft_h4845f30_101
+  - torchaudio=2.1.1=py310_cu118
+  - torchtriton=2.1.0=py310
+  - torchvision=0.16.1=py310_cu118
+  - tqdm=4.66.1=pyhd8ed1ab_0
+  - typing-extensions=4.8.0=hd8ed1ab_0
+  - typing_extensions=4.8.0=pyha770c72_0
+  - tzdata=2023c=h71feb2d_0
+  - urllib3=2.1.0=pyhd8ed1ab_0
+  - wandb=0.15.12=pyhd8ed1ab_0
+  - wheel=0.42.0=pyhd8ed1ab_0
+  - x264=1!164.3095=h166bdaf_2
+  - x265=3.5=h924138e_3
+  - xcb-util=0.4.0=h516909a_0
+  - xcb-util-image=0.4.0=h166bdaf_0
+  - xcb-util-keysyms=0.4.0=h516909a_0
+  - xcb-util-renderutil=0.3.9=h166bdaf_0
+  - xcb-util-wm=0.4.1=h516909a_0
+  - xkeyboard-config=2.38=h0b41bf4_0
+  - xorg-fixesproto=5.0=h7f98852_1002
+  - xorg-inputproto=2.3.2=h7f98852_1002
+  - xorg-kbproto=1.0.7=h7f98852_1002
+  - xorg-libice=1.1.1=hd590300_0
+  - xorg-libsm=1.2.4=h7391055_0
+  - xorg-libx11=1.8.4=h0b41bf4_0
+  - xorg-libxau=1.0.11=hd590300_0
+  - xorg-libxdmcp=1.1.3=h7f98852_0
+  - xorg-libxext=1.3.4=h0b41bf4_2
+  - xorg-libxfixes=5.0.3=h7f98852_1004
+  - xorg-libxi=1.7.10=h7f98852_0
+  - xorg-libxrender=0.9.10=h7f98852_1003
+  - xorg-renderproto=0.11.1=h7f98852_1002
+  - xorg-xextproto=7.3.0=h0b41bf4_1003
+  - xorg-xproto=7.0.31=h7f98852_1007
+  - xz=5.2.6=h166bdaf_0
+  - yaml=0.2.5=h7f98852_2
+  - zlib=1.2.13=hd590300_5
+  - zstd=1.5.5=hfc55251_0
+  - pip:
+      - annotated-types==0.6.0
+      - decord==0.6.0
+      - hjson==3.1.0
+      - ninja==1.11.1.1
+      - psutil==5.9.6
+      - py-cpuinfo==9.0.0
+      - pydantic==2.5.2
+      - pydantic-core==2.14.5
+      - pynvml==11.5.0
+      - imutils==0.5.4
+      - transformers==4.31.0
+      - ftfy
+      - easydict
+      - matplotlib==3.10.0

functional.py ADDED Viewed

	@@ -0,0 +1,89 @@

+import numbers
+import cv2
+import numpy as np
+import PIL
+import torch
+def _is_tensor_clip(clip):
+    return torch.is_tensor(clip) and clip.ndimension() == 4
+def crop_clip(clip, min_h, min_w, h, w):
+    if isinstance(clip[0], np.ndarray):
+        cropped = [img[min_h:min_h + h, min_w:min_w + w, :] for img in clip]
+    elif isinstance(clip[0], PIL.Image.Image):
+        cropped = [
+            img.crop((min_w, min_h, min_w + w, min_h + h)) for img in clip
+        ]
+    else:
+        raise TypeError('Expected numpy.ndarray or PIL.Image' +
+                        'but got list of {0}'.format(type(clip[0])))
+    return cropped
+def resize_clip(clip, size, interpolation='bilinear'):
+    if isinstance(clip[0], np.ndarray):
+        if isinstance(size, numbers.Number):
+            im_h, im_w, im_c = clip[0].shape
+            # Min spatial dim already matches minimal size
+            if (im_w <= im_h and im_w == size) or (im_h <= im_w
+                                                   and im_h == size):
+                return clip
+            new_h, new_w = get_resize_sizes(im_h, im_w, size)
+            size = (new_w, new_h)
+        else:
+            size = size[0], size[1]
+        if interpolation == 'bilinear':
+            np_inter = cv2.INTER_LINEAR
+        else:
+            np_inter = cv2.INTER_NEAREST
+        scaled = [
+            cv2.resize(img, size, interpolation=np_inter) for img in clip
+        ]
+    elif isinstance(clip[0], PIL.Image.Image):
+        if isinstance(size, numbers.Number):
+            im_w, im_h = clip[0].size
+            # Min spatial dim already matches minimal size
+            if (im_w <= im_h and im_w == size) or (im_h <= im_w
+                                                   and im_h == size):
+                return clip
+            new_h, new_w = get_resize_sizes(im_h, im_w, size)
+            size = (new_w, new_h)
+        else:
+            size = size[1], size[0]
+        if interpolation == 'bilinear':
+            pil_inter = PIL.Image.BILINEAR
+        else:
+            pil_inter = PIL.Image.NEAREST
+        scaled = [img.resize(size, pil_inter) for img in clip]
+    else:
+        raise TypeError('Expected numpy.ndarray or PIL.Image' +
+                        'but got list of {0}'.format(type(clip[0])))
+    return scaled
+def get_resize_sizes(im_h, im_w, size):
+    if im_w < im_h:
+        ow = size
+        oh = int(size * im_h / im_w)
+    else:
+        oh = size
+        ow = int(size * im_w / im_h)
+    return oh, ow
+def normalize(clip, mean, std, inplace=False):
+    if not _is_tensor_clip(clip):
+        raise TypeError('tensor is not a torch clip.')
+    if not inplace:
+        clip = clip.clone()
+    dtype = clip.dtype
+    mean = torch.as_tensor(mean, dtype=dtype, device=clip.device)
+    std = torch.as_tensor(std, dtype=dtype, device=clip.device)
+    clip.sub_(mean[:, None, None, None]).div_(std[:, None, None, None])
+    return clip

kinetics.py ADDED Viewed

	@@ -0,0 +1,559 @@

+import os
+import numpy as np
+from numpy.lib.function_base import disp
+import torch
+import decord
+from PIL import Image
+from torchvision import transforms
+from random_erasing import RandomErasing
+import warnings
+from decord import VideoReader, cpu
+from torch.utils.data import Dataset
+import video_transforms as video_transforms
+import volume_transforms as volume_transforms
+class VideoClsDataset(Dataset):
+    """Load your own video classification dataset."""
+    def __init__(self, anno_path, data_path, mode='train', clip_len=8,
+                 frame_sample_rate=2, crop_size=224, short_side_size=256,
+                 new_height=256, new_width=340, keep_aspect_ratio=True,
+                 num_segment=1, num_crop=1, test_num_segment=10, test_num_crop=3,args=None):
+        self.anno_path = anno_path
+        self.data_path = data_path
+        self.mode = mode
+        self.clip_len = clip_len
+        self.frame_sample_rate = frame_sample_rate
+        self.crop_size = crop_size
+        self.short_side_size = short_side_size
+        self.new_height = new_height
+        self.new_width = new_width
+        self.keep_aspect_ratio = keep_aspect_ratio
+        self.num_segment = num_segment
+        self.test_num_segment = test_num_segment
+        self.num_crop = num_crop
+        self.test_num_crop = test_num_crop
+        self.args = args
+        self.aug = False
+        self.rand_erase = False
+        if self.mode in ['train']:
+            self.aug = True
+            if self.args.reprob > 0:
+                self.rand_erase = True
+        if VideoReader is None:
+            raise ImportError("Unable to import `decord` which is required to read videos.")
+        import pandas as pd
+        cleaned = pd.read_csv(self.anno_path, header=None, delimiter=' ')
+        self.dataset_samples = list(cleaned.values[:, 0])
+        self.label_array = list(cleaned.values[:, 1])
+        if (mode == 'train'):
+            pass
+        elif (mode == 'validation'):
+            self.data_transform = video_transforms.Compose([
+                video_transforms.Resize(self.short_side_size, interpolation='bilinear'),
+                video_transforms.CenterCrop(size=(self.crop_size, self.crop_size)),
+                volume_transforms.ClipToTensor(),
+                video_transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                                           std=[0.229, 0.224, 0.225])
+            ])
+        elif mode == 'test':
+            self.data_resize = video_transforms.Compose([
+                video_transforms.Resize(size=(short_side_size), interpolation='bilinear')
+            ])
+            self.data_transform = video_transforms.Compose([
+                volume_transforms.ClipToTensor(),
+                video_transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                                           std=[0.229, 0.224, 0.225])
+            ])
+            self.test_seg = []
+            self.test_dataset = []
+            self.test_label_array = []
+            for ck in range(self.test_num_segment):
+                for cp in range(self.test_num_crop):
+                    for idx in range(len(self.label_array)):
+                        sample_label = self.label_array[idx]
+                        self.test_label_array.append(sample_label)
+                        self.test_dataset.append(self.dataset_samples[idx])
+                        self.test_seg.append((ck, cp))
+    def __getitem__(self, index):
+        if self.mode == 'train':
+            args = self.args
+            scale_t = 1
+            sample = self.dataset_samples[index]
+            buffer = self.loadvideo_decord(sample, sample_rate_scale=scale_t) # T H W C
+            if len(buffer) == 0:
+                while len(buffer) == 0:
+                    warnings.warn("video {} not correctly loaded during training".format(sample))
+                    index = np.random.randint(self.__len__())
+                    sample = self.dataset_samples[index]
+                    buffer = self.loadvideo_decord(sample, sample_rate_scale=scale_t)
+            if args.num_sample > 1:
+                frame_list = []
+                label_list = []
+                index_list = []
+                for _ in range(args.num_sample):
+                    new_frames = self._aug_frame(buffer, args)
+                    label = self.label_array[index]
+                    frame_list.append(new_frames)
+                    label_list.append(label)
+                    index_list.append(index)
+                return frame_list, label_list, index_list, {}
+            else:
+                buffer = self._aug_frame(buffer, args)
+            return buffer, self.label_array[index], index, {}
+        elif self.mode == 'validation':
+            sample = self.dataset_samples[index]
+            buffer = self.loadvideo_decord(sample)
+            if len(buffer) == 0:
+                while len(buffer) == 0:
+                    warnings.warn("video {} not correctly loaded during validation".format(sample))
+                    index = np.random.randint(self.__len__())
+                    sample = self.dataset_samples[index]
+                    buffer = self.loadvideo_decord(sample)
+            buffer = self.data_transform(buffer)
+            return buffer, self.label_array[index], sample.split("/")[-1].split(".")[0]
+        elif self.mode == 'test':
+            sample = self.test_dataset[index]
+            chunk_nb, split_nb = self.test_seg[index]
+            buffer = self.loadvideo_decord(sample)
+            while len(buffer) == 0:
+                warnings.warn("video {}, temporal {}, spatial {} not found during testing".format(\
+                    str(self.test_dataset[index]), chunk_nb, split_nb))
+                index = np.random.randint(self.__len__())
+                sample = self.test_dataset[index]
+                chunk_nb, split_nb = self.test_seg[index]
+                buffer = self.loadvideo_decord(sample)
+            buffer = self.data_resize(buffer)
+            if isinstance(buffer, list):
+                buffer = np.stack(buffer, 0)
+            spatial_step = 1.0 * (max(buffer.shape[1], buffer.shape[2]) - self.short_side_size) \
+                                 / (self.test_num_crop - 1)
+            temporal_step = max(1.0 * (buffer.shape[0] - self.clip_len) \
+                                / (self.test_num_segment - 1), 0)
+            temporal_start = int(chunk_nb * temporal_step)
+            spatial_start = int(split_nb * spatial_step)
+            if buffer.shape[1] >= buffer.shape[2]:
+                buffer = buffer[temporal_start:temporal_start + self.clip_len, \
+                       spatial_start:spatial_start + self.short_side_size, :, :]
+            else:
+                buffer = buffer[temporal_start:temporal_start + self.clip_len, \
+                       :, spatial_start:spatial_start + self.short_side_size, :]
+            buffer = self.data_transform(buffer)
+            return buffer, self.test_label_array[index], sample.split("/")[-1].split(".")[0], \
+                   chunk_nb, split_nb
+        else:
+            raise NameError('mode {} unkown'.format(self.mode))
+    def _aug_frame(
+        self,
+        buffer,
+        args,
+    ):
+        aug_transform = video_transforms.create_random_augment(
+            input_size=(self.crop_size, self.crop_size),
+            auto_augment=args.aa,
+            interpolation=args.train_interpolation,
+        )
+        buffer = [
+            transforms.ToPILImage()(frame) for frame in buffer
+        ]
+        buffer = aug_transform(buffer)
+        buffer = [transforms.ToTensor()(img) for img in buffer]
+        buffer = torch.stack(buffer) # T C H W
+        buffer = buffer.permute(0, 2, 3, 1) # T H W C
+        # T H W C
+        buffer = tensor_normalize(
+            buffer, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]
+        )
+        # T H W C -> C T H W.
+        buffer = buffer.permute(3, 0, 1, 2)
+        # Perform data augmentation.
+        scl, asp = (
+            [0.25, 1.0],
+            [0.75, 1.3333],
+        )
+        buffer = spatial_sampling(
+            buffer,
+            spatial_idx=-1,
+            min_scale=256,
+            max_scale=320,
+            crop_size=self.crop_size,
+            random_horizontal_flip=False if args.data_set == 'SSV2' else True ,
+            inverse_uniform_sampling=False,
+            aspect_ratio=asp,
+            scale=scl,
+            motion_shift=False
+        )
+        if self.rand_erase:
+            erase_transform = RandomErasing(
+                args.reprob,
+                mode=args.remode,
+                max_count=args.recount,
+                num_splits=args.recount,
+                device="cpu",
+            )
+            buffer = buffer.permute(1, 0, 2, 3)
+            buffer = erase_transform(buffer)
+            buffer = buffer.permute(1, 0, 2, 3)
+        return buffer
+    def loadvideo_decord(self, sample, sample_rate_scale=1):
+        """Load video content using Decord"""
+        fname = sample
+        if not (os.path.exists(fname)):
+            return []
+        # avoid hanging issue
+        if os.path.getsize(fname) < 1 * 1024:
+            print('SKIP: ', fname, " - ", os.path.getsize(fname))
+            return []
+        try:
+            if self.keep_aspect_ratio:
+                vr = VideoReader(fname, num_threads=1, ctx=cpu(0))
+            else:
+                vr = VideoReader(fname, width=self.new_width, height=self.new_height,
+                                 num_threads=1, ctx=cpu(0))
+        except:
+            print("video cannot be loaded by decord: ", fname)
+            return []
+        if self.mode == 'test':
+            all_index = [x for x in range(0, len(vr), self.frame_sample_rate)]
+            while len(all_index) < self.clip_len:
+                all_index.append(all_index[-1])
+            vr.seek(0)
+            buffer = vr.get_batch(all_index).asnumpy()
+            return buffer
+        # handle temporal segments
+        converted_len = int(self.clip_len * self.frame_sample_rate)
+        seg_len = len(vr) // self.num_segment
+        all_index = []
+        for i in range(self.num_segment):
+            if seg_len <= converted_len:
+                index = np.linspace(0, seg_len, num=seg_len // self.frame_sample_rate)
+                index = np.concatenate((index, np.ones(self.clip_len - seg_len // self.frame_sample_rate) * seg_len))
+                index = np.clip(index, 0, seg_len - 1).astype(np.int64)
+            else:
+                end_idx = np.random.randint(converted_len, seg_len)
+                str_idx = end_idx - converted_len
+                index = np.linspace(str_idx, end_idx, num=self.clip_len)
+                index = np.clip(index, str_idx, end_idx - 1).astype(np.int64)
+            index = index + i*seg_len
+            all_index.extend(list(index))
+        all_index = all_index[::int(sample_rate_scale)]
+        vr.seek(0)
+        buffer = vr.get_batch(all_index).asnumpy()
+        return buffer
+    def __len__(self):
+        if self.mode != 'test':
+            return len(self.dataset_samples)
+        else:
+            return len(self.test_dataset)
+def spatial_sampling(
+    frames,
+    spatial_idx=-1,
+    min_scale=256,
+    max_scale=320,
+    crop_size=224,
+    random_horizontal_flip=True,
+    inverse_uniform_sampling=False,
+    aspect_ratio=None,
+    scale=None,
+    motion_shift=False,
+):
+    """
+    Perform spatial sampling on the given video frames. If spatial_idx is
+    -1, perform random scale, random crop, and random flip on the given
+    frames. If spatial_idx is 0, 1, or 2, perform spatial uniform sampling
+    with the given spatial_idx.
+    Args:
+        frames (tensor): frames of images sampled from the video. The
+            dimension is `num frames` x `height` x `width` x `channel`.
+        spatial_idx (int): if -1, perform random spatial sampling. If 0, 1,
+            or 2, perform left, center, right crop if width is larger than
+            height, and perform top, center, buttom crop if height is larger
+            than width.
+        min_scale (int): the minimal size of scaling.
+        max_scale (int): the maximal size of scaling.
+        crop_size (int): the size of height and width used to crop the
+            frames.
+        inverse_uniform_sampling (bool): if True, sample uniformly in
+            [1 / max_scale, 1 / min_scale] and take a reciprocal to get the
+            scale. If False, take a uniform sample from [min_scale,
+            max_scale].
+        aspect_ratio (list): Aspect ratio range for resizing.
+        scale (list): Scale range for resizing.
+        motion_shift (bool): Whether to apply motion shift for resizing.
+    Returns:
+        frames (tensor): spatially sampled frames.
+    """
+    assert spatial_idx in [-1, 0, 1, 2]
+    if spatial_idx == -1:
+        if aspect_ratio is None and scale is None:
+            frames, _ = video_transforms.random_short_side_scale_jitter(
+                images=frames,
+                min_size=min_scale,
+                max_size=max_scale,
+                inverse_uniform_sampling=inverse_uniform_sampling,
+            )
+            frames, _ = video_transforms.random_crop(frames, crop_size)
+        else:
+            transform_func = (
+                video_transforms.random_resized_crop_with_shift
+                if motion_shift
+                else video_transforms.random_resized_crop
+            )
+            frames = transform_func(
+                images=frames,
+                target_height=crop_size,
+                target_width=crop_size,
+                scale=scale,
+                ratio=aspect_ratio,
+            )
+        if random_horizontal_flip:
+            frames, _ = video_transforms.horizontal_flip(0.5, frames)
+    else:
+        # The testing is deterministic and no jitter should be performed.
+        # min_scale, max_scale, and crop_size are expect to be the same.
+        assert len({min_scale, max_scale, crop_size}) == 1
+        frames, _ = video_transforms.random_short_side_scale_jitter(
+            frames, min_scale, max_scale
+        )
+        frames, _ = video_transforms.uniform_crop(frames, crop_size, spatial_idx)
+    return frames
+def tensor_normalize(tensor, mean, std):
+    """
+    Normalize a given tensor by subtracting the mean and dividing the std.
+    Args:
+        tensor (tensor): tensor to normalize.
+        mean (tensor or list): mean value to subtract.
+        std (tensor or list): std to divide.
+    """
+    if tensor.dtype == torch.uint8:
+        tensor = tensor.float()
+        tensor = tensor / 255.0
+    if type(mean) == list:
+        mean = torch.tensor(mean)
+    if type(std) == list:
+        std = torch.tensor(std)
+    tensor = tensor - mean
+    tensor = tensor / std
+    return tensor
+class VideoMAE(torch.utils.data.Dataset):
+    """Load your own video classification dataset.
+    Parameters
+    ----------
+    root : str, required.
+        Path to the root folder storing the dataset.
+    setting : str, required.
+        A text file describing the dataset, each line per video sample.
+        There are three items in each line: (1) video path; (2) video length and (3) video label.
+    train : bool, default True.
+        Whether to load the training or validation set.
+    test_mode : bool, default False.
+        Whether to perform evaluation on the test set.
+        Usually there is three-crop or ten-crop evaluation strategy involved.
+    name_pattern : str, default None.
+        The naming pattern of the decoded video frames.
+        For example, img_00012.jpg.
+    video_ext : str, default 'mp4'.
+        If video_loader is set to True, please specify the video format accordinly.
+    is_color : bool, default True.
+        Whether the loaded image is color or grayscale.
+    modality : str, default 'rgb'.
+        Input modalities, we support only rgb video frames for now.
+        Will add support for rgb difference image and optical flow image later.
+    num_segments : int, default 1.
+        Number of segments to evenly divide the video into clips.
+        A useful technique to obtain global video-level information.
+        Limin Wang, etal, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, ECCV 2016.
+    num_crop : int, default 1.
+        Number of crops for each image. default is 1.
+        Common choices are three crops and ten crops during evaluation.
+    new_length : int, default 1.
+        The length of input video clip. Default is a single image, but it can be multiple video frames.
+        For example, new_length=16 means we will extract a video clip of consecutive 16 frames.
+    new_step : int, default 1.
+        Temporal sampling rate. For example, new_step=1 means we will extract a video clip of consecutive frames.
+        new_step=2 means we will extract a video clip of every other frame.
+    temporal_jitter : bool, default False.
+        Whether to temporally jitter if new_step > 1.
+    video_loader : bool, default False.
+        Whether to use video loader to load data.
+    use_decord : bool, default True.
+        Whether to use Decord video loader to load data. Otherwise use mmcv video loader.
+    transform : function, default None.
+        A function that takes data and label and transforms them.
+    data_aug : str, default 'v1'.
+        Different types of data augmentation auto. Supports v1, v2, v3 and v4.
+    lazy_init : bool, default False.
+        If set to True, build a dataset instance without loading any dataset.
+    """
+    def __init__(self,
+                 root,
+                 setting,
+                 train=True,
+                 test_mode=False,
+                 name_pattern='img_%05d.jpg',
+                 video_ext='mp4',
+                 is_color=True,
+                 modality='rgb',
+                 num_segments=1,
+                 num_crop=1,
+                 new_length=1,
+                 new_step=1,
+                 transform=None,
+                 temporal_jitter=False,
+                 video_loader=False,
+                 use_decord=False,
+                 lazy_init=False):
+        super(VideoMAE, self).__init__()
+        self.root = root
+        self.setting = setting
+        self.train = train
+        self.test_mode = test_mode
+        self.is_color = is_color
+        self.modality = modality
+        self.num_segments = num_segments
+        self.num_crop = num_crop
+        self.new_length = new_length
+        self.new_step = new_step
+        self.skip_length = self.new_length * self.new_step
+        self.temporal_jitter = temporal_jitter
+        self.name_pattern = name_pattern
+        self.video_loader = video_loader
+        self.video_ext = video_ext
+        self.use_decord = use_decord
+        self.transform = transform
+        self.lazy_init = lazy_init
+        if not self.lazy_init:
+            self.clips = self._make_dataset(root, setting)
+            if len(self.clips) == 0:
+                raise(RuntimeError("Found 0 video clips in subfolders of: " + root + "\n"
+                                   "Check your data directory (opt.data-dir)."))
+    def __getitem__(self, index):
+       try:
+            directory, target = self.clips[index]
+            if self.video_loader:
+                if '.' in directory.split('/')[-1]:
+                    # data in the "setting" file already have extension, e.g., demo.mp4
+                    video_name = directory
+                else:
+                    # data in the "setting" file do not have extension, e.g., demo
+                    # So we need to provide extension (i.e., .mp4) to complete the file name.
+                    video_name = '{}.{}'.format(directory, self.video_ext)
+                decord_vr = decord.VideoReader(video_name, num_threads=1)
+                duration = len(decord_vr)
+            segment_indices, skip_offsets = self._sample_train_indices(duration)
+            images = self._video_TSN_decord_batch_loader(directory, decord_vr, duration, segment_indices, skip_offsets)
+            process_data, mask = self.transform((images, None)) # T*C,H,W
+            process_data = process_data.view((self.new_length, 3) + process_data.size()[-2:]).transpose(0,1)  # T*C,H,W -> T,C,H,W -> C,T,H,W
+            return (process_data, mask)
+       except Exception as error:
+            print(error , " in failed to load : ",video_name)
+            return self[(index+1) % len(self)]
+    def __len__(self):
+        return len(self.clips)
+    def _make_dataset(self, directory, setting):
+        if not os.path.exists(setting):
+            raise(RuntimeError("Setting file %s doesn't exist. Check opt.train-list and opt.val-list. " % (setting)))
+        clips = []
+        with open(setting) as split_f:
+            data = split_f.readlines()
+            for line in data:
+                line_info = line.split(' ')
+                # line format: video_path, video_duration, video_label
+                if len(line_info) < 2:
+                    raise(RuntimeError('Video input format is not correct, missing one or more element. %s' % line))
+                clip_path = os.path.join(line_info[0])
+                target = int(line_info[1])
+                item = (clip_path, target)
+                clips.append(item)
+        return clips
+    def _sample_train_indices(self, num_frames):
+        average_duration = (num_frames - self.skip_length + 1) // self.num_segments
+        if average_duration > 0:
+            offsets = np.multiply(list(range(self.num_segments)),
+                                  average_duration)
+            offsets = offsets + np.random.randint(average_duration,
+                                                  size=self.num_segments)
+        elif num_frames > max(self.num_segments, self.skip_length):
+            offsets = np.sort(np.random.randint(
+                num_frames - self.skip_length + 1,
+                size=self.num_segments))
+        else:
+            offsets = np.zeros((self.num_segments,))
+        if self.temporal_jitter:
+            skip_offsets = np.random.randint(
+                self.new_step, size=self.skip_length // self.new_step)
+        else:
+            skip_offsets = np.zeros(
+                self.skip_length // self.new_step, dtype=int)
+        return offsets + 1, skip_offsets
+    def _video_TSN_decord_batch_loader(self, directory, video_reader, duration, indices, skip_offsets):
+        sampled_list = []
+        frame_id_list = []
+        for seg_ind in indices:
+            offset = int(seg_ind)
+            for i, _ in enumerate(range(0, self.skip_length, self.new_step)):
+                if offset + skip_offsets[i] <= duration:
+                    frame_id = offset + skip_offsets[i] - 1
+                else:
+                    frame_id = offset - 1
+                frame_id_list.append(frame_id)
+                if offset + self.new_step < duration:
+                    offset += self.new_step
+        try:
+            video_data = video_reader.get_batch(frame_id_list).asnumpy()
+            sampled_list = [Image.fromarray(video_data[vid, :, :, :]).convert('RGB') for vid, _ in enumerate(frame_id_list)]
+        except:
+            raise RuntimeError('Error occured in reading frames {} from video {} of duration {}.'.format(frame_id_list, directory, duration))
+        return sampled_list

masking_generator.py ADDED Viewed

	@@ -0,0 +1,185 @@

+import numpy as np
+import random
+import ast
+class TubeMaskingGenerator:
+    def __init__(self, input_size, mask_ratio):
+        self.frames, self.height, self.width = input_size
+        self.num_patches_per_frame =  self.height * self.width
+        self.total_patches = self.frames * self.num_patches_per_frame
+        self.num_masks_per_frame = int(mask_ratio * self.num_patches_per_frame)
+        self.total_masks = self.frames * self.num_masks_per_frame
+    def __repr__(self):
+        repr_str = "Maks: total patches {}, mask patches {}".format(
+            self.total_patches, self.total_masks
+        )
+        return repr_str
+    def __call__(self):
+        mask_per_frame = np.hstack([
+            np.zeros(self.num_patches_per_frame - self.num_masks_per_frame),
+            np.ones(self.num_masks_per_frame),
+        ])
+        np.random.shuffle(mask_per_frame)
+        mask = np.tile(mask_per_frame, (self.frames,1)).flatten()
+        return mask
+class TubeletMaskingGenerator:
+    def __init__(self, input_size, mask_ratio, visible_frames, mask_type="tube", traj_unmask_ratio=0.1):
+        self.tube_masking_generator = TubeMaskingGenerator(input_size, mask_ratio)
+        self.frames, self.height, self.width = input_size
+        self.num_patches_per_frame = self.height * self.width
+        self.total_patches = self.frames * self.num_patches_per_frame
+        self.num_masks_per_frame = int(mask_ratio * self.num_patches_per_frame)
+        self.total_masks = self.frames * self.num_masks_per_frame
+        self.patch_size = 16
+        self.traj_unmask_ratio = traj_unmask_ratio
+        if visible_frames is not None:
+            visible_list = ast.literal_eval(visible_frames)
+            self.visible_frames = [int(element) for element in visible_list]
+        else:
+            self.visible_frames = None
+        self.mask_type = mask_type
+    def _balance_num_masks(self, combined_mask,
+                        unmasked_object_patches_index,
+                        unmasked_non_object_patches_index,
+                        masked_object_patches_index,
+                        tube_masked_index=None,
+                        tube_unmasked_index=None):
+        current_masks = np.sum(combined_mask)
+        num_diff = np.abs(self.total_masks - current_masks)
+        if tube_masked_index is None or tube_unmasked_index is None:
+            # tubelet masking without tube mask
+            # if too many masked patches, we unmask some patches
+            if current_masks > self.total_masks:
+                picked_index = masked_object_patches_index[np.random.choice(masked_object_patches_index.size, size=int(num_diff), replace=False)]
+                combined_mask[picked_index] = 0.
+            # if too few masked patches, we first try to mask non-object patches, if not enough, then we mask protected patches
+            elif current_masks < self.total_masks:
+                if num_diff <= len(unmasked_non_object_patches_index):
+                    picked_index = unmasked_non_object_patches_index[np.random.choice(unmasked_non_object_patches_index.size, size=int(num_diff), replace=False)]
+                    combined_mask[picked_index] = 1.
+                else:
+                    combined_mask[unmasked_non_object_patches_index] = 1.
+                    picked_index = unmasked_object_patches_index[np.random.choice(unmasked_object_patches_index.size, size=int(num_diff - len(unmasked_non_object_patches_index)), replace=False)]
+                    combined_mask[picked_index] = 1.
+        else:
+            # if too many masked patches, we first try to unmask tube masked patches, if not enough, then we unmask object patches
+            tube_masked_non_object_index = np.array(list(set(tube_masked_index) - set(masked_object_patches_index) - set(unmasked_object_patches_index)))
+            if current_masks > self.total_masks:
+                if num_diff <= len(tube_masked_non_object_index):
+                    picked_index = tube_masked_non_object_index[np.random.choice(tube_masked_non_object_index.size, size=int(num_diff), replace=False)]
+                    combined_mask[picked_index] = 0.
+                else:
+                    combined_mask[tube_masked_non_object_index] = 0.
+                    picked_index = masked_object_patches_index[np.random.choice(masked_object_patches_index.size, size=int(num_diff - len(tube_masked_non_object_index)), replace=False)]
+                    combined_mask[picked_index] = 0.
+            # if too few masked patches, we first try to mask non-object patches, if not enough, then we mask protected patches
+            elif current_masks < self.total_masks:
+                tube_unmasked_non_object_index = np.array(list(set(tube_unmasked_index) - set(masked_object_patches_index) - set(unmasked_object_patches_index)))
+                if num_diff <= len(tube_unmasked_non_object_index):
+                    picked_index = tube_unmasked_non_object_index[np.random.choice(tube_unmasked_non_object_index.size, size=int(num_diff), replace=False)]
+                    combined_mask[picked_index] = 1.
+                else:
+                    combined_mask[tube_unmasked_non_object_index] = 1.
+                    picked_index = unmasked_object_patches_index[np.random.choice(unmasked_object_patches_index.size, size=int(num_diff - len(tube_unmasked_non_object_index)), replace=False)]
+                    combined_mask[picked_index] = 1.
+        balanced_mask = combined_mask
+        return balanced_mask
+    def __repr__(self):
+            repr_str = "Maks: total patches {}, mask patches {}".format(
+                self.total_patches, self.total_masks
+            )
+            return repr_str
+    # 1 in mask array means masked, 0 means unmasked
+    def __call__(self, traj_rois):
+        # generate original VideoMAE tube mask and intialize the tube mask index
+        tube_mask = self.tube_masking_generator()
+        tube_masked_index = None
+        tube_unmasked_index = None
+        # initialize mask
+        num_tubelet, num_frame, box = traj_rois.shape
+        assert num_frame % 2 == 0 and self.frames == (num_frame // 2)
+        combined_mask = np.zeros((num_frame // 2, self.height, self.width))
+        # assume patch size is (2, 16, 16) so mask shape should be (8, 14, 14)
+        # we combine the traj_rois of two consecutive frames to one large traj_rois
+        # pick one tubelet that is not masked
+        if self.visible_frames is None:
+            picked_frame = np.random.randint(0, (num_frame // 2))
+            picked_list = [picked_frame]
+        else:
+            picked_list = self.visible_frames
+        # combined mask 1 means object patches that should be masked, 2 means object patches that should not be masked, 0 means non-object patches
+        for roi_idx, roi in enumerate(traj_rois):
+            for i in range(num_frame // 2):
+                min_x = min( (roi[2 * i][0], roi[2 * i + 1][0]) )
+                max_x = max( (roi[2 * i][2], roi[2 * i + 1][2]) )
+                min_y = min( (roi[2 * i][1], roi[2 * i + 1][1]) )
+                max_y = max( (roi[2 * i][3], roi[2 * i + 1][3]) )
+                patch_index_x_min = max( int(np.floor(min_x / self.patch_size)), 0)
+                patch_index_x_max = min( int(np.ceil(max_x / self.patch_size)) + 1, 14)
+                patch_index_y_min = max( int(np.floor(min_y / self.patch_size)), 0)
+                patch_index_y_max = min( int(np.ceil(max_y / self.patch_size)) + 1, 14)
+                if i in picked_list:
+                    combined_mask[i][patch_index_y_min:patch_index_y_max, patch_index_x_min:patch_index_x_max] = 2.
+                else:
+                    combined_mask[i][patch_index_y_min:patch_index_y_max, patch_index_x_min:patch_index_x_max] = 1.
+        combined_mask = combined_mask.flatten()
+        masked_object_patches_index = np.where(combined_mask == 1.)[0]
+        unmasked_non_object_patches_index = np.where(combined_mask == 0.)[0]
+        unmasked_object_patches_index = np.where(combined_mask == 2.)[0]
+        combined_mask[unmasked_object_patches_index] = 0.
+        tube_masked_index = np.where(tube_mask == 1.)[0]
+        tube_unmasked_index = np.where(tube_mask == 0.)[0]
+        # combine tubelet mask and tube mask
+        combined_mask = np.bitwise_or(combined_mask.astype(bool), tube_mask.astype(bool)).astype(np.float32)
+        if self.mask_type == "tube+picked_frame_visible":
+            # unmasked the protected patches
+            combined_mask[unmasked_object_patches_index] = 0.
+        elif self.mask_type == "tube+traj_mask":
+            # get index of unmasked traj patches
+            traj_unmask_ratio = self.traj_unmask_ratio
+            traj_patches_index = np.array(list(set(masked_object_patches_index) | set(unmasked_object_patches_index)))
+            unmasked_traj_patches_index = traj_patches_index[np.random.choice(traj_patches_index.size, size=int(traj_unmask_ratio * len(traj_patches_index)), replace=False)]
+            # mask the whole traj
+            combined_mask[traj_patches_index] = 1.
+            # unmask those selected patches
+            combined_mask[unmasked_traj_patches_index] = 0.
+            # update indexes
+            unmasked_object_patches_index = unmasked_traj_patches_index
+            masked_object_patches_index = np.array(list(set(traj_patches_index) - set(unmasked_traj_patches_index)))
+        # balance masked patch number
+        mask = self._balance_num_masks(combined_mask,
+                                       unmasked_object_patches_index,
+                                       unmasked_non_object_patches_index,
+                                       masked_object_patches_index,
+                                       tube_masked_index,
+                                       tube_unmasked_index)
+        assert np.sum(mask) == self.total_masks
+        return mask

mixup.py ADDED Viewed

	@@ -0,0 +1,316 @@

+""" Mixup and Cutmix
+Papers:
+mixup: Beyond Empirical Risk Minimization (https://arxiv.org/abs/1710.09412)
+CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (https://arxiv.org/abs/1905.04899)
+Code Reference:
+CutMix: https://github.com/clovaai/CutMix-PyTorch
+Hacked together by / Copyright 2019, Ross Wightman
+"""
+import numpy as np
+import torch
+def one_hot(x, num_classes, on_value=1., off_value=0., device='cuda'):
+    x = x.long().view(-1, 1)
+    return torch.full((x.size()[0], num_classes), off_value, device=device).scatter_(1, x, on_value)
+def mixup_target(target, num_classes, lam=1., smoothing=0.0, device='cuda'):
+    off_value = smoothing / num_classes
+    on_value = 1. - smoothing + off_value
+    y1 = one_hot(target, num_classes, on_value=on_value, off_value=off_value, device=device)
+    y2 = one_hot(target.flip(0), num_classes, on_value=on_value, off_value=off_value, device=device)
+    return y1 * lam + y2 * (1. - lam)
+def rand_bbox(img_shape, lam, margin=0., count=None):
+    """ Standard CutMix bounding-box
+    Generates a random square bbox based on lambda value. This impl includes
+    support for enforcing a border margin as percent of bbox dimensions.
+    Args:
+        img_shape (tuple): Image shape as tuple
+        lam (float): Cutmix lambda value
+        margin (float): Percentage of bbox dimension to enforce as margin (reduce amount of box outside image)
+        count (int): Number of bbox to generate
+    """
+    ratio = np.sqrt(1 - lam)
+    img_h, img_w = img_shape[-2:]
+    cut_h, cut_w = int(img_h * ratio), int(img_w * ratio)
+    margin_y, margin_x = int(margin * cut_h), int(margin * cut_w)
+    cy = np.random.randint(0 + margin_y, img_h - margin_y, size=count)
+    cx = np.random.randint(0 + margin_x, img_w - margin_x, size=count)
+    yl = np.clip(cy - cut_h // 2, 0, img_h)
+    yh = np.clip(cy + cut_h // 2, 0, img_h)
+    xl = np.clip(cx - cut_w // 2, 0, img_w)
+    xh = np.clip(cx + cut_w // 2, 0, img_w)
+    return yl, yh, xl, xh
+def rand_bbox_minmax(img_shape, minmax, count=None):
+    """ Min-Max CutMix bounding-box
+    Inspired by Darknet cutmix impl, generates a random rectangular bbox
+    based on min/max percent values applied to each dimension of the input image.
+    Typical defaults for minmax are usually in the  .2-.3 for min and .8-.9 range for max.
+    Args:
+        img_shape (tuple): Image shape as tuple
+        minmax (tuple or list): Min and max bbox ratios (as percent of image size)
+        count (int): Number of bbox to generate
+    """
+    assert len(minmax) == 2
+    img_h, img_w = img_shape[-2:]
+    cut_h = np.random.randint(int(img_h * minmax[0]), int(img_h * minmax[1]), size=count)
+    cut_w = np.random.randint(int(img_w * minmax[0]), int(img_w * minmax[1]), size=count)
+    yl = np.random.randint(0, img_h - cut_h, size=count)
+    xl = np.random.randint(0, img_w - cut_w, size=count)
+    yu = yl + cut_h
+    xu = xl + cut_w
+    return yl, yu, xl, xu
+def cutmix_bbox_and_lam(img_shape, lam, ratio_minmax=None, correct_lam=True, count=None):
+    """ Generate bbox and apply lambda correction.
+    """
+    if ratio_minmax is not None:
+        yl, yu, xl, xu = rand_bbox_minmax(img_shape, ratio_minmax, count=count)
+    else:
+        yl, yu, xl, xu = rand_bbox(img_shape, lam, count=count)
+    if correct_lam or ratio_minmax is not None:
+        bbox_area = (yu - yl) * (xu - xl)
+        lam = 1. - bbox_area / float(img_shape[-2] * img_shape[-1])
+    return (yl, yu, xl, xu), lam
+class Mixup:
+    """ Mixup/Cutmix that applies different params to each element or whole batch
+    Args:
+        mixup_alpha (float): mixup alpha value, mixup is active if > 0.
+        cutmix_alpha (float): cutmix alpha value, cutmix is active if > 0.
+        cutmix_minmax (List[float]): cutmix min/max image ratio, cutmix is active and uses this vs alpha if not None.
+        prob (float): probability of applying mixup or cutmix per batch or element
+        switch_prob (float): probability of switching to cutmix instead of mixup when both are active
+        mode (str): how to apply mixup/cutmix params (per 'batch', 'pair' (pair of elements), 'elem' (element)
+        correct_lam (bool): apply lambda correction when cutmix bbox clipped by image borders
+        label_smoothing (float): apply label smoothing to the mixed target tensor
+        num_classes (int): number of classes for target
+    """
+    def __init__(self, mixup_alpha=1., cutmix_alpha=0., cutmix_minmax=None, prob=1.0, switch_prob=0.5,
+                 mode='batch', correct_lam=True, label_smoothing=0.1, num_classes=1000):
+        self.mixup_alpha = mixup_alpha
+        self.cutmix_alpha = cutmix_alpha
+        self.cutmix_minmax = cutmix_minmax
+        if self.cutmix_minmax is not None:
+            assert len(self.cutmix_minmax) == 2
+            # force cutmix alpha == 1.0 when minmax active to keep logic simple & safe
+            self.cutmix_alpha = 1.0
+        self.mix_prob = prob
+        self.switch_prob = switch_prob
+        self.label_smoothing = label_smoothing
+        self.num_classes = num_classes
+        self.mode = mode
+        self.correct_lam = correct_lam  # correct lambda based on clipped area for cutmix
+        self.mixup_enabled = True  # set to false to disable mixing (intended tp be set by train loop)
+    def _params_per_elem(self, batch_size):
+        lam = np.ones(batch_size, dtype=np.float32)
+        use_cutmix = np.zeros(batch_size, dtype=np.bool)
+        if self.mixup_enabled:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand(batch_size) < self.switch_prob
+                lam_mix = np.where(
+                    use_cutmix,
+                    np.random.beta(self.cutmix_alpha, self.cutmix_alpha, size=batch_size),
+                    np.random.beta(self.mixup_alpha, self.mixup_alpha, size=batch_size))
+            elif self.mixup_alpha > 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha, size=batch_size)
+            elif self.cutmix_alpha > 0.:
+                use_cutmix = np.ones(batch_size, dtype=np.bool)
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha, size=batch_size)
+            else:
+                assert False, "One of mixup_alpha > 0., cutmix_alpha > 0., cutmix_minmax not None should be true."
+            lam = np.where(np.random.rand(batch_size) < self.mix_prob, lam_mix.astype(np.float32), lam)
+        return lam, use_cutmix
+    def _params_per_batch(self):
+        lam = 1.
+        use_cutmix = False
+        if self.mixup_enabled and np.random.rand() < self.mix_prob:
+            if self.mixup_alpha > 0. and self.cutmix_alpha > 0.:
+                use_cutmix = np.random.rand() < self.switch_prob
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha) if use_cutmix else \
+                    np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            elif self.mixup_alpha > 0.:
+                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)
+            elif self.cutmix_alpha > 0.:
+                use_cutmix = True
+                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)
+            else:
+                assert False, "One of mixup_alpha > 0., cutmix_alpha > 0., cutmix_minmax not None should be true."
+            lam = float(lam_mix)
+        return lam, use_cutmix
+    def _mix_elem(self, x):
+        batch_size = len(x)
+        lam_batch, use_cutmix = self._params_per_elem(batch_size)
+        x_orig = x.clone()  # need to keep an unmodified original for mixing source
+        for i in range(batch_size):
+            j = batch_size - i - 1
+            lam = lam_batch[i]
+            if lam != 1.:
+                if use_cutmix[i]:
+                    (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
+                        x[i].shape, lam, ratio_minmax=self.cutmix_minmax, correct_lam=self.correct_lam)
+                    x[i][..., yl:yh, xl:xh] = x_orig[j][..., yl:yh, xl:xh]
+                    lam_batch[i] = lam
+                else:
+                    x[i] = x[i] * lam + x_orig[j] * (1 - lam)
+        return torch.tensor(lam_batch, device=x.device, dtype=x.dtype).unsqueeze(1)
+    def _mix_pair(self, x):
+        batch_size = len(x)
+        lam_batch, use_cutmix = self._params_per_elem(batch_size // 2)
+        x_orig = x.clone()  # need to keep an unmodified original for mixing source
+        for i in range(batch_size // 2):
+            j = batch_size - i - 1
+            lam = lam_batch[i]
+            if lam != 1.:
+                if use_cutmix[i]:
+                    (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
+                        x[i].shape, lam, ratio_minmax=self.cutmix_minmax, correct_lam=self.correct_lam)
+                    x[i][:, yl:yh, xl:xh] = x_orig[j][:, yl:yh, xl:xh]
+                    x[j][:, yl:yh, xl:xh] = x_orig[i][:, yl:yh, xl:xh]
+                    lam_batch[i] = lam
+                else:
+                    x[i] = x[i] * lam + x_orig[j] * (1 - lam)
+                    x[j] = x[j] * lam + x_orig[i] * (1 - lam)
+        lam_batch = np.concatenate((lam_batch, lam_batch[::-1]))
+        return torch.tensor(lam_batch, device=x.device, dtype=x.dtype).unsqueeze(1)
+    def _mix_batch(self, x):
+        lam, use_cutmix = self._params_per_batch()
+        if lam == 1.:
+            return 1.
+        if use_cutmix:
+            (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
+                x.shape, lam, ratio_minmax=self.cutmix_minmax, correct_lam=self.correct_lam)
+            x[..., yl:yh, xl:xh] = x.flip(0)[..., yl:yh, xl:xh]
+        else:
+            x_flipped = x.flip(0).mul_(1. - lam)
+            x.mul_(lam).add_(x_flipped)
+        return lam
+    def __call__(self, x, target):
+        assert len(x) % 2 == 0, 'Batch size should be even when using this'
+        if self.mode == 'elem':
+            lam = self._mix_elem(x)
+        elif self.mode == 'pair':
+            lam = self._mix_pair(x)
+        else:
+            lam = self._mix_batch(x)
+        target = mixup_target(target, self.num_classes, lam, self.label_smoothing, x.device)
+        return x, target
+class FastCollateMixup(Mixup):
+    """ Fast Collate w/ Mixup/Cutmix that applies different params to each element or whole batch
+    A Mixup impl that's performed while collating the batches.
+    """
+    def _mix_elem_collate(self, output, batch, half=False):
+        batch_size = len(batch)
+        num_elem = batch_size // 2 if half else batch_size
+        assert len(output) == num_elem
+        lam_batch, use_cutmix = self._params_per_elem(num_elem)
+        for i in range(num_elem):
+            j = batch_size - i - 1
+            lam = lam_batch[i]
+            mixed = batch[i][0]
+            if lam != 1.:
+                if use_cutmix[i]:
+                    if not half:
+                        mixed = mixed.copy()
+                    (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
+                        output.shape, lam, ratio_minmax=self.cutmix_minmax, correct_lam=self.correct_lam)
+                    mixed[:, yl:yh, xl:xh] = batch[j][0][:, yl:yh, xl:xh]
+                    lam_batch[i] = lam
+                else:
+                    mixed = mixed.astype(np.float32) * lam + batch[j][0].astype(np.float32) * (1 - lam)
+                    np.rint(mixed, out=mixed)
+            output[i] += torch.from_numpy(mixed.astype(np.uint8))
+        if half:
+            lam_batch = np.concatenate((lam_batch, np.ones(num_elem)))
+        return torch.tensor(lam_batch).unsqueeze(1)
+    def _mix_pair_collate(self, output, batch):
+        batch_size = len(batch)
+        lam_batch, use_cutmix = self._params_per_elem(batch_size // 2)
+        for i in range(batch_size // 2):
+            j = batch_size - i - 1
+            lam = lam_batch[i]
+            mixed_i = batch[i][0]
+            mixed_j = batch[j][0]
+            assert 0 <= lam <= 1.0
+            if lam < 1.:
+                if use_cutmix[i]:
+                    (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
+                        output.shape, lam, ratio_minmax=self.cutmix_minmax, correct_lam=self.correct_lam)
+                    patch_i = mixed_i[:, yl:yh, xl:xh].copy()
+                    mixed_i[:, yl:yh, xl:xh] = mixed_j[:, yl:yh, xl:xh]
+                    mixed_j[:, yl:yh, xl:xh] = patch_i
+                    lam_batch[i] = lam
+                else:
+                    mixed_temp = mixed_i.astype(np.float32) * lam + mixed_j.astype(np.float32) * (1 - lam)
+                    mixed_j = mixed_j.astype(np.float32) * lam + mixed_i.astype(np.float32) * (1 - lam)
+                    mixed_i = mixed_temp
+                    np.rint(mixed_j, out=mixed_j)
+                    np.rint(mixed_i, out=mixed_i)
+            output[i] += torch.from_numpy(mixed_i.astype(np.uint8))
+            output[j] += torch.from_numpy(mixed_j.astype(np.uint8))
+        lam_batch = np.concatenate((lam_batch, lam_batch[::-1]))
+        return torch.tensor(lam_batch).unsqueeze(1)
+    def _mix_batch_collate(self, output, batch):
+        batch_size = len(batch)
+        lam, use_cutmix = self._params_per_batch()
+        if use_cutmix:
+            (yl, yh, xl, xh), lam = cutmix_bbox_and_lam(
+                output.shape, lam, ratio_minmax=self.cutmix_minmax, correct_lam=self.correct_lam)
+        for i in range(batch_size):
+            j = batch_size - i - 1
+            mixed = batch[i][0]
+            if lam != 1.:
+                if use_cutmix:
+                    mixed = mixed.copy()  # don't want to modify the original while iterating
+                    mixed[..., yl:yh, xl:xh] = batch[j][0][..., yl:yh, xl:xh]
+                else:
+                    mixed = mixed.astype(np.float32) * lam + batch[j][0].astype(np.float32) * (1 - lam)
+                    np.rint(mixed, out=mixed)
+            output[i] += torch.from_numpy(mixed.astype(np.uint8))
+        return lam
+    def __call__(self, batch, _=None):
+        batch_size = len(batch)
+        assert batch_size % 2 == 0, 'Batch size should be even when using this'
+        half = 'half' in self.mode
+        if half:
+            batch_size //= 2
+        output = torch.zeros((batch_size, *batch[0][0].shape), dtype=torch.uint8)
+        if self.mode == 'elem' or self.mode == 'half':
+            lam = self._mix_elem_collate(output, batch, half=half)
+        elif self.mode == 'pair':
+            lam = self._mix_pair_collate(output, batch)
+        else:
+            lam = self._mix_batch_collate(output, batch)
+        target = torch.tensor([b[1] for b in batch], dtype=torch.int64)
+        target = mixup_target(target, self.num_classes, lam, self.label_smoothing, device='cpu')
+        target = target[:batch_size]
+        return output, target

modeling_finetune.py ADDED Viewed

	@@ -0,0 +1,351 @@

+from functools import partial
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from timm.models.layers import drop_path, to_2tuple, trunc_normal_
+from timm.models.registry import register_model
+import torch.utils.checkpoint as checkpoint
+def _cfg(url='', **kwargs):
+    return {
+        'url': url,
+        'num_classes': 400, 'input_size': (3, 224, 224), 'pool_size': None,
+        'crop_pct': .9, 'interpolation': 'bicubic',
+        'mean': (0.5, 0.5, 0.5), 'std': (0.5, 0.5, 0.5),
+        **kwargs
+    }
+class DropPath(nn.Module):
+    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
+    """
+    def __init__(self, drop_prob=None):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+    def forward(self, x):
+        return drop_path(x, self.drop_prob, self.training)
+    def extra_repr(self) -> str:
+        return 'p={}'.format(self.drop_prob)
+class Mlp(nn.Module):
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        # x = self.drop(x)
+        # commit this for the orignal BERT implement
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+class Attention(nn.Module):
+    def __init__(
+            self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0.,
+            proj_drop=0., attn_head_dim=None):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        if attn_head_dim is not None:
+            head_dim = attn_head_dim
+        all_head_dim = head_dim * self.num_heads
+        self.scale = qk_scale or head_dim ** -0.5
+        self.qkv = nn.Linear(dim, all_head_dim * 3, bias=False)
+        if qkv_bias:
+            self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
+            self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
+        else:
+            self.q_bias = None
+            self.v_bias = None
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(all_head_dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+    def forward(self, x):
+        B, N, C = x.shape
+        qkv_bias = None
+        if self.q_bias is not None:
+            qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))
+        # qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+        qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
+        qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
+                 drop_path=0., init_values=None, act_layer=nn.GELU, norm_layer=nn.LayerNorm,
+                 attn_head_dim=None):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
+            attn_drop=attn_drop, proj_drop=drop, attn_head_dim=attn_head_dim)
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+        if init_values > 0:
+            self.gamma_1 = nn.Parameter(init_values * torch.ones((dim)),requires_grad=True)
+            self.gamma_2 = nn.Parameter(init_values * torch.ones((dim)),requires_grad=True)
+        else:
+            self.gamma_1, self.gamma_2 = None, None
+    def forward(self, x):
+        if self.gamma_1 is None:
+            x = x + self.drop_path(self.attn(self.norm1(x)))
+            x = x + self.drop_path(self.mlp(self.norm2(x)))
+        else:
+            x = x + self.drop_path(self.gamma_1 * self.attn(self.norm1(x)))
+            x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
+        return x
+class PatchEmbed(nn.Module):
+    """ Image to Patch Embedding
+    """
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, num_frames=16, tubelet_size=2):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        self.tubelet_size = int(tubelet_size)
+        num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0]) * (num_frames // self.tubelet_size)
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.num_patches = num_patches
+        self.proj = nn.Conv3d(in_channels=in_chans, out_channels=embed_dim,
+                            kernel_size = (self.tubelet_size,  patch_size[0],patch_size[1]),
+                            stride=(self.tubelet_size,  patch_size[0],  patch_size[1]))
+    def forward(self, x, **kwargs):
+        B, C, T, H, W = x.shape
+        # FIXME look at relaxing size constraints
+        assert H == self.img_size[0] and W == self.img_size[1], \
+            f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
+        x = self.proj(x).flatten(2).transpose(1, 2)
+        return x
+# sin-cos position encoding
+# https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py#L31
+def get_sinusoid_encoding_table(n_position, d_hid):
+    ''' Sinusoid position encoding table '''
+    # TODO: make it with torch instead of numpy
+    def get_position_angle_vec(position):
+        return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]
+    sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
+    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i
+    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1
+    return  torch.tensor(sinusoid_table,dtype=torch.float, requires_grad=False).unsqueeze(0)
+class VisionTransformer(nn.Module):
+    """ Vision Transformer with support for patch or hybrid CNN input stage
+    """
+    def __init__(self,
+                 img_size=224,
+                 patch_size=16,
+                 in_chans=3,
+                 num_classes=1000,
+                 embed_dim=768,
+                 depth=12,
+                 num_heads=12,
+                 mlp_ratio=4.,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 fc_drop_rate=0.,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 norm_layer=nn.LayerNorm,
+                 init_values=0.,
+                 use_learnable_pos_emb=False,
+                 init_scale=0.,
+                 all_frames=16,
+                 tubelet_size=2,
+                 use_checkpoint=False,
+                 use_mean_pooling=True,
+                 pretrained_cfg=None,
+                 pretrained_cfg_overlay = None
+                 ):
+        super().__init__()
+        self.num_classes = num_classes
+        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
+        self.tubelet_size = tubelet_size
+        self.patch_embed = PatchEmbed(
+            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim, num_frames=all_frames, tubelet_size=self.tubelet_size)
+        num_patches = self.patch_embed.num_patches
+        self.use_checkpoint = use_checkpoint
+        if use_learnable_pos_emb:
+            self.pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim))
+        else:
+            # sine-cosine positional embeddings is on the way
+            self.pos_embed = get_sinusoid_encoding_table(num_patches, embed_dim)
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        self.blocks = nn.ModuleList([
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer,
+                init_values=init_values)
+            for i in range(depth)])
+        self.norm = nn.Identity() if use_mean_pooling else norm_layer(embed_dim)
+        self.fc_norm = norm_layer(embed_dim) if use_mean_pooling else None
+        self.fc_dropout = nn.Dropout(p=fc_drop_rate) if fc_drop_rate > 0 else nn.Identity()
+        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+        if use_learnable_pos_emb:
+            trunc_normal_(self.pos_embed, std=.02)
+        trunc_normal_(self.head.weight, std=.02)
+        self.apply(self._init_weights)
+        self.head.weight.data.mul_(init_scale)
+        self.head.bias.data.mul_(init_scale)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    def get_num_layers(self):
+        return len(self.blocks)
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'pos_embed', 'cls_token'}
+    def get_classifier(self):
+        return self.head
+    def reset_classifier(self, num_classes, global_pool=''):
+        self.num_classes = num_classes
+        self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+    def forward_features(self, x):
+        x = self.patch_embed(x)
+        B, _, _ = x.size()
+        if self.pos_embed is not None:
+            x = x + self.pos_embed.expand(B, -1, -1).type_as(x).to(x.device).clone().detach()
+        x = self.pos_drop(x)
+        if self.use_checkpoint:
+            for blk in self.blocks:
+                x = checkpoint.checkpoint(blk, x)
+        else:
+            for blk in self.blocks:
+                x = blk(x)
+        x = self.norm(x)
+        if self.fc_norm is not None:
+            return self.fc_norm(x.mean(1))
+        else:
+            return x[:, 0]
+    def forward(self, x):
+        x = self.forward_features(x)
+        x = self.head(self.fc_dropout(x))
+        return x
+@register_model
+def vit_small_patch16_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=384, depth=12, num_heads=6, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    return model
+@register_model
+def vit_base_patch16_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    return model
+@register_model
+def vit_base_patch16_384(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        img_size=384, patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    return model
+@register_model
+def vit_large_patch16_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    return model
+@register_model
+def vit_large_patch16_384(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        img_size=384, patch_size=16, embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    return model
+@register_model
+def vit_large_patch16_512(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        img_size=512, patch_size=16, embed_dim=1024, depth=24, num_heads=16, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    return model
+@register_model
+def vit_huge_patch16_224(pretrained=False, **kwargs):
+    model = VisionTransformer(
+        patch_size=16, embed_dim=1280, depth=32, num_heads=16, mlp_ratio=4, qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6), **kwargs)
+    model.default_cfg = _cfg()
+    return model

modeling_pretrain.py ADDED Viewed

	@@ -0,0 +1,398 @@

+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as checkpoint
+from functools import partial
+from modeling_finetune import Block, _cfg, PatchEmbed, get_sinusoid_encoding_table
+from timm.models.registry import register_model
+from timm.models.layers import trunc_normal_ as __call_trunc_normal_
+def trunc_normal_(tensor, mean=0., std=1.):
+    __call_trunc_normal_(tensor, mean=mean, std=std, a=-std, b=std)
+__all__ = [
+    'pretrain_videomae_small_patch16_224',
+    'pretrain_videomae_base_patch16_224',
+    'pretrain_videomae_large_patch16_224',
+    'pretrain_videomae_huge_patch16_224',
+]
+class PretrainVisionTransformerEncoder(nn.Module):
+    """ Vision Transformer with support for patch or hybrid CNN input stage
+    """
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, num_classes=0, embed_dim=768, depth=12,
+                 num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0.,
+                 drop_path_rate=0., norm_layer=nn.LayerNorm, init_values=None, tubelet_size=2, use_checkpoint=False,
+                 use_learnable_pos_emb=False):
+        super().__init__()
+        self.num_classes = num_classes
+        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
+        self.patch_embed = PatchEmbed(
+            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim,tubelet_size=tubelet_size)
+        num_patches = self.patch_embed.num_patches
+        self.use_checkpoint = use_checkpoint
+        # TODO: Add the cls token
+        if use_learnable_pos_emb:
+            self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
+        else:
+            # sine-cosine positional embeddings
+            self.pos_embed = get_sinusoid_encoding_table(num_patches, embed_dim)
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        self.blocks = nn.ModuleList([
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer,
+                init_values=init_values)
+            for i in range(depth)])
+        self.norm =  norm_layer(embed_dim)
+        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+        if use_learnable_pos_emb:
+            trunc_normal_(self.pos_embed, std=.02)
+        self.apply(self._init_weights)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            nn.init.xavier_uniform_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    def get_num_layers(self):
+        return len(self.blocks)
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'pos_embed', 'cls_token'}
+    def get_classifier(self):
+        return self.head
+    def reset_classifier(self, num_classes, global_pool=''):
+        self.num_classes = num_classes
+        self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+    def forward_features(self, x, mask):
+        _, _, T, _, _ = x.shape
+        x = self.patch_embed(x)
+        x = x + self.pos_embed.type_as(x).to(x.device).clone().detach()
+        B, _, C = x.shape
+        x_vis = x[~mask].reshape(B, -1, C) # ~mask means visible
+        if self.use_checkpoint:
+            for blk in self.blocks:
+                x_vis = checkpoint.checkpoint(blk, x_vis)
+        else:
+            for blk in self.blocks:
+                x_vis = blk(x_vis)
+        x_vis = self.norm(x_vis)
+        return x_vis
+    def forward(self, x, mask):
+        x = self.forward_features(x, mask)
+        x = self.head(x)
+        return x
+class PretrainVisionTransformerDecoder(nn.Module):
+    """ Vision Transformer with support for patch or hybrid CNN input stage
+    """
+    def __init__(self, patch_size=16, num_classes=768, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.,
+                 qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0., drop_path_rate=0.,
+                 norm_layer=nn.LayerNorm, init_values=None, num_patches=196, tubelet_size=2, use_checkpoint=False
+                 ):
+        super().__init__()
+        self.num_classes = num_classes
+        #assert num_classes == 3 * tubelet_size * patch_size ** 2
+        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
+        self.patch_size = patch_size
+        self.use_checkpoint = use_checkpoint
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        self.blocks = nn.ModuleList([
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer,
+                init_values=init_values)
+            for i in range(depth)])
+        self.norm =  norm_layer(embed_dim)
+        self.head = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+        self.apply(self._init_weights)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            nn.init.xavier_uniform_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    def get_num_layers(self):
+        return len(self.blocks)
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'pos_embed', 'cls_token'}
+    def get_classifier(self):
+        return self.head
+    def reset_classifier(self, num_classes, global_pool=''):
+        self.num_classes = num_classes
+        self.head = nn.Linear(self.embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+    def forward(self, x, return_token_num):
+        if self.use_checkpoint:
+            for blk in self.blocks:
+                x = checkpoint.checkpoint(blk, x)
+        else:
+            for blk in self.blocks:
+                x = blk(x)
+        if return_token_num > 0:
+            x = self.head(self.norm(x[:, -return_token_num:])) # only return the mask tokens predict pixels
+        else:
+            x = self.head(self.norm(x))
+        return x
+class FeatureExtractor(torch.nn.Module):
+    def __init__(self, vit_model, input_size, patch_size):
+        super(FeatureExtractor, self).__init__()
+        self.vit_model = vit_model
+        self.input_size = input_size
+        self.patch_size = patch_size
+        self.spatial_resolution = input_size // patch_size
+        assert self.spatial_resolution * patch_size == input_size
+    def forward(self, x):
+        if self.patch_size == 14:
+            features = self.vit_model.forward_features(x)[:, 5:]
+            bs, np, dim = features.shape
+            features = features.reshape(bs, self.spatial_resolution, self.spatial_resolution, dim).permute(0, 3, 1, 2)
+            features = F.interpolate(features, size=(14, 14), mode='bilinear')
+            features = features.flatten(2, -1).permute(0, 2, 1)
+        else:
+            features = self.vit_model.forward_features(x)[:, 1:]
+        return features
+class PretrainVisionTransformer(nn.Module):
+    """ Vision Transformer with support for patch or hybrid CNN input stage
+    """
+    def __init__(self,
+                 img_size=224,
+                 patch_size=16,
+                 encoder_in_chans=3,
+                 encoder_num_classes=0,
+                 encoder_embed_dim=768,
+                 encoder_depth=12,
+                 encoder_num_heads=12,
+                 decoder_num_classes=1536, #  decoder_num_classes=768,
+                 decoder_embed_dim=512,
+                 decoder_depth=8,
+                 decoder_num_heads=8,
+                 mlp_ratio=4.,
+                 qkv_bias=False,
+                 qk_scale=None,
+                 drop_rate=0.,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 norm_layer=nn.LayerNorm,
+                 init_values=0.,
+                 use_learnable_pos_emb=False,
+                 use_checkpoint=False,
+                 tubelet_size=2,
+                 num_classes=0, # avoid the error from create_fn in timm
+                 in_chans=0, # avoid the error from create_fn in timm
+                 pretrained_cfg=None, # avoid the error from create_fn in timm
+                 pretrained_cfg_overlay=None, # avoid the error from create_fn in timm
+                 ):
+        super().__init__()
+        self.encoder = PretrainVisionTransformerEncoder(
+            img_size=img_size,
+            patch_size=patch_size,
+            in_chans=encoder_in_chans,
+            num_classes=encoder_num_classes,
+            embed_dim=encoder_embed_dim,
+            depth=encoder_depth,
+            num_heads=encoder_num_heads,
+            mlp_ratio=mlp_ratio,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            drop_rate=drop_rate,
+            attn_drop_rate=attn_drop_rate,
+            drop_path_rate=drop_path_rate,
+            norm_layer=norm_layer,
+            init_values=init_values,
+            tubelet_size=tubelet_size,
+            use_checkpoint=use_checkpoint,
+            use_learnable_pos_emb=use_learnable_pos_emb)
+        self.decoder = PretrainVisionTransformerDecoder(
+            patch_size=patch_size,
+            num_patches=self.encoder.patch_embed.num_patches,
+            num_classes=decoder_num_classes,
+            embed_dim=decoder_embed_dim,
+            depth=decoder_depth,
+            num_heads=decoder_num_heads,
+            mlp_ratio=mlp_ratio,
+            qkv_bias=qkv_bias,
+            qk_scale=qk_scale,
+            drop_rate=drop_rate,
+            attn_drop_rate=attn_drop_rate,
+            drop_path_rate=drop_path_rate,
+            norm_layer=norm_layer,
+            init_values=init_values,
+            tubelet_size=tubelet_size,
+            use_checkpoint=use_checkpoint)
+        self.encoder_to_decoder = nn.Linear(encoder_embed_dim, decoder_embed_dim, bias=False)
+        self.mask_token = nn.Parameter(torch.zeros(1, 1, decoder_embed_dim))
+        self.pos_embed = get_sinusoid_encoding_table(self.encoder.patch_embed.num_patches, decoder_embed_dim)
+        trunc_normal_(self.mask_token, std=.02)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            nn.init.xavier_uniform_(m.weight)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    def get_num_layers(self):
+        return len(self.blocks)
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'pos_embed', 'cls_token', 'mask_token'}
+    def forward(self, x, mask):
+        _, _, T, _, _ = x.shape
+        x_encoder = self.encoder(x, mask) # [B, N_vis, C_e]
+        x_vis = self.encoder_to_decoder(x_encoder) # [B, N_vis, C_d]
+        B, N, C = x_vis.shape
+        # we don't unshuffle the correct visible token order,
+        # but shuffle the pos embedding accorddingly.
+        expand_pos_embed = self.pos_embed.expand(B, -1, -1).type_as(x).to(x.device).clone().detach()
+        pos_emd_vis = expand_pos_embed[~mask].reshape(B, -1, C)
+        pos_emd_mask = expand_pos_embed[mask].reshape(B, -1, C)
+        x_full = torch.cat([x_vis + pos_emd_vis, self.mask_token + pos_emd_mask], dim=1) # [B, N, C_d]
+        x = self.decoder(x_full, pos_emd_mask.shape[1]) # [B, N_mask, 3 * 16 * 16]
+        return x
+@register_model
+def pretrain_videomae_small_patch16_224(pretrained=False, **kwargs):
+    model = PretrainVisionTransformer(
+        img_size=224,
+        patch_size=16,
+        encoder_embed_dim=384,
+        encoder_depth=12,
+        encoder_num_heads=6,
+        encoder_num_classes=0,
+        decoder_embed_dim=192,
+        decoder_num_heads=3,
+        mlp_ratio=4,
+        qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6),
+        **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.load(
+            kwargs["init_ckpt"], map_location="cpu"
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def pretrain_videomae_base_patch16_224(pretrained=False, **kwargs):
+    model = PretrainVisionTransformer(
+        img_size=224,
+        patch_size=16,
+        encoder_embed_dim=768,
+        encoder_depth=12,
+        encoder_num_heads=12,
+        encoder_num_classes=0,
+        decoder_embed_dim=384,
+        decoder_num_heads=6,
+        mlp_ratio=4,
+        qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6),
+        **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.load(
+            kwargs["init_ckpt"], map_location="cpu"
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def pretrain_videomae_large_patch16_224(pretrained=False, **kwargs):
+    model = PretrainVisionTransformer(
+        img_size=224,
+        patch_size=16,
+        encoder_embed_dim=1024,
+        encoder_depth=24,
+        encoder_num_heads=16,
+        encoder_num_classes=0,
+        decoder_embed_dim=512,
+        decoder_num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6),
+        **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.load(
+            kwargs["init_ckpt"], map_location="cpu"
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model
+@register_model
+def pretrain_videomae_huge_patch16_224(pretrained=False, **kwargs):
+    model = PretrainVisionTransformer(
+        img_size=224,
+        patch_size=16,
+        encoder_embed_dim=1280,
+        encoder_depth=32,
+        encoder_num_heads=16,
+        encoder_num_classes=0,
+        decoder_num_classes=1536,
+        decoder_embed_dim=640,
+        decoder_num_heads=8,
+        mlp_ratio=4,
+        qkv_bias=True,
+        norm_layer=partial(nn.LayerNorm, eps=1e-6),
+        **kwargs)
+    model.default_cfg = _cfg()
+    if pretrained:
+        checkpoint = torch.load(
+            kwargs["init_ckpt"], map_location="cpu"
+        )
+        model.load_state_dict(checkpoint["model"])
+    return model

optim_factory.py ADDED Viewed

	@@ -0,0 +1,175 @@

+import torch
+from torch import optim as optim
+from timm.optim.adafactor import Adafactor
+from timm.optim.adahessian import Adahessian
+from timm.optim.adamp import AdamP
+from timm.optim.lookahead import Lookahead
+from timm.optim.nadam import Nadam
+#from timm.optim.novograd import NovoGrad
+from timm.optim.nvnovograd import NvNovoGrad
+from timm.optim.radam import RAdam
+from timm.optim.rmsprop_tf import RMSpropTF
+from timm.optim.sgdp import SGDP
+import json
+try:
+    from apex.optimizers import FusedNovoGrad, FusedAdam, FusedLAMB, FusedSGD
+    has_apex = True
+except ImportError:
+    has_apex = False
+def get_num_layer_for_vit(var_name, num_max_layer):
+    if var_name in ("cls_token", "mask_token", "pos_embed"):
+        return 0
+    elif var_name.startswith("patch_embed"):
+        return 0
+    elif var_name.startswith("rel_pos_bias"):
+        return num_max_layer - 1
+    elif var_name.startswith("blocks"):
+        layer_id = int(var_name.split('.')[1])
+        return layer_id + 1
+    else:
+        return num_max_layer - 1
+class LayerDecayValueAssigner(object):
+    def __init__(self, values):
+        self.values = values
+    def get_scale(self, layer_id):
+        return self.values[layer_id]
+    def get_layer_id(self, var_name):
+        return get_num_layer_for_vit(var_name, len(self.values))
+def get_parameter_groups(model, weight_decay=1e-5, skip_list=(), get_num_layer=None, get_layer_scale=None):
+    parameter_group_names = {}
+    parameter_group_vars = {}
+    for name, param in model.named_parameters():
+        if not param.requires_grad:
+            continue  # frozen weights
+        if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list:
+            group_name = "no_decay"
+            this_weight_decay = 0.
+        else:
+            group_name = "decay"
+            this_weight_decay = weight_decay
+        if get_num_layer is not None:
+            layer_id = get_num_layer(name)
+            group_name = "layer_%d_%s" % (layer_id, group_name)
+        else:
+            layer_id = None
+        if group_name not in parameter_group_names:
+            if get_layer_scale is not None:
+                scale = get_layer_scale(layer_id)
+            else:
+                scale = 1.
+            parameter_group_names[group_name] = {
+                "weight_decay": this_weight_decay,
+                "params": [],
+                "lr_scale": scale
+            }
+            parameter_group_vars[group_name] = {
+                "weight_decay": this_weight_decay,
+                "params": [],
+                "lr_scale": scale
+            }
+        parameter_group_vars[group_name]["params"].append(param)
+        parameter_group_names[group_name]["params"].append(name)
+    print("Param groups = %s" % json.dumps(parameter_group_names, indent=2))
+    return list(parameter_group_vars.values())
+def create_optimizer(args, model, get_num_layer=None, get_layer_scale=None, filter_bias_and_bn=True, skip_list=None):
+    opt_lower = args.opt.lower()
+    weight_decay = args.weight_decay
+    if weight_decay and filter_bias_and_bn:
+        skip = {}
+        if skip_list is not None:
+            skip = skip_list
+        elif hasattr(model, 'no_weight_decay'):
+            skip = model.no_weight_decay()
+        parameters = get_parameter_groups(model, weight_decay, skip, get_num_layer, get_layer_scale)
+        weight_decay = 0.
+    else:
+        parameters = model.parameters()
+    if 'fused' in opt_lower:
+        assert has_apex and torch.cuda.is_available(), 'APEX and CUDA required for fused optimizers'
+    opt_args = dict(lr=args.lr, weight_decay=weight_decay)
+    if hasattr(args, 'opt_eps') and args.opt_eps is not None:
+        opt_args['eps'] = args.opt_eps
+    if hasattr(args, 'opt_betas') and args.opt_betas is not None:
+        opt_args['betas'] = args.opt_betas
+    print("optimizer settings:", opt_args)
+    opt_split = opt_lower.split('_')
+    opt_lower = opt_split[-1]
+    if opt_lower == 'sgd' or opt_lower == 'nesterov':
+        opt_args.pop('eps', None)
+        optimizer = optim.SGD(parameters, momentum=args.momentum, nesterov=True, **opt_args)
+    elif opt_lower == 'momentum':
+        opt_args.pop('eps', None)
+        optimizer = optim.SGD(parameters, momentum=args.momentum, nesterov=False, **opt_args)
+    elif opt_lower == 'adam':
+        optimizer = optim.Adam(parameters, **opt_args)
+    elif opt_lower == 'adamw':
+        optimizer = optim.AdamW(parameters, **opt_args)
+    elif opt_lower == 'nadam':
+        optimizer = Nadam(parameters, **opt_args)
+    elif opt_lower == 'radam':
+        optimizer = RAdam(parameters, **opt_args)
+    elif opt_lower == 'adamp':
+        optimizer = AdamP(parameters, wd_ratio=0.01, nesterov=True, **opt_args)
+    elif opt_lower == 'sgdp':
+        optimizer = SGDP(parameters, momentum=args.momentum, nesterov=True, **opt_args)
+    elif opt_lower == 'adadelta':
+        optimizer = optim.Adadelta(parameters, **opt_args)
+    elif opt_lower == 'adafactor':
+        if not args.lr:
+            opt_args['lr'] = None
+        optimizer = Adafactor(parameters, **opt_args)
+    elif opt_lower == 'adahessian':
+        optimizer = Adahessian(parameters, **opt_args)
+    elif opt_lower == 'rmsprop':
+        optimizer = optim.RMSprop(parameters, alpha=0.9, momentum=args.momentum, **opt_args)
+    elif opt_lower == 'rmsproptf':
+        optimizer = RMSpropTF(parameters, alpha=0.9, momentum=args.momentum, **opt_args)
+    elif opt_lower == 'novograd':
+        optimizer = NovoGrad(parameters, **opt_args)
+    elif opt_lower == 'nvnovograd':
+        optimizer = NvNovoGrad(parameters, **opt_args)
+    elif opt_lower == 'fusedsgd':
+        opt_args.pop('eps', None)
+        optimizer = FusedSGD(parameters, momentum=args.momentum, nesterov=True, **opt_args)
+    elif opt_lower == 'fusedmomentum':
+        opt_args.pop('eps', None)
+        optimizer = FusedSGD(parameters, momentum=args.momentum, nesterov=False, **opt_args)
+    elif opt_lower == 'fusedadam':
+        optimizer = FusedAdam(parameters, adam_w_mode=False, **opt_args)
+    elif opt_lower == 'fusedadamw':
+        optimizer = FusedAdam(parameters, adam_w_mode=True, **opt_args)
+    elif opt_lower == 'fusedlamb':
+        optimizer = FusedLAMB(parameters, **opt_args)
+    elif opt_lower == 'fusednovograd':
+        opt_args.setdefault('betas', (0.95, 0.98))
+        optimizer = FusedNovoGrad(parameters, **opt_args)
+    else:
+        assert False and "Invalid optimizer"
+        raise ValueError
+    if len(opt_split) > 1:
+        if opt_split[0] == 'lookahead':
+            optimizer = Lookahead(optimizer)
+    return optimizer

rand_augment.py ADDED Viewed

	@@ -0,0 +1,531 @@

+"""
+This implementation is based on
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/data/auto_augment.py
+pulished under an Apache License 2.0.
+COMMENT FROM ORIGINAL:
+AutoAugment, RandAugment, and AugMix for PyTorch
+This code implements the searched ImageNet policies with various tweaks and
+improvements and does not include any of the search code. AA and RA
+Implementation adapted from:
+    https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py
+AugMix adapted from:
+    https://github.com/google-research/augmix
+Papers:
+    AutoAugment: Learning Augmentation Policies from Data
+    https://arxiv.org/abs/1805.09501
+    Learning Data Augmentation Strategies for Object Detection
+    https://arxiv.org/abs/1906.11172
+    RandAugment: Practical automated data augmentation...
+    https://arxiv.org/abs/1909.13719
+    AugMix: A Simple Data Processing Method to Improve Robustness and
+    Uncertainty https://arxiv.org/abs/1912.02781
+Hacked together by / Copyright 2020 Ross Wightman
+"""
+import math
+import numpy as np
+import random
+import re
+import PIL
+from PIL import Image, ImageEnhance, ImageOps
+_PIL_VER = tuple([int(x) for x in PIL.__version__.split(".")[:2]])
+_FILL = (128, 128, 128)
+# This signifies the max integer that the controller RNN could predict for the
+# augmentation scheme.
+_MAX_LEVEL = 10.0
+_HPARAMS_DEFAULT = {
+    "translate_const": 250,
+    "img_mean": _FILL,
+}
+_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
+def _interpolation(kwargs):
+    interpolation = kwargs.pop("resample", Image.BILINEAR)
+    if isinstance(interpolation, (list, tuple)):
+        return random.choice(interpolation)
+    else:
+        return interpolation
+def _check_args_tf(kwargs):
+    if "fillcolor" in kwargs and _PIL_VER < (5, 0):
+        kwargs.pop("fillcolor")
+    kwargs["resample"] = _interpolation(kwargs)
+def shear_x(img, factor, **kwargs):
+    _check_args_tf(kwargs)
+    return img.transform(
+        img.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), **kwargs
+    )
+def shear_y(img, factor, **kwargs):
+    _check_args_tf(kwargs)
+    return img.transform(
+        img.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), **kwargs
+    )
+def translate_x_rel(img, pct, **kwargs):
+    pixels = pct * img.size[0]
+    _check_args_tf(kwargs)
+    return img.transform(
+        img.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), **kwargs
+    )
+def translate_y_rel(img, pct, **kwargs):
+    pixels = pct * img.size[1]
+    _check_args_tf(kwargs)
+    return img.transform(
+        img.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), **kwargs
+    )
+def translate_x_abs(img, pixels, **kwargs):
+    _check_args_tf(kwargs)
+    return img.transform(
+        img.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), **kwargs
+    )
+def translate_y_abs(img, pixels, **kwargs):
+    _check_args_tf(kwargs)
+    return img.transform(
+        img.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), **kwargs
+    )
+def rotate(img, degrees, **kwargs):
+    _check_args_tf(kwargs)
+    if _PIL_VER >= (5, 2):
+        return img.rotate(degrees, **kwargs)
+    elif _PIL_VER >= (5, 0):
+        w, h = img.size
+        post_trans = (0, 0)
+        rotn_center = (w / 2.0, h / 2.0)
+        angle = -math.radians(degrees)
+        matrix = [
+            round(math.cos(angle), 15),
+            round(math.sin(angle), 15),
+            0.0,
+            round(-math.sin(angle), 15),
+            round(math.cos(angle), 15),
+            0.0,
+        ]
+        def transform(x, y, matrix):
+            (a, b, c, d, e, f) = matrix
+            return a * x + b * y + c, d * x + e * y + f
+        matrix[2], matrix[5] = transform(
+            -rotn_center[0] - post_trans[0],
+            -rotn_center[1] - post_trans[1],
+            matrix,
+        )
+        matrix[2] += rotn_center[0]
+        matrix[5] += rotn_center[1]
+        return img.transform(img.size, Image.AFFINE, matrix, **kwargs)
+    else:
+        return img.rotate(degrees, resample=kwargs["resample"])
+def auto_contrast(img, **__):
+    return ImageOps.autocontrast(img)
+def invert(img, **__):
+    return ImageOps.invert(img)
+def equalize(img, **__):
+    return ImageOps.equalize(img)
+def solarize(img, thresh, **__):
+    return ImageOps.solarize(img, thresh)
+def solarize_add(img, add, thresh=128, **__):
+    lut = []
+    for i in range(256):
+        if i < thresh:
+            lut.append(min(255, i + add))
+        else:
+            lut.append(i)
+    if img.mode in ("L", "RGB"):
+        if img.mode == "RGB" and len(lut) == 256:
+            lut = lut + lut + lut
+        return img.point(lut)
+    else:
+        return img
+def posterize(img, bits_to_keep, **__):
+    if bits_to_keep >= 8:
+        return img
+    return ImageOps.posterize(img, bits_to_keep)
+def contrast(img, factor, **__):
+    return ImageEnhance.Contrast(img).enhance(factor)
+def color(img, factor, **__):
+    return ImageEnhance.Color(img).enhance(factor)
+def brightness(img, factor, **__):
+    return ImageEnhance.Brightness(img).enhance(factor)
+def sharpness(img, factor, **__):
+    return ImageEnhance.Sharpness(img).enhance(factor)
+def _randomly_negate(v):
+    """With 50% prob, negate the value"""
+    return -v if random.random() > 0.5 else v
+def _rotate_level_to_arg(level, _hparams):
+    # range [-30, 30]
+    level = (level / _MAX_LEVEL) * 30.0
+    level = _randomly_negate(level)
+    return (level,)
+def _enhance_level_to_arg(level, _hparams):
+    # range [0.1, 1.9]
+    return ((level / _MAX_LEVEL) * 1.8 + 0.1,)
+def _enhance_increasing_level_to_arg(level, _hparams):
+    # the 'no change' level is 1.0, moving away from that towards 0. or 2.0 increases the enhancement blend
+    # range [0.1, 1.9]
+    level = (level / _MAX_LEVEL) * 0.9
+    level = 1.0 + _randomly_negate(level)
+    return (level,)
+def _shear_level_to_arg(level, _hparams):
+    # range [-0.3, 0.3]
+    level = (level / _MAX_LEVEL) * 0.3
+    level = _randomly_negate(level)
+    return (level,)
+def _translate_abs_level_to_arg(level, hparams):
+    translate_const = hparams["translate_const"]
+    level = (level / _MAX_LEVEL) * float(translate_const)
+    level = _randomly_negate(level)
+    return (level,)
+def _translate_rel_level_to_arg(level, hparams):
+    # default range [-0.45, 0.45]
+    translate_pct = hparams.get("translate_pct", 0.45)
+    level = (level / _MAX_LEVEL) * translate_pct
+    level = _randomly_negate(level)
+    return (level,)
+def _posterize_level_to_arg(level, _hparams):
+    # As per Tensorflow TPU EfficientNet impl
+    # range [0, 4], 'keep 0 up to 4 MSB of original image'
+    # intensity/severity of augmentation decreases with level
+    return (int((level / _MAX_LEVEL) * 4),)
+def _posterize_increasing_level_to_arg(level, hparams):
+    # As per Tensorflow models research and UDA impl
+    # range [4, 0], 'keep 4 down to 0 MSB of original image',
+    # intensity/severity of augmentation increases with level
+    return (4 - _posterize_level_to_arg(level, hparams)[0],)
+def _posterize_original_level_to_arg(level, _hparams):
+    # As per original AutoAugment paper description
+    # range [4, 8], 'keep 4 up to 8 MSB of image'
+    # intensity/severity of augmentation decreases with level
+    return (int((level / _MAX_LEVEL) * 4) + 4,)
+def _solarize_level_to_arg(level, _hparams):
+    # range [0, 256]
+    # intensity/severity of augmentation decreases with level
+    return (int((level / _MAX_LEVEL) * 256),)
+def _solarize_increasing_level_to_arg(level, _hparams):
+    # range [0, 256]
+    # intensity/severity of augmentation increases with level
+    return (256 - _solarize_level_to_arg(level, _hparams)[0],)
+def _solarize_add_level_to_arg(level, _hparams):
+    # range [0, 110]
+    return (int((level / _MAX_LEVEL) * 110),)
+LEVEL_TO_ARG = {
+    "AutoContrast": None,
+    "Equalize": None,
+    "Invert": None,
+    "Rotate": _rotate_level_to_arg,
+    # There are several variations of the posterize level scaling in various Tensorflow/Google repositories/papers
+    "Posterize": _posterize_level_to_arg,
+    "PosterizeIncreasing": _posterize_increasing_level_to_arg,
+    "PosterizeOriginal": _posterize_original_level_to_arg,
+    "Solarize": _solarize_level_to_arg,
+    "SolarizeIncreasing": _solarize_increasing_level_to_arg,
+    "SolarizeAdd": _solarize_add_level_to_arg,
+    "Color": _enhance_level_to_arg,
+    "ColorIncreasing": _enhance_increasing_level_to_arg,
+    "Contrast": _enhance_level_to_arg,
+    "ContrastIncreasing": _enhance_increasing_level_to_arg,
+    "Brightness": _enhance_level_to_arg,
+    "BrightnessIncreasing": _enhance_increasing_level_to_arg,
+    "Sharpness": _enhance_level_to_arg,
+    "SharpnessIncreasing": _enhance_increasing_level_to_arg,
+    "ShearX": _shear_level_to_arg,
+    "ShearY": _shear_level_to_arg,
+    "TranslateX": _translate_abs_level_to_arg,
+    "TranslateY": _translate_abs_level_to_arg,
+    "TranslateXRel": _translate_rel_level_to_arg,
+    "TranslateYRel": _translate_rel_level_to_arg,
+}
+NAME_TO_OP = {
+    "AutoContrast": auto_contrast,
+    "Equalize": equalize,
+    "Invert": invert,
+    "Rotate": rotate,
+    "Posterize": posterize,
+    "PosterizeIncreasing": posterize,
+    "PosterizeOriginal": posterize,
+    "Solarize": solarize,
+    "SolarizeIncreasing": solarize,
+    "SolarizeAdd": solarize_add,
+    "Color": color,
+    "ColorIncreasing": color,
+    "Contrast": contrast,
+    "ContrastIncreasing": contrast,
+    "Brightness": brightness,
+    "BrightnessIncreasing": brightness,
+    "Sharpness": sharpness,
+    "SharpnessIncreasing": sharpness,
+    "ShearX": shear_x,
+    "ShearY": shear_y,
+    "TranslateX": translate_x_abs,
+    "TranslateY": translate_y_abs,
+    "TranslateXRel": translate_x_rel,
+    "TranslateYRel": translate_y_rel,
+}
+class AugmentOp:
+    """
+    Apply for video.
+    """
+    def __init__(self, name, prob=0.5, magnitude=10, hparams=None):
+        hparams = hparams or _HPARAMS_DEFAULT
+        self.aug_fn = NAME_TO_OP[name]
+        self.level_fn = LEVEL_TO_ARG[name]
+        self.prob = prob
+        self.magnitude = magnitude
+        self.hparams = hparams.copy()
+        self.kwargs = {
+            "fillcolor": hparams["img_mean"]
+            if "img_mean" in hparams
+            else _FILL,
+            "resample": hparams["interpolation"]
+            if "interpolation" in hparams
+            else _RANDOM_INTERPOLATION,
+        }
+        # If magnitude_std is > 0, we introduce some randomness
+        # in the usually fixed policy and sample magnitude from a normal distribution
+        # with mean `magnitude` and std-dev of `magnitude_std`.
+        # NOTE This is my own hack, being tested, not in papers or reference impls.
+        self.magnitude_std = self.hparams.get("magnitude_std", 0)
+    def __call__(self, img_list):
+        if self.prob < 1.0 and random.random() > self.prob:
+            return img_list
+        magnitude = self.magnitude
+        if self.magnitude_std and self.magnitude_std > 0:
+            magnitude = random.gauss(magnitude, self.magnitude_std)
+        magnitude = min(_MAX_LEVEL, max(0, magnitude))  # clip to valid range
+        level_args = (
+            self.level_fn(magnitude, self.hparams)
+            if self.level_fn is not None
+            else ()
+        )
+        if isinstance(img_list, list):
+            return [
+                self.aug_fn(img, *level_args, **self.kwargs) for img in img_list
+            ]
+        else:
+            return self.aug_fn(img_list, *level_args, **self.kwargs)
+_RAND_TRANSFORMS = [
+    "AutoContrast",
+    "Equalize",
+    "Invert",
+    "Rotate",
+    "Posterize",
+    "Solarize",
+    "SolarizeAdd",
+    "Color",
+    "Contrast",
+    "Brightness",
+    "Sharpness",
+    "ShearX",
+    "ShearY",
+    "TranslateXRel",
+    "TranslateYRel",
+]
+_RAND_INCREASING_TRANSFORMS = [
+    "AutoContrast",
+    "Equalize",
+    "Invert",
+    "Rotate",
+    "PosterizeIncreasing",
+    "SolarizeIncreasing",
+    "SolarizeAdd",
+    "ColorIncreasing",
+    "ContrastIncreasing",
+    "BrightnessIncreasing",
+    "SharpnessIncreasing",
+    "ShearX",
+    "ShearY",
+    "TranslateXRel",
+    "TranslateYRel",
+]
+# These experimental weights are based loosely on the relative improvements mentioned in paper.
+# They may not result in increased performance, but could likely be tuned to so.
+_RAND_CHOICE_WEIGHTS_0 = {
+    "Rotate": 0.3,
+    "ShearX": 0.2,
+    "ShearY": 0.2,
+    "TranslateXRel": 0.1,
+    "TranslateYRel": 0.1,
+    "Color": 0.025,
+    "Sharpness": 0.025,
+    "AutoContrast": 0.025,
+    "Solarize": 0.005,
+    "SolarizeAdd": 0.005,
+    "Contrast": 0.005,
+    "Brightness": 0.005,
+    "Equalize": 0.005,
+    "Posterize": 0,
+    "Invert": 0,
+}
+def _select_rand_weights(weight_idx=0, transforms=None):
+    transforms = transforms or _RAND_TRANSFORMS
+    assert weight_idx == 0  # only one set of weights currently
+    rand_weights = _RAND_CHOICE_WEIGHTS_0
+    probs = [rand_weights[k] for k in transforms]
+    probs /= np.sum(probs)
+    return probs
+def rand_augment_ops(magnitude=10, hparams=None, transforms=None):
+    hparams = hparams or _HPARAMS_DEFAULT
+    transforms = transforms or _RAND_TRANSFORMS
+    return [
+        AugmentOp(name, prob=0.5, magnitude=magnitude, hparams=hparams)
+        for name in transforms
+    ]
+class RandAugment:
+    def __init__(self, ops, num_layers=2, choice_weights=None):
+        self.ops = ops
+        self.num_layers = num_layers
+        self.choice_weights = choice_weights
+    def __call__(self, img):
+        # no replacement when using weighted choice
+        ops = np.random.choice(
+            self.ops,
+            self.num_layers,
+            replace=self.choice_weights is None,
+            p=self.choice_weights,
+        )
+        for op in ops:
+            img = op(img)
+        return img
+def rand_augment_transform(config_str, hparams):
+    """
+    RandAugment: Practical automated data augmentation... - https://arxiv.org/abs/1909.13719
+    Create a RandAugment transform
+    :param config_str: String defining configuration of random augmentation. Consists of multiple sections separated by
+    dashes ('-'). The first section defines the specific variant of rand augment (currently only 'rand'). The remaining
+    sections, not order sepecific determine
+        'm' - integer magnitude of rand augment
+        'n' - integer num layers (number of transform ops selected per image)
+        'w' - integer probabiliy weight index (index of a set of weights to influence choice of op)
+        'mstd' -  float std deviation of magnitude noise applied
+        'inc' - integer (bool), use augmentations that increase in severity with magnitude (default: 0)
+    Ex 'rand-m9-n3-mstd0.5' results in RandAugment with magnitude 9, num_layers 3, magnitude_std 0.5
+    'rand-mstd1-w0' results in magnitude_std 1.0, weights 0, default magnitude of 10 and num_layers 2
+    :param hparams: Other hparams (kwargs) for the RandAugmentation scheme
+    :return: A PyTorch compatible Transform
+    """
+    magnitude = _MAX_LEVEL  # default to _MAX_LEVEL for magnitude (currently 10)
+    num_layers = 2  # default to 2 ops per image
+    weight_idx = None  # default to no probability weights for op choice
+    transforms = _RAND_TRANSFORMS
+    config = config_str.split("-")
+    assert config[0] == "rand"
+    config = config[1:]
+    for c in config:
+        cs = re.split(r"(\d.*)", c)
+        if len(cs) < 2:
+            continue
+        key, val = cs[:2]
+        if key == "mstd":
+            # noise param injected via hparams for now
+            hparams.setdefault("magnitude_std", float(val))
+        elif key == "inc":
+            if bool(val):
+                transforms = _RAND_INCREASING_TRANSFORMS
+        elif key == "m":
+            magnitude = int(val)
+        elif key == "n":
+            num_layers = int(val)
+        elif key == "w":
+            weight_idx = int(val)
+        else:
+            assert NotImplementedError
+    ra_ops = rand_augment_ops(
+        magnitude=magnitude, hparams=hparams, transforms=transforms
+    )
+    choice_weights = (
+        None if weight_idx is None else _select_rand_weights(weight_idx)
+    )
+    return RandAugment(ra_ops, num_layers, choice_weights=choice_weights)

random_erasing.py ADDED Viewed

	@@ -0,0 +1,173 @@

+"""
+This implementation is based on
+https://github.com/rwightman/pytorch-image-models/blob/master/timm/data/random_erasing.py
+pulished under an Apache License 2.0.
+"""
+import math
+import random
+import torch
+def _get_pixels(
+    per_pixel, rand_color, patch_size, dtype=torch.float32, device="cuda"
+):
+    # NOTE I've seen CUDA illegal memory access errors being caused by the normal_()
+    # paths, flip the order so normal is run on CPU if this becomes a problem
+    # Issue has been fixed in master https://github.com/pytorch/pytorch/issues/19508
+    if per_pixel:
+        return torch.empty(patch_size, dtype=dtype, device=device).normal_()
+    elif rand_color:
+        return torch.empty(
+            (patch_size[0], 1, 1), dtype=dtype, device=device
+        ).normal_()
+    else:
+        return torch.zeros((patch_size[0], 1, 1), dtype=dtype, device=device)
+class RandomErasing:
+    """Randomly selects a rectangle region in an image and erases its pixels.
+        'Random Erasing Data Augmentation' by Zhong et al.
+        See https://arxiv.org/pdf/1708.04896.pdf
+        This variant of RandomErasing is intended to be applied to either a batch
+        or single image tensor after it has been normalized by dataset mean and std.
+    Args:
+         probability: Probability that the Random Erasing operation will be performed.
+         min_area: Minimum percentage of erased area wrt input image area.
+         max_area: Maximum percentage of erased area wrt input image area.
+         min_aspect: Minimum aspect ratio of erased area.
+         mode: pixel color mode, one of 'const', 'rand', or 'pixel'
+            'const' - erase block is constant color of 0 for all channels
+            'rand'  - erase block is same per-channel random (normal) color
+            'pixel' - erase block is per-pixel random (normal) color
+        max_count: maximum number of erasing blocks per image, area per box is scaled by count.
+            per-image count is randomly chosen between 1 and this value.
+    """
+    def __init__(
+        self,
+        probability=0.5,
+        min_area=0.02,
+        max_area=1 / 3,
+        min_aspect=0.3,
+        max_aspect=None,
+        mode="const",
+        min_count=1,
+        max_count=None,
+        num_splits=0,
+        device="cuda",
+        cube=True,
+    ):
+        self.probability = probability
+        self.min_area = min_area
+        self.max_area = max_area
+        max_aspect = max_aspect or 1 / min_aspect
+        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))
+        self.min_count = min_count
+        self.max_count = max_count or min_count
+        self.num_splits = num_splits
+        mode = mode.lower()
+        self.rand_color = False
+        self.per_pixel = False
+        self.cube = cube
+        if mode == "rand":
+            self.rand_color = True  # per block random normal
+        elif mode == "pixel":
+            self.per_pixel = True  # per pixel random normal
+        else:
+            assert not mode or mode == "const"
+        self.device = device
+    def _erase(self, img, chan, img_h, img_w, dtype):
+        if random.random() > self.probability:
+            return
+        area = img_h * img_w
+        count = (
+            self.min_count
+            if self.min_count == self.max_count
+            else random.randint(self.min_count, self.max_count)
+        )
+        for _ in range(count):
+            for _ in range(10):
+                target_area = (
+                    random.uniform(self.min_area, self.max_area) * area / count
+                )
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    img[:, top : top + h, left : left + w] = _get_pixels(
+                        self.per_pixel,
+                        self.rand_color,
+                        (chan, h, w),
+                        dtype=dtype,
+                        device=self.device,
+                    )
+                    break
+    def _erase_cube(
+        self,
+        img,
+        batch_start,
+        batch_size,
+        chan,
+        img_h,
+        img_w,
+        dtype,
+    ):
+        if random.random() > self.probability:
+            return
+        area = img_h * img_w
+        count = (
+            self.min_count
+            if self.min_count == self.max_count
+            else random.randint(self.min_count, self.max_count)
+        )
+        for _ in range(count):
+            for _ in range(100):
+                target_area = (
+                    random.uniform(self.min_area, self.max_area) * area / count
+                )
+                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))
+                h = int(round(math.sqrt(target_area * aspect_ratio)))
+                w = int(round(math.sqrt(target_area / aspect_ratio)))
+                if w < img_w and h < img_h:
+                    top = random.randint(0, img_h - h)
+                    left = random.randint(0, img_w - w)
+                    for i in range(batch_start, batch_size):
+                        img_instance = img[i]
+                        img_instance[
+                            :, top : top + h, left : left + w
+                        ] = _get_pixels(
+                            self.per_pixel,
+                            self.rand_color,
+                            (chan, h, w),
+                            dtype=dtype,
+                            device=self.device,
+                        )
+                    break
+    def __call__(self, input):
+        if len(input.size()) == 3:
+            self._erase(input, *input.size(), input.dtype)
+        else:
+            batch_size, chan, img_h, img_w = input.size()
+            # skip first slice of batch if num_splits is set (for clean portion of samples)
+            batch_start = (
+                batch_size // self.num_splits if self.num_splits > 1 else 0
+            )
+            if self.cube:
+                self._erase_cube(
+                    input,
+                    batch_start,
+                    batch_size,
+                    chan,
+                    img_h,
+                    img_w,
+                    input.dtype,
+                )
+            else:
+                for i in range(batch_start, batch_size):
+                    self._erase(input[i], chan, img_h, img_w, input.dtype)
+        return input

run_class_finetuning.py ADDED Viewed

	@@ -0,0 +1,582 @@

+import argparse
+import datetime
+import numpy as np
+import time
+import torch
+import torch.backends.cudnn as cudnn
+import json
+import os
+from functools import partial
+from pathlib import Path
+from collections import OrderedDict
+from mixup import Mixup
+from timm.models import create_model
+from timm.loss import LabelSmoothingCrossEntropy, SoftTargetCrossEntropy
+from timm.utils import ModelEma
+from optim_factory import create_optimizer, get_parameter_groups, LayerDecayValueAssigner
+from datasets import build_dataset
+from engine_for_finetuning import train_one_epoch, validation_one_epoch, final_test, merge, merge_mean_per_class
+from utils_mae import NativeScalerWithGradNormCount as NativeScaler
+from utils_mae import  multiple_samples_collate
+import utils_mae as utils
+import modeling_finetune
+def get_args():
+    parser = argparse.ArgumentParser('VideoMAE fine-tuning and evaluation script for video classification', add_help=False)
+    parser.add_argument('--batch_size', default=64, type=int)
+    parser.add_argument('--epochs', default=30, type=int)
+    parser.add_argument('--update_freq', default=1, type=int)
+    parser.add_argument('--save_ckpt_freq', default=100, type=int)
+    parser.add_argument('--val_freq', default=1, type=int)
+    # Model parameters
+    parser.add_argument('--model', default='vit_base_patch16_224', type=str, metavar='MODEL',
+                        help='Name of model to train')
+    parser.add_argument('--tubelet_size', type=int, default= 2)
+    parser.add_argument('--input_size', default=224, type=int,
+                        help='videos input size')
+    parser.add_argument('--fc_drop_rate', type=float, default=0.0, metavar='PCT',
+                        help='Dropout rate (default: 0.)')
+    parser.add_argument('--drop', type=float, default=0.0, metavar='PCT',
+                        help='Dropout rate (default: 0.)')
+    parser.add_argument('--attn_drop_rate', type=float, default=0.0, metavar='PCT',
+                        help='Attention dropout rate (default: 0.)')
+    parser.add_argument('--drop_path', type=float, default=0.1, metavar='PCT',
+                        help='Drop path rate (default: 0.1)')
+    parser.add_argument('--disable_eval_during_finetuning', action='store_true', default=False)
+    parser.add_argument('--model_ema', action='store_true', default=False)
+    parser.add_argument('--model_ema_decay', type=float, default=0.9999, help='')
+    parser.add_argument('--model_ema_force_cpu', action='store_true', default=False, help='')
+    # Optimizer parameters
+    parser.add_argument('--opt', default='adamw', type=str, metavar='OPTIMIZER',
+                        help='Optimizer (default: "adamw"')
+    parser.add_argument('--opt_eps', default=1e-8, type=float, metavar='EPSILON',
+                        help='Optimizer Epsilon (default: 1e-8)')
+    parser.add_argument('--opt_betas', default=None, type=float, nargs='+', metavar='BETA',
+                        help='Optimizer Betas (default: None, use opt default)')
+    parser.add_argument('--clip_grad', type=float, default=None, metavar='NORM',
+                        help='Clip gradient norm (default: None, no clipping)')
+    parser.add_argument('--momentum', type=float, default=0.9, metavar='M',
+                        help='SGD momentum (default: 0.9)')
+    parser.add_argument('--weight_decay', type=float, default=0.05,
+                        help='weight decay (default: 0.05)')
+    parser.add_argument('--weight_decay_end', type=float, default=None, help="""Final value of the
+        weight decay. We use a cosine schedule for WD and using a larger decay by
+        the end of training improves performance for ViTs.""")
+    parser.add_argument('--lr', type=float, default=1e-3, metavar='LR',
+                        help='learning rate (default: 1e-3)')
+    parser.add_argument('--layer_decay', type=float, default=0.75)
+    parser.add_argument('--warmup_lr', type=float, default=1e-6, metavar='LR',
+                        help='warmup learning rate (default: 1e-6)')
+    parser.add_argument('--min_lr', type=float, default=1e-6, metavar='LR',
+                        help='lower lr bound for cyclic schedulers that hit 0 (1e-5)')
+    parser.add_argument('--warmup_epochs', type=int, default=5, metavar='N',
+                        help='epochs to warmup LR, if scheduler supports')
+    parser.add_argument('--warmup_steps', type=int, default=-1, metavar='N',
+                        help='num of steps to warmup LR, will overload warmup_epochs if set > 0')
+    # Augmentation parameters
+    parser.add_argument('--color_jitter', type=float, default=0.4, metavar='PCT',
+                        help='Color jitter factor (default: 0.4)')
+    parser.add_argument('--num_sample', type=int, default=2,
+                        help='Repeated_aug (default: 2)')
+    parser.add_argument('--aa', type=str, default='rand-m7-n4-mstd0.5-inc1', metavar='NAME',
+                        help='Use AutoAugment policy. "v0" or "original". " + "(default: rand-m7-n4-mstd0.5-inc1)'),
+    parser.add_argument('--smoothing', type=float, default=0.1,
+                        help='Label smoothing (default: 0.1)')
+    parser.add_argument('--train_interpolation', type=str, default='bicubic',
+                        help='Training interpolation (random, bilinear, bicubic default: "bicubic")')
+    # Evaluation parameters
+    parser.add_argument('--crop_pct', type=float, default=None)
+    parser.add_argument('--short_side_size', type=int, default=224)
+    parser.add_argument('--test_num_segment', type=int, default=5)
+    parser.add_argument('--test_num_crop', type=int, default=3)
+    # Random Erase params
+    parser.add_argument('--reprob', type=float, default=0.25, metavar='PCT',
+                        help='Random erase prob (default: 0.25)')
+    parser.add_argument('--remode', type=str, default='pixel',
+                        help='Random erase mode (default: "pixel")')
+    parser.add_argument('--recount', type=int, default=1,
+                        help='Random erase count (default: 1)')
+    parser.add_argument('--resplit', action='store_true', default=False,
+                        help='Do not random erase first (clean) augmentation split')
+    # Mixup params
+    parser.add_argument('--mixup', type=float, default=0.8,
+                        help='mixup alpha, mixup enabled if > 0.')
+    parser.add_argument('--cutmix', type=float, default=1.0,
+                        help='cutmix alpha, cutmix enabled if > 0.')
+    parser.add_argument('--cutmix_minmax', type=float, nargs='+', default=None,
+                        help='cutmix min/max ratio, overrides alpha and enables cutmix if set (default: None)')
+    parser.add_argument('--mixup_prob', type=float, default=1.0,
+                        help='Probability of performing mixup or cutmix when either/both is enabled')
+    parser.add_argument('--mixup_switch_prob', type=float, default=0.5,
+                        help='Probability of switching to cutmix when both mixup and cutmix enabled')
+    parser.add_argument('--mixup_mode', type=str, default='batch',
+                        help='How to apply mixup/cutmix params. Per "batch", "pair", or "elem"')
+    # Finetuning params
+    parser.add_argument('--finetune', default='', help='finetune from checkpoint')
+    parser.add_argument('--model_key', default='model|module', type=str)
+    parser.add_argument('--model_prefix', default='', type=str)
+    parser.add_argument('--init_scale', default=0.001, type=float)
+    parser.add_argument('--use_checkpoint', action='store_true')
+    parser.set_defaults(use_checkpoint=False)
+    parser.add_argument('--use_mean_pooling', action='store_true')
+    parser.set_defaults(use_mean_pooling=True)
+    parser.add_argument('--use_cls', action='store_false', dest='use_mean_pooling')
+    # Dataset parameters
+    parser.add_argument('--data_path', default='/path/to/list_kinetics-400', type=str,
+                        help='dataset path')
+    parser.add_argument('--eval_data_path', default=None, type=str,
+                        help='dataset path for evaluation')
+    parser.add_argument('--nb_classes', default=400, type=int,
+                        help='number of the classification types')
+    parser.add_argument('--imagenet_default_mean_and_std', default=True, action='store_true')
+    parser.add_argument('--num_segments', type=int, default= 1)
+    parser.add_argument('--num_frames', type=int, default= 16)
+    parser.add_argument('--sampling_rate', type=int, default= 4)
+    parser.add_argument('--data_set', default='Kinetics-400', choices=['Kinetics-400', 'SSV2', 'UCF101', 'HMDB51','image_folder','SSV2-Mini', 'Mini-Kinetics'],
+                        type=str, help='dataset')
+    parser.add_argument('--output_dir', default='',
+                        help='path where to save, empty for no saving')
+    parser.add_argument('--log_dir', default=None,
+                        help='path where to tensorboard log')
+    parser.add_argument('--device', default='cuda',
+                        help='device to use for training / testing')
+    parser.add_argument('--seed', default=0, type=int)
+    parser.add_argument('--resume', default='',
+                        help='resume from checkpoint')
+    parser.add_argument('--auto_resume', action='store_true')
+    parser.add_argument('--no_auto_resume', action='store_false', dest='auto_resume')
+    parser.set_defaults(auto_resume=True)
+    parser.add_argument('--save_ckpt', action='store_true')
+    parser.add_argument('--no_save_ckpt', action='store_false', dest='save_ckpt')
+    parser.set_defaults(save_ckpt=True)
+    parser.add_argument('--start_epoch', default=0, type=int, metavar='N',
+                        help='start epoch')
+    parser.add_argument('--eval', action='store_true',
+                        help='Perform evaluation only')
+    parser.add_argument('--dist_eval', action='store_true', default=False,
+                        help='Enabling distributed evaluation')
+    parser.add_argument('--num_workers', default=10, type=int)
+    parser.add_argument('--pin_mem', action='store_true',
+                        help='Pin CPU memory in DataLoader for more efficient (sometimes) transfer to GPU.')
+    parser.add_argument('--no_pin_mem', action='store_false', dest='pin_mem')
+    parser.set_defaults(pin_mem=True)
+    # distributed training parameters
+    parser.add_argument('--world_size', default=1, type=int,
+                        help='number of distributed processes')
+    parser.add_argument('--local_rank', default=-1, type=int)
+    parser.add_argument('--dist_on_itp', action='store_true')
+    parser.add_argument('--dist_url', default='env://',
+                        help='url used to set up distributed training')
+    parser.add_argument('--enable_deepspeed', action='store_true', default=False)
+    # debug mode
+    parser.add_argument('--not_dist', action='store_true', default=False)
+    parser.add_argument('--num_outputs', default=8, type=int)
+    known_args, _ = parser.parse_known_args()
+    if known_args.enable_deepspeed:
+        try:
+            import deepspeed
+            from deepspeed import DeepSpeedConfig
+            parser = deepspeed.add_config_arguments(parser)
+            ds_init = deepspeed.initialize
+        except:
+            print("Please 'pip install deepspeed'")
+            exit(0)
+    else:
+        ds_init = None
+    return parser.parse_args(), ds_init
+def main(args, ds_init):
+    if args.not_dist:
+        args.distributed = False
+    else:
+        utils.init_distributed_mode(args)
+    if ds_init is not None:
+        utils.create_ds_config(args)
+    print(args)
+    device = torch.device(args.device)
+    # fix the seed for reproducibility
+    seed = args.seed + utils.get_rank()
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    # random.seed(seed)
+    cudnn.benchmark = True
+    dataset_train, args.nb_classes = build_dataset(is_train=True, test_mode=False, args=args)
+    if args.disable_eval_during_finetuning:
+        dataset_val = None
+    else:
+        dataset_val, _ = build_dataset(is_train=False, test_mode=False, args=args)
+    dataset_test, _ = build_dataset(is_train=False, test_mode=True, args=args)
+    num_tasks = utils.get_world_size()
+    global_rank = utils.get_rank()
+    sampler_train = torch.utils.data.DistributedSampler(
+        dataset_train, num_replicas=num_tasks, rank=global_rank, shuffle=True
+    )
+    print("Sampler_train = %s" % str(sampler_train))
+    if args.dist_eval:
+        if len(dataset_val) % num_tasks != 0:
+            print('Warning: Enabling distributed evaluation with an eval dataset not divisible by process number. '
+                    'This will slightly alter validation results as extra duplicate entries are added to achieve '
+                    'equal num of samples per-process.')
+        sampler_val = torch.utils.data.DistributedSampler(
+            dataset_val, num_replicas=num_tasks, rank=global_rank, shuffle=False)
+        sampler_test = torch.utils.data.DistributedSampler(
+            dataset_test, num_replicas=num_tasks, rank=global_rank, shuffle=False)
+    else:
+        sampler_val = torch.utils.data.SequentialSampler(dataset_val)
+    if global_rank == 0 and args.log_dir is not None:
+        os.makedirs(args.log_dir, exist_ok=True)
+        log_writer = utils.TensorboardLogger(log_dir=args.log_dir)
+    else:
+        log_writer = None
+    if args.num_sample > 1:
+        collate_func = partial(multiple_samples_collate, fold=False)
+    else:
+        collate_func = None
+    data_loader_train = torch.utils.data.DataLoader(
+        dataset_train, sampler=sampler_train,
+        batch_size=args.batch_size,
+        num_workers=args.num_workers,
+        pin_memory=args.pin_mem,
+        drop_last=True,
+        collate_fn=collate_func,
+    )
+    if dataset_val is not None:
+        data_loader_val = torch.utils.data.DataLoader(
+            dataset_val, sampler=sampler_val,
+            batch_size=int(1.5 * args.batch_size),
+            num_workers=args.num_workers,
+            pin_memory=args.pin_mem,
+            drop_last=False
+        )
+    else:
+        data_loader_val = None
+    if dataset_test is not None:
+        data_loader_test = torch.utils.data.DataLoader(
+            dataset_test, sampler=sampler_test,
+            batch_size=args.batch_size,
+            num_workers=args.num_workers,
+            pin_memory=args.pin_mem,
+            drop_last=False
+        )
+    else:
+        data_loader_test = None
+    mixup_fn = None
+    mixup_active = args.mixup > 0 or args.cutmix > 0. or args.cutmix_minmax is not None
+    if mixup_active:
+        print("Mixup is activated!")
+        mixup_fn = Mixup(
+            mixup_alpha=args.mixup, cutmix_alpha=args.cutmix, cutmix_minmax=args.cutmix_minmax,
+            prob=args.mixup_prob, switch_prob=args.mixup_switch_prob, mode=args.mixup_mode,
+            label_smoothing=args.smoothing, num_classes=args.nb_classes)
+    model = create_model(
+        args.model,
+        pretrained=False,
+        num_classes=args.nb_classes,
+        all_frames=args.num_frames * args.num_segments,
+        tubelet_size=args.tubelet_size,
+        fc_drop_rate=args.fc_drop_rate,
+        drop_rate=args.drop,
+        drop_path_rate=args.drop_path,
+        attn_drop_rate=args.attn_drop_rate,
+        drop_block_rate=None,
+        use_checkpoint=args.use_checkpoint,
+        use_mean_pooling=args.use_mean_pooling,
+        init_scale=args.init_scale,
+    )
+    patch_size = model.patch_embed.patch_size
+    print("Patch size = %s" % str(patch_size))
+    args.window_size = (args.num_frames // 2, args.input_size // patch_size[0], args.input_size // patch_size[1])
+    args.patch_size = patch_size
+    if args.finetune:
+        if args.finetune.startswith('https'):
+            checkpoint = torch.hub.load_state_dict_from_url(
+                args.finetune, map_location='cpu', check_hash=True)
+        else:
+            checkpoint = torch.load(args.finetune, map_location='cpu')
+        print("Load ckpt from %s" % args.finetune)
+        checkpoint_model = None
+        for model_key in args.model_key.split('|'):
+            if model_key in checkpoint:
+                checkpoint_model = checkpoint[model_key]
+                print("Load state_dict by model_key = %s" % model_key)
+                break
+        if checkpoint_model is None:
+            checkpoint_model = checkpoint
+        state_dict = model.state_dict()
+        for k in ['head.weight', 'head.bias']:
+            if k in checkpoint_model and checkpoint_model[k].shape != state_dict[k].shape:
+                print(f"Removing key {k} from pretrained checkpoint")
+                del checkpoint_model[k]
+        all_keys = list(checkpoint_model.keys())
+        new_dict = OrderedDict()
+        for key in all_keys:
+            if key.startswith('backbone.'):
+                new_dict[key[9:]] = checkpoint_model[key]
+            elif key.startswith('encoder.'):
+                new_dict[key[8:]] = checkpoint_model[key]
+            else:
+                new_dict[key] = checkpoint_model[key]
+        checkpoint_model = new_dict
+        # interpolate position embedding
+        if 'pos_embed' in checkpoint_model:
+            pos_embed_checkpoint = checkpoint_model['pos_embed']
+            embedding_size = pos_embed_checkpoint.shape[-1] # channel dim
+            num_patches = model.patch_embed.num_patches #
+            num_extra_tokens = model.pos_embed.shape[-2] - num_patches # 0/1
+            # height (== width) for the checkpoint position embedding
+            orig_size = int(((pos_embed_checkpoint.shape[-2] - num_extra_tokens)//(args.num_frames // model.patch_embed.tubelet_size)) ** 0.5)
+            # height (== width) for the new position embedding
+            new_size = int((num_patches // (args.num_frames // model.patch_embed.tubelet_size) )** 0.5)
+            # class_token and dist_token are kept unchanged
+            if orig_size != new_size:
+                print("Position interpolate from %dx%d to %dx%d" % (orig_size, orig_size, new_size, new_size))
+                extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
+                # only the position tokens are interpolated
+                pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
+                # B, L, C -> BT, H, W, C -> BT, C, H, W
+                pos_tokens = pos_tokens.reshape(-1, args.num_frames // model.patch_embed.tubelet_size, orig_size, orig_size, embedding_size)
+                pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
+                pos_tokens = torch.nn.functional.interpolate(
+                    pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
+                # BT, C, H, W -> BT, H, W, C ->  B, T, H, W, C
+                pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, args.num_frames // model.patch_embed.tubelet_size, new_size, new_size, embedding_size)
+                pos_tokens = pos_tokens.flatten(1, 3) # B, L, C
+                new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
+                checkpoint_model['pos_embed'] = new_pos_embed
+        utils.load_state_dict(model, checkpoint_model, prefix=args.model_prefix)
+    model.to(device)
+    model_ema = None
+    if args.model_ema:
+        model_ema = ModelEma(
+            model,
+            decay=args.model_ema_decay,
+            device='cpu' if args.model_ema_force_cpu else '',
+            resume='')
+        print("Using EMA with decay = %.8f" % args.model_ema_decay)
+    model_without_ddp = model
+    n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    print("Model = %s" % str(model_without_ddp))
+    print('number of params:', n_parameters)
+    total_batch_size = args.batch_size * args.update_freq * utils.get_world_size()
+    num_training_steps_per_epoch = len(dataset_train) // total_batch_size
+    args.lr = args.lr * total_batch_size / 256
+    args.min_lr = args.min_lr * total_batch_size / 256
+    args.warmup_lr = args.warmup_lr * total_batch_size / 256
+    print("LR = %.8f" % args.lr)
+    print("Batch size = %d" % total_batch_size)
+    print("Update frequent = %d" % args.update_freq)
+    print("Number of training examples = %d" % len(dataset_train))
+    print("Number of training training per epoch = %d" % num_training_steps_per_epoch)
+    num_layers = model_without_ddp.get_num_layers()
+    if args.layer_decay < 1.0:
+        assigner = LayerDecayValueAssigner(list(args.layer_decay ** (num_layers + 1 - i) for i in range(num_layers + 2)))
+    else:
+        assigner = None
+    if assigner is not None:
+        print("Assigned values = %s" % str(assigner.values))
+    skip_weight_decay_list = model.no_weight_decay()
+    print("Skip weight decay list: ", skip_weight_decay_list)
+    if args.enable_deepspeed:
+        loss_scaler = None
+        optimizer_params = get_parameter_groups(
+            model, args.weight_decay, skip_weight_decay_list,
+            assigner.get_layer_id if assigner is not None else None,
+            assigner.get_scale if assigner is not None else None)
+        model, optimizer, _, _ = ds_init(
+            args=args, model=model, model_parameters=optimizer_params, dist_init_required=not args.distributed,
+        )
+        print("model.gradient_accumulation_steps() = %d" % model.gradient_accumulation_steps())
+        assert model.gradient_accumulation_steps() == args.update_freq
+    else:
+        if args.distributed:
+            model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=True)
+            model_without_ddp = model.module
+        optimizer = create_optimizer(
+            args, model_without_ddp, skip_list=skip_weight_decay_list,
+            get_num_layer=assigner.get_layer_id if assigner is not None else None,
+            get_layer_scale=assigner.get_scale if assigner is not None else None)
+        loss_scaler = NativeScaler()
+    print("Use step level LR scheduler!")
+    lr_schedule_values = utils.cosine_scheduler(
+        args.lr, args.min_lr, args.epochs, num_training_steps_per_epoch,
+        warmup_epochs=args.warmup_epochs, warmup_steps=args.warmup_steps,
+    )
+    if args.weight_decay_end is None:
+        args.weight_decay_end = args.weight_decay
+    wd_schedule_values = utils.cosine_scheduler(
+        args.weight_decay, args.weight_decay_end, args.epochs, num_training_steps_per_epoch)
+    print("Max WD = %.7f, Min WD = %.7f" % (max(wd_schedule_values), min(wd_schedule_values)))
+    if mixup_fn is not None:
+        # smoothing is handled with mixup label transform
+        criterion = SoftTargetCrossEntropy()
+    elif args.smoothing > 0.:
+        criterion = LabelSmoothingCrossEntropy(smoothing=args.smoothing)
+    else:
+        criterion = torch.nn.CrossEntropyLoss()
+    print("criterion = %s" % str(criterion))
+    utils.auto_load_model(
+        args=args, model=model, model_without_ddp=model_without_ddp,
+        optimizer=optimizer, loss_scaler=loss_scaler, model_ema=model_ema)
+    if args.eval:
+        if not args.not_dist:
+            preds_file = os.path.join(args.output_dir, str(global_rank) + '.txt')
+            test_stats = final_test(data_loader_test, model, device, preds_file)
+            torch.distributed.barrier()
+        else:
+            num_tasks = args.num_outputs
+        if global_rank == 0:
+            print("Start merging results...")
+            final_top1 ,final_top5 = merge(args.output_dir, num_tasks)
+            print(f"Accuracy of the network on the {len(dataset_test)} test videos: Top-1: {final_top1:.2f}%, Top-5: {final_top5:.2f}%")
+            log_stats = {'Final top-1': final_top1,
+                        'Final Top-5': final_top5}
+            final_top1_per_class ,final_top5_per_class = merge_mean_per_class(args.output_dir, num_tasks,args.nb_classes)
+            print(f"Accuracy of the network on the {len(dataset_test)} test videos: Mean-Top-1: {final_top1_per_class:.2f}%, Mean-Top-5: {final_top5_per_class:.2f}%")
+            log_stats["Class-Mean-Top-1"] = final_top1_per_class
+            log_stats["Class-Mean-Top-5"] = final_top5_per_class
+            if args.output_dir and utils.is_main_process():
+                with open(os.path.join(args.output_dir, "log.txt"), mode="a", encoding="utf-8") as f:
+                    f.write(json.dumps(log_stats) + "\n")
+        exit(0)
+    print(f"Start training for {args.epochs} epochs")
+    start_time = time.time()
+    max_accuracy = 0.0
+    for epoch in range(args.start_epoch, args.epochs):
+        if args.distributed:
+            data_loader_train.sampler.set_epoch(epoch)
+        if log_writer is not None:
+            log_writer.set_step(epoch * num_training_steps_per_epoch * args.update_freq)
+        train_stats = train_one_epoch(
+            model, criterion, data_loader_train, optimizer,
+            device, epoch, loss_scaler, args.clip_grad, model_ema, mixup_fn,
+            log_writer=log_writer, start_steps=epoch * num_training_steps_per_epoch,
+            lr_schedule_values=lr_schedule_values, wd_schedule_values=wd_schedule_values,
+            num_training_steps_per_epoch=num_training_steps_per_epoch, update_freq=args.update_freq,
+        )
+        if args.output_dir and args.save_ckpt:
+            if (epoch + 1) % args.save_ckpt_freq == 0 or epoch + 1 == args.epochs:
+                utils.save_model(
+                    args=args, model=model, model_without_ddp=model_without_ddp, optimizer=optimizer,
+                    loss_scaler=loss_scaler, epoch=epoch, model_ema=model_ema)
+        if data_loader_val is not None and (epoch + 1) % args.val_freq == 0:
+            test_stats = validation_one_epoch(data_loader_val, model, device)
+            print(f"Accuracy of the network on the {len(dataset_val)} val videos: {test_stats['acc1']:.1f}%")
+            if max_accuracy < test_stats["acc1"]:
+                max_accuracy = test_stats["acc1"]
+                if args.output_dir and args.save_ckpt:
+                    utils.save_model(
+                        args=args, model=model, model_without_ddp=model_without_ddp, optimizer=optimizer,
+                        loss_scaler=loss_scaler, epoch="best", model_ema=model_ema)
+            print(f'Max accuracy: {max_accuracy:.2f}%')
+            if log_writer is not None:
+                log_writer.update(val_acc1=test_stats['acc1'], head="perf", step=epoch)
+                log_writer.update(val_acc5=test_stats['acc5'], head="perf", step=epoch)
+                log_writer.update(val_loss=test_stats['loss'], head="perf", step=epoch)
+            log_stats = {**{f'train_{k}': v for k, v in train_stats.items()},
+                         **{f'val_{k}': v for k, v in test_stats.items()},
+                         'epoch': epoch,
+                         'n_parameters': n_parameters}
+        else:
+            log_stats = {**{f'train_{k}': v for k, v in train_stats.items()},
+                         'epoch': epoch,
+                         'n_parameters': n_parameters}
+        if args.output_dir and utils.is_main_process():
+            if log_writer is not None:
+                log_writer.flush()
+            with open(os.path.join(args.output_dir, "log.txt"), mode="a", encoding="utf-8") as f:
+                f.write(json.dumps(log_stats) + "\n")
+    preds_file = os.path.join(args.output_dir, str(global_rank) + '.txt')
+    test_stats = final_test(data_loader_test, model, device, preds_file)
+    torch.distributed.barrier()
+    if global_rank == 0:
+        print("Start merging results...")
+        final_top1 ,final_top5 = merge(args.output_dir, num_tasks)
+        print(f"Accuracy of the network on the {len(dataset_test)} test videos: Top-1: {final_top1:.2f}%, Top-5: {final_top5:.2f}%")
+        log_stats = {'Final top-1': final_top1,
+                    'Final Top-5': final_top5}
+        if args.output_dir and utils.is_main_process():
+            with open(os.path.join(args.output_dir, "log.txt"), mode="a", encoding="utf-8") as f:
+                f.write(json.dumps(log_stats) + "\n")
+    total_time = time.time() - start_time
+    total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+    print('Training time {}'.format(total_time_str))
+if __name__ == '__main__':
+    opts, ds_init = get_args()
+    if opts.output_dir:
+        Path(opts.output_dir).mkdir(parents=True, exist_ok=True)
+    main(opts, ds_init)

run_mae_pretraining.py ADDED Viewed

	@@ -0,0 +1,359 @@

+import argparse
+import datetime
+import numpy as np
+import time
+import torch
+import torch.backends.cudnn as cudnn
+import json
+import os
+from pathlib import Path
+from timm.models import create_model
+from optim_factory import create_optimizer
+from datasets import build_pretraining_dataset
+from engine_for_pretraining import train_one_epoch
+from utils_mae import NativeScalerWithGradNormCount as NativeScaler
+import utils_mae as utils
+import modeling_pretrain
+from timm.models.vision_transformer import vit_small_patch16_224, vit_base_patch16_224, vit_large_patch16_224
+from modeling_pretrain import FeatureExtractor
+def get_args():
+    parser = argparse.ArgumentParser('VideoMAE pre-training script', add_help=False)
+    parser.add_argument('--batch_size', default=64, type=int)
+    parser.add_argument('--epochs', default=800, type=int)
+    parser.add_argument('--save_ckpt_freq', default=50, type=int)
+    # Model parameters
+    parser.add_argument('--model', default='pretrain_videomae_base_patch16_224', type=str, metavar='MODEL',
+                        help='Name of model to train')
+    parser.add_argument('--decoder_depth', default=4, type=int,
+                        help='depth of decoder')
+    parser.add_argument('--mask_type', default='tube', choices=['random', 'tube', 'tubelet'],
+                        type=str, help='masked strategy of video tokens/patches')
+    parser.add_argument('--sub_mask_type', default='tube+picked_frame_visible', choices=['tube', 'tube+picked_frame_visible', 'tube+traj_mask'],
+                        type=str, help='sub masked strategy of tubelet masking')
+    parser.add_argument('--mask_ratio', default=0.75, type=float,
+                        help='ratio of the visual tokens/patches need be masked')
+    parser.add_argument('--input_size', default=224, type=int,
+                        help='videos input size for backbone')
+    parser.add_argument('--drop_path', type=float, default=0.0, metavar='PCT',
+                        help='Drop path rate (default: 0.1)')
+    parser.add_argument('--normlize_target', default=True, type=bool,
+                        help='normalized the target patch pixels')
+    # Optimizer parameters
+    parser.add_argument('--opt', default='adamw', type=str, metavar='OPTIMIZER',
+                        help='Optimizer (default: "adamw"')
+    parser.add_argument('--opt_eps', default=1e-8, type=float, metavar='EPSILON',
+                        help='Optimizer Epsilon (default: 1e-8)')
+    parser.add_argument('--opt_betas', default=None, type=float, nargs='+', metavar='BETA',
+                        help='Optimizer Betas (default: None, use opt default)')
+    parser.add_argument('--clip_grad', type=float, default=None, metavar='NORM',
+                        help='Clip gradient norm (default: None, no clipping)')
+    parser.add_argument('--momentum', type=float, default=0.9, metavar='M',
+                        help='SGD momentum (default: 0.9)')
+    parser.add_argument('--weight_decay', type=float, default=0.05,
+                        help='weight decay (default: 0.05)')
+    parser.add_argument('--weight_decay_end', type=float, default=None, help="""Final value of the
+        weight decay. We use a cosine schedule for WD.
+        (Set the same value with args.weight_decay to keep weight decay no change)""")
+    parser.add_argument('--lr', type=float, default=1.5e-4, metavar='LR',
+                        help='learning rate (default: 1.5e-4)')
+    parser.add_argument('--warmup_lr', type=float, default=1e-6, metavar='LR',
+                        help='warmup learning rate (default: 1e-6)')
+    parser.add_argument('--min_lr', type=float, default=1e-5, metavar='LR',
+                        help='lower lr bound for cyclic schedulers that hit 0 (1e-5)')
+    parser.add_argument('--warmup_epochs', type=int, default=40, metavar='N',
+                        help='epochs to warmup LR, if scheduler supports')
+    parser.add_argument('--warmup_steps', type=int, default=-1, metavar='N',
+                        help='epochs to warmup LR, if scheduler supports')
+    parser.add_argument('--use_checkpoint', action='store_true')
+    parser.set_defaults(use_checkpoint=False)
+    # Augmentation parameters
+    parser.add_argument('--color_jitter', type=float, default=0.0, metavar='PCT',
+                        help='Color jitter factor (default: 0.4)')
+    parser.add_argument('--train_interpolation', type=str, default='bicubic',
+                        help='Training interpolation (random, bilinear, bicubic default: "bicubic")')
+    # Dataset parameters
+    parser.add_argument('--data_path', default='/path/to/list_kinetics-400', type=str,
+                        help='dataset path')
+    parser.add_argument('--imagenet_default_mean_and_std', default=True, action='store_true')
+    parser.add_argument('--num_frames', type=int, default= 16)
+    parser.add_argument('--sampling_rate', type=int, default= 4)
+    parser.add_argument('--output_dir', default='',
+                        help='path where to save, empty for no saving')
+    parser.add_argument('--log_dir', default=None,
+                        help='path where to tensorboard log')
+    parser.add_argument('--device', default='cuda',
+                        help='device to use for training / testing')
+    parser.add_argument('--seed', default=0, type=int)
+    parser.add_argument('--resume', default='', help='resume from checkpoint')
+    parser.add_argument('--auto_resume', action='store_true')
+    parser.add_argument('--no_auto_resume', action='store_false', dest='auto_resume')
+    parser.set_defaults(auto_resume=True)
+    parser.add_argument('--start_epoch', default=0, type=int, metavar='N',
+                        help='start epoch')
+    parser.add_argument('--num_workers', default=10, type=int)
+    parser.add_argument('--pin_mem', action='store_true',
+                        help='Pin CPU memory in DataLoader for more efficient (sometimes) transfer to GPU.')
+    parser.add_argument('--no_pin_mem', action='store_false', dest='pin_mem',
+                        help='')
+    parser.set_defaults(pin_mem=True)
+    # distributed training parameters
+    parser.add_argument('--world_size', default=1, type=int,
+                        help='number of distributed processes')
+    parser.add_argument('--local-rank', default=-1, type=int)
+    parser.add_argument('--dist_on_itp', action='store_true')
+    parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')
+    # Tubelet params
+    parser.add_argument('--add_tubelets', action='store_true')
+    parser.set_defaults(add_tubelets=False)
+    parser.add_argument('--use_objects', action='store_true')
+    parser.set_defaults(use_objects=False)
+    parser.add_argument('--objects_path', type=str, default=None)
+    parser.add_argument('--motion_type', type=str, default='gaussian')
+    parser.add_argument('--scales', type=str, default='[32, 48, 56, 64, 96, 128]')
+    parser.add_argument('--visible_frames', type=str, default=None) # not used
+    parser.add_argument('--traj_unmask_ratio', type=float, default=0.1)
+    #dino params
+    parser.add_argument('--target_type', default='pixel', choices=['pixel', 'dino_v1', 'clip'], type=str, help='define target type for loss')
+    parser.add_argument('--distillation_teacher', default="clip_b", type=str, choices=['dino_s', 'dino_b', 'clip_b'], help='distillation teacher model')
+    # multiple sampling
+    parser.add_argument('--multiple_sampling', action='store_true')
+    # for 2nd stage training
+    parser.add_argument('--first_stage_path', type=str, default=None)
+    return parser.parse_args()
+def get_teacher_student_models(args):
+    print(f"Creating model: {args.model}")
+    if args.target_type=='pixel':
+        dec_dim = 1536
+    elif 'dino' in args.target_type or 'clip' in args.target_type:
+        if args.distillation_teacher == 'dino_s':
+            dec_dim = 384
+        elif args.distillation_teacher == 'dino_b' or args.distillation_teacher == 'clip_b':
+            dec_dim = 768
+    student_model = create_model(
+        args.model,
+        pretrained=False,
+        drop_path_rate=args.drop_path,
+        drop_block_rate=None,
+        decoder_depth=args.decoder_depth,
+        use_checkpoint=args.use_checkpoint,
+        decoder_num_classes=dec_dim,
+    )
+    if args.target_type == 'dino_v1':
+        # load dino
+        if args.distillation_teacher == 'dino_s':
+            pretraining = torch.hub.load('facebookresearch/dino:main', 'dino_vits16')
+            teacher_model = vit_small_patch16_224(pretrained=False)
+        elif args.distillation_teacher == 'dino_b':
+            pretraining = torch.hub.load('facebookresearch/dino:main', 'dino_vitb16')
+            teacher_model = vit_base_patch16_224(pretrained=False)
+        msg =teacher_model.load_state_dict(pretraining.state_dict(), strict=False)
+        teacher_model = FeatureExtractor(teacher_model, args.input_size, 16)
+        print(msg)
+        teacher_model.eval()
+    elif args.target_type == 'clip':
+         # load clip
+         from utils_viclip.config import Config
+         from utils_viclip.config_utils import setup_viclip
+         from tasks.shared_utils import setup_model
+         from models_viclip.viclip import ViCLIP
+         config = setup_viclip('configs/config.py')
+         model_cls = eval(config.model.get('model_cls', 'ViCLIP'))
+         teacher_model = setup_model(
+             config,
+             model_cls=model_cls,
+             has_decoder=False,
+             pretrain=False,
+             find_unused_parameters=False,
+         )
+         teacher_model.eval()
+    else:
+        teacher_model = None
+    return student_model, teacher_model
+def load_first_stage(model,args):
+    if args.first_stage_path is not None:
+        checkpoint = torch.load(args.first_stage_path, map_location='cpu')
+        print("loading first stage from ",args.first_stage_path)
+        checkpoint_model = checkpoint['model']
+        utils.load_state_dict(model, checkpoint_model)
+def main(args):
+    utils.init_distributed_mode(args)
+    print(args)
+    device = torch.device(args.device)
+    # fix the seed for reproducibility
+    seed = args.seed + utils.get_rank()
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    cudnn.benchmark = True
+    student_model, teacher_model = get_teacher_student_models(args)
+    patch_size = student_model.encoder.patch_embed.patch_size
+    print("Patch size = %s" % str(patch_size))
+    args.window_size = (args.num_frames // 2, args.input_size // patch_size[0], args.input_size // patch_size[1]) # [8, 14, 14]
+    print(f"Window Size = {args.window_size}")
+    args.patch_size = patch_size
+    # Start from pretrained first stage model
+    if args.first_stage_path is not None:
+           load_first_stage(student_model,args)
+    # get dataset
+    dataset_train = build_pretraining_dataset(args)
+    num_tasks = utils.get_world_size()
+    global_rank = utils.get_rank()
+    sampler_rank = global_rank
+    total_batch_size = args.batch_size * num_tasks
+    num_training_steps_per_epoch = len(dataset_train) // total_batch_size
+    sampler_train = torch.utils.data.DistributedSampler(
+        dataset_train, num_replicas=num_tasks, rank=sampler_rank, shuffle=True
+    )
+    print("Sampler_train = %s" % str(sampler_train))
+    if global_rank == 0 and args.log_dir is not None:
+        os.makedirs(args.log_dir, exist_ok=True)
+        log_writer = utils.TensorboardLogger(log_dir=args.log_dir)
+    else:
+        log_writer = None
+    data_loader_train = torch.utils.data.DataLoader(
+        dataset_train, sampler=sampler_train,
+        batch_size=args.batch_size if not args.multiple_sampling else int(args.batch_size/2),
+        num_workers=args.num_workers,
+        pin_memory=args.pin_mem,
+        drop_last=True,
+        worker_init_fn=utils.seed_worker
+    )
+    student_model.to(device)
+    if teacher_model is not None:
+        teacher_model.to(device)
+    model_without_ddp = student_model
+    n_parameters = sum(p.numel() for p in student_model.parameters() if p.requires_grad)
+    print("Model = %s" % str(model_without_ddp))
+    print('number of params: {} M'.format(n_parameters / 1e6))
+    args.lr = args.lr * total_batch_size / 256
+    args.min_lr = args.min_lr * total_batch_size / 256
+    args.warmup_lr = args.warmup_lr * total_batch_size / 256
+    print("LR = %.8f" % args.lr)
+    print("Batch size = %d" % total_batch_size)
+    print("Number of training steps = %d" % num_training_steps_per_epoch)
+    print("Number of training examples per epoch = %d" % (total_batch_size * num_training_steps_per_epoch))
+    if args.distributed:
+        student_model = torch.nn.parallel.DistributedDataParallel(student_model, device_ids=[args.gpu], find_unused_parameters=False)
+        model_without_ddp = student_model.module
+    optimizer = create_optimizer(
+        args, model_without_ddp)
+    loss_scaler = NativeScaler()
+    print("Use step level LR & WD scheduler!")
+    lr_schedule_values = utils.cosine_scheduler(
+        args.lr, args.min_lr, args.epochs, num_training_steps_per_epoch,
+        warmup_epochs=args.warmup_epochs, warmup_steps=args.warmup_steps,
+    )
+    if args.weight_decay_end is None:
+        args.weight_decay_end = args.weight_decay
+    wd_schedule_values = utils.cosine_scheduler(
+        args.weight_decay, args.weight_decay_end, args.epochs, num_training_steps_per_epoch)
+    print("Max WD = %.7f, Min WD = %.7f" % (max(wd_schedule_values), min(wd_schedule_values)))
+    utils.auto_load_model(
+        args=args, model=student_model, model_without_ddp=model_without_ddp, optimizer=optimizer, loss_scaler=loss_scaler)
+    torch.cuda.empty_cache()
+    print(f"Start training for {args.epochs} epochs")
+    start_time = time.time()
+    for epoch in range(args.start_epoch, args.epochs):
+        if args.distributed:
+            data_loader_train.sampler.set_epoch(epoch)
+        if log_writer is not None:
+            log_writer.set_step(epoch * num_training_steps_per_epoch)
+        train_stats = train_one_epoch(
+            student_model, data_loader_train,
+            optimizer, device, epoch, loss_scaler,
+            args.clip_grad, log_writer=log_writer,
+            start_steps=epoch * num_training_steps_per_epoch,
+            lr_schedule_values=lr_schedule_values,
+            wd_schedule_values=wd_schedule_values,
+            patch_size=patch_size[0],
+            normlize_target=args.normlize_target,
+            teacher_model = teacher_model,
+            target_type=args.target_type,
+            multiple_sampling=args.multiple_sampling,
+        )
+        if args.output_dir:
+            if (epoch + 1) % args.save_ckpt_freq == 0 or epoch + 1 == args.epochs:
+                utils.save_model(
+                    args=args, model=student_model, model_without_ddp=model_without_ddp, optimizer=optimizer,
+                    loss_scaler=loss_scaler, epoch=epoch)
+        log_stats = {**{f'train_{k}': v for k, v in train_stats.items()},
+                     'epoch': epoch, 'n_parameters': n_parameters}
+        if args.output_dir and utils.is_main_process():
+            if log_writer is not None:
+                log_writer.flush()
+            with open(os.path.join(args.output_dir, "log.txt"), mode="a", encoding="utf-8") as f:
+                f.write(json.dumps(log_stats) + "\n")
+        #if (epoch + 1) % 2 == 0:
+            #exit(0)
+    total_time = time.time() - start_time
+    total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+    print('Training time {}'.format(total_time_str))
+if __name__ == '__main__':
+    opts = get_args()
+    if opts.output_dir:
+        Path(opts.output_dir).mkdir(parents=True, exist_ok=True)
+    main(opts)

run_videomae_vis.py ADDED Viewed

	@@ -0,0 +1,198 @@

+# -*- coding: utf-8 -*-
+import argparse
+import numpy as np
+import torch
+import torch.backends.cudnn as cudnn
+from PIL import Image
+from pathlib import Path
+from timm.models import create_model
+import utils
+import modeling_pretrain
+from datasets import DataAugmentationForVideoMAE
+from torchvision.transforms import ToPILImage
+from einops import rearrange
+from timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
+from decord import VideoReader, cpu
+from torchvision import transforms
+from transforms import *
+# from datasets import DataAugmentationForVideoMAE
+from masking_generator import  TubeMaskingGenerator
+class DataAugmentationForVideoMAE(object):
+   def __init__(self, args):
+       self.input_mean = [0.485, 0.456, 0.406] # IMAGENET_DEFAULT_MEAN
+       self.input_std = [0.229, 0.224, 0.225] # IMAGENET_DEFAULT_STD
+       normalize = GroupNormalize(self.input_mean, self.input_std)
+       self.train_augmentation = GroupCenterCrop(args.input_size)
+       self.transform = transforms.Compose([
+           self.train_augmentation,
+           Stack(roll=False),
+           ToTorchFormatTensor(div=True),
+           normalize,
+       ])
+       if args.mask_type == 'tube':
+           self.masked_position_generator = TubeMaskingGenerator(
+               args.window_size, args.mask_ratio
+           )
+   def __call__(self, images):
+       process_data , _ = self.transform(images)
+       return process_data, self.masked_position_generator()
+   def __repr__(self):
+       repr = "(DataAugmentationForVideoMAE,\n"
+       repr += "  transform = %s,\n" % str(self.transform)
+       repr += "  Masked position generator = %s,\n" % str(self.masked_position_generator)
+       repr += ")"
+       return repr
+def get_args():
+    parser = argparse.ArgumentParser('VideoMAE visualization reconstruction script', add_help=False)
+    parser.add_argument('img_path', type=str, help='input video path')
+    parser.add_argument('save_path', type=str, help='save video path')
+    parser.add_argument('model_path', type=str, help='checkpoint path of model')
+    parser.add_argument('--mask_type', default='tube', choices=['random', 'tube', 'tubelet'],
+                        type=str, help='masked strategy of video tokens/patches')
+    parser.add_argument('--num_frames', type=int, default= 16)
+    parser.add_argument('--sampling_rate', type=int, default= 4)
+    parser.add_argument('--decoder_depth', default=4, type=int,
+                        help='depth of decoder')
+    parser.add_argument('--input_size', default=224, type=int,
+                        help='videos input size for backbone')
+    parser.add_argument('--device', default='cuda:0',
+                        help='device to use for training / testing')
+    parser.add_argument('--imagenet_default_mean_and_std', default=True, action='store_true')
+    parser.add_argument('--mask_ratio', default=0.75, type=float,
+                        help='ratio of the visual tokens/patches need be masked')
+    # Model parameters
+    parser.add_argument('--model', default='pretrain_videomae_small_patch16_224', type=str, metavar='MODEL',
+                        help='Name of model to vis')
+    parser.add_argument('--drop_path', type=float, default=0.0, metavar='PCT',
+                        help='Drop path rate (default: 0.1)')
+    # Tubelet params
+    parser.add_argument('--add_tubelets', action='store_true')
+    parser.set_defaults(add_tubelets=True)
+    parser.add_argument('--use_objects', action='store_true')
+    parser.set_defaults(use_objects=True)
+    parser.add_argument('--motion_type', type=str, default='gaussian')
+    parser.add_argument('--scales', type=str, default='[32, 48, 56, 64, 96, 128]')
+    parser.add_argument('--loc_velocity', type=int, default=12)
+    parser.add_argument('--mixed_tubelet', action='store_true')
+    parser.set_defaults(mixed_tubelet=False)
+    parser.add_argument('--visible_frames', type=str, default=None)
+    return parser.parse_args()
+def get_model(args):
+    print(f"Creating model: {args.model}")
+    model = create_model(
+        args.model,
+        pretrained=False,
+        drop_path_rate=args.drop_path,
+        drop_block_rate=None,
+        decoder_depth=args.decoder_depth
+    )
+    return model
+def main(args):
+    print(args)
+    device = torch.device(args.device)
+    cudnn.benchmark = True
+    model = get_model(args)
+    patch_size = model.encoder.patch_embed.patch_size
+    print("Patch size = %s" % str(patch_size))
+    args.window_size = (args.num_frames // 2, args.input_size // patch_size[0], args.input_size // patch_size[1])
+    args.patch_size = patch_size
+    model.to(device)
+    checkpoint = torch.load(args.model_path, map_location='cpu')
+    model.load_state_dict(checkpoint['model'])
+    model.eval()
+    if args.save_path:
+        Path(args.save_path).mkdir(parents=True, exist_ok=True)
+    with open(args.img_path, 'rb') as f:
+        vr = VideoReader(f, ctx=cpu(0))
+    duration = len(vr)
+    new_length  = 1
+    new_step = 1
+    skip_length = new_length * new_step
+    # frame_id_list = [1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57, 61]
+    tmp = np.arange(0,32, 2) + 60
+    frame_id_list = tmp.tolist()
+    # average_duration = (duration - skip_length + 1) // args.num_frames
+    # if average_duration > 0:
+    #     frame_id_list = np.multiply(list(range(args.num_frames)),
+    #                             average_duration)
+    #     frame_id_list = frame_id_list + np.random.randint(average_duration,
+    #                                             size=args.num_frames)
+    video_data = vr.get_batch(frame_id_list).asnumpy()
+    print(video_data.shape)
+    img = [Image.fromarray(video_data[vid, :, :, :]).convert('RGB') for vid, _ in enumerate(frame_id_list)]
+    transforms = DataAugmentationForVideoMAE(args)
+    img, bool_masked_pos = transforms((img, None)) # T*C,H,W
+    # print(img.shape)
+    img = img.view((args.num_frames , 3) + img.size()[-2:]).transpose(0,1) # T*C,H,W -> T,C,H,W -> C,T,H,W
+    # img = img.view(( -1 , args.num_frames) + img.size()[-2:])
+    bool_masked_pos = torch.from_numpy(bool_masked_pos)
+    with torch.no_grad():
+        # img = img[None, :]
+        # bool_masked_pos = bool_masked_pos[None, :]
+        img = img.unsqueeze(0)
+        print(img.shape)
+        bool_masked_pos = bool_masked_pos.unsqueeze(0)
+        img = img.to(device, non_blocking=True)
+        bool_masked_pos = bool_masked_pos.to(device, non_blocking=True).flatten(1).to(torch.bool)
+        outputs = model(img, bool_masked_pos)
+        #save original video
+        mean = torch.as_tensor(IMAGENET_DEFAULT_MEAN).to(device)[None, :, None, None, None]
+        std = torch.as_tensor(IMAGENET_DEFAULT_STD).to(device)[None, :, None, None, None]
+        ori_img = img * std + mean  # in [0, 1]
+        imgs = [ToPILImage()(ori_img[0,:,vid,:,:].cpu()) for vid, _ in enumerate(frame_id_list)  ]
+        for id, im in enumerate(imgs):
+            im.save(f"{args.save_path}/ori_img{id}.jpg")
+        img_squeeze = rearrange(ori_img, 'b c (t p0) (h p1) (w p2) -> b (t h w) (p0 p1 p2) c', p0=2, p1=patch_size[0], p2=patch_size[0])
+        img_norm = (img_squeeze - img_squeeze.mean(dim=-2, keepdim=True)) / (img_squeeze.var(dim=-2, unbiased=True, keepdim=True).sqrt() + 1e-6)
+        img_patch = rearrange(img_norm, 'b n p c -> b n (p c)')
+        img_patch[bool_masked_pos] = outputs
+        #make mask
+        mask = torch.ones_like(img_patch)
+        mask[bool_masked_pos] = 0
+        mask = rearrange(mask, 'b n (p c) -> b n p c', c=3)
+        mask = rearrange(mask, 'b (t h w) (p0 p1 p2) c -> b c (t p0) (h p1) (w p2) ', p0=2, p1=patch_size[0], p2=patch_size[1], h=14, w=14)
+        #save reconstruction video
+        rec_img = rearrange(img_patch, 'b n (p c) -> b n p c', c=3)
+        # Notice: To visualize the reconstruction video, we add the predict and the original mean and var of each patch.
+        rec_img = rec_img * (img_squeeze.var(dim=-2, unbiased=True, keepdim=True).sqrt() + 1e-6) + img_squeeze.mean(dim=-2, keepdim=True)
+        rec_img = rearrange(rec_img, 'b (t h w) (p0 p1 p2) c -> b c (t p0) (h p1) (w p2)', p0=2, p1=patch_size[0], p2=patch_size[1], h=14, w=14)
+        imgs = [ ToPILImage()(rec_img[0, :, vid, :, :].cpu().clamp(0,0.996)) for vid, _ in enumerate(frame_id_list)  ]
+        for id, im in enumerate(imgs):
+            im.save(f"{args.save_path}/rec_img{id}.jpg")
+        #save masked video
+        img_mask = rec_img * mask
+        imgs = [ToPILImage()(img_mask[0, :, vid, :, :].cpu()) for vid, _ in enumerate(frame_id_list)]
+        for id, im in enumerate(imgs):
+            im.save(f"{args.save_path}/mask_img{id}.jpg")
+if __name__ == '__main__':
+    opts = get_args()
+    main(opts)

ssv2.py ADDED Viewed

	@@ -0,0 +1,363 @@

+import os
+import numpy as np
+import torch
+from torchvision import transforms
+from random_erasing import RandomErasing
+import warnings
+from decord import VideoReader, cpu
+from torch.utils.data import Dataset
+import video_transforms as video_transforms
+import volume_transforms as volume_transforms
+class SSVideoClsDataset(Dataset):
+    """Load your own video classification dataset."""
+    def __init__(self, anno_path, data_path, mode='train', clip_len=8,
+                crop_size=224, short_side_size=256, new_height=256,
+                new_width=340, keep_aspect_ratio=True, num_segment=1,
+                num_crop=1, test_num_segment=10, test_num_crop=3, args=None):
+        self.anno_path = anno_path
+        self.data_path = data_path
+        self.mode = mode
+        self.clip_len = clip_len
+        self.crop_size = crop_size
+        self.short_side_size = short_side_size
+        self.new_height = new_height
+        self.new_width = new_width
+        self.keep_aspect_ratio = keep_aspect_ratio
+        self.num_segment = num_segment
+        self.test_num_segment = test_num_segment
+        self.num_crop = num_crop
+        self.test_num_crop = test_num_crop
+        self.args = args
+        self.aug = False
+        self.rand_erase = False
+        if self.mode in ['train']:
+            self.aug = True
+            if self.args.reprob > 0:
+                self.rand_erase = True
+        if VideoReader is None:
+            raise ImportError("Unable to import `decord` which is required to read videos.")
+        import pandas as pd
+        cleaned = pd.read_csv(self.anno_path, header=None, delimiter=' ')
+        self.dataset_samples = list(cleaned.values[:, 0])
+        self.label_array = list(cleaned.values[:, 1])
+        if (mode == 'train'):
+            pass
+        elif (mode == 'validation'):
+            self.data_transform = video_transforms.Compose([
+                video_transforms.Resize(self.short_side_size, interpolation='bilinear'),
+                video_transforms.CenterCrop(size=(self.crop_size, self.crop_size)),
+                volume_transforms.ClipToTensor(),
+                video_transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                                        std=[0.229, 0.224, 0.225])
+            ])
+        elif mode == 'test':
+            self.data_resize = video_transforms.Compose([
+                video_transforms.Resize(size=(short_side_size), interpolation='bilinear')
+            ])
+            self.data_transform = video_transforms.Compose([
+                volume_transforms.ClipToTensor(),
+                video_transforms.Normalize(mean=[0.485, 0.456, 0.406],
+                                        std=[0.229, 0.224, 0.225])
+            ])
+            self.test_seg = []
+            self.test_dataset = []
+            self.test_label_array = []
+            for ck in range(self.test_num_segment):
+                for cp in range(self.test_num_crop):
+                    for idx in range(len(self.label_array)):
+                        sample_label = self.label_array[idx]
+                        self.test_label_array.append(sample_label)
+                        self.test_dataset.append(self.dataset_samples[idx])
+                        self.test_seg.append((ck, cp))
+    def __getitem__(self, index):
+        if self.mode == 'train':
+            args = self.args
+            scale_t = 1
+            sample = self.dataset_samples[index]
+            buffer = self.loadvideo_decord(sample, sample_rate_scale=scale_t) # T H W C
+            if len(buffer) == 0:
+                while len(buffer) == 0:
+                    warnings.warn("video {} not correctly loaded during training".format(sample))
+                    index = np.random.randint(self.__len__())
+                    sample = self.dataset_samples[index]
+                    buffer = self.loadvideo_decord(sample, sample_rate_scale=scale_t)
+            if args.num_sample > 1:
+                frame_list = []
+                label_list = []
+                index_list = []
+                for _ in range(args.num_sample):
+                    new_frames = self._aug_frame(buffer, args)
+                    label = self.label_array[index]
+                    frame_list.append(new_frames)
+                    label_list.append(label)
+                    index_list.append(index)
+                return frame_list, label_list, index_list, {}
+            else:
+                buffer = self._aug_frame(buffer, args)
+            return buffer, self.label_array[index], index, {}
+        elif self.mode == 'validation':
+            sample = self.dataset_samples[index]
+            buffer = self.loadvideo_decord(sample)
+            if len(buffer) == 0:
+                while len(buffer) == 0:
+                    warnings.warn("video {} not correctly loaded during validation".format(sample))
+                    index = np.random.randint(self.__len__())
+                    sample = self.dataset_samples[index]
+                    buffer = self.loadvideo_decord(sample)
+            buffer = self.data_transform(buffer)
+            return buffer, self.label_array[index], sample.split("/")[-1].split(".")[0]
+        elif self.mode == 'test':
+            sample = self.test_dataset[index]
+            chunk_nb, split_nb = self.test_seg[index]
+            buffer = self.loadvideo_decord(sample)
+            while len(buffer) == 0:
+                warnings.warn("video {}, temporal {}, spatial {} not found during testing".format(\
+                    str(self.test_dataset[index]), chunk_nb, split_nb))
+                index = np.random.randint(self.__len__())
+                sample = self.test_dataset[index]
+                chunk_nb, split_nb = self.test_seg[index]
+                buffer = self.loadvideo_decord(sample)
+            buffer = self.data_resize(buffer)
+            if isinstance(buffer, list):
+                buffer = np.stack(buffer, 0)
+            spatial_step = 1.0 * (max(buffer.shape[1], buffer.shape[2]) - self.short_side_size) \
+                                / (self.test_num_crop - 1)
+            temporal_start = chunk_nb # 0/1
+            spatial_start = int(split_nb * spatial_step)
+            if buffer.shape[1] >= buffer.shape[2]:
+                buffer = buffer[temporal_start::2, \
+                       spatial_start:spatial_start + self.short_side_size, :, :]
+            else:
+                buffer = buffer[temporal_start::2, \
+                       :, spatial_start:spatial_start + self.short_side_size, :]
+            buffer = self.data_transform(buffer)
+            return buffer, self.test_label_array[index], sample.split("/")[-1].split(".")[0], \
+                   chunk_nb, split_nb
+        else:
+            raise NameError('mode {} unkown'.format(self.mode))
+    def _aug_frame(
+        self,
+        buffer,
+        args,
+    ):
+        aug_transform = video_transforms.create_random_augment(
+            input_size=(self.crop_size, self.crop_size),
+            auto_augment=args.aa,
+            interpolation=args.train_interpolation,
+        )
+        buffer = [
+            transforms.ToPILImage()(frame) for frame in buffer
+        ]
+        buffer = aug_transform(buffer)
+        buffer = [transforms.ToTensor()(img) for img in buffer]
+        buffer = torch.stack(buffer) # T C H W
+        buffer = buffer.permute(0, 2, 3, 1) # T H W C
+        # T H W C
+        buffer = tensor_normalize(
+            buffer, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]
+        )
+        # T H W C -> C T H W.
+        buffer = buffer.permute(3, 0, 1, 2)
+        # Perform data augmentation.
+        scl, asp = (
+            [0.25, 1.0],
+            [0.75, 1.3333],
+        )
+        buffer = spatial_sampling(
+            buffer,
+            spatial_idx=-1,
+            min_scale=256,
+            max_scale=320,
+            crop_size=self.crop_size,
+            random_horizontal_flip=False if args.data_set == 'SSV2' else True,
+            inverse_uniform_sampling=False,
+            aspect_ratio=asp,
+            scale=scl,
+            motion_shift=False
+        )
+        if self.rand_erase:
+            erase_transform = RandomErasing(
+                args.reprob,
+                mode=args.remode,
+                max_count=args.recount,
+                num_splits=args.recount,
+                device="cpu",
+            )
+            buffer = buffer.permute(1, 0, 2, 3)
+            buffer = erase_transform(buffer)
+            buffer = buffer.permute(1, 0, 2, 3)
+        return buffer
+    def loadvideo_decord(self, sample, sample_rate_scale=1):
+        """Load video content using Decord"""
+        fname = sample
+        if not (os.path.exists(fname)):
+            return []
+        # avoid hanging issue
+        if os.path.getsize(fname) < 1 * 1024:
+            print('SKIP: ', fname, " - ", os.path.getsize(fname))
+            return []
+        try:
+            if self.keep_aspect_ratio:
+                vr = VideoReader(fname, num_threads=1, ctx=cpu(0))
+            else:
+                vr = VideoReader(fname, width=self.new_width, height=self.new_height,
+                                 num_threads=1, ctx=cpu(0))
+        except:
+            print("video cannot be loaded by decord: ", fname)
+            return []
+        if self.mode == 'test':
+            all_index = []
+            tick = len(vr) / float(self.num_segment)
+            all_index = list(np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segment)] +
+                               [int(tick * x) for x in range(self.num_segment)]))
+            while len(all_index) < (self.num_segment * self.test_num_segment):
+                all_index.append(all_index[-1])
+            all_index = list(np.sort(np.array(all_index)))
+            vr.seek(0)
+            buffer = vr.get_batch(all_index).asnumpy()
+            return buffer
+        # handle temporal segments
+        average_duration = len(vr) // self.num_segment
+        all_index = []
+        if average_duration > 0:
+            all_index += list(np.multiply(list(range(self.num_segment)), average_duration) + np.random.randint(average_duration,
+                                                                                                        size=self.num_segment))
+        elif len(vr) > self.num_segment:
+            all_index += list(np.sort(np.random.randint(len(vr), size=self.num_segment)))
+        else:
+            all_index += list(np.zeros((self.num_segment,)))
+        all_index = list(np.array(all_index))
+        vr.seek(0)
+        buffer = vr.get_batch(all_index).asnumpy()
+        return buffer
+    def __len__(self):
+        if self.mode != 'test':
+            return len(self.dataset_samples)
+        else:
+            return len(self.test_dataset)
+def spatial_sampling(
+    frames,
+    spatial_idx=-1,
+    min_scale=256,
+    max_scale=320,
+    crop_size=224,
+    random_horizontal_flip=True,
+    inverse_uniform_sampling=False,
+    aspect_ratio=None,
+    scale=None,
+    motion_shift=False,
+):
+    """
+    Perform spatial sampling on the given video frames. If spatial_idx is
+    -1, perform random scale, random crop, and random flip on the given
+    frames. If spatial_idx is 0, 1, or 2, perform spatial uniform sampling
+    with the given spatial_idx.
+    Args:
+        frames (tensor): frames of images sampled from the video. The
+            dimension is `num frames` x `height` x `width` x `channel`.
+        spatial_idx (int): if -1, perform random spatial sampling. If 0, 1,
+            or 2, perform left, center, right crop if width is larger than
+            height, and perform top, center, buttom crop if height is larger
+            than width.
+        min_scale (int): the minimal size of scaling.
+        max_scale (int): the maximal size of scaling.
+        crop_size (int): the size of height and width used to crop the
+            frames.
+        inverse_uniform_sampling (bool): if True, sample uniformly in
+            [1 / max_scale, 1 / min_scale] and take a reciprocal to get the
+            scale. If False, take a uniform sample from [min_scale,
+            max_scale].
+        aspect_ratio (list): Aspect ratio range for resizing.
+        scale (list): Scale range for resizing.
+        motion_shift (bool): Whether to apply motion shift for resizing.
+    Returns:
+        frames (tensor): spatially sampled frames.
+    """
+    assert spatial_idx in [-1, 0, 1, 2]
+    if spatial_idx == -1:
+        if aspect_ratio is None and scale is None:
+            frames, _ = video_transforms.random_short_side_scale_jitter(
+                images=frames,
+                min_size=min_scale,
+                max_size=max_scale,
+                inverse_uniform_sampling=inverse_uniform_sampling,
+            )
+            frames, _ = video_transforms.random_crop(frames, crop_size)
+        else:
+            transform_func = (
+                video_transforms.random_resized_crop_with_shift
+                if motion_shift
+                else video_transforms.random_resized_crop
+            )
+            frames = transform_func(
+                images=frames,
+                target_height=crop_size,
+                target_width=crop_size,
+                scale=scale,
+                ratio=aspect_ratio,
+            )
+        if random_horizontal_flip:
+            frames, _ = video_transforms.horizontal_flip(0.5, frames)
+    else:
+        # The testing is deterministic and no jitter should be performed.
+        # min_scale, max_scale, and crop_size are expect to be the same.
+        assert len({min_scale, max_scale, crop_size}) == 1
+        frames, _ = video_transforms.random_short_side_scale_jitter(
+            frames, min_scale, max_scale
+        )
+        frames, _ = video_transforms.uniform_crop(frames, crop_size, spatial_idx)
+    return frames
+def tensor_normalize(tensor, mean, std):
+    """
+    Normalize a given tensor by subtracting the mean and dividing the std.
+    Args:
+        tensor (tensor): tensor to normalize.
+        mean (tensor or list): mean value to subtract.
+        std (tensor or list): std to divide.
+    """
+    if tensor.dtype == torch.uint8:
+        tensor = tensor.float()
+        tensor = tensor / 255.0
+    if type(mean) == list:
+        mean = torch.tensor(mean)
+    if type(std) == list:
+        std = torch.tensor(std)
+    tensor = tensor - mean
+    tensor = tensor / std
+    return tensor

synthetic_tubelets.py ADDED Viewed

	@@ -0,0 +1,785 @@

+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+import random
+import numpy as np
+import random
+import cv2
+from typing import List
+from PIL import Image
+from dynamic_utils import (extend_key_frame_to_all,
+                                                   sample_key_frames)
+import imutils
+import math
+from scipy.ndimage import gaussian_filter1d
+from glob import glob
+class RandomRegionSampler(object):
+    def __init__(self,
+                 num_rois: int,
+                 scales: tuple,
+                 ratios: tuple,
+                 scale_jitter: float):
+        """ Randomly sample several RoIs
+        Args:
+            num_rois (int): number of sampled RoIs per image
+            scales (tuple): scales of candidate bounding boxes
+            ratios (tuple): aspect ratios of candidate bounding boxes
+            scale_jitter (float): scale jitter factor, positive number
+        """
+        self.num_rois = num_rois
+        self.scale_jitter = scale_jitter
+        scales = np.array(scales, np.float32)
+        ratios = np.array(ratios, np.float32)
+        widths = scales.reshape(1, -1) * np.sqrt(ratios).reshape(-1, 1)
+        heights = scales.reshape(1, -1) / np.sqrt(ratios).reshape(-1, 1)
+        self.anchors = np.concatenate((widths.reshape(-1, 1),
+                                       heights.reshape(-1, 1)), axis=-1)
+    def sample(self, data: List[np.ndarray]) -> np.ndarray:
+        """ Sample boxes.
+        Args:
+            data (list): image list, each element is a numpy.ndarray
+                in shape of [H, W, 3]
+        Returns:
+            boxes (np.ndarray): the sampled bounding boxes. in shape of
+                [self.num_rois, 4], represented in (x1, y1, x2, y2).
+        """
+        h, w = data[0].shape[0:2]
+        # random sample box shapes
+        anchor_inds = np.random.randint(0, len(self.anchors),
+                                        size=(self.num_rois, ))
+        box_shapes = self.anchors[anchor_inds].copy()
+        if self.scale_jitter is not None:
+            scale_factors = np.random.uniform(-self.scale_jitter,
+                                              self.scale_jitter,
+                                              size=(self.num_rois, 2))
+            box_shapes = box_shapes * np.exp(scale_factors)
+        box_shapes[:, 0] = np.clip(box_shapes[:, 0], 1, w - 1)
+        box_shapes[:, 1] = np.clip(box_shapes[:, 1], 1, h - 1)
+        #print("box shapes",box_shapes,box_shapes.shape)
+        # random sample box x1, y1
+        x1 = np.random.uniform(0, w - box_shapes[:, 0])
+        y1 = np.random.uniform(0, h - box_shapes[:, 1])
+        #print("x1, y1",x1,y1)
+        boxes = np.concatenate((x1.reshape(-1, 1),
+                                y1.reshape(-1, 1),
+                                (x1 + box_shapes[:, 0]).reshape(-1, 1),
+                                (y1 + box_shapes[:, 1]).reshape(-1, 1)),
+                               axis=1)
+        #print("sampled initial boxes",boxes)
+        return boxes
+    def sample_box_shapes(self, data: List[np.ndarray]) -> np.ndarray:
+        """ Sample boxes.
+        Args:
+            data (list): image list, each element is a numpy.ndarray
+                in shape of [H, W, 3]
+        Returns:
+            boxes (np.ndarray): the sampled bounding boxes. in shape of
+                [self.num_rois, 4], represented in (x1, y1, x2, y2).
+        """
+        h, w = data[0].shape[0:2]
+        # random sample box shapes
+        anchor_inds = np.random.randint(0, len(self.anchors),
+                                        size=(self.num_rois, ))
+        box_shapes = self.anchors[anchor_inds].copy()
+        if self.scale_jitter is not None:
+            scale_factors = np.random.uniform(-self.scale_jitter,
+                                              self.scale_jitter,
+                                              size=(self.num_rois, 2))
+            box_shapes = box_shapes * np.exp(scale_factors)
+        box_shapes[:, 0] = np.clip(box_shapes[:, 0], 1, w - 1)
+        box_shapes[:, 1] = np.clip(box_shapes[:, 1], 1, h - 1)
+        #print(" gaussian box shapes",box_shapes)
+        return box_shapes
+class PatchMask(object):
+    def __init__(self,
+                 use_objects: bool,
+                 objects_path: str,
+                 region_sampler: dict,
+                 key_frame_probs: list,
+                 loc_velocity: float,
+                 rot_velocity: float,
+                 size_velocity: float,
+                 label_prob: float,
+                 patch_transformation: str,
+                 motion_type: str):
+        """ Core transformation in Catch-the-Patch.
+        Args:
+            region_sampler (dict): region sampler setting, it will be used to
+                construct a RandomRegionSampler object.
+            key_frame_probs (list): probabilities of sampling how many key
+                frames. The sum of this list should be 1.
+            loc_velocity (float): the maximum patch movement speed. (pix per
+                frame).
+            size_velocity (float): the maximum size change ratios between two
+                neighbouring frames.
+            label_prob (float): how many percentages of frames will be
+                modified. Note that even the frame is not modified, we still
+                force the model to infer the patch positions. (see MRM module
+                in the paper).
+        """
+        self.region_sampler = RandomRegionSampler(**region_sampler)
+        self.key_frame_probs = key_frame_probs
+        self.loc_velocity = loc_velocity
+        self.rot_velocity = rot_velocity
+        self.size_velocity = size_velocity
+        self.label_prob = label_prob
+        if motion_type is not None:
+            self.motion_type = motion_type
+        self.patch_transformation = patch_transformation
+        self.use_objects = use_objects
+        if self.use_objects:
+              #self.object_list  = glob("/ibex/user/jianl0b/Dataset/Fida_file_1/video_images/micheal_objects/cleaned/images/*/*")
+              self.object_list  = glob(objects_path+"/*/*")
+              #self.object_list  = glob("/ibex/project/c2134/Fida/micheal_objects_big/cleaned_big/images/*/*")
+              print(self.object_list[0:10],len(self.object_list))
+    def paste_objects(self, data, traj_rois, boxes):
+        objects_list = []
+        label_list = []
+        for i in range(len(boxes)):
+            objects, crop_index = self.pick_objects(data, traj_rois[i])
+            labels = np.random.uniform(0, 1, size=(len(data), ))
+            labels[crop_index] = 0.0
+            labels[0] = 0.0
+            labels = labels <= self.label_prob
+            objects_list.append(objects)
+            label_list.append(labels)
+        return objects_list, None, label_list
+    def paste_patches(self, data, traj_rois, boxes):
+            patches_list = []
+            alphas_list = []
+            label_list = []
+            for i in range(len(boxes)):
+                patches, crop_index = self.pick_patches(data, traj_rois[i])
+                alphas = self.pick_alphas(data, traj_rois[i], crop_index)
+                labels = np.random.uniform(0, 1, size=(len(data), ))
+                labels[crop_index] = 0.0
+                labels[0] = 0.0
+                labels = labels <= self.label_prob
+                patches_list.append(patches)
+                alphas_list.append(alphas)
+                label_list.append(labels)
+            return patches_list, alphas_list, label_list
+    def pick_patches(self,
+                     data: List[np.ndarray],
+                     traj_rois: np.ndarray) -> tuple:
+        """ Pick image patches from the raw video frame.
+        We just randomly select a frame index, and crop the frame according to
+        the trajectory rois. This cropped patch will be resized into the
+        suitable size specified by the traj_rois.
+        Args:
+            data (List[np.ndarray]): list of images, each element is in shape
+                of [H, W, 3]
+            traj_rois (np.ndarray): the generated trajectories, in shape of
+                [N_frames, 4]. (x1, y1, x2, y2)
+        Returns:
+            patches (List[np.ndarray]): the cropped patches
+            select_idx (int): the frame index which the source patch
+                cropped from.
+        """
+        traj_sizes = traj_rois[..., 2:4] - traj_rois[..., 0:2]
+        num = len(traj_sizes)
+        select_idx = random.randint(0, num - 1)
+        x1, y1, x2, y2 = traj_rois[select_idx]
+        traj_rois_H = y2 - y1
+        traj_rois_W = x2 - x1
+        img = data[select_idx]
+        img_H, img_W, _ = img.shape
+        if img_W - traj_rois_W - 1 >= 0 and img_H - traj_rois_H - 1 >= 0:
+            new_x1 = random.randint(0, img_W - traj_rois_W - 1)
+            new_y1 = random.randint(0, img_H - traj_rois_H - 1)
+            new_x2 = new_x1 + traj_rois_W
+            new_y2 = new_y1 + traj_rois_H
+            img = img[new_y1:new_y2, new_x1:new_x2, :]
+        else:
+            img = img
+        patches = [cv2.resize(img, (traj_sizes[i, 0], traj_sizes[i, 1]))
+                   for i in range(traj_rois.shape[0])]
+        return patches, select_idx
+    def pick_objects(self,
+                     data: List[np.ndarray],
+                     traj_rois: np.ndarray) -> tuple:
+        """ Pick image patches from the raw video frame.
+        We just randomly select a frame index, and crop the frame according to
+        the trajectory rois. This cropped patch will be resized into the
+        suitable size specified by the traj_rois.
+        Args:
+            data (List[np.ndarray]): list of images, each element is in shape
+                of [H, W, 3]
+            traj_rois (np.ndarray): the generated trajectories, in shape of
+                [N_frames, 4]. (x1, y1, x2, y2)
+        Returns:
+            patches (List[np.ndarray]): the cropped patches
+            select_idx (int): the frame index which the source patch
+                cropped from.
+        """
+        traj_sizes = traj_rois[..., 2:4] - traj_rois[..., 0:2]
+        num = len(traj_sizes)
+        select_idx = random.randint(0, num - 1)
+        #print(len(data),traj_rois.shape)
+        x1, y1, x2, y2 = traj_rois[select_idx]
+        #print(x1, y1, x2, y2)
+        object_ind = random.randint(0, len(self.object_list)- 1)
+        object_img = Image.open(self.object_list[object_ind])
+        object_img = object_img.resize((x2-x1,y2-y1))
+        objects = [object_img.resize((traj_sizes[i, 0], traj_sizes[i, 1]))
+                   for i in range(traj_rois.shape[0])]
+        return objects, select_idx
+    def pick_alphas(self,
+                    data,
+                    traj_rois: np.ndarray,
+                    crop_index: int):
+        """ Generate the alpha masks for merging the patches into the raw
+        frames:
+            out_frame = raw_frame * (1 - alpha) + patch * alpha.
+        Despite the transparency, the alpha values are also used to mask the
+        patches into some predefined shapes, like ellipse or rhombus.
+        There are many strange constants in this function. But we do not
+        conduct any ablation analysis on these constants. They should have
+        little impact to the final performances.
+         Args:
+            data (List[np.ndarray]): list of images, each element is in shape
+                of [H, W, 3]
+            traj_rois (np.ndarray): the generated trajectories, in shape of
+                [N_frames, 4]. (x1, y1, x2, y2)
+            crop_index (int): the frame index which the source patch
+                cropped from.
+        Returns:
+            alphas (List[np.ndarray]): the generated alpha values
+        """
+        traj_sizes = traj_rois[..., 2:4] - traj_rois[..., 0:2]
+        num_frames = traj_sizes.shape[0]
+        base_w, base_h = traj_sizes[crop_index]
+        base_x_grids, base_y_grids = np.meshgrid(
+            np.arange(base_w).astype(np.float32),
+            np.arange(base_h).astype(np.float32)
+        )
+        ctr_w = (base_w - 1) // 2
+        ctr_h = (base_h - 1) // 2
+        dist_to_ctr_x = np.abs(base_x_grids - ctr_w) / base_w
+        dist_to_ctr_y = np.abs(base_y_grids - ctr_h) / base_h
+        mask_type = int(np.random.choice(3, p=[0.5, 0.35, 0.15]))
+        if mask_type == 0:
+            dist_to_ctr = np.maximum(dist_to_ctr_x, dist_to_ctr_y)
+            base_alpha = np.ones((base_h, base_w), np.float32)
+        elif mask_type == 1:
+            dist_to_ctr = np.sqrt(dist_to_ctr_x ** 2 + dist_to_ctr_y ** 2)
+            base_alpha = np.where(dist_to_ctr < 0.5,
+                                  np.ones((base_h, base_w), np.float32),
+                                  np.zeros((base_h, base_w), np.float32))
+        elif mask_type == 2:
+            dist_to_ctr = (dist_to_ctr_x + dist_to_ctr_y)
+            base_alpha = np.where(dist_to_ctr < 0.5,
+                                  np.ones((base_h, base_w), np.float32),
+                                  np.zeros((base_h, base_w), np.float32))
+        else:
+            raise NotImplementedError
+        use_smooth_edge = random.uniform(0, 1) < 0.5
+        if use_smooth_edge:
+            turning_point = random.uniform(0.30, 0.45)
+            k = -1 / (0.5 - turning_point)
+            alpha_mul = k * dist_to_ctr - 0.5 * k
+            alpha_mul = np.clip(alpha_mul, 0, 1)
+            base_alpha = base_alpha * alpha_mul
+        # sample key frames
+        key_inds = sample_key_frames(num_frames, self.key_frame_probs)
+        frame_alphas = np.random.uniform(0.8, 1.0, size=(len(key_inds), 1))
+        frame_alphas = extend_key_frame_to_all(frame_alphas, key_inds)
+        alphas = []
+        for frame_idx in range(num_frames):
+            w, h = traj_sizes[frame_idx]
+            i_alpha = cv2.resize(base_alpha, (w, h))
+            i_alpha = i_alpha * frame_alphas[frame_idx]
+            alphas.append(i_alpha)
+        return alphas
+    def get_rotation_angles(self,
+                     num_frames,
+                     transform_param: dict):
+        key_frame_probs = transform_param['key_frame_probs']
+        loc_key_inds = sample_key_frames(num_frames, key_frame_probs)
+        rot_velocity  = transform_param['rot_velocity']
+        rot_angles = np.zeros((transform_param['traj_rois'].shape[0],1))
+        #print("rotation  angles original",rot_angles.shape,loc_key_inds)
+        rot_angles_list= [np.expand_dims(rot_angles, axis=0)]
+        for i in range(len(loc_key_inds) - 1):
+            if rot_velocity > 0:
+                index_diff = loc_key_inds[i + 1] - loc_key_inds[i]
+                shifts = np.random.uniform(low=-rot_velocity* index_diff,
+                                           high=rot_velocity* index_diff,
+                                           size=rot_angles.shape)
+                rot_angles = rot_angles + shifts
+            rot_angles_list.append(np.expand_dims(rot_angles, axis=0))
+        rot_angles = np.concatenate(rot_angles_list, axis=0)
+        rot_angles = extend_key_frame_to_all(rot_angles, loc_key_inds, 'random')
+        rot_angles = rot_angles.transpose((1, 0, 2))
+        return rot_angles
+    def get_shear_factors(self,
+                     num_frames,
+                     transform_param: dict):
+        key_frame_probs = transform_param['key_frame_probs']
+        loc_key_inds = sample_key_frames(num_frames, key_frame_probs)
+        #print("Loc key inds shear",loc_key_inds)
+        rot_velocity  = transform_param['rot_velocity']
+        rot_angles = np.zeros((transform_param['traj_rois'].shape[0],1))
+        #print("rotation  angles original",rot_angles.shape,loc_key_inds)
+        rot_angles_list= [np.expand_dims(rot_angles, axis=0)]
+        for i in range(len(loc_key_inds) - 1):
+            if rot_velocity > 0:
+                index_diff = loc_key_inds[i + 1] - loc_key_inds[i]
+                shifts = np.random.uniform(low=-rot_velocity* index_diff,
+                                           high=rot_velocity* index_diff,
+                                           size=rot_angles.shape)
+                #scales = np.exp(shifts)
+                #print("shifts shear", shifts)
+                #rot_angles = scales
+                rot_angles = rot_angles + shifts
+            rot_angles_list.append(np.expand_dims(rot_angles, axis=0))
+        rot_angles = np.concatenate(rot_angles_list, axis=0)
+        rot_angles = extend_key_frame_to_all(rot_angles, loc_key_inds, 'random')
+        rot_angles = rot_angles.transpose((1, 0, 2))
+        return rot_angles
+    def _apply_image(self,
+                     data: List[np.ndarray],
+                     transform_param: dict):
+             data_1 = data
+             # we sort the size and firstly paste the large patch
+             # this trick is because, if we paste the small patch first, it may
+             # be totally covered by a large one.
+             sizes = transform_param['traj_rois'][..., 2:4] - \
+                     transform_param['traj_rois'][..., 0:2]
+             avg_sizes = np.prod(np.mean(sizes, axis=1), axis=1)
+             arg_rank = np.argsort(avg_sizes)[::-1]
+             width, height,_ = data_1[0].shape
+             #print(width,height)
+             if self.use_objects:
+                 if transform_param['patch_transformation'] == 'rotation':
+                        rot_angles = self.get_rotation_angles(len(data_1),transform_param)
+                        transformed_data_1 = []
+                        for frame_idx in range(len(data_1)):
+                            i_rois = transform_param['traj_rois'][:, frame_idx, :]
+                            img = data_1[frame_idx].copy()
+                            for patch_idx in arg_rank:
+                                if not transform_param['traj_labels'][patch_idx][frame_idx]:
+                                    continue
+                                i_object = transform_param['patches'][patch_idx][frame_idx]  # here patches are objects
+                                i_object = np.array(i_object)
+                                angle = int(rot_angles[patch_idx][frame_idx])
+                                rotated_i_object = imutils.rotate_bound(i_object, angle)
+                                rotated_i_alpha = rotated_i_object[..., -1]
+                                rotated_i_alpha = rotated_i_alpha / 255.0
+                                rotated_i_object = rotated_i_object[..., :3]
+                                h_prime, w_prime, channels = rotated_i_object.shape
+                                x1, y1, x2, y2 = i_rois[patch_idx]
+                                h, w = y2 - y1, x2 - x1
+                                if ((h_prime - h) % 2) == 0:
+                                    delta_h1 = delta_h2 = math.ceil((h_prime - h) / 2)
+                                else:
+                                    delta_h1 = math.ceil((h_prime - h) / 2)
+                                    delta_h2 = math.floor((h_prime - h) / 2)
+                                if ((w_prime - w) % 2) == 0:
+                                    delta_w1 = delta_w2 = math.ceil((w_prime - w) / 2)
+                                else:
+                                    delta_w1 = math.ceil((w_prime - w) / 2)
+                                    delta_w2 = math.floor((w_prime - w) / 2)
+                                x1_new, y1_new, x2_new, y2_new = x1 - delta_w1, y1 - delta_h1, x2 + delta_w2, y2 + delta_h2
+                                if all(i >= 0 for i in [x1_new, y1_new, x2_new, y2_new]) and all(
+                                        i < width for i in [x1_new, y1_new, x2_new, y2_new]):
+                                    # in bound
+                                    i_patch = rotated_i_object
+                                    i_alpha = rotated_i_alpha[..., np.newaxis]
+                                    img[y1_new:y2_new, x1_new:x2_new, :] = img[y1_new:y2_new, x1_new:x2_new, :] * (1 - i_alpha) + i_patch * i_alpha
+                                else:
+                                    # out of bound
+                                    img_H, img_W, C = img.shape
+                                    patch_H, patch_W, _ = rotated_i_object.shape
+                                    extended_img = np.zeros((img_H + 2 * patch_H, img_W + 2 * patch_W, C), dtype=img.dtype)
+                                    extended_img[patch_H:(img_H + patch_H), patch_W:(img_W + patch_W), :] = img
+                                    x1_new += patch_W
+                                    x2_new += patch_W
+                                    y1_new += patch_H
+                                    y2_new += patch_H
+                                    i_alpha = rotated_i_alpha[..., np.newaxis]
+                                    extended_img[y1_new:y2_new, x1_new:x2_new, :] = extended_img[y1_new:y2_new, x1_new:x2_new, :] * (1 - i_alpha) + rotated_i_object * i_alpha
+                                    img = extended_img[patch_H:(img_H + patch_H), patch_W:(img_W + patch_W), :]
+                            img = np.array(img)
+                            transformed_data_1.append(img)
+             return transformed_data_1
+    @staticmethod
+    def rectangle_movement(boxes: np.ndarray,
+                           img_wh: tuple,
+                           loc_velocity: float,
+                           size_velocity: float,
+                           num_frames: int,
+                           key_frame_probs: List[float]) -> np.ndarray:
+        """ Simulate the object movement.
+        Args:
+            boxes (np.ndarray): in shpae of [N_boxes, 4]
+            img_wh (tuple): image width and image height
+            loc_velocity (float): max speed of the center point movement
+            size_velocity (float): max speed of size changes
+            num_frames (int): number of frames
+            key_frame_probs (float): probability distribution of how many key
+                frames will be sampled.
+        Returns
+            all_boxes (np.ndarray): the generated box trajectory, in shpae
+                of [N_traj, N_frame, 4].
+        """
+        # Step 1, sample key frames for location changes
+        loc_key_inds = sample_key_frames(num_frames, key_frame_probs)
+        # Step 2, decide box locations in key frames
+        ctr_pts = (boxes[:, 0:2] + boxes[:, 2:4]) * 0.5
+        #print("center points original",ctr_pts)
+        box_sizes = (boxes[:, 2:4] - boxes[:, 0:2])
+        #print("box sizes = ",box_sizes,box_sizes.shape)
+        min_ctr_pts = box_sizes * 0.5
+        max_ctr_pts = np.array(img_wh[0:2]).reshape(1, 2) - box_sizes * 0.5
+        #print("initial center points ",ctr_pts,loc_key_inds)
+        ctr_pts_list = [np.expand_dims(ctr_pts, axis=0)]
+        #print("ctr pts list",ctr_pts_list)
+        for i in range(len(loc_key_inds) - 1):
+            if loc_velocity > 0:
+                index_diff = loc_key_inds[i + 1] - loc_key_inds[i]
+                shifts = np.random.uniform(low=-loc_velocity * index_diff,
+                                           high=loc_velocity * index_diff,
+                                           size=ctr_pts.shape)
+                #print("shifts",shifts)
+                ctr_pts = ctr_pts + shifts
+                ctr_pts = np.clip(ctr_pts, min_ctr_pts, max_ctr_pts)
+            ctr_pts_list.append(np.expand_dims(ctr_pts, axis=0))
+        ctr_pts = np.concatenate(ctr_pts_list, axis=0)
+        ctr_pts = extend_key_frame_to_all(ctr_pts, loc_key_inds, 'random')
+        #print("all center points ",ctr_pts,ctr_pts.shape)
+        # Step 3, sample key frames for shape changes
+        size_key_inds = sample_key_frames(num_frames, key_frame_probs)
+        # Step 4, setup shape in different key frames
+        box_sizes_list = [np.expand_dims(box_sizes, axis=0)]
+        for i in range(len(size_key_inds) - 1):
+            if size_velocity > 0:
+                index_diff = size_key_inds[i + 1] - size_key_inds[i]
+                scales = np.random.uniform(low=-size_velocity * index_diff,
+                                           high=size_velocity * index_diff,
+                                           size=box_sizes.shape)
+                scales = np.exp(scales)
+                box_sizes = box_sizes * scales
+            box_sizes_list.append(np.expand_dims(box_sizes, axis=0))
+        box_sizes = np.concatenate(box_sizes_list, axis=0)
+        # print("box sizes before interpolation",box_sizes,size_key_inds)
+        box_sizes = extend_key_frame_to_all(box_sizes, size_key_inds, 'random')
+        #print("box sizes after interpolation",box_sizes)
+        # Step 5, construct boxes in key frames
+        all_boxes = np.concatenate((ctr_pts - box_sizes * 0.5,
+                                    ctr_pts + box_sizes * 0.5), axis=2)
+        # all_boxes[..., 0::2] = np.clip(all_boxes[..., 0::2], 0, img_wh[0])
+        # all_boxes[..., 1::2] = np.clip(all_boxes[..., 1::2], 0, img_wh[1])
+        all_boxes = all_boxes.transpose((1, 0, 2))
+        return all_boxes
+    @staticmethod
+    def gaussian_movement(box_shapes: np.ndarray,
+                           img_wh: tuple,
+                           num_trajs: int,
+                           size_velocity: float,
+                           num_frames: int,
+                           key_frame_probs: List[float]) -> np.ndarray:
+        """ Simulate the object movement.
+        Args:
+        Returns
+            all_boxes (np.ndarray): the generated box trajectory, in shpae
+                of [N_traj, N_frame, 4].
+        """
+        def create_traj(box_shapes):
+                w = img_wh[0]
+                h = img_wh[1]
+                #print("gaussian",w,h)
+                n_points = 48 # how many points to create trajectory
+                sigma = 8 # bigger sigma -> smoother trajectory
+                # simulate trajectory points
+                #x = np.random.uniform(0,112,n_points)
+                #y = np.random.uniform(0,112,n_points)
+                # for 112 x 112
+                x = np.random.uniform(1+box_shapes[0]/2,w-1-box_shapes[0]/2,n_points)
+                y = np.random.uniform(1+box_shapes[1]/2,h-1-box_shapes[1]/2,n_points)
+                # for 224x 224
+                # x = np.random.uniform(0,112,n_points)
+                # y = np.random.uniform(0,112,n_points)
+                # smooth trajectory
+                xk = gaussian_filter1d(x, sigma=sigma, mode='reflect')
+                yk = gaussian_filter1d(y, sigma=sigma, mode='reflect')
+                # normalize and random scale
+                xkk = (xk -xk.min())
+                xkk /= xkk.max()
+                ykk = (yk -yk.min())
+                ykk /= ykk.max()
+                #scaling_factor =  np.random.randint(20,90)
+                scaling_factor =  np.random.randint(40,180)
+                xkk*=scaling_factor    # randomize
+                ykk*=scaling_factor    # randomize
+                # random  translate and clip
+                translation_factor_x =  np.random.randint(0,w-scaling_factor)
+                translation_factor_y =  np.random.randint(0,h-scaling_factor)
+                tr_x = xkk + translation_factor_x
+                tr_y = ykk + translation_factor_y
+                tr_x = np.clip(tr_x,0,w-1)
+                tr_y = np.clip(tr_y,0,h-1)
+                # sample 16 points from trajectory with linear spacing
+                idxs = np.round(np.linspace(0, tr_x.shape[0]-1, num=16)).astype(int)
+                x_f = tr_x[idxs].astype(int)
+                y_f = tr_y[idxs].astype(int)
+                #print(x_f.shape,y_f.shape)
+                traj = np.column_stack((x_f,y_f))
+                traj = np.expand_dims(traj, axis=1)
+                return traj
+        # Step 1 create a non-linear trajectory
+        #print(" number of rois",num_trajs,box_shapes.shape)
+        ctr_pts_list = []
+        for i in range(num_trajs):
+             ctr_pts_list.append(create_traj(box_shapes[i]))
+        ctr_pts = np.concatenate(ctr_pts_list, axis=1)
+        #print("all center points guassian ",ctr_pts,ctr_pts.shape)
+        # Step 2 create box shapes for the starting location
+        boxes_list = []
+        for i in range(num_trajs):
+            x1, y1 = ctr_pts[0][i][0], ctr_pts[0][i][1]
+            box = np.concatenate((
+                                (x1 - box_shapes[i, 0]/2).reshape(-1, 1),
+                                (y1 - box_shapes[i, 1]/2).reshape(-1, 1),
+                                (x1 + box_shapes[i, 0]/2).reshape(-1, 1),
+                                (y1 + box_shapes[i, 1]/2).reshape(-1, 1)),
+                               axis=1)
+            boxes_list.append(box)
+        boxes= np.concatenate(boxes_list, axis=0)
+        box_sizes = (boxes[:, 2:4] - boxes[:, 0:2])
+        #print("bboxes guassian ",boxes,boxes.shape)
+        #print("guassian box sizes = ",box_sizes,box_sizes.shape)
+        # Step 3, sample key frames for shape changes
+        size_key_inds = sample_key_frames(num_frames, key_frame_probs)
+        # Step 4, setup shape in different key frames
+        box_sizes_list = [np.expand_dims(box_sizes, axis=0)]
+        for i in range(len(size_key_inds) - 1):
+            if size_velocity > 0:
+                index_diff = size_key_inds[i + 1] - size_key_inds[i]
+                scales = np.random.uniform(low=-size_velocity * index_diff,
+                                           high=size_velocity * index_diff,
+                                           size=box_sizes.shape)
+                scales = np.exp(scales)
+                box_sizes = box_sizes * scales
+            box_sizes_list.append(np.expand_dims(box_sizes, axis=0))
+        box_sizes = np.concatenate(box_sizes_list, axis=0)
+        # print("box sizes before interpolation",box_sizes)
+        box_sizes = extend_key_frame_to_all(box_sizes, size_key_inds, 'random')
+        #print("box sizes after interpolation",box_sizes)
+        # Step 5, construct boxes in key frames
+        all_boxes = np.concatenate((ctr_pts - box_sizes * 0.5,
+                                    ctr_pts + box_sizes * 0.5), axis=2)
+        # all_boxes[..., 0::2] = np.clip(all_boxes[..., 0::2], 0, img_wh[0])
+        # all_boxes[..., 1::2] = np.clip(all_boxes[..., 1::2], 0, img_wh[1])
+        all_boxes = all_boxes.transpose((1, 0, 2))
+        return all_boxes,boxes
+    def __call__(self,img_tuple):
+    #def get_transform_param(self, data: List[np.ndarray], *args, **kwargs):
+        """ Generate the transformation parameters.
+        Args:
+            data (List[np.ndarray]): list of image array, each element is in
+                a shape of [H, W, 3]
+        Returns:
+            params (dict): a dict that contains necessary transformation
+                params, which include:
+                'patches': list of image patches (np.ndarray)
+                'alphas': list of alpha mask, same size and shape as patches.
+                'traj_rois': the trajectory position, in shape of
+                    [N_traj, N_frame, 4]
+                'traj_labels': whether the patches have been pasted on some
+                    specific frames, in shape of [N_traj, N_frame]
+        """
+        #print("with tubelets")
+        img_group, label = img_tuple
+        #print("before length data",len(img_group),img_group[0].size)
+        new_data = [np.array(img) for img in img_group]
+        #print("after length data",len(new_data),new_data[0].shape)
+        data_1  = new_data        # Step 1, generate the trajectories.
+        h, w = data_1[0].shape[0:2]
+        #print("motion type and size_velocity", self.motion_type,self.size_velocity)
+        #print(" patch transformation and rotation velocity =",self.patch_transformation,self.rot_velocity)
+        if self.motion_type == 'linear' :
+               boxes = self.region_sampler.sample(data_1)
+               traj_rois = self.rectangle_movement(boxes, (w, h),
+                                            self.loc_velocity,
+                                            self.size_velocity,
+                                            len(data_1),
+                                            self.key_frame_probs)
+        # gaussian
+        elif self.motion_type == 'gaussian' :
+              box_shapes = self.region_sampler.sample_box_shapes(data_1)
+              traj_rois,boxes = self.gaussian_movement(box_shapes, (w, h),
+                                                  self.region_sampler.num_rois,
+                                                  self.size_velocity,
+                                                  len(data_1),
+                                                  self.key_frame_probs)
+        #print("gaussian rois",traj_rois.shape)
+        traj_rois = np.round(traj_rois).astype(int)
+        # traj_rois[..., 0::2] = np.clip(traj_rois[..., 0::2], 0, w)
+        # traj_rois[..., 1::2] = np.clip(traj_rois[..., 1::2], 0, h)
+        # Step 2, crop the patches and prepare the alpha masks.
+        if not self.use_objects:
+                #print(" pasting patches")
+                patches_list, alphas_list, label_list  = self.paste_patches(data_1,traj_rois,boxes)
+        else:
+                #print(" pasting objects")
+                patches_list, alphas_list, label_list  = self.paste_objects(data_1,traj_rois,boxes)
+        transforms_dict =  dict(
+            traj_rois=traj_rois,
+            patches=patches_list,
+            alphas=alphas_list,
+            traj_labels=label_list,
+            rot_velocity = self.rot_velocity,
+            patch_transformation = self.patch_transformation,
+            key_frame_probs = self.key_frame_probs
+        )
+        output_data = self._apply_image( new_data,transforms_dict)
+        ret_data = [Image.fromarray(img) for img in output_data]
+        return ret_data, label, traj_rois

transforms.py ADDED Viewed

	@@ -0,0 +1,206 @@

+import torch
+import torchvision.transforms.functional as F
+import warnings
+import random
+import numpy as np
+import torchvision
+from PIL import Image, ImageOps
+import numbers
+class GroupRandomCrop(object):
+    def __init__(self, size):
+        if isinstance(size, numbers.Number):
+            self.size = (int(size), int(size))
+        else:
+            self.size = size
+    def __call__(self, img_tuple):
+        img_group, label = img_tuple
+        w, h = img_group[0].size
+        th, tw = self.size
+        out_images = list()
+        x1 = random.randint(0, w - tw)
+        y1 = random.randint(0, h - th)
+        for img in img_group:
+            assert(img.size[0] == w and img.size[1] == h)
+            if w == tw and h == th:
+                out_images.append(img)
+            else:
+                out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))
+        return (out_images, label)
+class GroupCenterCrop(object):
+    def __init__(self, size):
+        self.worker = torchvision.transforms.CenterCrop(size)
+    def __call__(self, img_tuple):
+        img_group, label = img_tuple
+        return ([self.worker(img) for img in img_group], label)
+class GroupNormalize(object):
+    def __init__(self, mean, std):
+        self.mean = mean
+        self.std = std
+    def __call__(self, tensor_tuple):
+        tensor, label = tensor_tuple
+        rep_mean = self.mean * (tensor.size()[0]//len(self.mean))
+        rep_std = self.std * (tensor.size()[0]//len(self.std))
+        # TODO: make efficient
+        for t, m, s in zip(tensor, rep_mean, rep_std):
+            t.sub_(m).div_(s)
+        return (tensor,label)
+class GroupGrayScale(object):
+    def __init__(self, size):
+        self.worker = torchvision.transforms.Grayscale(size)
+    def __call__(self, img_tuple):
+        img_group, label = img_tuple
+        return ([self.worker(img) for img in img_group], label)
+class GroupScale(object):
+    """ Rescales the input PIL.Image to the given 'size'.
+    'size' will be the size of the smaller edge.
+    For example, if height > width, then image will be
+    rescaled to (size * height / width, size)
+    size: size of the smaller edge
+    interpolation: Default: PIL.Image.BILINEAR
+    """
+    def __init__(self, size, interpolation=Image.BILINEAR):
+        self.worker = torchvision.transforms.Resize(size, interpolation)
+    def __call__(self, img_tuple):
+        img_group, label = img_tuple
+        return ([self.worker(img) for img in img_group], label)
+class GroupMultiScaleCrop(object):
+    def __init__(self, input_size, scales=None, max_distort=1, fix_crop=True, more_fix_crop=True):
+        self.scales = scales if scales is not None else [1, .875, .75, .66]
+        self.max_distort = max_distort
+        self.fix_crop = fix_crop
+        self.more_fix_crop = more_fix_crop
+        self.input_size = input_size if not isinstance(input_size, int) else [input_size, input_size]
+        self.interpolation = Image.BILINEAR
+    def __call__(self, img_tuple):
+        img_group, label = img_tuple
+        im_size = img_group[0].size
+        crop_w, crop_h, offset_w, offset_h = self._sample_crop_size(im_size)
+        crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group]
+        ret_img_group = [img.resize((self.input_size[0], self.input_size[1]), self.interpolation) for img in crop_img_group]
+        return (ret_img_group, label)
+    def _sample_crop_size(self, im_size):
+        image_w, image_h = im_size[0], im_size[1]
+        # find a crop size
+        base_size = min(image_w, image_h)
+        crop_sizes = [int(base_size * x) for x in self.scales]
+        crop_h = [self.input_size[1] if abs(x - self.input_size[1]) < 3 else x for x in crop_sizes]
+        crop_w = [self.input_size[0] if abs(x - self.input_size[0]) < 3 else x for x in crop_sizes]
+        pairs = []
+        for i, h in enumerate(crop_h):
+            for j, w in enumerate(crop_w):
+                if abs(i - j) <= self.max_distort:
+                    pairs.append((w, h))
+        crop_pair = random.choice(pairs)
+        if not self.fix_crop:
+            w_offset = random.randint(0, image_w - crop_pair[0])
+            h_offset = random.randint(0, image_h - crop_pair[1])
+        else:
+            w_offset, h_offset = self._sample_fix_offset(image_w, image_h, crop_pair[0], crop_pair[1])
+        return crop_pair[0], crop_pair[1], w_offset, h_offset
+    def _sample_fix_offset(self, image_w, image_h, crop_w, crop_h):
+        offsets = self.fill_fix_offset(self.more_fix_crop, image_w, image_h, crop_w, crop_h)
+        return random.choice(offsets)
+    @staticmethod
+    def fill_fix_offset(more_fix_crop, image_w, image_h, crop_w, crop_h):
+        w_step = (image_w - crop_w) // 4
+        h_step = (image_h - crop_h) // 4
+        ret = list()
+        ret.append((0, 0))  # upper left
+        ret.append((4 * w_step, 0))  # upper right
+        ret.append((0, 4 * h_step))  # lower left
+        ret.append((4 * w_step, 4 * h_step))  # lower right
+        ret.append((2 * w_step, 2 * h_step))  # center
+        if more_fix_crop:
+            ret.append((0, 2 * h_step))  # center left
+            ret.append((4 * w_step, 2 * h_step))  # center right
+            ret.append((2 * w_step, 4 * h_step))  # lower center
+            ret.append((2 * w_step, 0 * h_step))  # upper center
+            ret.append((1 * w_step, 1 * h_step))  # upper left quarter
+            ret.append((3 * w_step, 1 * h_step))  # upper right quarter
+            ret.append((1 * w_step, 3 * h_step))  # lower left quarter
+            ret.append((3 * w_step, 3 * h_step))  # lower righ quarter
+        return ret
+class Stack(object):
+    def __init__(self, roll=False):
+        self.roll = roll
+    def __call__(self, img_tuple):
+        img_group, label = img_tuple
+        if img_group[0].mode == 'L':
+            return (np.concatenate([np.expand_dims(x, 2) for x in img_group], axis=2), label)
+        elif img_group[0].mode == 'RGB':
+            if self.roll:
+                return (np.concatenate([np.array(x)[:, :, ::-1] for x in img_group], axis=2), label)
+            else:
+                return (np.concatenate(img_group, axis=2), label)
+class ToTorchFormatTensor(object):
+    """ Converts a PIL.Image (RGB) or numpy.ndarray (H x W x C) in the range [0, 255]
+    to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] """
+    def __init__(self, div=True):
+        self.div = div
+    def __call__(self, pic_tuple):
+        pic, label = pic_tuple
+        if isinstance(pic, np.ndarray):
+            # handle numpy array
+            img = torch.from_numpy(pic).permute(2, 0, 1).contiguous()
+        else:
+            # handle PIL Image
+            img = torch.ByteTensor(torch.ByteStorage.from_buffer(pic.tobytes()))
+            img = img.view(pic.size[1], pic.size[0], len(pic.mode))
+            # put it from HWC to CHW format
+            # yikes, this transpose takes 80% of the loading time/CPU
+            img = img.transpose(0, 1).transpose(0, 2).contiguous()
+        return (img.float().div(255.) if self.div else img.float(), label)
+class IdentityTransform(object):
+    def __call__(self, data):
+        return data

utils_mae.py ADDED Viewed

	@@ -0,0 +1,536 @@

+import io
+import os
+import math
+import time
+import json
+from collections import defaultdict, deque
+import datetime
+import numpy as np
+from timm.utils import get_state_dict
+from torch.utils.data._utils.collate import default_collate
+from pathlib import Path
+import subprocess
+import torch
+import torch.distributed as dist
+#from torch._six import inf
+from torch import inf
+import random
+from tensorboardX import SummaryWriter
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+    def __init__(self, window_size=20, fmt=None):
+        if fmt is None:
+            fmt = "{median:.4f} ({global_avg:.4f})"
+        self.deque = deque(maxlen=window_size)
+        self.total = 0.0
+        self.count = 0
+        self.fmt = fmt
+    def update(self, value, n=1):
+        self.deque.append(value)
+        self.count += n
+        self.total += value * n
+    def synchronize_between_processes(self):
+        """
+        Warning: does not synchronize the deque!
+        """
+        if not is_dist_avail_and_initialized():
+            return
+        t = torch.tensor([self.count, self.total], dtype=torch.float64, device='cuda')
+        dist.barrier()
+        dist.all_reduce(t)
+        t = t.tolist()
+        self.count = int(t[0])
+        self.total = t[1]
+    @property
+    def median(self):
+        d = torch.tensor(list(self.deque))
+        return d.median().item()
+    @property
+    def avg(self):
+        d = torch.tensor(list(self.deque), dtype=torch.float32)
+        return d.mean().item()
+    @property
+    def global_avg(self):
+        return self.total / self.count
+    @property
+    def max(self):
+        return max(self.deque)
+    @property
+    def value(self):
+        return self.deque[-1]
+    def __str__(self):
+        return self.fmt.format(
+            median=self.median,
+            avg=self.avg,
+            global_avg=self.global_avg,
+            max=self.max,
+            value=self.value)
+class MetricLogger(object):
+    def __init__(self, delimiter="\t"):
+        self.meters = defaultdict(SmoothedValue)
+        self.delimiter = delimiter
+    def update(self, **kwargs):
+        for k, v in kwargs.items():
+            if v is None:
+                continue
+            if isinstance(v, torch.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.meters[k].update(v)
+    def __getattr__(self, attr):
+        if attr in self.meters:
+            return self.meters[attr]
+        if attr in self.__dict__:
+            return self.__dict__[attr]
+        raise AttributeError("'{}' object has no attribute '{}'".format(
+            type(self).__name__, attr))
+    def __str__(self):
+        loss_str = []
+        for name, meter in self.meters.items():
+            loss_str.append(
+                "{}: {}".format(name, str(meter))
+            )
+        return self.delimiter.join(loss_str)
+    def synchronize_between_processes(self):
+        for meter in self.meters.values():
+            meter.synchronize_between_processes()
+    def add_meter(self, name, meter):
+        self.meters[name] = meter
+    def log_every(self, iterable, print_freq, header=None):
+        i = 0
+        if not header:
+            header = ''
+        start_time = time.time()
+        end = time.time()
+        iter_time = SmoothedValue(fmt='{avg:.4f}')
+        data_time = SmoothedValue(fmt='{avg:.4f}')
+        space_fmt = ':' + str(len(str(len(iterable)))) + 'd'
+        log_msg = [
+            header,
+            '[{0' + space_fmt + '}/{1}]',
+            'eta: {eta}',
+            '{meters}',
+            'time: {time}',
+            'data: {data}'
+        ]
+        if torch.cuda.is_available():
+            log_msg.append('max mem: {memory:.0f}')
+        log_msg = self.delimiter.join(log_msg)
+        MB = 1024.0 * 1024.0
+        for obj in iterable:
+            data_time.update(time.time() - end)
+            yield obj
+            iter_time.update(time.time() - end)
+            if i % print_freq == 0 or i == len(iterable) - 1:
+                eta_seconds = iter_time.global_avg * (len(iterable) - i)
+                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                if torch.cuda.is_available():
+                    print(log_msg.format(
+                        i, len(iterable), eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time),
+                        memory=torch.cuda.max_memory_allocated() / MB))
+                else:
+                    print(log_msg.format(
+                        i, len(iterable), eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time)))
+            i += 1
+            end = time.time()
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print('{} Total time: {} ({:.4f} s / it)'.format(
+            header, total_time_str, total_time / len(iterable)))
+class TensorboardLogger(object):
+    def __init__(self, log_dir):
+        self.writer = SummaryWriter(logdir=log_dir)
+        self.step = 0
+    def set_step(self, step=None):
+        if step is not None:
+            self.step = step
+        else:
+            self.step += 1
+    def update(self, head='scalar', step=None, **kwargs):
+        for k, v in kwargs.items():
+            if v is None:
+                continue
+            if isinstance(v, torch.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.writer.add_scalar(head + "/" + k, v, self.step if step is None else step)
+    def flush(self):
+        self.writer.flush()
+def seed_worker(worker_id):
+    worker_seed = torch.initial_seed() % 2**32
+    np.random.seed(worker_seed)
+    random.seed(worker_seed)
+def _load_checkpoint_for_ema(model_ema, checkpoint):
+    """
+    Workaround for ModelEma._load_checkpoint to accept an already-loaded object
+    """
+    mem_file = io.BytesIO()
+    torch.save(checkpoint, mem_file)
+    mem_file.seek(0)
+    model_ema._load_checkpoint(mem_file)
+def setup_for_distributed(is_master):
+    """
+    This function disables printing when not in master process
+    """
+    import builtins as __builtin__
+    builtin_print = __builtin__.print
+    def print(*args, **kwargs):
+        force = kwargs.pop('force', False)
+        if is_master or force:
+            builtin_print(*args, **kwargs)
+    __builtin__.print = print
+def is_dist_avail_and_initialized():
+    if not dist.is_available():
+        return False
+    if not dist.is_initialized():
+        return False
+    return True
+def get_world_size():
+    if not is_dist_avail_and_initialized():
+        return 1
+    return dist.get_world_size()
+def get_rank():
+    if not is_dist_avail_and_initialized():
+        return 0
+    return dist.get_rank()
+def is_main_process():
+    return get_rank() == 0
+def save_on_master(*args, **kwargs):
+    if is_main_process():
+        torch.save(*args, **kwargs)
+def init_distributed_mode(args):
+    if args.dist_on_itp:
+        args.rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
+        args.world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
+        args.gpu = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
+        args.dist_url = "tcp://%s:%s" % (os.environ['MASTER_ADDR'], os.environ['MASTER_PORT'])
+        os.environ['LOCAL_RANK'] = str(args.gpu)
+        os.environ['RANK'] = str(args.rank)
+        os.environ['WORLD_SIZE'] = str(args.world_size)
+    elif 'SLURM_PROCID' in os.environ:
+        args.rank = int(os.environ['SLURM_PROCID'])
+        args.gpu = int(os.environ['SLURM_LOCALID'])
+        args.world_size = int(os.environ['SLURM_NTASKS'])
+        os.environ['RANK'] = str(args.rank)
+        os.environ['LOCAL_RANK'] = str(args.gpu)
+        os.environ['WORLD_SIZE'] = str(args.world_size)
+        node_list = os.environ['SLURM_NODELIST']
+        addr = subprocess.getoutput(
+            f'scontrol show hostname {node_list} | head -n1')
+        if 'MASTER_ADDR' not in os.environ:
+            os.environ['MASTER_ADDR'] = addr
+    elif 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
+        args.rank = int(os.environ["RANK"])
+        args.world_size = int(os.environ['WORLD_SIZE'])
+        args.gpu = int(os.environ['LOCAL_RANK'])
+    else:
+        print('Not using distributed mode')
+        args.distributed = False
+        return
+    args.distributed = True
+    torch.cuda.set_device(args.gpu)
+    args.dist_backend = 'nccl'
+    print('| distributed init (rank {}): {}, gpu {}'.format(
+        args.rank, args.dist_url, args.gpu), flush=True)
+    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
+                                         world_size=args.world_size, rank=args.rank)
+    torch.distributed.barrier()
+    # assert torch.distributed.is_initialized()
+    setup_for_distributed(args.rank == 0)
+def load_state_dict(model, state_dict, prefix='', ignore_missing="relative_position_index"):
+    missing_keys = []
+    unexpected_keys = []
+    error_msgs = []
+    metadata = getattr(state_dict, '_metadata', None)
+    state_dict = state_dict.copy()
+    if metadata is not None:
+        state_dict._metadata = metadata
+    def load(module, prefix=''):
+        local_metadata = {} if metadata is None else metadata.get(
+            prefix[:-1], {})
+        module._load_from_state_dict(
+            state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
+        for name, child in module._modules.items():
+            if child is not None:
+                load(child, prefix + name + '.')
+    load(model, prefix=prefix)
+    warn_missing_keys = []
+    ignore_missing_keys = []
+    for key in missing_keys:
+        keep_flag = True
+        for ignore_key in ignore_missing.split('|'):
+            if ignore_key in key:
+                keep_flag = False
+                break
+        if keep_flag:
+            warn_missing_keys.append(key)
+        else:
+            ignore_missing_keys.append(key)
+    missing_keys = warn_missing_keys
+    if len(missing_keys) > 0:
+        print("Weights of {} not initialized from pretrained model: {}".format(
+            model.__class__.__name__, missing_keys))
+    if len(unexpected_keys) > 0:
+        print("Weights from pretrained model not used in {}: {}".format(
+            model.__class__.__name__, unexpected_keys))
+    if len(ignore_missing_keys) > 0:
+        print("Ignored weights of {} not initialized from pretrained model: {}".format(
+            model.__class__.__name__, ignore_missing_keys))
+    if len(error_msgs) > 0:
+        print('\n'.join(error_msgs))
+class NativeScalerWithGradNormCount:
+    state_dict_key = "amp_scaler"
+    def __init__(self):
+        self._scaler = torch.cuda.amp.GradScaler()
+    def __call__(self, loss, optimizer, clip_grad=None, parameters=None, create_graph=False, update_grad=True):
+        self._scaler.scale(loss).backward(create_graph=create_graph)
+        if update_grad:
+            if clip_grad is not None:
+                assert parameters is not None
+                self._scaler.unscale_(optimizer)  # unscale the gradients of optimizer's assigned params in-place
+                norm = torch.nn.utils.clip_grad_norm_(parameters, clip_grad)
+            else:
+                self._scaler.unscale_(optimizer)
+                norm = get_grad_norm_(parameters)
+            self._scaler.step(optimizer)
+            self._scaler.update()
+        else:
+            norm = None
+        return norm
+    def state_dict(self):
+        return self._scaler.state_dict()
+    def load_state_dict(self, state_dict):
+        self._scaler.load_state_dict(state_dict)
+def get_grad_norm_(parameters, norm_type: float = 2.0) -> torch.Tensor:
+    if isinstance(parameters, torch.Tensor):
+        parameters = [parameters]
+    parameters = [p for p in parameters if p.grad is not None]
+    norm_type = float(norm_type)
+    if len(parameters) == 0:
+        return torch.tensor(0.)
+    device = parameters[0].grad.device
+    if norm_type == inf:
+        total_norm = max(p.grad.detach().abs().max().to(device) for p in parameters)
+    else:
+        total_norm = torch.norm(torch.stack([torch.norm(p.grad.detach(), norm_type).to(device) for p in parameters]), norm_type)
+    return total_norm
+def cosine_scheduler(base_value, final_value, epochs, niter_per_ep, warmup_epochs=0,
+                     start_warmup_value=0, warmup_steps=-1):
+    warmup_schedule = np.array([])
+    warmup_iters = warmup_epochs * niter_per_ep
+    if warmup_steps > 0:
+        warmup_iters = warmup_steps
+    print("Set warmup steps = %d" % warmup_iters)
+    if warmup_epochs > 0:
+        warmup_schedule = np.linspace(start_warmup_value, base_value, warmup_iters)
+    iters = np.arange(epochs * niter_per_ep - warmup_iters)
+    schedule = np.array(
+        [final_value + 0.5 * (base_value - final_value) * (1 + math.cos(math.pi * i / (len(iters)))) for i in iters])
+    schedule = np.concatenate((warmup_schedule, schedule))
+    assert len(schedule) == epochs * niter_per_ep
+    return schedule
+def save_model(args, epoch, model, model_without_ddp, optimizer, loss_scaler, model_ema=None):
+    output_dir = Path(args.output_dir)
+    epoch_name = str(epoch)
+    if loss_scaler is not None:
+        checkpoint_paths = [output_dir / ('checkpoint-%s.pth' % epoch_name)]
+        for checkpoint_path in checkpoint_paths:
+            to_save = {
+                'model': model_without_ddp.state_dict(),
+                'optimizer': optimizer.state_dict(),
+                'epoch': epoch,
+                'scaler': loss_scaler.state_dict(),
+                'args': args,
+            }
+            if model_ema is not None:
+                to_save['model_ema'] = get_state_dict(model_ema)
+            save_on_master(to_save, checkpoint_path)
+    else:
+        client_state = {'epoch': epoch}
+        if model_ema is not None:
+            client_state['model_ema'] = get_state_dict(model_ema)
+        model.save_checkpoint(save_dir=args.output_dir, tag="checkpoint-%s" % epoch_name, client_state=client_state)
+def auto_load_model(args, model, model_without_ddp, optimizer, loss_scaler, model_ema=None):
+    output_dir = Path(args.output_dir)
+    if loss_scaler is not None:
+        # torch.amp
+        if args.auto_resume and len(args.resume) == 0:
+            import glob
+            all_checkpoints = glob.glob(os.path.join(output_dir, 'checkpoint-*.pth'))
+            latest_ckpt = -1
+            for ckpt in all_checkpoints:
+                t = ckpt.split('-')[-1].split('.')[0]
+                if t.isdigit():
+                    latest_ckpt = max(int(t), latest_ckpt)
+            if latest_ckpt >= 0:
+                args.resume = os.path.join(output_dir, 'checkpoint-%d.pth' % latest_ckpt)
+            print("Auto resume checkpoint: %s" % args.resume)
+        if args.resume:
+            if args.resume.startswith('https'):
+                checkpoint = torch.hub.load_state_dict_from_url(
+                    args.resume, map_location='cpu', check_hash=True)
+            else:
+                checkpoint = torch.load(args.resume, map_location='cpu')
+            model_without_ddp.load_state_dict(checkpoint['model'])
+            print("Resume checkpoint %s" % args.resume)
+            if 'optimizer' in checkpoint and 'epoch' in checkpoint:
+                optimizer.load_state_dict(checkpoint['optimizer'])
+                args.start_epoch = checkpoint['epoch'] + 1
+                if hasattr(args, 'model_ema') and args.model_ema:
+                    _load_checkpoint_for_ema(model_ema, checkpoint['model_ema'])
+                if 'scaler' in checkpoint:
+                    loss_scaler.load_state_dict(checkpoint['scaler'])
+                print("With optim & sched!")
+    else:
+        # deepspeed, only support '--auto_resume'.
+        if args.auto_resume:
+            import glob
+            all_checkpoints = glob.glob(os.path.join(output_dir, 'checkpoint-*'))
+            latest_ckpt = -1
+            for ckpt in all_checkpoints:
+                t = ckpt.split('-')[-1].split('.')[0]
+                if t.isdigit():
+                    latest_ckpt = max(int(t), latest_ckpt)
+            if latest_ckpt >= 0:
+                args.resume = os.path.join(output_dir, 'checkpoint-%d' % latest_ckpt)
+                print("Auto resume checkpoint: %d" % latest_ckpt)
+                _, client_states = model.load_checkpoint(args.output_dir, tag='checkpoint-%d' % latest_ckpt)
+                args.start_epoch = client_states['epoch'] + 1
+                if model_ema is not None:
+                    if args.model_ema:
+                        _load_checkpoint_for_ema(model_ema, client_states['model_ema'])
+def create_ds_config(args):
+    args.deepspeed_config = os.path.join(args.output_dir, "deepspeed_config.json")
+    with open(args.deepspeed_config, mode="w") as writer:
+        ds_config = {
+            "train_batch_size": args.batch_size * args.update_freq * get_world_size(),
+            "train_micro_batch_size_per_gpu": args.batch_size,
+            "steps_per_print": 1000,
+            "optimizer": {
+                "type": "Adam",
+                "adam_w_mode": True,
+                "params": {
+                    "lr": args.lr,
+                    "weight_decay": args.weight_decay,
+                    "bias_correction": True,
+                    "betas": [
+                        0.9,
+                        0.999
+                    ],
+                    "eps": 1e-8
+                }
+            },
+            "fp16": {
+                "enabled": True,
+                "loss_scale": 0,
+                "initial_scale_power": 7,
+                "loss_scale_window": 128
+            }
+        }
+        writer.write(json.dumps(ds_config, indent=2))
+def multiple_samples_collate(batch, fold=False):
+    """
+    Collate function for repeated augmentation. Each instance in the batch has
+    more than one sample.
+    Args:
+        batch (tuple or list): data batch to collate.
+    Returns:
+        (tuple): collated data batch.
+    """
+    inputs, labels, video_idx, extra_data = zip(*batch)
+    inputs = [item for sublist in inputs for item in sublist]
+    labels = [item for sublist in labels for item in sublist]
+    video_idx = [item for sublist in video_idx for item in sublist]
+    inputs, labels, video_idx, extra_data = (
+        default_collate(inputs),
+        default_collate(labels),
+        default_collate(video_idx),
+        default_collate(extra_data),
+    )
+    if fold:
+        return [inputs], labels, video_idx, extra_data
+    else:
+        return inputs, labels, video_idx, extra_data

video_transforms.py ADDED Viewed

	@@ -0,0 +1,1281 @@

+#!/usr/bin/env python3
+import math
+import numpy as np
+import random
+import torch
+import torchvision.transforms.functional as F
+from PIL import Image
+from torchvision import transforms
+from rand_augment import rand_augment_transform
+from random_erasing import RandomErasing
+import numbers
+import PIL
+import torchvision
+import functional as FF
+_pil_interpolation_to_str = {
+    Image.NEAREST: "PIL.Image.NEAREST",
+    Image.BILINEAR: "PIL.Image.BILINEAR",
+    Image.BICUBIC: "PIL.Image.BICUBIC",
+    Image.LANCZOS: "PIL.Image.LANCZOS",
+    Image.HAMMING: "PIL.Image.HAMMING",
+    Image.BOX: "PIL.Image.BOX",
+}
+_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
+def _pil_interp(method):
+    if method == "bicubic":
+        return Image.BICUBIC
+    elif method == "lanczos":
+        return Image.LANCZOS
+    elif method == "hamming":
+        return Image.HAMMING
+    else:
+        return Image.BILINEAR
+def random_short_side_scale_jitter(
+    images, min_size, max_size, boxes=None, inverse_uniform_sampling=False
+):
+    """
+    Perform a spatial short scale jittering on the given images and
+    corresponding boxes.
+    Args:
+        images (tensor): images to perform scale jitter. Dimension is
+            `num frames` x `channel` x `height` x `width`.
+        min_size (int): the minimal size to scale the frames.
+        max_size (int): the maximal size to scale the frames.
+        boxes (ndarray): optional. Corresponding boxes to images.
+            Dimension is `num boxes` x 4.
+        inverse_uniform_sampling (bool): if True, sample uniformly in
+            [1 / max_scale, 1 / min_scale] and take a reciprocal to get the
+            scale. If False, take a uniform sample from [min_scale, max_scale].
+    Returns:
+        (tensor): the scaled images with dimension of
+            `num frames` x `channel` x `new height` x `new width`.
+        (ndarray or None): the scaled boxes with dimension of
+            `num boxes` x 4.
+    """
+    if inverse_uniform_sampling:
+        size = int(
+            round(1.0 / np.random.uniform(1.0 / max_size, 1.0 / min_size))
+        )
+    else:
+        size = int(round(np.random.uniform(min_size, max_size)))
+    height = images.shape[2]
+    width = images.shape[3]
+    if (width <= height and width == size) or (
+        height <= width and height == size
+    ):
+        return images, boxes
+    new_width = size
+    new_height = size
+    if width < height:
+        new_height = int(math.floor((float(height) / width) * size))
+        if boxes is not None:
+            boxes = boxes * float(new_height) / height
+    else:
+        new_width = int(math.floor((float(width) / height) * size))
+        if boxes is not None:
+            boxes = boxes * float(new_width) / width
+    return (
+        torch.nn.functional.interpolate(
+            images,
+            size=(new_height, new_width),
+            mode="bilinear",
+            align_corners=False,
+        ),
+        boxes,
+    )
+def crop_boxes(boxes, x_offset, y_offset):
+    """
+    Peform crop on the bounding boxes given the offsets.
+    Args:
+        boxes (ndarray or None): bounding boxes to peform crop. The dimension
+            is `num boxes` x 4.
+        x_offset (int): cropping offset in the x axis.
+        y_offset (int): cropping offset in the y axis.
+    Returns:
+        cropped_boxes (ndarray or None): the cropped boxes with dimension of
+            `num boxes` x 4.
+    """
+    cropped_boxes = boxes.copy()
+    cropped_boxes[:, [0, 2]] = boxes[:, [0, 2]] - x_offset
+    cropped_boxes[:, [1, 3]] = boxes[:, [1, 3]] - y_offset
+    return cropped_boxes
+def random_crop(images, size, boxes=None):
+    """
+    Perform random spatial crop on the given images and corresponding boxes.
+    Args:
+        images (tensor): images to perform random crop. The dimension is
+            `num frames` x `channel` x `height` x `width`.
+        size (int): the size of height and width to crop on the image.
+        boxes (ndarray or None): optional. Corresponding boxes to images.
+            Dimension is `num boxes` x 4.
+    Returns:
+        cropped (tensor): cropped images with dimension of
+            `num frames` x `channel` x `size` x `size`.
+        cropped_boxes (ndarray or None): the cropped boxes with dimension of
+            `num boxes` x 4.
+    """
+    if images.shape[2] == size and images.shape[3] == size:
+        return images
+    height = images.shape[2]
+    width = images.shape[3]
+    y_offset = 0
+    if height > size:
+        y_offset = int(np.random.randint(0, height - size))
+    x_offset = 0
+    if width > size:
+        x_offset = int(np.random.randint(0, width - size))
+    cropped = images[
+        :, :, y_offset : y_offset + size, x_offset : x_offset + size
+    ]
+    cropped_boxes = (
+        crop_boxes(boxes, x_offset, y_offset) if boxes is not None else None
+    )
+    return cropped, cropped_boxes
+def horizontal_flip(prob, images, boxes=None):
+    """
+    Perform horizontal flip on the given images and corresponding boxes.
+    Args:
+        prob (float): probility to flip the images.
+        images (tensor): images to perform horizontal flip, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+        boxes (ndarray or None): optional. Corresponding boxes to images.
+            Dimension is `num boxes` x 4.
+    Returns:
+        images (tensor): images with dimension of
+            `num frames` x `channel` x `height` x `width`.
+        flipped_boxes (ndarray or None): the flipped boxes with dimension of
+            `num boxes` x 4.
+    """
+    if boxes is None:
+        flipped_boxes = None
+    else:
+        flipped_boxes = boxes.copy()
+    if np.random.uniform() < prob:
+        images = images.flip((-1))
+        if len(images.shape) == 3:
+            width = images.shape[2]
+        elif len(images.shape) == 4:
+            width = images.shape[3]
+        else:
+            raise NotImplementedError("Dimension does not supported")
+        if boxes is not None:
+            flipped_boxes[:, [0, 2]] = width - boxes[:, [2, 0]] - 1
+    return images, flipped_boxes
+def uniform_crop(images, size, spatial_idx, boxes=None, scale_size=None):
+    """
+    Perform uniform spatial sampling on the images and corresponding boxes.
+    Args:
+        images (tensor): images to perform uniform crop. The dimension is
+            `num frames` x `channel` x `height` x `width`.
+        size (int): size of height and weight to crop the images.
+        spatial_idx (int): 0, 1, or 2 for left, center, and right crop if width
+            is larger than height. Or 0, 1, or 2 for top, center, and bottom
+            crop if height is larger than width.
+        boxes (ndarray or None): optional. Corresponding boxes to images.
+            Dimension is `num boxes` x 4.
+        scale_size (int): optinal. If not None, resize the images to scale_size before
+            performing any crop.
+    Returns:
+        cropped (tensor): images with dimension of
+            `num frames` x `channel` x `size` x `size`.
+        cropped_boxes (ndarray or None): the cropped boxes with dimension of
+            `num boxes` x 4.
+    """
+    assert spatial_idx in [0, 1, 2]
+    ndim = len(images.shape)
+    if ndim == 3:
+        images = images.unsqueeze(0)
+    height = images.shape[2]
+    width = images.shape[3]
+    if scale_size is not None:
+        if width <= height:
+            width, height = scale_size, int(height / width * scale_size)
+        else:
+            width, height = int(width / height * scale_size), scale_size
+        images = torch.nn.functional.interpolate(
+            images,
+            size=(height, width),
+            mode="bilinear",
+            align_corners=False,
+        )
+    y_offset = int(math.ceil((height - size) / 2))
+    x_offset = int(math.ceil((width - size) / 2))
+    if height > width:
+        if spatial_idx == 0:
+            y_offset = 0
+        elif spatial_idx == 2:
+            y_offset = height - size
+    else:
+        if spatial_idx == 0:
+            x_offset = 0
+        elif spatial_idx == 2:
+            x_offset = width - size
+    cropped = images[
+        :, :, y_offset : y_offset + size, x_offset : x_offset + size
+    ]
+    cropped_boxes = (
+        crop_boxes(boxes, x_offset, y_offset) if boxes is not None else None
+    )
+    if ndim == 3:
+        cropped = cropped.squeeze(0)
+    return cropped, cropped_boxes
+def clip_boxes_to_image(boxes, height, width):
+    """
+    Clip an array of boxes to an image with the given height and width.
+    Args:
+        boxes (ndarray): bounding boxes to perform clipping.
+            Dimension is `num boxes` x 4.
+        height (int): given image height.
+        width (int): given image width.
+    Returns:
+        clipped_boxes (ndarray): the clipped boxes with dimension of
+            `num boxes` x 4.
+    """
+    clipped_boxes = boxes.copy()
+    clipped_boxes[:, [0, 2]] = np.minimum(
+        width - 1.0, np.maximum(0.0, boxes[:, [0, 2]])
+    )
+    clipped_boxes[:, [1, 3]] = np.minimum(
+        height - 1.0, np.maximum(0.0, boxes[:, [1, 3]])
+    )
+    return clipped_boxes
+def blend(images1, images2, alpha):
+    """
+    Blend two images with a given weight alpha.
+    Args:
+        images1 (tensor): the first images to be blended, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+        images2 (tensor): the second images to be blended, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+        alpha (float): the blending weight.
+    Returns:
+        (tensor): blended images, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+    """
+    return images1 * alpha + images2 * (1 - alpha)
+def grayscale(images):
+    """
+    Get the grayscale for the input images. The channels of images should be
+    in order BGR.
+    Args:
+        images (tensor): the input images for getting grayscale. Dimension is
+            `num frames` x `channel` x `height` x `width`.
+    Returns:
+        img_gray (tensor): blended images, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+    """
+    # R -> 0.299, G -> 0.587, B -> 0.114.
+    img_gray = torch.tensor(images)
+    gray_channel = (
+        0.299 * images[:, 2] + 0.587 * images[:, 1] + 0.114 * images[:, 0]
+    )
+    img_gray[:, 0] = gray_channel
+    img_gray[:, 1] = gray_channel
+    img_gray[:, 2] = gray_channel
+    return img_gray
+def color_jitter(images, img_brightness=0, img_contrast=0, img_saturation=0):
+    """
+    Perfrom a color jittering on the input images. The channels of images
+    should be in order BGR.
+    Args:
+        images (tensor): images to perform color jitter. Dimension is
+            `num frames` x `channel` x `height` x `width`.
+        img_brightness (float): jitter ratio for brightness.
+        img_contrast (float): jitter ratio for contrast.
+        img_saturation (float): jitter ratio for saturation.
+    Returns:
+        images (tensor): the jittered images, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+    """
+    jitter = []
+    if img_brightness != 0:
+        jitter.append("brightness")
+    if img_contrast != 0:
+        jitter.append("contrast")
+    if img_saturation != 0:
+        jitter.append("saturation")
+    if len(jitter) > 0:
+        order = np.random.permutation(np.arange(len(jitter)))
+        for idx in range(0, len(jitter)):
+            if jitter[order[idx]] == "brightness":
+                images = brightness_jitter(img_brightness, images)
+            elif jitter[order[idx]] == "contrast":
+                images = contrast_jitter(img_contrast, images)
+            elif jitter[order[idx]] == "saturation":
+                images = saturation_jitter(img_saturation, images)
+    return images
+def brightness_jitter(var, images):
+    """
+    Perfrom brightness jittering on the input images. The channels of images
+    should be in order BGR.
+    Args:
+        var (float): jitter ratio for brightness.
+        images (tensor): images to perform color jitter. Dimension is
+            `num frames` x `channel` x `height` x `width`.
+    Returns:
+        images (tensor): the jittered images, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+    """
+    alpha = 1.0 + np.random.uniform(-var, var)
+    img_bright = torch.zeros(images.shape)
+    images = blend(images, img_bright, alpha)
+    return images
+def contrast_jitter(var, images):
+    """
+    Perfrom contrast jittering on the input images. The channels of images
+    should be in order BGR.
+    Args:
+        var (float): jitter ratio for contrast.
+        images (tensor): images to perform color jitter. Dimension is
+            `num frames` x `channel` x `height` x `width`.
+    Returns:
+        images (tensor): the jittered images, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+    """
+    alpha = 1.0 + np.random.uniform(-var, var)
+    img_gray = grayscale(images)
+    img_gray[:] = torch.mean(img_gray, dim=(1, 2, 3), keepdim=True)
+    images = blend(images, img_gray, alpha)
+    return images
+def saturation_jitter(var, images):
+    """
+    Perfrom saturation jittering on the input images. The channels of images
+    should be in order BGR.
+    Args:
+        var (float): jitter ratio for saturation.
+        images (tensor): images to perform color jitter. Dimension is
+            `num frames` x `channel` x `height` x `width`.
+    Returns:
+        images (tensor): the jittered images, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+    """
+    alpha = 1.0 + np.random.uniform(-var, var)
+    img_gray = grayscale(images)
+    images = blend(images, img_gray, alpha)
+    return images
+def lighting_jitter(images, alphastd, eigval, eigvec):
+    """
+    Perform AlexNet-style PCA jitter on the given images.
+    Args:
+        images (tensor): images to perform lighting jitter. Dimension is
+            `num frames` x `channel` x `height` x `width`.
+        alphastd (float): jitter ratio for PCA jitter.
+        eigval (list): eigenvalues for PCA jitter.
+        eigvec (list[list]): eigenvectors for PCA jitter.
+    Returns:
+        out_images (tensor): the jittered images, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+    """
+    if alphastd == 0:
+        return images
+    # generate alpha1, alpha2, alpha3.
+    alpha = np.random.normal(0, alphastd, size=(1, 3))
+    eig_vec = np.array(eigvec)
+    eig_val = np.reshape(eigval, (1, 3))
+    rgb = np.sum(
+        eig_vec * np.repeat(alpha, 3, axis=0) * np.repeat(eig_val, 3, axis=0),
+        axis=1,
+    )
+    out_images = torch.zeros_like(images)
+    if len(images.shape) == 3:
+        # C H W
+        channel_dim = 0
+    elif len(images.shape) == 4:
+        # T C H W
+        channel_dim = 1
+    else:
+        raise NotImplementedError(f"Unsupported dimension {len(images.shape)}")
+    for idx in range(images.shape[channel_dim]):
+        # C H W
+        if len(images.shape) == 3:
+            out_images[idx] = images[idx] + rgb[2 - idx]
+        # T C H W
+        elif len(images.shape) == 4:
+            out_images[:, idx] = images[:, idx] + rgb[2 - idx]
+        else:
+            raise NotImplementedError(
+                f"Unsupported dimension {len(images.shape)}"
+            )
+    return out_images
+def color_normalization(images, mean, stddev):
+    """
+    Perform color nomration on the given images.
+    Args:
+        images (tensor): images to perform color normalization. Dimension is
+            `num frames` x `channel` x `height` x `width`.
+        mean (list): mean values for normalization.
+        stddev (list): standard deviations for normalization.
+    Returns:
+        out_images (tensor): the noramlized images, the dimension is
+            `num frames` x `channel` x `height` x `width`.
+    """
+    if len(images.shape) == 3:
+        assert (
+            len(mean) == images.shape[0]
+        ), "channel mean not computed properly"
+        assert (
+            len(stddev) == images.shape[0]
+        ), "channel stddev not computed properly"
+    elif len(images.shape) == 4:
+        assert (
+            len(mean) == images.shape[1]
+        ), "channel mean not computed properly"
+        assert (
+            len(stddev) == images.shape[1]
+        ), "channel stddev not computed properly"
+    else:
+        raise NotImplementedError(f"Unsupported dimension {len(images.shape)}")
+    out_images = torch.zeros_like(images)
+    for idx in range(len(mean)):
+        # C H W
+        if len(images.shape) == 3:
+            out_images[idx] = (images[idx] - mean[idx]) / stddev[idx]
+        elif len(images.shape) == 4:
+            out_images[:, idx] = (images[:, idx] - mean[idx]) / stddev[idx]
+        else:
+            raise NotImplementedError(
+                f"Unsupported dimension {len(images.shape)}"
+            )
+    return out_images
+def _get_param_spatial_crop(
+    scale, ratio, height, width, num_repeat=10, log_scale=True, switch_hw=False
+):
+    """
+    Given scale, ratio, height and width, return sampled coordinates of the videos.
+    """
+    for _ in range(num_repeat):
+        area = height * width
+        target_area = random.uniform(*scale) * area
+        if log_scale:
+            log_ratio = (math.log(ratio[0]), math.log(ratio[1]))
+            aspect_ratio = math.exp(random.uniform(*log_ratio))
+        else:
+            aspect_ratio = random.uniform(*ratio)
+        w = int(round(math.sqrt(target_area * aspect_ratio)))
+        h = int(round(math.sqrt(target_area / aspect_ratio)))
+        if np.random.uniform() < 0.5 and switch_hw:
+            w, h = h, w
+        if 0 < w <= width and 0 < h <= height:
+            i = random.randint(0, height - h)
+            j = random.randint(0, width - w)
+            return i, j, h, w
+    # Fallback to central crop
+    in_ratio = float(width) / float(height)
+    if in_ratio < min(ratio):
+        w = width
+        h = int(round(w / min(ratio)))
+    elif in_ratio > max(ratio):
+        h = height
+        w = int(round(h * max(ratio)))
+    else:  # whole image
+        w = width
+        h = height
+    i = (height - h) // 2
+    j = (width - w) // 2
+    return i, j, h, w
+def random_resized_crop(
+    images,
+    target_height,
+    target_width,
+    scale=(0.8, 1.0),
+    ratio=(3.0 / 4.0, 4.0 / 3.0),
+):
+    """
+    Crop the given images to random size and aspect ratio. A crop of random
+    size (default: of 0.08 to 1.0) of the original size and a random aspect
+    ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This
+    crop is finally resized to given size. This is popularly used to train the
+    Inception networks.
+    Args:
+        images: Images to perform resizing and cropping.
+        target_height: Desired height after cropping.
+        target_width: Desired width after cropping.
+        scale: Scale range of Inception-style area based random resizing.
+        ratio: Aspect ratio range of Inception-style area based random resizing.
+    """
+    height = images.shape[2]
+    width = images.shape[3]
+    i, j, h, w = _get_param_spatial_crop(scale, ratio, height, width)
+    cropped = images[:, :, i : i + h, j : j + w]
+    return torch.nn.functional.interpolate(
+        cropped,
+        size=(target_height, target_width),
+        mode="bilinear",
+        align_corners=False,
+    )
+def random_resized_crop_with_shift(
+    images,
+    target_height,
+    target_width,
+    scale=(0.8, 1.0),
+    ratio=(3.0 / 4.0, 4.0 / 3.0),
+):
+    """
+    This is similar to random_resized_crop. However, it samples two different
+    boxes (for cropping) for the first and last frame. It then linearly
+    interpolates the two boxes for other frames.
+    Args:
+        images: Images to perform resizing and cropping.
+        target_height: Desired height after cropping.
+        target_width: Desired width after cropping.
+        scale: Scale range of Inception-style area based random resizing.
+        ratio: Aspect ratio range of Inception-style area based random resizing.
+    """
+    t = images.shape[1]
+    height = images.shape[2]
+    width = images.shape[3]
+    i, j, h, w = _get_param_spatial_crop(scale, ratio, height, width)
+    i_, j_, h_, w_ = _get_param_spatial_crop(scale, ratio, height, width)
+    i_s = [int(i) for i in torch.linspace(i, i_, steps=t).tolist()]
+    j_s = [int(i) for i in torch.linspace(j, j_, steps=t).tolist()]
+    h_s = [int(i) for i in torch.linspace(h, h_, steps=t).tolist()]
+    w_s = [int(i) for i in torch.linspace(w, w_, steps=t).tolist()]
+    out = torch.zeros((3, t, target_height, target_width))
+    for ind in range(t):
+        out[:, ind : ind + 1, :, :] = torch.nn.functional.interpolate(
+            images[
+                :,
+                ind : ind + 1,
+                i_s[ind] : i_s[ind] + h_s[ind],
+                j_s[ind] : j_s[ind] + w_s[ind],
+            ],
+            size=(target_height, target_width),
+            mode="bilinear",
+            align_corners=False,
+        )
+    return out
+def create_random_augment(
+    input_size,
+    auto_augment=None,
+    interpolation="bilinear",
+):
+    """
+    Get video randaug transform.
+    Args:
+        input_size: The size of the input video in tuple.
+        auto_augment: Parameters for randaug. An example:
+            "rand-m7-n4-mstd0.5-inc1" (m is the magnitude and n is the number
+            of operations to apply).
+        interpolation: Interpolation method.
+    """
+    if isinstance(input_size, tuple):
+        img_size = input_size[-2:]
+    else:
+        img_size = input_size
+    if auto_augment:
+        assert isinstance(auto_augment, str)
+        if isinstance(img_size, tuple):
+            img_size_min = min(img_size)
+        else:
+            img_size_min = img_size
+        aa_params = {"translate_const": int(img_size_min * 0.45)}
+        if interpolation and interpolation != "random":
+            aa_params["interpolation"] = _pil_interp(interpolation)
+        if auto_augment.startswith("rand"):
+            return transforms.Compose(
+                [rand_augment_transform(auto_augment, aa_params)]
+            )
+    raise NotImplementedError
+def random_sized_crop_img(
+    im,
+    size,
+    jitter_scale=(0.08, 1.0),
+    jitter_aspect=(3.0 / 4.0, 4.0 / 3.0),
+    max_iter=10,
+):
+    """
+    Performs Inception-style cropping (used for training).
+    """
+    assert (
+        len(im.shape) == 3
+    ), "Currently only support image for random_sized_crop"
+    h, w = im.shape[1:3]
+    i, j, h, w = _get_param_spatial_crop(
+        scale=jitter_scale,
+        ratio=jitter_aspect,
+        height=h,
+        width=w,
+        num_repeat=max_iter,
+        log_scale=False,
+        switch_hw=True,
+    )
+    cropped = im[:, i : i + h, j : j + w]
+    return torch.nn.functional.interpolate(
+        cropped.unsqueeze(0),
+        size=(size, size),
+        mode="bilinear",
+        align_corners=False,
+    ).squeeze(0)
+# The following code are modified based on timm lib, we will replace the following
+# contents with dependency from PyTorchVideo.
+# https://github.com/facebookresearch/pytorchvideo
+class RandomResizedCropAndInterpolation:
+    """Crop the given PIL Image to random size and aspect ratio with random interpolation.
+    A crop of random size (default: of 0.08 to 1.0) of the original size and a random
+    aspect ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This crop
+    is finally resized to given size.
+    This is popularly used to train the Inception networks.
+    Args:
+        size: expected output size of each edge
+        scale: range of size of the origin size cropped
+        ratio: range of aspect ratio of the origin aspect ratio cropped
+        interpolation: Default: PIL.Image.BILINEAR
+    """
+    def __init__(
+        self,
+        size,
+        scale=(0.08, 1.0),
+        ratio=(3.0 / 4.0, 4.0 / 3.0),
+        interpolation="bilinear",
+    ):
+        if isinstance(size, tuple):
+            self.size = size
+        else:
+            self.size = (size, size)
+        if (scale[0] > scale[1]) or (ratio[0] > ratio[1]):
+            print("range should be of kind (min, max)")
+        if interpolation == "random":
+            self.interpolation = _RANDOM_INTERPOLATION
+        else:
+            self.interpolation = _pil_interp(interpolation)
+        self.scale = scale
+        self.ratio = ratio
+    @staticmethod
+    def get_params(img, scale, ratio):
+        """Get parameters for ``crop`` for a random sized crop.
+        Args:
+            img (PIL Image): Image to be cropped.
+            scale (tuple): range of size of the origin size cropped
+            ratio (tuple): range of aspect ratio of the origin aspect ratio cropped
+        Returns:
+            tuple: params (i, j, h, w) to be passed to ``crop`` for a random
+                sized crop.
+        """
+        area = img.size[0] * img.size[1]
+        for _ in range(10):
+            target_area = random.uniform(*scale) * area
+            log_ratio = (math.log(ratio[0]), math.log(ratio[1]))
+            aspect_ratio = math.exp(random.uniform(*log_ratio))
+            w = int(round(math.sqrt(target_area * aspect_ratio)))
+            h = int(round(math.sqrt(target_area / aspect_ratio)))
+            if w <= img.size[0] and h <= img.size[1]:
+                i = random.randint(0, img.size[1] - h)
+                j = random.randint(0, img.size[0] - w)
+                return i, j, h, w
+        # Fallback to central crop
+        in_ratio = img.size[0] / img.size[1]
+        if in_ratio < min(ratio):
+            w = img.size[0]
+            h = int(round(w / min(ratio)))
+        elif in_ratio > max(ratio):
+            h = img.size[1]
+            w = int(round(h * max(ratio)))
+        else:  # whole image
+            w = img.size[0]
+            h = img.size[1]
+        i = (img.size[1] - h) // 2
+        j = (img.size[0] - w) // 2
+        return i, j, h, w
+    def __call__(self, img):
+        """
+        Args:
+            img (PIL Image): Image to be cropped and resized.
+        Returns:
+            PIL Image: Randomly cropped and resized image.
+        """
+        i, j, h, w = self.get_params(img, self.scale, self.ratio)
+        if isinstance(self.interpolation, (tuple, list)):
+            interpolation = random.choice(self.interpolation)
+        else:
+            interpolation = self.interpolation
+        return F.resized_crop(img, i, j, h, w, self.size, interpolation)
+    def __repr__(self):
+        if isinstance(self.interpolation, (tuple, list)):
+            interpolate_str = " ".join(
+                [_pil_interpolation_to_str[x] for x in self.interpolation]
+            )
+        else:
+            interpolate_str = _pil_interpolation_to_str[self.interpolation]
+        format_string = self.__class__.__name__ + "(size={0}".format(self.size)
+        format_string += ", scale={0}".format(
+            tuple(round(s, 4) for s in self.scale)
+        )
+        format_string += ", ratio={0}".format(
+            tuple(round(r, 4) for r in self.ratio)
+        )
+        format_string += ", interpolation={0})".format(interpolate_str)
+        return format_string
+def transforms_imagenet_train(
+    img_size=224,
+    scale=None,
+    ratio=None,
+    hflip=0.5,
+    vflip=0.0,
+    color_jitter=0.4,
+    auto_augment=None,
+    interpolation="random",
+    use_prefetcher=False,
+    mean=(0.485, 0.456, 0.406),
+    std=(0.229, 0.224, 0.225),
+    re_prob=0.0,
+    re_mode="const",
+    re_count=1,
+    re_num_splits=0,
+    separate=False,
+):
+    """
+    If separate==True, the transforms are returned as a tuple of 3 separate transforms
+    for use in a mixing dataset that passes
+     * all data through the first (primary) transform, called the 'clean' data
+     * a portion of the data through the secondary transform
+     * normalizes and converts the branches above with the third, final transform
+    """
+    if isinstance(img_size, tuple):
+        img_size = img_size[-2:]
+    else:
+        img_size = img_size
+    scale = tuple(scale or (0.08, 1.0))  # default imagenet scale range
+    ratio = tuple(
+        ratio or (3.0 / 4.0, 4.0 / 3.0)
+    )  # default imagenet ratio range
+    primary_tfl = [
+        RandomResizedCropAndInterpolation(
+            img_size, scale=scale, ratio=ratio, interpolation=interpolation
+        )
+    ]
+    if hflip > 0.0:
+        primary_tfl += [transforms.RandomHorizontalFlip(p=hflip)]
+    if vflip > 0.0:
+        primary_tfl += [transforms.RandomVerticalFlip(p=vflip)]
+    secondary_tfl = []
+    if auto_augment:
+        assert isinstance(auto_augment, str)
+        if isinstance(img_size, tuple):
+            img_size_min = min(img_size)
+        else:
+            img_size_min = img_size
+        aa_params = dict(
+            translate_const=int(img_size_min * 0.45),
+            img_mean=tuple([min(255, round(255 * x)) for x in mean]),
+        )
+        if interpolation and interpolation != "random":
+            aa_params["interpolation"] = _pil_interp(interpolation)
+        if auto_augment.startswith("rand"):
+            secondary_tfl += [rand_augment_transform(auto_augment, aa_params)]
+        elif auto_augment.startswith("augmix"):
+            raise NotImplementedError("Augmix not implemented")
+        else:
+            raise NotImplementedError("Auto aug not implemented")
+    elif color_jitter is not None:
+        # color jitter is enabled when not using AA
+        if isinstance(color_jitter, (list, tuple)):
+            # color jitter should be a 3-tuple/list if spec brightness/contrast/saturation
+            # or 4 if also augmenting hue
+            assert len(color_jitter) in (3, 4)
+        else:
+            # if it's a scalar, duplicate for brightness, contrast, and saturation, no hue
+            color_jitter = (float(color_jitter),) * 3
+        secondary_tfl += [transforms.ColorJitter(*color_jitter)]
+    final_tfl = []
+    final_tfl += [
+        transforms.ToTensor(),
+        transforms.Normalize(mean=torch.tensor(mean), std=torch.tensor(std)),
+    ]
+    if re_prob > 0.0:
+        final_tfl.append(
+            RandomErasing(
+                re_prob,
+                mode=re_mode,
+                max_count=re_count,
+                num_splits=re_num_splits,
+                device="cpu",
+                cube=False,
+            )
+        )
+    if separate:
+        return (
+            transforms.Compose(primary_tfl),
+            transforms.Compose(secondary_tfl),
+            transforms.Compose(final_tfl),
+        )
+    else:
+        return transforms.Compose(primary_tfl + secondary_tfl + final_tfl)
+############################################################################################################
+############################################################################################################
+class Compose(object):
+    """Composes several transforms
+    Args:
+    transforms (list of ``Transform`` objects): list of transforms
+    to compose
+    """
+    def __init__(self, transforms):
+        self.transforms = transforms
+    def __call__(self, clip):
+        for t in self.transforms:
+            clip = t(clip)
+        return clip
+class RandomHorizontalFlip(object):
+    """Horizontally flip the list of given images randomly
+    with a probability 0.5
+    """
+    def __call__(self, clip):
+        """
+        Args:
+        img (PIL.Image or numpy.ndarray): List of images to be cropped
+        in format (h, w, c) in numpy.ndarray
+        Returns:
+        PIL.Image or numpy.ndarray: Randomly flipped clip
+        """
+        if random.random() < 0.5:
+            if isinstance(clip[0], np.ndarray):
+                return [np.fliplr(img) for img in clip]
+            elif isinstance(clip[0], PIL.Image.Image):
+                return [
+                    img.transpose(PIL.Image.FLIP_LEFT_RIGHT) for img in clip
+                ]
+            else:
+                raise TypeError('Expected numpy.ndarray or PIL.Image' +
+                                ' but got list of {0}'.format(type(clip[0])))
+        return clip
+class RandomResize(object):
+    """Resizes a list of (H x W x C) numpy.ndarray to the final size
+    The larger the original image is, the more times it takes to
+    interpolate
+    Args:
+    interpolation (str): Can be one of 'nearest', 'bilinear'
+    defaults to nearest
+    size (tuple): (widht, height)
+    """
+    def __init__(self, ratio=(3. / 4., 4. / 3.), interpolation='nearest'):
+        self.ratio = ratio
+        self.interpolation = interpolation
+    def __call__(self, clip):
+        scaling_factor = random.uniform(self.ratio[0], self.ratio[1])
+        if isinstance(clip[0], np.ndarray):
+            im_h, im_w, im_c = clip[0].shape
+        elif isinstance(clip[0], PIL.Image.Image):
+            im_w, im_h = clip[0].size
+        new_w = int(im_w * scaling_factor)
+        new_h = int(im_h * scaling_factor)
+        new_size = (new_w, new_h)
+        resized = FF.resize_clip(
+            clip, new_size, interpolation=self.interpolation)
+        return resized
+class Resize(object):
+    """Resizes a list of (H x W x C) numpy.ndarray to the final size
+    The larger the original image is, the more times it takes to
+    interpolate
+    Args:
+    interpolation (str): Can be one of 'nearest', 'bilinear'
+    defaults to nearest
+    size (tuple): (widht, height)
+    """
+    def __init__(self, size, interpolation='nearest'):
+        self.size = size
+        self.interpolation = interpolation
+    def __call__(self, clip):
+        resized = FF.resize_clip(
+            clip, self.size, interpolation=self.interpolation)
+        return resized
+class RandomCrop(object):
+    """Extract random crop at the same location for a list of images
+    Args:
+    size (sequence or int): Desired output size for the
+    crop in format (h, w)
+    """
+    def __init__(self, size):
+        if isinstance(size, numbers.Number):
+            size = (size, size)
+        self.size = size
+    def __call__(self, clip):
+        """
+        Args:
+        img (PIL.Image or numpy.ndarray): List of images to be cropped
+        in format (h, w, c) in numpy.ndarray
+        Returns:
+        PIL.Image or numpy.ndarray: Cropped list of images
+        """
+        h, w = self.size
+        if isinstance(clip[0], np.ndarray):
+            im_h, im_w, im_c = clip[0].shape
+        elif isinstance(clip[0], PIL.Image.Image):
+            im_w, im_h = clip[0].size
+        else:
+            raise TypeError('Expected numpy.ndarray or PIL.Image' +
+                            'but got list of {0}'.format(type(clip[0])))
+        if w > im_w or h > im_h:
+            error_msg = (
+                'Initial image size should be larger then '
+                'cropped size but got cropped sizes : ({w}, {h}) while '
+                'initial image is ({im_w}, {im_h})'.format(
+                    im_w=im_w, im_h=im_h, w=w, h=h))
+            raise ValueError(error_msg)
+        x1 = random.randint(0, im_w - w)
+        y1 = random.randint(0, im_h - h)
+        cropped = FF.crop_clip(clip, y1, x1, h, w)
+        return cropped
+class ThreeCrop(object):
+    """Extract random crop at the same location for a list of images
+    Args:
+    size (sequence or int): Desired output size for the
+    crop in format (h, w)
+    """
+    def __init__(self, size):
+        if isinstance(size, numbers.Number):
+            size = (size, size)
+        self.size = size
+    def __call__(self, clip):
+        """
+        Args:
+        img (PIL.Image or numpy.ndarray): List of images to be cropped
+        in format (h, w, c) in numpy.ndarray
+        Returns:
+        PIL.Image or numpy.ndarray: Cropped list of images
+        """
+        h, w = self.size
+        if isinstance(clip[0], np.ndarray):
+            im_h, im_w, im_c = clip[0].shape
+        elif isinstance(clip[0], PIL.Image.Image):
+            im_w, im_h = clip[0].size
+        else:
+            raise TypeError('Expected numpy.ndarray or PIL.Image' +
+                            'but got list of {0}'.format(type(clip[0])))
+        if w != im_w and h != im_h:
+            clip = FF.resize_clip(clip, self.size, interpolation="bilinear")
+            im_h, im_w, im_c = clip[0].shape
+        step = np.max((np.max((im_w, im_h)) - self.size[0]) // 2, 0)
+        cropped = []
+        for i in range(3):
+            if (im_h > self.size[0]):
+                x1 = 0
+                y1 = i * step
+                cropped.extend(FF.crop_clip(clip, y1, x1, h, w))
+            else:
+                x1 = i * step
+                y1 = 0
+                cropped.extend(FF.crop_clip(clip, y1, x1, h, w))
+        return cropped
+class RandomRotation(object):
+    """Rotate entire clip randomly by a random angle within
+    given bounds
+    Args:
+    degrees (sequence or int): Range of degrees to select from
+    If degrees is a number instead of sequence like (min, max),
+    the range of degrees, will be (-degrees, +degrees).
+    """
+    def __init__(self, degrees):
+        if isinstance(degrees, numbers.Number):
+            if degrees < 0:
+                raise ValueError('If degrees is a single number,'
+                                 'must be positive')
+            degrees = (-degrees, degrees)
+        else:
+            if len(degrees) != 2:
+                raise ValueError('If degrees is a sequence,'
+                                 'it must be of len 2.')
+        self.degrees = degrees
+    def __call__(self, clip):
+        """
+        Args:
+        img (PIL.Image or numpy.ndarray): List of images to be cropped
+        in format (h, w, c) in numpy.ndarray
+        Returns:
+        PIL.Image or numpy.ndarray: Cropped list of images
+        """
+        import skimage
+        angle = random.uniform(self.degrees[0], self.degrees[1])
+        if isinstance(clip[0], np.ndarray):
+            rotated = [skimage.transform.rotate(img, angle) for img in clip]
+        elif isinstance(clip[0], PIL.Image.Image):
+            rotated = [img.rotate(angle) for img in clip]
+        else:
+            raise TypeError('Expected numpy.ndarray or PIL.Image' +
+                            'but got list of {0}'.format(type(clip[0])))
+        return rotated
+class CenterCrop(object):
+    """Extract center crop at the same location for a list of images
+    Args:
+    size (sequence or int): Desired output size for the
+    crop in format (h, w)
+    """
+    def __init__(self, size):
+        if isinstance(size, numbers.Number):
+            size = (size, size)
+        self.size = size
+    def __call__(self, clip):
+        """
+        Args:
+        img (PIL.Image or numpy.ndarray): List of images to be cropped
+        in format (h, w, c) in numpy.ndarray
+        Returns:
+        PIL.Image or numpy.ndarray: Cropped list of images
+        """
+        h, w = self.size
+        if isinstance(clip[0], np.ndarray):
+            im_h, im_w, im_c = clip[0].shape
+        elif isinstance(clip[0], PIL.Image.Image):
+            im_w, im_h = clip[0].size
+        else:
+            raise TypeError('Expected numpy.ndarray or PIL.Image' +
+                            'but got list of {0}'.format(type(clip[0])))
+        if w > im_w or h > im_h:
+            error_msg = (
+                'Initial image size should be larger then '
+                'cropped size but got cropped sizes : ({w}, {h}) while '
+                'initial image is ({im_w}, {im_h})'.format(
+                    im_w=im_w, im_h=im_h, w=w, h=h))
+            raise ValueError(error_msg)
+        x1 = int(round((im_w - w) / 2.))
+        y1 = int(round((im_h - h) / 2.))
+        cropped = FF.crop_clip(clip, y1, x1, h, w)
+        return cropped
+class ColorJitter(object):
+    """Randomly change the brightness, contrast and saturation and hue of the clip
+    Args:
+    brightness (float): How much to jitter brightness. brightness_factor
+    is chosen uniformly from [max(0, 1 - brightness), 1 + brightness].
+    contrast (float): How much to jitter contrast. contrast_factor
+    is chosen uniformly from [max(0, 1 - contrast), 1 + contrast].
+    saturation (float): How much to jitter saturation. saturation_factor
+    is chosen uniformly from [max(0, 1 - saturation), 1 + saturation].
+    hue(float): How much to jitter hue. hue_factor is chosen uniformly from
+    [-hue, hue]. Should be >=0 and <= 0.5.
+    """
+    def __init__(self, brightness=0, contrast=0, saturation=0, hue=0):
+        self.brightness = brightness
+        self.contrast = contrast
+        self.saturation = saturation
+        self.hue = hue
+    def get_params(self, brightness, contrast, saturation, hue):
+        if brightness > 0:
+            brightness_factor = random.uniform(
+                max(0, 1 - brightness), 1 + brightness)
+        else:
+            brightness_factor = None
+        if contrast > 0:
+            contrast_factor = random.uniform(
+                max(0, 1 - contrast), 1 + contrast)
+        else:
+            contrast_factor = None
+        if saturation > 0:
+            saturation_factor = random.uniform(
+                max(0, 1 - saturation), 1 + saturation)
+        else:
+            saturation_factor = None
+        if hue > 0:
+            hue_factor = random.uniform(-hue, hue)
+        else:
+            hue_factor = None
+        return brightness_factor, contrast_factor, saturation_factor, hue_factor
+    def __call__(self, clip):
+        """
+        Args:
+        clip (list): list of PIL.Image
+        Returns:
+        list PIL.Image : list of transformed PIL.Image
+        """
+        if isinstance(clip[0], np.ndarray):
+            raise TypeError(
+                'Color jitter not yet implemented for numpy arrays')
+        elif isinstance(clip[0], PIL.Image.Image):
+            brightness, contrast, saturation, hue = self.get_params(
+                self.brightness, self.contrast, self.saturation, self.hue)
+            # Create img transform function sequence
+            img_transforms = []
+            if brightness is not None:
+                img_transforms.append(lambda img: torchvision.transforms.functional.adjust_brightness(img, brightness))
+            if saturation is not None:
+                img_transforms.append(lambda img: torchvision.transforms.functional.adjust_saturation(img, saturation))
+            if hue is not None:
+                img_transforms.append(lambda img: torchvision.transforms.functional.adjust_hue(img, hue))
+            if contrast is not None:
+                img_transforms.append(lambda img: torchvision.transforms.functional.adjust_contrast(img, contrast))
+            random.shuffle(img_transforms)
+            # Apply to all images
+            jittered_clip = []
+            for img in clip:
+                for func in img_transforms:
+                    jittered_img = func(img)
+                jittered_clip.append(jittered_img)
+        else:
+            raise TypeError('Expected numpy.ndarray or PIL.Image' +
+                            'but got list of {0}'.format(type(clip[0])))
+        return jittered_clip
+class Normalize(object):
+    """Normalize a clip with mean and standard deviation.
+    Given mean: ``(M1,...,Mn)`` and std: ``(S1,..,Sn)`` for ``n`` channels, this transform
+    will normalize each channel of the input ``torch.*Tensor`` i.e.
+    ``input[channel] = (input[channel] - mean[channel]) / std[channel]``
+    .. note::
+        This transform acts out of place, i.e., it does not mutates the input tensor.
+    Args:
+        mean (sequence): Sequence of means for each channel.
+        std (sequence): Sequence of standard deviations for each channel.
+    """
+    def __init__(self, mean, std):
+        self.mean = mean
+        self.std = std
+    def __call__(self, clip):
+        """
+        Args:
+            clip (Tensor): Tensor clip of size (T, C, H, W) to be normalized.
+        Returns:
+            Tensor: Normalized Tensor clip.
+        """
+        return FF.normalize(clip, self.mean, self.std)
+    def __repr__(self):
+        return self.__class__.__name__ + '(mean={0}, std={1})'.format(self.mean, self.std)

volume_transforms.py ADDED Viewed

	@@ -0,0 +1,131 @@

+import numpy as np
+from PIL import Image
+import torch
+def convert_img(img):
+    """Converts (H, W, C) numpy.ndarray to (C, W, H) format
+    """
+    if len(img.shape) == 3:
+        img = img.transpose(2, 0, 1)
+    if len(img.shape) == 2:
+        img = np.expand_dims(img, 0)
+    return img
+class ClipToTensor(object):
+    """Convert a list of m (H x W x C) numpy.ndarrays in the range [0, 255]
+    to a torch.FloatTensor of shape (C x m x H x W) in the range [0, 1.0]
+    """
+    def __init__(self, channel_nb=3, div_255=True, numpy=False):
+        self.channel_nb = channel_nb
+        self.div_255 = div_255
+        self.numpy = numpy
+    def __call__(self, clip):
+        """
+        Args: clip (list of numpy.ndarray): clip (list of images)
+        to be converted to tensor.
+        """
+        # Retrieve shape
+        if isinstance(clip[0], np.ndarray):
+            h, w, ch = clip[0].shape
+            assert ch == self.channel_nb, 'Got {0} instead of 3 channels'.format(
+                ch)
+        elif isinstance(clip[0], Image.Image):
+            w, h = clip[0].size
+        else:
+            raise TypeError('Expected numpy.ndarray or PIL.Image\
+            but got list of {0}'.format(type(clip[0])))
+        np_clip = np.zeros([self.channel_nb, len(clip), int(h), int(w)])
+        # Convert
+        for img_idx, img in enumerate(clip):
+            if isinstance(img, np.ndarray):
+                pass
+            elif isinstance(img, Image.Image):
+                img = np.array(img, copy=False)
+            else:
+                raise TypeError('Expected numpy.ndarray or PIL.Image\
+                but got list of {0}'.format(type(clip[0])))
+            img = convert_img(img)
+            np_clip[:, img_idx, :, :] = img
+        if self.numpy:
+            if self.div_255:
+                np_clip = np_clip / 255.0
+            return np_clip
+        else:
+            tensor_clip = torch.from_numpy(np_clip)
+            if not isinstance(tensor_clip, torch.FloatTensor):
+                tensor_clip = tensor_clip.float()
+            if self.div_255:
+                tensor_clip = torch.div(tensor_clip, 255)
+            return tensor_clip
+# Note this norms data to -1/1
+class ClipToTensor_K(object):
+    """Convert a list of m (H x W x C) numpy.ndarrays in the range [0, 255]
+    to a torch.FloatTensor of shape (C x m x H x W) in the range [0, 1.0]
+    """
+    def __init__(self, channel_nb=3, div_255=True, numpy=False):
+        self.channel_nb = channel_nb
+        self.div_255 = div_255
+        self.numpy = numpy
+    def __call__(self, clip):
+        """
+        Args: clip (list of numpy.ndarray): clip (list of images)
+        to be converted to tensor.
+        """
+        # Retrieve shape
+        if isinstance(clip[0], np.ndarray):
+            h, w, ch = clip[0].shape
+            assert ch == self.channel_nb, 'Got {0} instead of 3 channels'.format(
+                ch)
+        elif isinstance(clip[0], Image.Image):
+            w, h = clip[0].size
+        else:
+            raise TypeError('Expected numpy.ndarray or PIL.Image\
+            but got list of {0}'.format(type(clip[0])))
+        np_clip = np.zeros([self.channel_nb, len(clip), int(h), int(w)])
+        # Convert
+        for img_idx, img in enumerate(clip):
+            if isinstance(img, np.ndarray):
+                pass
+            elif isinstance(img, Image.Image):
+                img = np.array(img, copy=False)
+            else:
+                raise TypeError('Expected numpy.ndarray or PIL.Image\
+                but got list of {0}'.format(type(clip[0])))
+            img = convert_img(img)
+            np_clip[:, img_idx, :, :] = img
+        if self.numpy:
+            if self.div_255:
+                np_clip = (np_clip - 127.5) / 127.5
+            return np_clip
+        else:
+            tensor_clip = torch.from_numpy(np_clip)
+            if not isinstance(tensor_clip, torch.FloatTensor):
+                tensor_clip = tensor_clip.float()
+            if self.div_255:
+                tensor_clip = torch.div(torch.sub(tensor_clip, 127.5), 127.5)
+            return tensor_clip
+class ToTensor(object):
+    """Converts numpy array to tensor
+    """
+    def __call__(self, array):
+        tensor = torch.from_numpy(array)
+        return tensor