OpenGVLab/InternVideo2-Stage2_6B · Performance Issue and retrieval code

Hello,

I tested demo.py and noticed some unusual results.

Initially, I ran the code using the provided sample video file and text descriptions:

text_candidates = ["A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon.",
                    "A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys.",
                    "A person dressed in a blue jacket shovels the snow-covered pavement outside their house.",
                    "A cat excitedly runs through the yard, chasing a rabbit.",
                    "A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery."]

The output was:

text: A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon. ~ prob: 0.5354
text: A cat excitedly runs through the yard, chasing a rabbit. ~ prob: 0.2978
text: A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys. ~ prob: 0.0989
text: A person dressed in a blue jacket shovels the snow-covered pavement outside their house. ~ prob: 0.0630
text: A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery. ~ prob: 0.0048

This result seems reasonable. However, when I tested with the following paraphrased descriptions, the results were not as expected:

paraphrased_text_candidates = [
        "A cheerful dog and its owner tumble and chase each other in the snow-covered yard, full of excitement.",
        "A man wearing a gray coat strides through the snowy terrain, dragging a sleigh stacked with toys.",
        "Wearing a blue jacket, a person clears the snow from their driveway with a shovel.",
        "A cat dashes energetically across the yard, pursuing a rabbit.",
        "Wrapped in a warm blanket, a person strolls through the snowy landscape, admiring the peaceful winter atmosphere."
    ]

The output was:

text: A man wearing a gray coat strides through the snowy terrain, dragging a sleigh stacked with toys. ~ prob: 0.7446
text: A cat dashes energetically across the yard, pursuing a rabbit. ~ prob: 0.1992
text: A cheerful dog and its owner tumble and chase each other in the snow-covered yard, full of excitement. ~ prob: 0.0257
text: Wearing a blue jacket, a person clears the snow from their driveway with a shovel. ~ prob: 0.0200
text: Wrapped in a warm blanket, a person strolls through the snowy landscape, admiring the peaceful winter atmosphere. ~ prob: 0.0105

I am curious why the retrieval code (predict_label) multiplies the video embedding (feature) by 100 in the following function:

def predict_label(self, 
                      vid_feat: torch.Tensor, 
                      txt_feat: torch.Tensor, 
                      top: int=5):
        
        label_probs = (100.0 * vid_feat @ txt_feat.T).softmax(dim=-1)
        top_probs, top_labels = label_probs.float().cpu().topk(top, dim=-1)
        return top_probs, top_labels