Performance Issue and retrieval code
#1
by
junia3
- opened
Hello,
I tested demo.py and noticed some unusual results.
Initially, I ran the code using the provided sample video file and text descriptions:
text_candidates = ["A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon.",
"A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys.",
"A person dressed in a blue jacket shovels the snow-covered pavement outside their house.",
"A cat excitedly runs through the yard, chasing a rabbit.",
"A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery."]
The output was:
text: A playful dog and its owner wrestle in the snowy yard, chasing each other with joyous abandon. ~ prob: 0.5354
text: A cat excitedly runs through the yard, chasing a rabbit. ~ prob: 0.2978
text: A man in a gray coat walks through the snowy landscape, pulling a sleigh loaded with toys. ~ prob: 0.0989
text: A person dressed in a blue jacket shovels the snow-covered pavement outside their house. ~ prob: 0.0630
text: A person bundled up in a blanket walks through the snowy landscape, enjoying the serene winter scenery. ~ prob: 0.0048
This result seems reasonable. However, when I tested with the following paraphrased descriptions, the results were not as expected:
paraphrased_text_candidates = [
"A cheerful dog and its owner tumble and chase each other in the snow-covered yard, full of excitement.",
"A man wearing a gray coat strides through the snowy terrain, dragging a sleigh stacked with toys.",
"Wearing a blue jacket, a person clears the snow from their driveway with a shovel.",
"A cat dashes energetically across the yard, pursuing a rabbit.",
"Wrapped in a warm blanket, a person strolls through the snowy landscape, admiring the peaceful winter atmosphere."
]
The output was:
text: A man wearing a gray coat strides through the snowy terrain, dragging a sleigh stacked with toys. ~ prob: 0.7446
text: A cat dashes energetically across the yard, pursuing a rabbit. ~ prob: 0.1992
text: A cheerful dog and its owner tumble and chase each other in the snow-covered yard, full of excitement. ~ prob: 0.0257
text: Wearing a blue jacket, a person clears the snow from their driveway with a shovel. ~ prob: 0.0200
text: Wrapped in a warm blanket, a person strolls through the snowy landscape, admiring the peaceful winter atmosphere. ~ prob: 0.0105
I am curious why the retrieval code (predict_label) multiplies the video embedding (feature) by 100 in the following function:
def predict_label(self,
vid_feat: torch.Tensor,
txt_feat: torch.Tensor,
top: int=5):
label_probs = (100.0 * vid_feat @ txt_feat.T).softmax(dim=-1)
top_probs, top_labels = label_probs.float().cpu().topk(top, dim=-1)
return top_probs, top_labels