Add basic Python scripts and documentation

Files changed (10) hide show

LogSAD技术详解.md +621 -0
README.md +102 -0
compute_coreset.py +121 -0
environment.yml +116 -0
evaluation.py +257 -0
imagenet_template.py +82 -0
model_ensemble.py +1034 -0
model_ensemble_few_shot.py +935 -0
prompt_ensemble.py +121 -0
requirements.txt +77 -0

LogSAD技术详解.md ADDED Viewed

	@@ -0,0 +1,621 @@

+# LogSAD：基于视觉和语言基础模型的无训练异常检测方法详解
+## 项目概述
+LogSAD（Towards Training-free Anomaly Detection with Vision and Language Foundation Models）是一个发表在CVPR 2025的无需训练的异常检测方法。该方法通过结合多个预训练的视觉和语言基础模型，实现了对MVTec LOCO数据集的逻辑异常和结构异常检测。
+## 整体架构与流程
+### 核心理念
+LogSAD的核心思想是利用预训练模型的强大表示能力，通过多模态特征融合和逻辑推理来检测异常，无需对特定数据集进行训练。
+### 系统架构
+```
+输入图像 (448x448)
+    ↓
+┌─────────────────────────────────────────────────┐
+│  多模态特征提取层                                  │
+│  ├─ CLIP ViT-L-14 (图像+文本特征)                 │
+│  ├─ DINOv2 ViT-L-14 (图像特征)                   │
+│  └─ SAM ViT-H (实例分割)                         │
+└─────────────────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────────────────┐
+│  特征处理与融合层                                  │
+│  ├─ K-means聚类分割                              │
+│  ├─ 文本引导的语义分割                            │
+│  └─ 多尺度特征融合                               │
+└─────────────────────────────────────────────────┘
+    ↓
+┌─────────────────────────────────────────────────┐
+│  异常检测层                                       │
+│  ├─ 结构异常检测 (PatchCore)                      │
+│  ├─ 逻辑异常检测 (直方图匹配)                      │
+│  └─ 实例匹配检测 (Hungarian算法)                  │
+└─────────────────────────────────────────────────┘
+    ↓
+最终异常分数
+```
+## 预训练模型详解
+### 1. CLIP ViT-L-14 模型
+**作用**：视觉-语言理解的核心
+- **模型**：`hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K`
+- **输入尺寸**：448×448
+- **特征提取层**：[6, 12, 18, 24]
+- **特征维度**：1024维
+- **输出特征尺寸**：32×32 → 64×64（插值）
+**具体实现**：
+```python
+# model_ensemble.py:96-97
+self.model_clip, _, _ = open_clip.create_model_and_transforms('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
+self.feature_list = [6, 12, 18, 24]
+```
+**协作机制**：
+- 提供图像的语义特征表示
+- 通过文本提示编码不同物体的语义信息
+- 用于语义分割和异常分类
+### 2. DINOv2 ViT-L-14 模型
+**作用**：提供更丰富的视觉特征
+- **模型**：`dinov2_vitl14`
+- **特征提取层**：[6, 12, 18, 24]
+- **特征维度**：1024维
+- **输出特征尺寸**：32×32 → 64×64（插值）
+**具体实现**：
+```python
+# model_ensemble.py:181-186
+from dinov2.dinov2.hub.backbones import dinov2_vitl14
+self.model_dinov2 = dinov2_vitl14()
+self.feature_list_dinov2 = [6, 12, 18, 24]
+```
+**协作机制**：
+- 为某些类别（splicing_connectors, breakfast_box, juice_bottle）提供更强的视觉特征
+- 与CLIP特征互补，提高检测精度
+### 3. SAM (Segment Anything Model)
+**作用**：实例分割
+- **模型**：ViT-H版本
+- **检查点**：`./checkpoint/sam_vit_h_4b8939.pth`
+- **功能**：自动生成物体mask
+**具体实现**：
+```python
+# model_ensemble.py:102-103
+self.model_sam = sam_model_registry["vit_h"](checkpoint = "./checkpoint/sam_vit_h_4b8939.pth")
+self.mask_generator = SamAutomaticMaskGenerator(model = self.model_sam)
+```
+**协作机制**：
+- 提供精确的物体边界
+- 用于实例级别的异常检测
+- 与语义分割结果融合
+## 数据处理与尺寸变换详解
+### 图像预处理流程
+1. **输入尺寸标准化**：
+```python
+# evaluation.py:184
+datamodule = MVTecLoco(root=dataset_path, eval_batch_size=1, image_size=(448, 448), category=category)
+```
+2. **归一化处理**：
+```python
+# model_ensemble.py:88-92
+self.transform = v2.Compose([
+    v2.Normalize(mean=(0.48145466, 0.4578275, 0.40821073),
+                std=(0.26862954, 0.26130258, 0.27577711)),
+])
+```
+3. **特征图尺寸变换**：
+```python
+# model_ensemble.py:155-156
+self.feat_size = 64        # 目标特征图大小
+self.ori_feat_size = 32    # 原始特征图大小
+```
+### 详细的Resize流程
+**CLIP特征处理**：
+```python
+# model_ensemble.py:245-255
+# 1. ��32x32插值到64x64
+patch_tokens_clip = patch_tokens_clip.view(1, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+patch_tokens_clip = F.interpolate(patch_tokens_clip, size=(self.feat_size, self.feat_size),
+                                  mode=self.inter_mode, align_corners=self.align_corners)
+patch_tokens_clip = patch_tokens_clip.permute(0, 2, 3, 1).view(-1, self.vision_width * len(self.feature_list))
+```
+**DINOv2特征处理**：
+```python
+# model_ensemble.py:253-263
+# 相同的插值流程
+patch_tokens_dinov2 = F.interpolate(patch_tokens_dinov2, size=(self.feat_size, self.feat_size),
+                                    mode=self.inter_mode, align_corners=self.align_corners)
+```
+**插值参数**：
+- **插值模式**：双线性插值（`bilinear`）
+- **对齐角点**：`align_corners=True`
+- **抗锯齿**：`antialias=True`
+## SAM多Mask处理机制
+### SAM生成多个Mask的处理
+**Mask生成**：
+```python
+# model_ensemble.py:394
+masks = self.mask_generator.generate(raw_image)
+sorted_masks = sorted(masks, key=(lambda x: x['area']), reverse=True)
+```
+**Mask融合策略**：
+```python
+# model_ensemble.py:347-367
+def merge_segmentations(a, b, background_class):
+    """将SAM mask与语义分割结果融合"""
+    # 通过投票机制确定每个SAM区域的语义标签
+    for label_a in unique_labels_a:
+        mask_a = (a == label_a)
+        labels_b = b[mask_a]
+        if labels_b.size > 0:
+            count_b = np.bincount(labels_b, minlength=unique_labels_b.max() + 1)
+            label_map[label_a] = np.argmax(count_b)  # 多数投票
+```
+**多Mask协作流程**：
+1. SAM生成所有可能的实例mask
+2. K-means聚类生成语义分割mask
+3. 文本引导生成patch级别的语义mask
+4. 通过投票机制融合不同来源的mask
+5. 过滤小区域噪声（阈值：32像素）
+## Ground Truth多Mask处理机制
+### MVTec LOCO数据集的Mask组织结构
+**文件结构**：
+```
+dataset/
+├── test/category/image_filename.png          # 测试图像
+├── ground_truth/category/image_filename/     # 对应的GT mask目录
+│   ├── 000.png                              # 第一个异常区域mask
+│   ├── 001.png                              # 第二个异常区域mask
+│   ├── 002.png                              # 第三个异常区域mask
+│   └── ...                                  # 更多异常区域mask
+```
+**数据加载时的多Mask聚合**：
+```python
+# anomalib/data/image/mvtec_loco.py:142-148
+mask_samples = (
+    mask_samples.groupby(["path", "split", "label", "image_folder"])["image_path"]
+    .agg(list)  # 将同一图像的多个mask路径聚合成列表
+    .reset_index()
+    .rename(columns={"image_path": "mask_path"})
+)
+```
+### 多Mask融合策略
+**步骤1：Mask路径处理**：
+```python
+# anomalib/data/image/mvtec_loco.py:279-280
+if isinstance(mask_path, str):
+    mask_path = [mask_path]  # 确保mask_path是列表格式
+```
+**步骤2：语义Mask堆叠**：
+```python
+# anomalib/data/image/mvtec_loco.py:281-285
+semantic_mask = (
+    Mask(torch.zeros(image.shape[-2:])).to(torch.uint8)  # 正常图像：零mask
+    if label_index == LabelName.NORMAL
+    else Mask(torch.stack([self._read_mask(path) for path in mask_path]))  # 异常图像：堆叠所有mask
+)
+```
+**步骤3：二值Mask生成**：
+```python
+# anomalib/data/image/mvtec_loco.py:287
+binary_mask = Mask(semantic_mask.view(-1, *semantic_mask.shape[-2:]).int().any(dim=0).to(torch.uint8))
+```
+### 关键融合机制解析
+**维度变换**：
+- 输入：多个mask，每个形状为 (H, W)
+- 堆叠后：(N, H, W)，其中N为mask数量
+- `view(-1, H, W)`：重塑为 (N, H, W)
+- `any(dim=0)`：沿第一维度求或运算，得到 (H, W)
+**融合逻辑**：
+```python
+# 伪代码示例
+mask1 = [[0, 1, 0],    mask2 = [[0, 0, 1],
+         [1, 0, 1],             [0, 1, 0],
+         [0, 1, 0]]             [1, 0, 0]]
+# 堆叠：shape (2, 3, 3)
+stacked = torch.stack([mask1, mask2])
+# any操作：逐像素求或
+result = [[0, 1, 1],     # max(0,0), max(1,0), max(0,1)
+          [1, 1, 1],     # max(1,0), max(0,1), max(1,0)
+          [1, 1, 0]]     # max(0,1), max(1,0), max(0,0)
+```
+### 数据加载完整流程
+**MVTec LOCO数据项结构**：
+```python
+# 正常样本
+item = {
+    "image_path": "/path/to/normal_image.png",
+    "label": 0,
+    "image": torch.Tensor(...),
+    "mask": torch.zeros(H, W),           # 零mask
+    "mask_path": [],                     # 空列表
+    "semantic_mask": torch.zeros(H, W)   # 零mask
+}
+# 异常样本
+item = {
+    "image_path": "/path/to/abnormal_image.png",
+    "label": 1,
+    "image": torch.Tensor(...),
+    "mask": torch.Tensor(...),           # 融合后的二值mask
+    "mask_path": [                       # 多个mask路径列表
+        "/path/to/ground_truth/image/000.png",
+        "/path/to/ground_truth/image/001.png",
+        "/path/to/ground_truth/image/002.png"
+    ],
+    "semantic_mask": torch.Tensor(...)   # 原始多mask堆叠，shape (N, H, W)
+}
+```
+### 评估时的Mask使用
+**重要特性**：LogSAD在推理过程中**不使用**ground truth mask，完全基于输入图像进行异常检测。Ground truth mask仅用于：
+1. **性能评估**：计算AUROC、F1等指标
+2. **可视化对比**：与预测结果对比
+3. **指标计算**：像素级和语义级异常检测性能
+**验证机制**：
+```python
+# anomalib/data/image/mvtec_loco.py:158-174
+# 验证mask文件与图像文件的对应关系
+image_stems = samples.loc[samples.label_index == LabelName.ABNORMAL]["image_path"].apply(lambda x: Path(x).stem)
+mask_parent_stems = samples.loc[samples.label_index == LabelName.ABNORMAL]["mask_path"].apply(
+    lambda x: {Path(mask_path).parent.stem for mask_path in x},
+)
+# 确保 image: '005.png' 对应 mask: '005/000.png', '005/001.png' 等
+```
+### 多Mask场景的实际应用
+**典型场景**：
+1. **Splicing Connectors**：连接器、电缆、夹具可能分别标注
+2. **Juice Bottle**：液体、标签、瓶身缺陷可能分别标注
+3. **Breakfast Box**：不同食物的缺失可能分别标注
+4. **Screw Bag**：不同螺丝、螺母、垫圈的异常分别标注
+**处理优势**：
+- 保留了详细的异常区域信息
+- 支持多类型异常的联合评估
+- 便于细粒度的性能分析
+- 兼容传统二值异常检测评估
+## 关键特判逻辑详解
+代码中存在**5个主要特判分支**，分别对应不同的数据集类别：
+### 1. Pushpins类别特判
+**位置**：`model_ensemble.py:432-479`
+**逻辑**：
+```python
+if self.class_name == 'pushpins':
+    # 1. 物体计数检测
+    pushpins_count = num_labels - 1
+    if self.few_shot_inited and pushpins_count != self.pushpins_count:
+        self.anomaly_flag = True
+    # 2. Patch直方图匹配
+    clip_patch_hist = np.bincount(patch_mask.reshape(-1), minlength=self.patch_query_obj.shape[0])
+    patch_hist_similarity = (clip_patch_hist @ self.patch_token_hist.T)
+    score = 1 - patch_hist_similarity.max()
+```
+**检测异常类型**：
+- 推钉数量异常（标准数量：15个）
+- 颜色分布异常
+### 2. Splicing Connectors类别特判
+**位置**：`model_ensemble.py:481-615`
+**复杂逻辑**：
+```python
+elif self.class_name == 'splicing_connectors':
+    # 1. 连接组件检测
+    if count != 1:
+        self.anomaly_flag = True
+    # 2. 电缆颜色与夹具数量匹配检测
+    foreground_pixel_count = np.sum(erode_binary) / self.splicing_connectors_count[idx_color]
+    ratio = foreground_pixel_count / self.foreground_pixel_hist_splicing_connectors
+    if ratio > 1.2 or ratio < 0.8:
+        self.anomaly_flag = True
+    # 3. 左右对称性检测
+    ratio = np.sum(left_count) / (np.sum(right_count) + 1e-5)
+    if ratio > 1.2 or ratio < 0.8:
+        self.anomaly_flag = True
+    # 4. 距离检测
+    distance = np.sqrt((x1/w - x2/w)**2 + (y1/h - y2/h)**2)
+    ratio = distance / self.splicing_connectors_distance
+    if ratio < 0.6 or ratio > 1.4:
+        self.anomaly_flag = True
+```
+**检测异常类型**：
+- 电缆断裂或缺失
+- 颜色与夹具数量不匹配（黄色2夹、蓝色3夹、红色5夹）
+- 左右夹具不对称
+- 电缆长度异常
+### 3. Screw Bag类别特判
+**位置**：`model_ensemble.py:617-670`
+**逻辑**：
+```python
+elif self.class_name == 'screw_bag':
+    # 前景像素统计异常检测
+    foreground_pixel_count = np.sum(np.bincount(kmeans_mask.reshape(-1))[:len(self.foreground_label_idx[self.class_name])])
+    ratio = foreground_pixel_count / self.foreground_pixel_hist_screw_bag
+    if ratio < 0.94 or ratio > 1.06:
+        self.anomaly_flag = True
+```
+**检测异常类型**：
+- 螺丝、螺母、垫圈数量异常
+- 前景像素比例异常（阈值：±6%）
+### 4. Juice Bottle类别特判
+**位置**：`model_ensemble.py:715-771`
+**逻辑**：
+```python
+elif self.class_name == 'juice_bottle':
+    # 液体与水果匹配检测
+    liquid_idx = (liquid_feature @ query_liquid.T).argmax(-1).squeeze(0).item()
+    fruit_idx = (fruit_feature @ query_fruit.T).argmax(-1).squeeze(0).item()
+    if liquid_idx != fruit_idx:
+        self.anomaly_flag = True
+```
+**检测异常类型**：
+- 液体颜色与标签水果不匹配
+- 标签错位
+### 5. Breakfast Box类别特判
+**位置**：`model_ensemble.py:672-713`
+**逻辑**：
+```python
+elif self.class_name == 'breakfast_box':
+    # 主要依靠patch直方图匹配
+    sam_patch_hist = np.bincount(patch_merge_sam.reshape(-1), minlength=self.patch_query_obj.shape[0])
+    patch_hist_similarity = (sam_patch_hist @ self.patch_token_hist.T)
+    score = 1 - patch_hist_similarity.max()
+```
+**检测异常类型**：
+- 食物分布异常
+- 缺失或多余物品
+## Few-shot与Full-data模式区别
+### 数据处理差异
+**Few-shot模式**（`model_ensemble_few_shot.py`）：
+```python
+# 直接使用所有few-shot样本
+FEW_SHOT_SAMPLES = [0, 1, 2, 3]  # 固定4个样本
+self.k_shot = few_shot_samples.size(0)
+```
+**Full-data模式**（`model_ensemble.py`）：
+```python
+# 使用完整训练集构建coreset
+FEW_SHOT_SAMPLES = range(len(datamodule.train_data))  # 所有训练样本
+self.k_shot = 4 if self.total_size > 4 else self.total_size
+```
+### Coreset子采样机制
+**Few-shot模式**：无coreset，直接使用原始特征
+```python
+# model_ensemble_few_shot.py:852
+self.mem_patch_feature_clip_coreset = patch_tokens_clip
+self.mem_patch_feature_dinov2_coreset = patch_tokens_dinov2
+```
+**Full-data模式**：使用K-Center Greedy算法进行coreset子采样
+```python
+# model_ensemble.py:892-896
+clip_sampler = KCenterGreedy(embedding=mem_patch_feature_clip_coreset, sampling_ratio=0.25)
+mem_patch_feature_clip_coreset = clip_sampler.sample_coreset()
+dinov2_sampler = KCenterGreedy(embedding=mem_patch_feature_dinov2_coreset, sampling_ratio=0.25)
+mem_patch_feature_dinov2_coreset = dinov2_sampler.sample_coreset()
+```
+### 统计信息差异
+**Few-shot模式**：
+```python
+# model_ensemble_few_shot.py:185
+self.stats = pickle.load(open("memory_bank/statistic_scores_model_ensemble_few_shot_val.pkl", "rb"))
+```
+**Full-data模式**：
+```python
+# model_ensemble.py:188
+self.stats = pickle.load(open("memory_bank/statistic_scores_model_ensemble_val.pkl", "rb"))
+```
+### 计算流程差异
+**Few-shot模式流程**：
+1. 直接计算4个样本的特征
+2. 无需coreset计算
+3. 直接进行异常检测
+**Full-data模式流程**：
+1. 计算所有训练样本特征（`compute_coreset.py`）
+2. 使用K-Center Greedy算法选择代表性特征
+3. 保存coreset到`memory_bank/`目录
+4. 加载预计算的coreset进行异常检测
+## 实现细节与优化
+### 内存优化策略
+**批处理机制**：
+```python
+# model_ensemble.py:926-928
+for i in range(self.total_size//self.k_shot):
+    self.process(class_name, few_shot_samples[self.k_shot*i : min(self.k_shot*(i+1), self.total_size)],
+                 few_shot_paths[self.k_shot*i : min(self.k_shot*(i+1), self.total_size)])
+```
+**特征缓存**：
+- 预计算的coreset特征保存在`memory_bank/`目录
+- 统计信息预计算并缓存
+### 多模态特征融合
+**特征层选择策略**：
+- **聚类特征**：使用CLIP的第0、1层（`cluster_feature_id = [0, 1]`）
+- **检测特征**：使用第6、12、18、24层的完整特征
+**不同类别的模型选择**：
+```python
+# model_ensemble.py:290-310
+if self.class_name in ['pushpins', 'screw_bag']:
+    # 使用CLIP特征进行PatchCore检测
+    len_feature_list = len(self.feature_list)
+    for patch_feature, mem_patch_feature in zip(patch_tokens_clip.chunk(len_feature_list, dim=-1),
+                                                mem_patch_feature_clip_coreset.chunk(len_feature_list, dim=-1)):
+if self.class_name in ['splicing_connectors', 'breakfast_box', 'juice_bottle']:
+    # 使用DINOv2特征进行PatchCore检测
+    len_feature_list = len(self.feature_list_dinov2)
+    for patch_feature, mem_patch_feature in zip(patch_tokens_dinov2.chunk(len_feature_list, dim=-1),
+                                                mem_patch_feature_dinov2_coreset.chunk(len_feature_list, dim=-1)):
+```
+## 文本提示工程
+### 语义查询词典
+**物体级别查询**：
+```python
+# model_ensemble.py:123-136
+self.query_words_dict = {
+    "breakfast_box": ['orange', "nectarine", "cereals", "banana chips", 'almonds', 'white box', 'black background'],
+    "juice_bottle": ['bottle', ['black background', 'background']],
+    "pushpins": [['pushpin', 'pin'], ['plastic box', 'black background']],
+    "screw_bag": [['screw'], 'plastic bag', 'background'],
+    "splicing_connectors": [['splicing connector', 'splice connector',], ['cable', 'wire'], ['grid']],
+}
+```
+**Patch级别查询**：
+```python
+# model_ensemble.py:138-145
+self.patch_query_words_dict = {
+    "juice_bottle": [['glass'], ['liquid in bottle'], ['fruit'], ['label', 'tag'], ['black background', 'background']],
+    "screw_bag": [['hex screw', 'hexagon bolt'], ['hex nut', 'hexagon nut'], ['ring washer', 'ring gasket'], ['plastic bag', 'background']],
+    # ...
+}
+```
+### 文本编码策略
+**多模板编码**：
+```python
+# prompt_ensemble.py:98-120
+def encode_obj_text(model, query_words, tokenizer, device):
+    for qw in query_words:
+        if type(qw) == list:
+            for qw2 in qw:
+                token_input.extend([temp(qw2) for temp in openai_imagenet_template])
+        else:
+            token_input = [temp(qw) for temp in openai_imagenet_template]
+```
+使用82个不同的ImageNet模板进行文本增强，提高文本特征的鲁棒性。
+## 性能评估
+### 评估指标
+**图像级别指标**：
+- F1-Max（Image）
+- AUROC（Image）
+**异常类型指标**：
+- F1-Max（Logical）：逻辑异常
+- AUROC（Logical）：逻辑异常
+- F1-Max（Structural）：结构异常
+- AUROC（Structural）：结构异常
+### 评估流程
+**数据分离**：
+```python
+# evaluation.py:222-227
+if 'logical' not in image_path[0]:
+    image_metric_structure.update(output["pred_score"].cpu(), data["label"])
+if 'structural' not in image_path[0]:
+    image_metric_logical.update(output["pred_score"].cpu(), data["label"])
+```
+**分数融合**：
+```python
+# model_ensemble.py:227-231
+standard_structural_score = (structural_score - self.stats[self.class_name]["structural_scores"]["mean"]) / self.stats[self.class_name]["structural_scores"]["unbiased_std"]
+standard_instance_hungarian_match_score = (instance_hungarian_match_score - self.stats[self.class_name]["instance_hungarian_match_scores"]["mean"]) / self.stats[self.class_name]["instance_hungarian_match_scores"]["unbiased_std"]
+pred_score = max(standard_instance_hungarian_match_score, standard_structural_score)
+pred_score = sigmoid(pred_score)
+```
+## 总结
+LogSAD通过巧妙结合多个预训练模型的优势，实现了无需训练的异常检测：
+1. **多模态协作**：CLIP提供语义理解、DINOv2提供视觉特征、SAM提供精确分割
+2. **逻辑推理**：通过领域知识编码的特判逻辑检测复杂的逻辑异常
+3. **特征融合**：多尺度特征提取和融合提高检测精度
+4. **高效优化**：Coreset子采样和特征缓存机制保证实用性
+该方法在MVTec LOCO数据集上取得了优异的性能，展示了预训练模型在异常检测任务中的巨大潜力。

README.md ADDED Viewed

	@@ -0,0 +1,102 @@

+# Towards Training-free Anomaly Detection with Vision and Language Foundation Models (CVPR 2025)
+<div>
+  <a href="https://arxiv.org/abs/2503.18325"><img src="https://img.shields.io/static/v1?label=Arxiv&message=LogSAD&color=red&logo=arxiv"></a> &ensp;
+</div>
+## System Requirements
+**Hardware Requirements:**
+- **GPU Memory:** 32GB VRAM (for running complete experiments)
+- **Storage:** 70GB free disk space (for models, datasets, and results)
+- **CUDA:** Compatible GPU with CUDA 12.1 support
+**Software Requirements:**
+- Python 3.10
+- Conda (recommended for environment management)
+- CUDA 12.1 runtime
+> **Note:** The memory and storage requirements are for running the full experimental pipeline on all categories with visualization enabled. Smaller experiments on individual categories may require less resources.
+## Installation
+### Automated Setup (Recommended)
+Run the setup script to automatically configure the complete environment:
+```bash
+bash scripts/setup_environment.sh
+```
+This script will:
+- Create a conda environment named `logsad` with Python 3.10
+- Install PyTorch with CUDA 12.1 support
+- Install all required dependencies from `requirements.txt`
+- Configure numpy compatibility
+### Manual Setup
+If you prefer manual setup, download the checkpoint for [ViT-H SAM model](https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth) and put it in the checkpoint folder.
+After installation, activate the environment:
+```bash
+conda activate logsad
+```
+## Instructions for MVTEC LOCO dataset
+### Quick Start (Recommended)
+Run evaluation for all categories using the provided shell scripts:
+**Few-shot Protocol:**
+```bash
+bash scripts/run_few_shot.sh
+```
+**Full-data Protocol:**
+```bash
+bash scripts/run_full_data.sh
+```
+### Manual Execution
+#### Few-shot Protocol
+Run the script for few-shot protocal:
+```
+python evaluation.py --module_path model_ensemble_few_shot --category CATEGORY  --dataset_path DATASET_PATH
+```
+#### Full-data Protocol
+Run the script to compute coreset for full-data scenarios:
+```
+python compute_coreset.py --module_path model_ensemble --category CATEGORY  --dataset_path DATASET_PATH
+```
+Run the script for full-data protocol:
+```
+python evaluation.py --module_path model_ensemble --category CATEGORY  --dataset_path DATASET_PATH
+```
+**Available categories:** breakfast_box, juice_bottle, pushpins, screw_bag, splicing_connectors
+## Acknowledgement
+We are grateful for the following awesome projects when implementing LogSAD:
+* [SAM](https://github.com/facebookresearch/segment-anything), [OpenCLIP](https://github.com/mlfoundations/open_clip), [DINOv2](https://github.com/facebookresearch/dinov2) and [NACLIP](https://github.com/sinahmr/NACLIP).
+## Citation
+If you find our paper is helpful in your research or applications, generously cite with
+```
+@inproceedings{zhang2025logsad,
+      title={Towards Training-free Anomaly Detection with Vision and Language Foundation Models},
+      author={Jinjin Zhang, Guodong Wang, Yizhou Jin, Di Huang},
+      year={2025},
+      booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+    }
+```

compute_coreset.py ADDED Viewed

	@@ -0,0 +1,121 @@

+"""Sample evaluation script for track 2."""
+import os
+# Set cache directories to use checkpoint folder for model downloads
+os.environ['TORCH_HOME'] = './checkpoint'
+os.environ['HF_HOME'] = './checkpoint/huggingface'
+os.environ['TRANSFORMERS_CACHE'] = './checkpoint/huggingface/transformers'
+os.environ['HF_HUB_CACHE'] = './checkpoint/huggingface/hub'
+# Create checkpoint subdirectories if they don't exist
+os.makedirs('./checkpoint/huggingface/transformers', exist_ok=True)
+os.makedirs('./checkpoint/huggingface/hub', exist_ok=True)
+import argparse
+import importlib
+import importlib.util
+import torch
+import logging
+from torch import nn
+# NOTE: The following MVTecLoco import is not available in anomalib v1.0.1.
+# It will be available in v1.1.0 which will be released on April 29th, 2024.
+# If you are using an earlier version of anomalib, you could install anomalib
+# from the anomalib source code from the following branch:
+# https://github.com/openvinotoolkit/anomalib/tree/feature/mvtec-loco
+from anomalib.data import MVTecLoco
+from anomalib.metrics.f1_max import F1Max
+from anomalib.metrics.auroc import AUROC
+from tabulate import tabulate
+import numpy as np
+# FEW_SHOT_SAMPLES = [0, 1, 2, 3]
+def parse_args() -> argparse.Namespace:
+    """Parse command line arguments.
+    Returns:
+        argparse.Namespace: Parsed arguments.
+    """
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--module_path", type=str, required=True)
+    parser.add_argument("--class_name", default='MyModel', type=str, required=False)
+    parser.add_argument("--weights_path", type=str, required=False)
+    parser.add_argument("--dataset_path", default='/home/bhu/Project/datasets/mvtec_loco_anomaly_detection/', type=str, required=False)
+    parser.add_argument("--category", type=str, required=True)
+    parser.add_argument("--viz", action='store_true', default=False)
+    return parser.parse_args()
+def load_model(module_path: str, class_name: str, weights_path: str) -> nn.Module:
+    """Load model.
+    Args:
+        module_path (str): Path to the module containing the model class.
+        class_name (str): Name of the model class.
+        weights_path (str): Path to the model weights.
+    Returns:
+        nn.Module: Loaded model.
+    """
+    # get model class
+    model_class = getattr(importlib.import_module(module_path), class_name)
+    # instantiate model
+    model = model_class()
+    # load weights
+    if weights_path:
+        model.load_state_dict(torch.load(weights_path))
+    return model
+def run(module_path: str, class_name: str, weights_path: str, dataset_path: str, category: str, viz: bool) -> None:
+    """Run the evaluation script.
+    Args:
+        module_path (str): Path to the module containing the model class.
+        class_name (str): Name of the model class.
+        weights_path (str): Path to the model weights.
+        dataset_path (str): Path to the dataset.
+        category (str): Category of the dataset.
+    """
+    # Disable verbose logging from all libraries
+    logging.getLogger().setLevel(logging.ERROR)
+    logging.getLogger('anomalib').setLevel(logging.ERROR)
+    logging.getLogger('lightning').setLevel(logging.ERROR)
+    logging.getLogger('pytorch_lightning').setLevel(logging.ERROR)
+    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    # Instantiate model class here
+    # Load the model here from checkpoint.
+    model = load_model(module_path, class_name, weights_path)
+    model.to(device)
+    # Create the dataset
+    datamodule = MVTecLoco(root=dataset_path, eval_batch_size=1, image_size=(448, 448), category=category)
+    datamodule.setup()
+    model.set_viz(viz)
+    model.set_save_coreset_features(True)
+    FEW_SHOT_SAMPLES = range(len(datamodule.train_data)) # traverse all dataset to build coreset
+    # pass few-shot images and dataset category to model
+    setup_data = {
+        "few_shot_samples": torch.stack([datamodule.train_data[idx]["image"] for idx in FEW_SHOT_SAMPLES]).to(device),
+        "few_shot_samples_path": [datamodule.train_data[idx]["image_path"] for idx in FEW_SHOT_SAMPLES],
+        "dataset_category": category,
+    }
+    model.setup(setup_data)
+    print(f"✓ Coreset computation completed for {category}")
+    print(f"  Memory bank features saved to memory_bank/ directory")
+if __name__ == "__main__":
+    args = parse_args()
+    run(args.module_path, args.class_name, args.weights_path, args.dataset_path, args.category, args.viz)

environment.yml ADDED Viewed

	@@ -0,0 +1,116 @@

+name: logsad
+channels:
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/pro/
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
+  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=conda_forge
+  - _openmp_mutex=4.5=2_gnu
+  - bzip2=1.0.8=h4bc722e_7
+  - ca-certificates=2025.8.3=hbd8a1cb_0
+  - ld_impl_linux-64=2.44=h1423503_1
+  - libexpat=2.7.1=hecca717_0
+  - libffi=3.4.6=h2dba641_1
+  - libgcc=15.1.0=h767d61c_4
+  - libgcc-ng=15.1.0=h69a702a_4
+  - libgomp=15.1.0=h767d61c_4
+  - liblzma=5.8.1=hb9d3cd8_2
+  - libnsl=2.0.1=hb9d3cd8_1
+  - libsqlite=3.50.4=h0c1763c_0
+  - libuuid=2.38.1=h0b41bf4_0
+  - libxcrypt=4.4.36=hd590300_1
+  - libzlib=1.3.1=hb9d3cd8_2
+  - ncurses=6.5=h2d0b736_3
+  - openssl=3.5.2=h26f9b46_0
+  - pip=25.2=pyh8b19718_0
+  - python=3.10.18=hd6af730_0_cpython
+  - readline=8.2=h8c095d6_2
+  - setuptools=80.9.0=pyhff2d567_0
+  - tk=8.6.13=noxft_hd72426e_102
+  - wheel=0.45.1=pyhd8ed1ab_1
+  - pip:
+      - aiohappyeyeballs==2.6.1
+      - aiohttp==3.12.11
+      - aiosignal==1.3.2
+      - antlr4-python3-runtime==4.9.3
+      - async-timeout==5.0.1
+      - attrs==25.3.0
+      - certifi==2025.4.26
+      - charset-normalizer==3.4.2
+      - contourpy==1.3.2
+      - cycler==0.12.1
+      - einops==0.6.1
+      - faiss-cpu==1.8.0
+      - filelock==3.18.0
+      - fonttools==4.58.2
+      - freia==0.2
+      - frozenlist==1.6.2
+      - fsspec==2024.12.0
+      - ftfy==6.3.1
+      - hf-xet==1.1.3
+      - huggingface-hub==0.32.4
+      - idna==3.10
+      - imageio==2.37.0
+      - imgaug==0.4.0
+      - jinja2==3.1.6
+      - joblib==1.5.1
+      - jsonargparse==4.29.0
+      - kiwisolver==1.4.8
+      - kmeans-pytorch==0.3
+      - kornia==0.7.0
+      - lazy-loader==0.4
+      - lightning==2.2.5
+      - lightning-utilities==0.14.3
+      - markdown-it-py==3.0.0
+      - markupsafe==3.0.2
+      - matplotlib==3.10.3
+      - mdurl==0.1.2
+      - mpmath==1.3.0
+      - multidict==6.4.4
+      - networkx==3.4.2
+      - numpy==1.23.1
+      - omegaconf==2.3.0
+      - open-clip-torch==2.24.0
+      - opencv-python==4.8.1.78
+      - packaging==24.2
+      - pandas==2.0.3
+      - pillow==11.2.1
+      - propcache==0.3.1
+      - protobuf==6.31.1
+      - pygments==2.19.1
+      - pyparsing==3.2.3
+      - python-dateutil==2.9.0.post0
+      - pytorch-lightning==2.5.1.post0
+      - pytz==2025.2
+      - pyyaml==6.0.2
+      - regex==2024.11.6
+      - requests==2.32.3
+      - rich==13.7.1
+      - safetensors==0.5.3
+      - scikit-image==0.25.2
+      - scikit-learn==1.2.2
+      - scipy==1.15.3
+      - segment-anything==1.0
+      - sentencepiece==0.2.0
+      - shapely==2.1.1
+      - six==1.17.0
+      - sympy==1.14.0
+      - tabulate==0.9.0
+      - threadpoolctl==3.6.0
+      - tifffile==2025.5.10
+      - timm==1.0.15
+      - torch==2.1.2+cu121
+      - torchmetrics==1.7.2
+      - torchvision==0.16.2+cu121
+      - tqdm==4.67.1
+      - triton==2.1.0
+      - typing-extensions==4.14.0
+      - tzdata==2025.2
+      - urllib3==2.4.0
+      - wcwidth==0.2.13
+      - yarl==1.20.0
+prefix: /opt/conda/envs/logsad

evaluation.py ADDED Viewed

	@@ -0,0 +1,257 @@

+"""Sample evaluation script for track 2."""
+import os
+from datetime import datetime
+from pathlib import Path
+# Set cache directories to use checkpoint folder for model downloads
+os.environ['TORCH_HOME'] = './checkpoint'
+os.environ['HF_HOME'] = './checkpoint/huggingface'
+os.environ['TRANSFORMERS_CACHE'] = './checkpoint/huggingface/transformers'
+os.environ['HF_HUB_CACHE'] = './checkpoint/huggingface/hub'
+# Create checkpoint subdirectories if they don't exist
+os.makedirs('./checkpoint/huggingface/transformers', exist_ok=True)
+os.makedirs('./checkpoint/huggingface/hub', exist_ok=True)
+import argparse
+import importlib
+import importlib.util
+import torch
+import logging
+from torch import nn
+# NOTE: The following MVTecLoco import is not available in anomalib v1.0.1.
+# It will be available in v1.1.0 which will be released on April 29th, 2024.
+# If you are using an earlier version of anomalib, you could install anomalib
+# from the anomalib source code from the following branch:
+# https://github.com/openvinotoolkit/anomalib/tree/feature/mvtec-loco
+from anomalib.data import MVTecLoco
+from anomalib.metrics.f1_max import F1Max
+from anomalib.metrics.auroc import AUROC
+from tabulate import tabulate
+import numpy as np
+FEW_SHOT_SAMPLES = [0, 1, 2, 3]
+def write_results_to_markdown(category, results_data, module_path):
+    """Write evaluation results to markdown file.
+    Args:
+        category (str): Dataset category name
+        results_data (dict): Dictionary containing all metrics
+        module_path (str): Model module path (for protocol identification)
+    """
+    # Determine protocol type from module path
+    protocol = "Few-shot" if "few_shot" in module_path else "Full-data"
+    # Create results directory
+    results_dir = Path("results")
+    results_dir.mkdir(exist_ok=True)
+    # Combined results file with simple protocol name
+    protocol_suffix = "few_shot" if "few_shot" in module_path else "full_data"
+    combined_file = results_dir / f"{protocol_suffix}_results.md"
+    # Read existing results if file exists
+    existing_results = {}
+    if combined_file.exists():
+        with open(combined_file, 'r') as f:
+            content = f.read()
+            # Parse existing results (basic parsing)
+            lines = content.split('\n')
+            for line in lines:
+                if '|' in line and line.count('|') >= 8:
+                    parts = [p.strip() for p in line.split('|')]
+                    if len(parts) >= 8 and parts[1] != 'Category' and parts[1] != '-----':
+                        existing_results[parts[1]] = {
+                            'k_shots': parts[2],
+                            'f1_image': parts[3],
+                            'auroc_image': parts[4],
+                            'f1_logical': parts[5],
+                            'auroc_logical': parts[6],
+                            'f1_structural': parts[7],
+                            'auroc_structural': parts[8]
+                        }
+    # Add current results
+    existing_results[category] = {
+        'k_shots': str(len(FEW_SHOT_SAMPLES)),
+        'f1_image': f"{results_data['f1_image']:.2f}",
+        'auroc_image': f"{results_data['auroc_image']:.2f}",
+        'f1_logical': f"{results_data['f1_logical']:.2f}",
+        'auroc_logical': f"{results_data['auroc_logical']:.2f}",
+        'f1_structural': f"{results_data['f1_structural']:.2f}",
+        'auroc_structural': f"{results_data['auroc_structural']:.2f}"
+    }
+    # Write combined results
+    with open(combined_file, 'w') as f:
+        f.write(f"# All Categories - {protocol} Protocol Results\n\n")
+        f.write(f"**Last Updated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
+        f.write(f"**Protocol:** {protocol}\n")
+        f.write(f"**Available Categories:** {', '.join(sorted(existing_results.keys()))}\n\n")
+        f.write("## Summary Table\n\n")
+        f.write("| Category | K-shots | F1-Max (Image) | AUROC (Image) | F1-Max (Logical) | AUROC (Logical) | F1-Max (Structural) | AUROC (Structural) |\n")
+        f.write("|----------|---------|----------------|---------------|------------------|-----------------|---------------------|-------------------|\n")
+        # Sort categories alphabetically
+        for cat in sorted(existing_results.keys()):
+            data = existing_results[cat]
+            f.write(f"| {cat} | {data['k_shots']} | {data['f1_image']} | {data['auroc_image']} | {data['f1_logical']} | {data['auroc_logical']} | {data['f1_structural']} | {data['auroc_structural']} |\n")
+    print(f"\n✓ Results saved to:")
+    print(f"  - Combined: {combined_file}")
+def parse_args() -> argparse.Namespace:
+    """Parse command line arguments.
+    Returns:
+        argparse.Namespace: Parsed arguments.
+    """
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--module_path", type=str, required=True)
+    parser.add_argument("--class_name", default='MyModel', type=str, required=False)
+    parser.add_argument("--weights_path", type=str, required=False)
+    parser.add_argument("--dataset_path", default='/home/bhu/Project/datasets/mvtec_loco_anomaly_detection/', type=str, required=False)
+    parser.add_argument("--category", type=str, required=True)
+    parser.add_argument("--viz", action='store_true', default=False)
+    return parser.parse_args()
+def load_model(module_path: str, class_name: str, weights_path: str) -> nn.Module:
+    """Load model.
+    Args:
+        module_path (str): Path to the module containing the model class.
+        class_name (str): Name of the model class.
+        weights_path (str): Path to the model weights.
+    Returns:
+        nn.Module: Loaded model.
+    """
+    # get model class
+    model_class = getattr(importlib.import_module(module_path), class_name)
+    # instantiate model
+    model = model_class()
+    # load weights
+    if weights_path:
+        model.load_state_dict(torch.load(weights_path))
+    return model
+def run(module_path: str, class_name: str, weights_path: str, dataset_path: str, category: str, viz: bool) -> None:
+    """Run the evaluation script.
+    Args:
+        module_path (str): Path to the module containing the model class.
+        class_name (str): Name of the model class.
+        weights_path (str): Path to the model weights.
+        dataset_path (str): Path to the dataset.
+        category (str): Category of the dataset.
+    """
+    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    # Instantiate model class here
+    # Load the model here from checkpoint.
+    model = load_model(module_path, class_name, weights_path)
+    model.to(device)
+    #
+    # Create the dataset
+    datamodule = MVTecLoco(root=dataset_path, eval_batch_size=1, image_size=(448, 448), category=category)
+    datamodule.setup()
+    model.set_viz(viz)
+    #
+    # Create the metrics
+    image_metric = F1Max()
+    pixel_metric = F1Max()
+    image_metric_logical = F1Max()
+    image_metric_structure = F1Max()
+    image_metric_auroc = AUROC()
+    pixel_metric_auroc = AUROC()
+    image_metric_auroc_logical = AUROC()
+    image_metric_auroc_structure = AUROC()
+    #
+    # pass few-shot images and dataset category to model
+    setup_data = {
+        "few_shot_samples": torch.stack([datamodule.train_data[idx]["image"] for idx in FEW_SHOT_SAMPLES]).to(device),
+        "few_shot_samples_path": [datamodule.train_data[idx]["image_path"] for idx in FEW_SHOT_SAMPLES],
+        "dataset_category": category,
+    }
+    model.setup(setup_data)
+    # Loop over the test set and compute the metrics
+    for data in datamodule.test_dataloader():
+        with torch.no_grad():
+            image_path = data['image_path']
+            output = model(data["image"].to(device), data['image_path'])
+        image_metric.update(output["pred_score"].cpu(), data["label"])
+        image_metric_auroc.update(output["pred_score"].cpu(), data["label"])
+        if 'logical' not in image_path[0]:
+            image_metric_structure.update(output["pred_score"].cpu(), data["label"])
+            image_metric_auroc_structure.update(output["pred_score"].cpu(), data["label"])
+        if 'structural' not in image_path[0]:
+            image_metric_logical.update(output["pred_score"].cpu(), data["label"])
+            image_metric_auroc_logical.update(output["pred_score"].cpu(), data["label"])
+    # Disable verbose logging from all libraries
+    logging.getLogger().setLevel(logging.ERROR)
+    logging.getLogger('anomalib').setLevel(logging.ERROR)
+    logging.getLogger('lightning').setLevel(logging.ERROR)
+    logging.getLogger('pytorch_lightning').setLevel(logging.ERROR)
+    # Set up our own logger for results only
+    logger = logging.getLogger('evaluation')
+    logger.handlers.clear()
+    logger.setLevel(logging.INFO)
+    formatter = logging.Formatter('%(asctime)s.%(msecs)03d - %(levelname)s: %(message)s', datefmt='%y-%m-%d %H:%M:%S')
+    console_handler = logging.StreamHandler()
+    console_handler.setFormatter(formatter)
+    logger.addHandler(console_handler)
+    table_ls = [[category,
+                str(len(FEW_SHOT_SAMPLES)),
+                str(np.round(image_metric.compute().item() * 100, decimals=2)),
+                str(np.round(image_metric_auroc.compute().item() * 100, decimals=2)),
+                # str(np.round(pixel_metric.compute().item() * 100, decimals=2)),
+                # str(np.round(pixel_metric_auroc.compute().item() * 100, decimals=2)),
+                str(np.round(image_metric_logical.compute().item() * 100, decimals=2)),
+                str(np.round(image_metric_auroc_logical.compute().item() * 100, decimals=2)),
+                str(np.round(image_metric_structure.compute().item() * 100, decimals=2)),
+                str(np.round(image_metric_auroc_structure.compute().item() * 100, decimals=2)),
+                ]]
+    results = tabulate(table_ls, headers=['category', 'K-shots', 'F1-Max(image)', 'AUROC(image)', 'F1-Max (logical)', 'AUROC (logical)', 'F1-Max (structural)', 'AUROC (structural)'], tablefmt="pipe")
+    logger.info("\n%s", results)
+    # Save results to markdown
+    results_data = {
+        'f1_image': np.round(image_metric.compute().item() * 100, decimals=2),
+        'auroc_image': np.round(image_metric_auroc.compute().item() * 100, decimals=2),
+        'f1_logical': np.round(image_metric_logical.compute().item() * 100, decimals=2),
+        'auroc_logical': np.round(image_metric_auroc_logical.compute().item() * 100, decimals=2),
+        'f1_structural': np.round(image_metric_structure.compute().item() * 100, decimals=2),
+        'auroc_structural': np.round(image_metric_auroc_structure.compute().item() * 100, decimals=2)
+    }
+    write_results_to_markdown(category, results_data, module_path)
+if __name__ == "__main__":
+    args = parse_args()
+    run(args.module_path, args.class_name, args.weights_path, args.dataset_path, args.category, args.viz)

imagenet_template.py ADDED Viewed

	@@ -0,0 +1,82 @@

+openai_imagenet_template = [
+    lambda c: f'a bad photo of a {c}.',
+    lambda c: f'a photo of many {c}.',
+    lambda c: f'a sculpture of a {c}.',
+    lambda c: f'a photo of the hard to see {c}.',
+    lambda c: f'a low resolution photo of the {c}.',
+    lambda c: f'a rendering of a {c}.',
+    lambda c: f'graffiti of a {c}.',
+    lambda c: f'a bad photo of the {c}.',
+    lambda c: f'a cropped photo of the {c}.',
+    lambda c: f'a tattoo of a {c}.',
+    lambda c: f'the embroidered {c}.',
+    lambda c: f'a photo of a hard to see {c}.',
+    lambda c: f'a bright photo of a {c}.',
+    lambda c: f'a photo of a clean {c}.',
+    lambda c: f'a photo of a dirty {c}.',
+    lambda c: f'a dark photo of the {c}.',
+    lambda c: f'a drawing of a {c}.',
+    lambda c: f'a photo of my {c}.',
+    lambda c: f'the plastic {c}.',
+    lambda c: f'a photo of the cool {c}.',
+    lambda c: f'a close-up photo of a {c}.',
+    lambda c: f'a black and white photo of the {c}.',
+    lambda c: f'a painting of the {c}.',
+    lambda c: f'a painting of a {c}.',
+    lambda c: f'a pixelated photo of the {c}.',
+    lambda c: f'a sculpture of the {c}.',
+    lambda c: f'a bright photo of the {c}.',
+    lambda c: f'a cropped photo of a {c}.',
+    lambda c: f'a plastic {c}.',
+    lambda c: f'a photo of the dirty {c}.',
+    lambda c: f'a jpeg corrupted photo of a {c}.',
+    lambda c: f'a blurry photo of the {c}.',
+    lambda c: f'a photo of the {c}.',
+    lambda c: f'a good photo of the {c}.',
+    lambda c: f'a rendering of the {c}.',
+    lambda c: f'a {c} in a video game.',
+    lambda c: f'a photo of one {c}.',
+    lambda c: f'a doodle of a {c}.',
+    lambda c: f'a close-up photo of the {c}.',
+    lambda c: f'a photo of a {c}.',
+    lambda c: f'the origami {c}.',
+    lambda c: f'the {c} in a video game.',
+    lambda c: f'a sketch of a {c}.',
+    lambda c: f'a doodle of the {c}.',
+    lambda c: f'a origami {c}.',
+    lambda c: f'a low resolution photo of a {c}.',
+    lambda c: f'the toy {c}.',
+    lambda c: f'a rendition of the {c}.',
+    lambda c: f'a photo of the clean {c}.',
+    lambda c: f'a photo of a large {c}.',
+    lambda c: f'a rendition of a {c}.',
+    lambda c: f'a photo of a nice {c}.',
+    lambda c: f'a photo of a weird {c}.',
+    lambda c: f'a blurry photo of a {c}.',
+    lambda c: f'a cartoon {c}.',
+    lambda c: f'art of a {c}.',
+    lambda c: f'a sketch of the {c}.',
+    lambda c: f'a embroidered {c}.',
+    lambda c: f'a pixelated photo of a {c}.',
+    lambda c: f'itap of the {c}.',
+    lambda c: f'a jpeg corrupted photo of the {c}.',
+    lambda c: f'a good photo of a {c}.',
+    lambda c: f'a plushie {c}.',
+    lambda c: f'a photo of the nice {c}.',
+    lambda c: f'a photo of the small {c}.',
+    lambda c: f'a photo of the weird {c}.',
+    lambda c: f'the cartoon {c}.',
+    lambda c: f'art of the {c}.',
+    lambda c: f'a drawing of the {c}.',
+    lambda c: f'a photo of the large {c}.',
+    lambda c: f'a black and white photo of a {c}.',
+    lambda c: f'the plushie {c}.',
+    lambda c: f'a dark photo of a {c}.',
+    lambda c: f'itap of a {c}.',
+    lambda c: f'graffiti of the {c}.',
+    lambda c: f'a toy {c}.',
+    lambda c: f'itap of my {c}.',
+    lambda c: f'a photo of a cool {c}.',
+    lambda c: f'a photo of a small {c}.',
+    lambda c: f'a tattoo of the {c}.',
+]

model_ensemble.py ADDED Viewed

	@@ -0,0 +1,1034 @@

+import os
+# Set cache directories to use checkpoint folder for model downloads
+os.environ['TORCH_HOME'] = './checkpoint'
+os.environ['HF_HOME'] = './checkpoint/huggingface'
+os.environ['TRANSFORMERS_CACHE'] = './checkpoint/huggingface/transformers'
+os.environ['HF_HUB_CACHE'] = './checkpoint/huggingface/hub'
+# Create checkpoint subdirectories if they don't exist
+os.makedirs('./checkpoint/huggingface/transformers', exist_ok=True)
+os.makedirs('./checkpoint/huggingface/hub', exist_ok=True)
+import torch
+from torch import nn
+from torchvision.transforms import v2
+from torchvision.transforms.v2.functional import resize
+import cv2
+import json
+import torch
+import random
+import logging
+import argparse
+import numpy as np
+from PIL import Image
+from skimage import measure
+from tabulate import tabulate
+from torchvision.ops.focal_loss import sigmoid_focal_loss
+import torch.nn.functional as F
+import torchvision.transforms as transforms
+import torchvision.transforms.functional as TF
+from sklearn.metrics import auc, roc_auc_score, average_precision_score, f1_score, precision_recall_curve, pairwise
+from sklearn.mixture import GaussianMixture
+import faiss
+import open_clip_local as open_clip
+from torch.utils.data.dataset import ConcatDataset
+from scipy.optimize import linear_sum_assignment
+from sklearn.random_projection import SparseRandomProjection
+import cv2
+from torchvision.transforms import InterpolationMode
+from PIL import Image
+import string
+from prompt_ensemble import encode_text_with_prompt_ensemble, encode_normal_text, encode_abnormal_text, encode_general_text, encode_obj_text
+from kmeans_pytorch import kmeans, kmeans_predict
+from scipy.optimize import linear_sum_assignment
+from scipy.stats import norm
+from segment_anything import sam_model_registry, SamAutomaticMaskGenerator, SamPredictor
+from matplotlib import pyplot as plt
+import pickle
+from scipy.stats import norm
+from open_clip_local.pos_embed import get_2d_sincos_pos_embed
+from anomalib.models.components import KCenterGreedy
+def to_np_img(m):
+    m = m.permute(1, 2, 0).cpu().numpy()
+    mean = np.array([[[0.48145466, 0.4578275, 0.40821073]]])
+    std = np.array([[[0.26862954, 0.26130258, 0.27577711]]])
+    m  = m * std + mean
+    return np.clip((m * 255.), 0, 255).astype(np.uint8)
+def setup_seed(seed):
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+class MyModel(nn.Module):
+    """Example model class for track 2.
+    This class applies few-shot anomaly detection using the WinClip model from Anomalib.
+    """
+    def __init__(self) -> None:
+        super().__init__()
+        setup_seed(42)
+        # NOTE: Create your transformation pipeline (if needed).
+        self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+        self.transform = v2.Compose(
+            [
+                v2.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711)),
+            ],
+        )
+        # NOTE: Create your model.
+        self.model_clip, _, _ = open_clip.create_model_and_transforms('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
+        self.tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
+        self.feature_list = [6, 12, 18, 24]
+        self.embed_dim = 768
+        self.vision_width = 1024
+        self.model_sam = sam_model_registry["vit_h"](checkpoint = "./checkpoint/sam_vit_h_4b8939.pth").to(self.device)
+        self.mask_generator = SamAutomaticMaskGenerator(model = self.model_sam)
+        self.memory_size = 2048
+        self.n_neighbors = 2
+        self.model_clip.eval()
+        self.test_args = None
+        self.align_corners = True # False
+        self.antialias = True # False
+        self.inter_mode = 'bilinear' # bilinear/bicubic
+        self.cluster_feature_id = [0, 1]
+        self.cluster_num_dict = {
+            "breakfast_box": 3, # unused
+            "juice_bottle": 8, # unused
+            "splicing_connectors": 10, # unused
+            "pushpins": 10,
+            "screw_bag": 10,
+        }
+        self.query_words_dict = {
+            "breakfast_box": ['orange', "nectarine", "cereals", "banana chips", 'almonds', 'white box', 'black background'],
+            "juice_bottle": ['bottle', ['black background', 'background']],
+            "pushpins": [['pushpin', 'pin'], ['plastic box', 'black background']],
+            "screw_bag": [['screw'], 'plastic bag', 'background'],
+            "splicing_connectors": [['splicing connector', 'splice connector',], ['cable', 'wire'], ['grid']],
+        }
+        self.foreground_label_idx = {  # for query_words_dict
+            "breakfast_box": [0, 1, 2, 3, 4, 5],
+            "juice_bottle": [0],
+            "pushpins": [0],
+            "screw_bag": [0],
+            "splicing_connectors":[0, 1]
+        }
+        self.patch_query_words_dict = {
+            "breakfast_box": ['orange', "nectarine", "cereals", "banana chips", 'almonds', 'white box', 'black background'],
+            "juice_bottle": [['glass'], ['liquid in bottle'], ['fruit'], ['label', 'tag'], ['black background', 'background']],
+            "pushpins": [['pushpin', 'pin'], ['plastic box', 'black background']],
+            "screw_bag": [['hex screw', 'hexagon bolt'], ['hex nut', 'hexagon nut'], ['ring washer', 'ring gasket'], ['plastic bag', 'background']], # 79.71
+            "splicing_connectors": [['splicing connector', 'splice connector',], ['cable', 'wire'], ['grid']],
+        }
+        self.query_threshold_dict = {
+            "breakfast_box": [0., 0., 0., 0., 0., 0., 0.], # unused
+            "juice_bottle": [0., 0., 0.], # unused
+            "splicing_connectors": [0.15, 0.15, 0.15, 0., 0.], # unused
+            "pushpins": [0.2, 0., 0., 0.],
+            "screw_bag": [0., 0., 0.,],
+        }
+        self.feat_size = 64
+        self.ori_feat_size = 32
+        self.visualization = False #False # True #False
+        self.pushpins_count = 15
+        self.splicing_connectors_count = [2, 3, 5] # coresponding to yellow, blue, and red
+        self.splicing_connectors_distance = 0
+        self.splicing_connectors_cable_color_query_words_dict = [['yellow cable', 'yellow wire'], ['blue cable', 'blue wire'], ['red cable', 'red wire']]
+        self.juice_bottle_liquid_query_words_dict = [['red liquid', 'cherry juice'], ['yellow liquid', 'orange juice'], ['milky liquid']]
+        self.juice_bottle_fruit_query_words_dict = ['cherry', ['tangerine', 'orange'], 'banana']
+        # query words
+        self.foreground_pixel_hist = 0
+        self.foreground_pixel_hist_screw_bag = 366.0 # 4-shot statistics
+        self.foreground_pixel_hist_splicing_connectors = 4249.666666666667  # 4-shot statistics
+        # patch query words
+        self.patch_token_hist = []
+        self.few_shot_inited = False
+        self.save_coreset_features = False
+        from dinov2.dinov2.hub.backbones import dinov2_vitl14
+        self.model_dinov2 = dinov2_vitl14()
+        self.model_dinov2.to(self.device)
+        self.model_dinov2.eval()
+        self.feature_list_dinov2 = [6, 12, 18, 24]
+        self.vision_width_dinov2 = 1024
+        self.stats = pickle.load(open("memory_bank/statistic_scores_model_ensemble_val.pkl", "rb"))
+        self.mem_instance_masks = None
+        self.anomaly_flag = False
+        self.validation = False #True #False
+    def set_save_coreset_features(self, save_coreset_features):
+        self.save_coreset_features = save_coreset_features
+    def set_viz(self, viz):
+        self.visualization = viz
+    def set_val(self, val):
+        self.validation = val
+    def forward(self, batch: torch.Tensor, batch_path: list) -> dict[str, torch.Tensor]:
+        """Transform the input batch and pass it through the model.
+        This model returns a dictionary with the following keys
+        - ``anomaly_map`` - Anomaly map.
+        - ``pred_score`` - Predicted anomaly score.
+        """
+        self.anomaly_flag = False
+        batch = self.transform(batch).to(self.device)
+        results = self.forward_one_sample(batch, self.mem_patch_feature_clip_coreset, self.mem_patch_feature_dinov2_coreset, batch_path[0])
+        hist_score = results['hist_score']
+        structural_score = results['structural_score']
+        instance_hungarian_match_score = results['instance_hungarian_match_score']
+        if self.validation:
+            return {"hist_score": torch.tensor(hist_score), "structural_score": torch.tensor(structural_score), "instance_hungarian_match_score": torch.tensor(instance_hungarian_match_score)}
+        def sigmoid(z):
+            return 1/(1 + np.exp(-z))
+        # standardization
+        standard_structural_score = (structural_score - self.stats[self.class_name]["structural_scores"]["mean"]) / self.stats[self.class_name]["structural_scores"]["unbiased_std"]
+        standard_instance_hungarian_match_score = (instance_hungarian_match_score - self.stats[self.class_name]["instance_hungarian_match_scores"]["mean"]) / self.stats[self.class_name]["instance_hungarian_match_scores"]["unbiased_std"]
+        pred_score = max(standard_instance_hungarian_match_score, standard_structural_score)
+        pred_score = sigmoid(pred_score)
+        if self.anomaly_flag:
+            pred_score = 1.
+            self.anomaly_flag = False
+        return {"pred_score": torch.tensor(pred_score), "hist_score": torch.tensor(hist_score), "structural_score": torch.tensor(structural_score), "instance_hungarian_match_score": torch.tensor(instance_hungarian_match_score)}
+    def forward_one_sample(self, batch: torch.Tensor, mem_patch_feature_clip_coreset: torch.Tensor, mem_patch_feature_dinov2_coreset: torch.Tensor, path: str):
+        with torch.no_grad():
+            image_features, patch_tokens, proj_patch_tokens = self.model_clip.encode_image(batch, self.feature_list)
+            # image_features /= image_features.norm(dim=-1, keepdim=True)
+            patch_tokens = [p[:, 1:, :] for p in patch_tokens]
+            patch_tokens = [p.reshape(p.shape[0]*p.shape[1], p.shape[2]) for p in patch_tokens]
+            patch_tokens_clip = torch.cat(patch_tokens, dim=-1)  # (1, 1024, 1024x4)
+            # patch_tokens_clip = torch.cat(patch_tokens[2:], dim=-1)  # (1, 1024, 1024x2)
+            patch_tokens_clip = patch_tokens_clip.view(1, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            patch_tokens_clip = F.interpolate(patch_tokens_clip, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            patch_tokens_clip = patch_tokens_clip.permute(0, 2, 3, 1).view(-1, self.vision_width * len(self.feature_list))
+            patch_tokens_clip = F.normalize(patch_tokens_clip, p=2, dim=-1) # (1x64x64, 1024x4)
+        with torch.no_grad():
+            patch_tokens_dinov2 = self.model_dinov2.forward_features(batch, out_layer_list=self.feature_list)
+            patch_tokens_dinov2 = torch.cat(patch_tokens_dinov2, dim=-1)  # (1, 1024, 1024x4)
+            patch_tokens_dinov2 = patch_tokens_dinov2.view(1, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            patch_tokens_dinov2 = F.interpolate(patch_tokens_dinov2, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            patch_tokens_dinov2 = patch_tokens_dinov2.permute(0, 2, 3, 1).view(-1, self.vision_width_dinov2 * len(self.feature_list_dinov2))
+            patch_tokens_dinov2 = F.normalize(patch_tokens_dinov2, p=2, dim=-1) # (1x64x64, 1024x4)
+        '''adding for kmeans seg '''
+        if self.feat_size != self.ori_feat_size:
+            proj_patch_tokens = proj_patch_tokens.view(1, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            proj_patch_tokens = F.interpolate(proj_patch_tokens, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            proj_patch_tokens = proj_patch_tokens.permute(0, 2, 3, 1).view(self.feat_size * self.feat_size, self.embed_dim)
+        proj_patch_tokens = F.normalize(proj_patch_tokens, p=2, dim=-1)
+        mid_features = None
+        for layer in self.cluster_feature_id:
+            temp_feat = patch_tokens[layer]
+            mid_features = temp_feat if mid_features is None else torch.cat((mid_features, temp_feat), -1)
+        if self.feat_size != self.ori_feat_size:
+            mid_features = mid_features.view(1, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            mid_features = F.interpolate(mid_features, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            mid_features = mid_features.permute(0, 2, 3, 1).view(-1, self.vision_width * len(self.cluster_feature_id))
+        mid_features = F.normalize(mid_features, p=2, dim=-1)
+        results = self.histogram(batch, mid_features, proj_patch_tokens, self.class_name, os.path.dirname(path).split('/')[-1] + "_" + os.path.basename(path).split('.')[0])
+        hist_score = results['score']
+        '''calculate patchcore'''
+        anomaly_maps_patchcore = []
+        if self.class_name in ['pushpins', 'screw_bag']: # clip feature for patchcore
+            len_feature_list = len(self.feature_list)
+            for patch_feature, mem_patch_feature in zip(patch_tokens_clip.chunk(len_feature_list, dim=-1), mem_patch_feature_clip_coreset.chunk(len_feature_list, dim=-1)):
+                patch_feature = F.normalize(patch_feature, dim=-1)
+                mem_patch_feature = F.normalize(mem_patch_feature, dim=-1)
+                normal_map_patchcore = (patch_feature @ mem_patch_feature.T)
+                normal_map_patchcore = (normal_map_patchcore.max(1)[0]).cpu().numpy() # 1: normal 0: abnormal
+                anomaly_map_patchcore = 1 - normal_map_patchcore
+                anomaly_maps_patchcore.append(anomaly_map_patchcore)
+        if self.class_name in ['splicing_connectors', 'breakfast_box', 'juice_bottle']: # dinov2 feature for patchcore
+            len_feature_list = len(self.feature_list_dinov2)
+            for patch_feature, mem_patch_feature in zip(patch_tokens_dinov2.chunk(len_feature_list, dim=-1), mem_patch_feature_dinov2_coreset.chunk(len_feature_list, dim=-1)):
+                patch_feature = F.normalize(patch_feature, dim=-1)
+                mem_patch_feature = F.normalize(mem_patch_feature, dim=-1)
+                normal_map_patchcore = (patch_feature @ mem_patch_feature.T)
+                normal_map_patchcore = (normal_map_patchcore.max(1)[0]).cpu().numpy() # 1: normal 0: abnormal
+                anomaly_map_patchcore = 1 - normal_map_patchcore
+                anomaly_maps_patchcore.append(anomaly_map_patchcore)
+        structural_score = np.stack(anomaly_maps_patchcore).mean(0).max()
+        # anomaly_map_structural = np.stack(anomaly_maps_patchcore).mean(0).reshape(self.feat_size, self.feat_size)
+        instance_masks = results["instance_masks"]
+        anomaly_instances_hungarian = []
+        instance_hungarian_match_score = 1.
+        if self.mem_instance_masks is not None and len(instance_masks) != 0:
+            for patch_feature, mem_instance_features_single_stage in zip(patch_tokens_clip.chunk(len_feature_list, dim=-1), self.mem_instance_features_multi_stage.chunk(len_feature_list, dim=1)):
+                instance_features = [patch_feature[mask, :].mean(0, keepdim=True) for mask in instance_masks]
+                instance_features = torch.cat(instance_features, dim=0)
+                instance_features = F.normalize(instance_features, dim=-1)
+                normal_instance_hungarian = (instance_features @ mem_instance_features_single_stage.T)
+                cost_matrix = (1 - normal_instance_hungarian).cpu().numpy()
+                row_ind, col_ind = linear_sum_assignment(cost_matrix)
+                cost = cost_matrix[row_ind, col_ind].sum()
+                cost = cost / min(cost_matrix.shape)
+                anomaly_instances_hungarian.append(cost)
+            instance_hungarian_match_score = np.mean(anomaly_instances_hungarian)
+        results = {'hist_score': hist_score, 'structural_score': structural_score,  'instance_hungarian_match_score': instance_hungarian_match_score}
+        return results
+    def histogram(self, image, cluster_feature, proj_patch_token, class_name, path):
+        def plot_results_only(sorted_anns):
+            cur = 1
+            img_color = np.zeros((sorted_anns[0]['segmentation'].shape[0], sorted_anns[0]['segmentation'].shape[1]))
+            for ann in sorted_anns:
+                m = ann['segmentation']
+                img_color[m] = cur
+                cur += 1
+            return img_color
+        def merge_segmentations(a, b, background_class):
+            unique_labels_a = np.unique(a)
+            unique_labels_b = np.unique(b)
+            max_label_a = unique_labels_a.max()
+            label_map = np.zeros(max_label_a + 1, dtype=int)
+            for label_a in unique_labels_a:
+                mask_a = (a == label_a)
+                labels_b = b[mask_a]
+                if labels_b.size > 0:
+                    count_b = np.bincount(labels_b, minlength=unique_labels_b.max() + 1)
+                    label_map[label_a] = np.argmax(count_b)
+                else:
+                    label_map[label_a] = background_class # default background
+            merged_a = label_map[a]
+            return merged_a
+        pseudo_labels = kmeans_predict(cluster_feature, self.cluster_centers, 'euclidean', device=self.device)
+        kmeans_mask = torch.ones_like(pseudo_labels) * (self.classes - 1)    # default to background
+        for pl in pseudo_labels.unique():
+            mask = (pseudo_labels == pl).reshape(-1)
+            # filter small region
+            binary = mask.cpu().numpy().reshape(self.feat_size, self.feat_size).astype(np.uint8)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary, connectivity=8)
+            for i in range(1, num_labels):
+                temp_mask = labels == i
+                if np.sum(temp_mask) <= 8:
+                    mask[temp_mask.reshape(-1)] = False
+            if mask.any():
+                region_feature = proj_patch_token[mask, :].mean(0, keepdim=True)
+                similarity = (region_feature @ self.query_obj.T)
+                prob, index = torch.max(similarity, dim=-1)
+                temp_label = index.squeeze(0).item()
+                temp_prob = prob.squeeze(0).item()
+                if temp_prob > self.query_threshold_dict[class_name][temp_label]: # threshold for each class
+                    kmeans_mask[mask] = temp_label
+        raw_image = to_np_img(image[0])
+        height, width = raw_image.shape[:2]
+        masks = self.mask_generator.generate(raw_image)
+        # self.predictor.set_image(raw_image)
+        kmeans_label = pseudo_labels.view(self.feat_size, self.feat_size).cpu().numpy()
+        kmeans_mask = kmeans_mask.view(self.feat_size, self.feat_size).cpu().numpy()
+        patch_similarity = (proj_patch_token @ self.patch_query_obj.T)
+        patch_mask = patch_similarity.argmax(-1)
+        patch_mask = patch_mask.view(self.feat_size, self.feat_size).cpu().numpy()
+        sorted_masks = sorted(masks, key=(lambda x: x['area']), reverse=True)
+        sam_mask = plot_results_only(sorted_masks).astype(np.int)
+        resized_mask = cv2.resize(kmeans_mask, (width, height), interpolation = cv2.INTER_NEAREST)
+        merge_sam = merge_segmentations(sam_mask, resized_mask, background_class=self.classes-1)
+        resized_patch_mask = cv2.resize(patch_mask, (width, height), interpolation = cv2.INTER_NEAREST)
+        patch_merge_sam = merge_segmentations(sam_mask, resized_patch_mask, background_class=self.patch_query_obj.shape[0]-1)
+        # filter small region for merge sam
+        binary = np.isin(merge_sam, self.foreground_label_idx[self.class_name]).astype(np.uint8)  # foreground 1  background 0
+        num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary, connectivity=8)
+        for i in range(1, num_labels):
+            temp_mask = labels == i
+            if np.sum(temp_mask) <= 32: # 448x448
+                merge_sam[temp_mask] = self.classes - 1 # set to background
+        # filter small region for patch merge sam
+        binary = (patch_merge_sam != (self.patch_query_obj.shape[0]-1) ).astype(np.uint8)  # foreground 1  background 0
+        num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary, connectivity=8)
+        for i in range(1, num_labels):
+            temp_mask = labels == i
+            if np.sum(temp_mask) <= 32: # 448x448
+                patch_merge_sam[temp_mask] = self.patch_query_obj.shape[0]-1 # set to background
+        score = 0. # default to normal
+        self.anomaly_flag = False
+        instance_masks = []
+        if self.class_name == 'pushpins':
+            # object count hist
+            kernel = np.ones((3, 3), dtype=np.uint8)  # dilate for robustness
+            binary = np.isin(merge_sam, self.foreground_label_idx[self.class_name]).astype(np.uint8) # foreground 1  background 0
+            dilate_binary = cv2.dilate(binary, kernel)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(dilate_binary, connectivity=8)
+            pushpins_count = num_labels - 1 # number of pushpins
+            for i in range(1, num_labels):
+                instance_mask = (labels == i).astype(np.uint8)
+                instance_mask = cv2.resize(instance_mask, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+                if instance_mask.any():
+                    instance_masks.append(instance_mask.astype(np.bool).reshape(-1))
+            if self.few_shot_inited and pushpins_count != self.pushpins_count and self.anomaly_flag is False:
+                self.anomaly_flag = True
+                print('number of pushpins: {}, but canonical number of pushpins: {}'.format(pushpins_count, self.pushpins_count))
+            # patch hist
+            clip_patch_hist = np.bincount(patch_mask.reshape(-1), minlength=self.patch_query_obj.shape[0])
+            clip_patch_hist = clip_patch_hist / np.linalg.norm(clip_patch_hist)
+            if self.few_shot_inited:
+                patch_hist_similarity = (clip_patch_hist @ self.patch_token_hist.T)
+                score = 1 - patch_hist_similarity.max()
+            binary_foreground = dilate_binary.astype(np.uint8)
+            if len(instance_masks) != 0:
+                instance_masks = np.stack(instance_masks) #[N, 64x64]
+            if self.visualization:
+                image_list = [raw_image, kmeans_label, kmeans_mask, patch_mask, sam_mask, merge_sam, patch_merge_sam, binary_foreground]
+                title_list = ['raw image', 'k-means', 'kmeans mask', 'patch mask', 'sam mask', 'merge sam mask', 'patch merge sam', 'binary_foreground']
+                plt.figure(figsize=(20, 3))
+                for ind, (temp_title, temp_image) in enumerate(zip(title_list, image_list), start=1):
+                    plt.subplot(1, len(image_list), ind)
+                    plt.imshow(temp_image)
+                    plt.title(temp_title)
+                    plt.margins(0, 0)
+                    plt.axis('off')
+                # Extract relative path from class_name onwards
+                if class_name in path:
+                    relative_path = path.split(class_name, 1)[-1]
+                    if relative_path.startswith('/'):
+                        relative_path = relative_path[1:]
+                    save_path = f'visualization/full_data/{class_name}/{relative_path}.png'
+                else:
+                    save_path = f'visualization/full_data/{class_name}/{path}.png'
+                os.makedirs(os.path.dirname(save_path), exist_ok=True)
+                plt.tight_layout()
+                plt.savefig(save_path, bbox_inches='tight', dpi=150)
+                plt.close()
+            # todo: same number in total but in different boxes or broken box
+            return {"score": score, "clip_patch_hist": clip_patch_hist, "instance_masks": instance_masks}
+        elif self.class_name == 'splicing_connectors':
+            #  object count hist for default
+            sam_mask_max_area = sorted_masks[0]['segmentation'] # background
+            binary = (sam_mask_max_area == 0).astype(np.uint8) # sam_mask_max_area is background,  background 0 foreground 1
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary, connectivity=8)
+            count = 0
+            for i in range(1, num_labels):
+                temp_mask = labels == i
+                if np.sum(temp_mask) <= 64: # 448x448 64
+                    binary[temp_mask] = 0 # set to background
+                else:
+                    count += 1
+            if count != 1 and self.anomaly_flag is False: # cable cut or no cable or no connector
+                print('number of connected component in splicing_connectors: {}, but the default connected component is 1.'.format(count))
+                self.anomaly_flag = True
+            merge_sam[~(binary.astype(np.bool))] = self.query_obj.shape[0] - 1 # remove noise
+            patch_merge_sam[~(binary.astype(np.bool))] = self.patch_query_obj.shape[0] - 1 # remove patch noise
+            # erode the cable and divide into left and right parts
+            kernel = np.ones((23, 23), dtype=np.uint8)
+            erode_binary = cv2.erode(binary, kernel)
+            h, w = erode_binary.shape
+            distance = 0
+            left, right = erode_binary[:, :int(w/2)],  erode_binary[:, int(w/2):]
+            left_count = np.bincount(left.reshape(-1), minlength=self.classes)[1]  # foreground
+            right_count = np.bincount(right.reshape(-1), minlength=self.classes)[1] # foreground
+            # binary_cable = (merge_sam == 1).astype(np.uint8)
+            binary_cable = (patch_merge_sam == 1).astype(np.uint8)
+            kernel = np.ones((5, 5), dtype=np.uint8)
+            binary_cable = cv2.erode(binary_cable, kernel)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_cable, connectivity=8)
+            for i in range(1, num_labels):
+                temp_mask = labels == i
+                if np.sum(temp_mask) <= 64: # 448x448
+                    binary_cable[temp_mask] = 0 # set to background
+            binary_cable = cv2.resize(binary_cable, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+            binary_clamps = (patch_merge_sam == 0).astype(np.uint8)
+            kernel = np.ones((5, 5), dtype=np.uint8)
+            binary_clamps = cv2.erode(binary_clamps, kernel)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_clamps, connectivity=8)
+            for i in range(1, num_labels):
+                temp_mask = labels == i
+                if np.sum(temp_mask) <= 64: # 448x448
+                    binary_clamps[temp_mask] = 0 # set to background
+                else:
+                    instance_mask = temp_mask.astype(np.uint8)
+                    instance_mask = cv2.resize(instance_mask, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+                    if instance_mask.any():
+                        instance_masks.append(instance_mask.astype(np.bool).reshape(-1))
+            binary_clamps = cv2.resize(binary_clamps, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+            binary_connector = cv2.resize(binary, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+            query_cable_color = encode_obj_text(self.model_clip, self.splicing_connectors_cable_color_query_words_dict, self.tokenizer, self.device)
+            cable_feature = proj_patch_token[binary_cable.astype(np.bool).reshape(-1), :].mean(0, keepdim=True)
+            idx_color = (cable_feature @ query_cable_color.T).argmax(-1).squeeze(0).item()
+            foreground_pixel_count = np.sum(erode_binary) / self.splicing_connectors_count[idx_color]
+            slice_cable = binary[:, int(w/2)-1: int(w/2)+1]
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(slice_cable, connectivity=8)
+            cable_count = num_labels - 1
+            if cable_count != 1 and self.anomaly_flag is False: # two cables
+                print('number of cable count in splicing_connectors: {}, but the default cable count is 1.'.format(cable_count))
+                self.anomaly_flag = True
+            # {2-clamp: yellow  3-clamp: blue  5-clamp: red}    cable color and clamp number mismatch
+            if self.few_shot_inited and self.foreground_pixel_hist_splicing_connectors != 0 and self.anomaly_flag is False:
+                ratio = foreground_pixel_count / self.foreground_pixel_hist_splicing_connectors
+                if (ratio > 1.2 or ratio < 0.8) and self.anomaly_flag is False:    # color and number mismatch
+                    print('cable color and number of clamps mismatch, cable color idx: {} (0: yellow 2-clamp, 1: blue 3-clamp, 2: red 5-clamp), foreground_pixel_count :{}, canonical foreground_pixel_hist: {}.'.format(idx_color, foreground_pixel_count, self.foreground_pixel_hist_splicing_connectors))
+                    self.anomaly_flag = True
+            # left right hist for symmetry
+            ratio = np.sum(left_count) / (np.sum(right_count) + 1e-5)
+            if self.few_shot_inited and (ratio > 1.2 or ratio < 0.8) and self.anomaly_flag is False: # left right asymmetry in clamp
+                print('left and right connectors are not symmetry.')
+                self.anomaly_flag = True
+            # left and right centroids distance
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(erode_binary, connectivity=8)
+            if num_labels - 1 == 2:
+                centroids = centroids[1:]
+                x1, y1 = centroids[0]
+                x2, y2 = centroids[1]
+                distance = np.sqrt((x1/w - x2/w)**2 + (y1/h - y2/h)**2)
+                if self.few_shot_inited and self.splicing_connectors_distance != 0 and self.anomaly_flag is False:
+                    ratio = distance / self.splicing_connectors_distance
+                    if ratio < 0.6 or ratio > 1.4:  # too short or too long centroids distance (cable) # 0.6 1.4
+                        print('cable is too short or too long.')
+                        self.anomaly_flag = True
+            # patch hist
+            sam_patch_hist = np.bincount(patch_merge_sam.reshape(-1), minlength=self.patch_query_obj.shape[0])#[:-1]  # ignore background (grid) for statistic
+            sam_patch_hist = sam_patch_hist / np.linalg.norm(sam_patch_hist)
+            if self.few_shot_inited:
+                patch_hist_similarity = (sam_patch_hist @ self.patch_token_hist.T)
+                score = 1 - patch_hist_similarity.max()
+            # todo    mismatch cable link
+            binary_foreground = binary.astype(np.uint8) # only 1 instance, so additionally seperate cable and clamps
+            if binary_connector.any():
+                instance_masks.append(binary_connector.astype(np.bool).reshape(-1))
+            if binary_clamps.any():
+                instance_masks.append(binary_clamps.astype(np.bool).reshape(-1))
+            if binary_cable.any():
+                instance_masks.append(binary_cable.astype(np.bool).reshape(-1))
+            if len(instance_masks) != 0:
+                instance_masks = np.stack(instance_masks) #[N, 64x64]
+            if self.visualization:
+                image_list = [raw_image, kmeans_label, kmeans_mask, patch_mask, sam_mask, binary_connector, merge_sam, patch_merge_sam, erode_binary, binary_cable, binary_clamps]
+                title_list = ['raw image', 'k-means', 'kmeans mask', 'patch mask', 'sam mask', 'binary_connector', 'merge sam', 'patch merge sam', 'erode binary', 'binary_cable', 'binary_clamps']
+                plt.figure(figsize=(25, 3))
+                for ind, (temp_title, temp_image) in enumerate(zip(title_list, image_list), start=1):
+                    plt.subplot(1, len(image_list), ind)
+                    plt.imshow(temp_image)
+                    plt.title(temp_title)
+                    plt.margins(0, 0)
+                    plt.axis('off')
+                # Extract relative path from class_name onwards
+                if class_name in path:
+                    relative_path = path.split(class_name, 1)[-1]
+                    if relative_path.startswith('/'):
+                        relative_path = relative_path[1:]
+                    save_path = f'visualization/full_data/{class_name}/{relative_path}.png'
+                else:
+                    save_path = f'visualization/full_data/{class_name}/{path}.png'
+                os.makedirs(os.path.dirname(save_path), exist_ok=True)
+                plt.tight_layout()
+                plt.savefig(save_path, bbox_inches='tight', dpi=150)
+                plt.close()
+            return {"score": score, "foreground_pixel_count": foreground_pixel_count, "distance": distance, "sam_patch_hist": sam_patch_hist, "instance_masks": instance_masks}
+        elif self.class_name == 'screw_bag':
+            # pixel hist of kmeans mask
+            foreground_pixel_count = np.sum(np.bincount(kmeans_mask.reshape(-1))[:len(self.foreground_label_idx[self.class_name])])  # foreground pixel
+            if self.few_shot_inited and self.foreground_pixel_hist_screw_bag != 0 and self.anomaly_flag is False:
+                ratio = foreground_pixel_count / self.foreground_pixel_hist_screw_bag
+                # todo: optimize
+                if ratio < 0.94 or ratio > 1.06: # 82.95 |    81.3
+                    print('foreground pixel histagram of screw bag: {}, the canonical foreground pixel histogram of screw bag in few shot: {}'.format(foreground_pixel_count, self.foreground_pixel_hist_screw_bag))
+                    self.anomaly_flag = True
+            # patch hist
+            binary_screw = np.isin(kmeans_mask, self.foreground_label_idx[self.class_name])
+            patch_mask[~binary_screw] = self.patch_query_obj.shape[0] - 1 # remove patch noise
+            resized_binary_screw = cv2.resize(binary_screw.astype(np.uint8), (patch_merge_sam.shape[1], patch_merge_sam.shape[0]), interpolation = cv2.INTER_NEAREST)
+            patch_merge_sam[~(resized_binary_screw.astype(np.bool))] = self.patch_query_obj.shape[0] - 1 # remove patch noise
+            clip_patch_hist = np.bincount(patch_mask.reshape(-1), minlength=self.patch_query_obj.shape[0])[:-1]
+            clip_patch_hist = clip_patch_hist / np.linalg.norm(clip_patch_hist)
+            if self.few_shot_inited:
+                patch_hist_similarity = (clip_patch_hist @ self.patch_token_hist.T)
+                score = 1 - patch_hist_similarity.max()
+            for i in range(self.patch_query_obj.shape[0]-1):
+                binary_foreground = (patch_merge_sam == i).astype(np.uint8)
+                num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_foreground, connectivity=8)
+                for i in range(1, num_labels):
+                    instance_mask = (labels == i).astype(np.uint8)
+                    instance_mask = cv2.resize(instance_mask, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+                    if instance_mask.any():
+                        instance_masks.append(instance_mask.astype(np.bool).reshape(-1))
+            if len(instance_masks) != 0:
+                instance_masks = np.stack(instance_masks) #[N, 64x64]
+            if self.visualization:
+                image_list = [raw_image, kmeans_label, kmeans_mask, patch_mask, sam_mask, merge_sam, patch_merge_sam, binary_foreground]
+                title_list = ['raw image', 'k-means', 'kmeans mask', 'patch mask', 'sam mask', 'merge sam mask', 'patch merge sam', 'binary_foreground']
+                plt.figure(figsize=(20, 3))
+                for ind, (temp_title, temp_image) in enumerate(zip(title_list, image_list), start=1):
+                    plt.subplot(1, len(image_list), ind)
+                    plt.imshow(temp_image)
+                    plt.title(temp_title)
+                    plt.margins(0, 0)
+                    plt.axis('off')
+                # Extract relative path from class_name onwards
+                if class_name in path:
+                    relative_path = path.split(class_name, 1)[-1]
+                    if relative_path.startswith('/'):
+                        relative_path = relative_path[1:]
+                    save_path = f'visualization/full_data/{class_name}/{relative_path}.png'
+                else:
+                    save_path = f'visualization/full_data/{class_name}/{path}.png'
+                os.makedirs(os.path.dirname(save_path), exist_ok=True)
+                plt.tight_layout()
+                plt.savefig(save_path, bbox_inches='tight', dpi=150)
+                plt.close()
+                # plt.axis('off')
+                # plt.imshow(patch_merge_sam)
+                # plt.savefig('pic/vis/{}_seg_{}.png'.format(class_name, path), bbox_inches='tight', pad_inches = 0) # pad_inches = 0
+                # plt.close()
+            return {"score": score, "foreground_pixel_count": foreground_pixel_count, "clip_patch_hist": clip_patch_hist, "instance_masks": instance_masks}
+        elif self.class_name == 'breakfast_box':
+            # patch hist
+            sam_patch_hist = np.bincount(patch_merge_sam.reshape(-1), minlength=self.patch_query_obj.shape[0])
+            sam_patch_hist = sam_patch_hist / np.linalg.norm(sam_patch_hist)
+            if self.few_shot_inited:
+                patch_hist_similarity = (sam_patch_hist @ self.patch_token_hist.T)
+                score = 1 - patch_hist_similarity.max()
+            # todo: exist of foreground
+            binary_foreground = (patch_merge_sam != (self.patch_query_obj.shape[0] - 1)).astype(np.uint8)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_foreground, connectivity=8)
+            for i in range(1, num_labels):
+                instance_mask = (labels == i).astype(np.uint8)
+                instance_mask = cv2.resize(instance_mask, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+                if instance_mask.any():
+                    instance_masks.append(instance_mask.astype(np.bool).reshape(-1))
+            if len(instance_masks) != 0:
+                instance_masks = np.stack(instance_masks) #[N, 64x64]
+            if self.visualization:
+                image_list = [raw_image, kmeans_label, kmeans_mask, patch_mask, sam_mask, merge_sam, patch_merge_sam, binary_foreground]
+                title_list = ['raw image', 'k-means', 'kmeans mask', 'patch mask', 'sam mask', 'merge sam mask', 'patch merge sam', 'binary_foreground']
+                plt.figure(figsize=(20, 3))
+                for ind, (temp_title, temp_image) in enumerate(zip(title_list, image_list), start=1):
+                    plt.subplot(1, len(image_list), ind)
+                    plt.imshow(temp_image)
+                    plt.title(temp_title)
+                    plt.margins(0, 0)
+                    plt.axis('off')
+                # Extract relative path from class_name onwards
+                if class_name in path:
+                    relative_path = path.split(class_name, 1)[-1]
+                    if relative_path.startswith('/'):
+                        relative_path = relative_path[1:]
+                    save_path = f'visualization/full_data/{class_name}/{relative_path}.png'
+                else:
+                    save_path = f'visualization/full_data/{class_name}/{path}.png'
+                os.makedirs(os.path.dirname(save_path), exist_ok=True)
+                plt.tight_layout()
+                plt.savefig(save_path, bbox_inches='tight', dpi=150)
+                plt.close()
+                # plt.axis('off')
+                # plt.imshow(patch_merge_sam)
+                # plt.savefig('pic/vis/{}_seg_{}.png'.format(class_name, path), bbox_inches='tight', pad_inches = 0) # pad_inches = 0
+                # plt.close()
+            return {"score": score, "sam_patch_hist": sam_patch_hist, "instance_masks": instance_masks}
+        elif self.class_name == 'juice_bottle':
+            # remove noise due to non sam mask
+            merge_sam[sam_mask == 0] = self.classes - 1
+            patch_merge_sam[sam_mask == 0] = self.patch_query_obj.shape[0] - 1  # 79.5
+            # [['glass'], ['liquid in bottle'], ['fruit'], ['label', 'tag'], ['black background', 'background']],
+            # fruit and liquid mismatch (todo if exist)
+            resized_patch_merge_sam = cv2.resize(patch_merge_sam, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+            binary_liquid = (resized_patch_merge_sam == 1)
+            binary_fruit = (resized_patch_merge_sam == 2)
+            query_liquid = encode_obj_text(self.model_clip, self.juice_bottle_liquid_query_words_dict, self.tokenizer, self.device)
+            query_fruit = encode_obj_text(self.model_clip, self.juice_bottle_fruit_query_words_dict, self.tokenizer, self.device)
+            liquid_feature = proj_patch_token[binary_liquid.reshape(-1), :].mean(0, keepdim=True)
+            liquid_idx = (liquid_feature @ query_liquid.T).argmax(-1).squeeze(0).item()
+            fruit_feature = proj_patch_token[binary_fruit.reshape(-1), :].mean(0, keepdim=True)
+            fruit_idx = (fruit_feature @ query_fruit.T).argmax(-1).squeeze(0).item()
+            if (liquid_idx != fruit_idx) and self.anomaly_flag is False:
+                print('liquid: {}, but fruit: {}.'.format(self.juice_bottle_liquid_query_words_dict[liquid_idx], self.juice_bottle_fruit_query_words_dict[fruit_idx]))
+                self.anomaly_flag = True
+            # # todo centroid of fruit and tag_0 mismatch (if exist) ,  only one tag, center
+            # patch hist
+            sam_patch_hist = np.bincount(patch_merge_sam.reshape(-1), minlength=self.patch_query_obj.shape[0])
+            sam_patch_hist = sam_patch_hist / np.linalg.norm(sam_patch_hist)
+            if self.few_shot_inited:
+                patch_hist_similarity = (sam_patch_hist @ self.patch_token_hist.T)
+                score = 1 - patch_hist_similarity.max()
+            binary_foreground = (patch_merge_sam != (self.patch_query_obj.shape[0] - 1) ).astype(np.uint8)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_foreground, connectivity=8)
+            for i in range(1, num_labels):
+                instance_mask = (labels == i).astype(np.uint8)
+                instance_mask = cv2.resize(instance_mask, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+                if instance_mask.any():
+                    instance_masks.append(instance_mask.astype(np.bool).reshape(-1))
+            if len(instance_masks) != 0:
+                instance_masks = np.stack(instance_masks) #[N, 64x64]
+            if self.visualization:
+                image_list = [raw_image, kmeans_label, kmeans_mask, patch_mask, sam_mask, merge_sam, patch_merge_sam, binary_foreground]
+                title_list = ['raw image', 'k-means', 'kmeans mask', 'patch mask', 'sam mask', 'merge sam mask', 'patch merge sam', 'binary_foreground']
+                plt.figure(figsize=(20, 3))
+                for ind, (temp_title, temp_image) in enumerate(zip(title_list, image_list), start=1):
+                    plt.subplot(1, len(image_list), ind)
+                    plt.imshow(temp_image)
+                    plt.title(temp_title)
+                    plt.margins(0, 0)
+                    plt.axis('off')
+                # Extract relative path from class_name onwards
+                if class_name in path:
+                    relative_path = path.split(class_name, 1)[-1]
+                    if relative_path.startswith('/'):
+                        relative_path = relative_path[1:]
+                    save_path = f'visualization/full_data/{class_name}/{relative_path}.png'
+                else:
+                    save_path = f'visualization/full_data/{class_name}/{path}.png'
+                os.makedirs(os.path.dirname(save_path), exist_ok=True)
+                plt.tight_layout()
+                plt.savefig(save_path, bbox_inches='tight', dpi=150)
+                plt.close()
+            return {"score": score, "sam_patch_hist": sam_patch_hist, "instance_masks": instance_masks}
+        return {"score": score, "instance_masks": instance_masks}
+    def process_k_shot(self, class_name, few_shot_samples, few_shot_paths):
+        few_shot_samples = F.interpolate(few_shot_samples, size=(448, 448), mode=self.inter_mode, align_corners=self.align_corners, antialias=self.antialias)
+        with torch.no_grad():
+            image_features, patch_tokens, proj_patch_tokens = self.model_clip.encode_image(few_shot_samples, self.feature_list)
+            patch_tokens = [p[:, 1:, :] for p in patch_tokens]
+            patch_tokens = [p.reshape(p.shape[0]*p.shape[1], p.shape[2]) for p in patch_tokens]
+            patch_tokens_clip = torch.cat(patch_tokens, dim=-1)  # (bs, 1024, 1024x4)
+            # patch_tokens_clip = torch.cat(patch_tokens[2:], dim=-1)  # (bs, 1024, 1024x2)
+            patch_tokens_clip = patch_tokens_clip.view(self.k_shot, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            patch_tokens_clip = F.interpolate(patch_tokens_clip, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            patch_tokens_clip = patch_tokens_clip.permute(0, 2, 3, 1).view(-1, self.vision_width * len(self.feature_list))
+            patch_tokens_clip = F.normalize(patch_tokens_clip, p=2, dim=-1)  # (bsx64x64, 1024x4)
+        with torch.no_grad():
+            patch_tokens_dinov2 = self.model_dinov2.forward_features(few_shot_samples, out_layer_list=self.feature_list_dinov2)  # 4 x [bs, 32x32, 1024]
+            patch_tokens_dinov2 = torch.cat(patch_tokens_dinov2, dim=-1)  # (bs, 1024, 1024x4)
+            patch_tokens_dinov2 = patch_tokens_dinov2.view(self.k_shot, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            patch_tokens_dinov2 = F.interpolate(patch_tokens_dinov2, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            patch_tokens_dinov2 = patch_tokens_dinov2.permute(0, 2, 3, 1).view(-1, self.vision_width_dinov2 * len(self.feature_list_dinov2))
+            patch_tokens_dinov2 = F.normalize(patch_tokens_dinov2, p=2, dim=-1)  # (bsx64x64, 1024x4)
+        cluster_features = None
+        for layer in self.cluster_feature_id:
+            temp_feat = patch_tokens[layer]
+            cluster_features = temp_feat if cluster_features is None else torch.cat((cluster_features, temp_feat), 1)
+        if self.feat_size != self.ori_feat_size:
+            cluster_features = cluster_features.view(self.k_shot, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            cluster_features = F.interpolate(cluster_features, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            cluster_features = cluster_features.permute(0, 2, 3, 1).view(-1, self.vision_width * len(self.cluster_feature_id))
+        cluster_features = F.normalize(cluster_features, p=2, dim=-1)
+        if self.feat_size != self.ori_feat_size:
+            proj_patch_tokens = proj_patch_tokens.view(self.k_shot, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            proj_patch_tokens = F.interpolate(proj_patch_tokens, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            proj_patch_tokens = proj_patch_tokens.permute(0, 2, 3, 1).view(-1, self.embed_dim)
+        proj_patch_tokens = F.normalize(proj_patch_tokens, p=2, dim=-1)
+        if not self.cluster_init:
+            num_clusters = self.cluster_num_dict[class_name]
+            _, self.cluster_centers = kmeans(X=cluster_features, num_clusters=num_clusters, device=self.device)
+            self.query_obj = encode_obj_text(self.model_clip, self.query_words_dict[class_name], self.tokenizer, self.device)
+            self.patch_query_obj = encode_obj_text(self.model_clip, self.patch_query_words_dict[class_name], self.tokenizer, self.device)
+            self.classes = self.query_obj.shape[0]
+            self.cluster_init = True
+        scores = []
+        foreground_pixel_hist = []
+        splicing_connectors_distance = []
+        patch_token_hist = []
+        mem_instance_masks = []
+        for image, cluster_feature, proj_patch_token, few_shot_path in zip(few_shot_samples.chunk(self.k_shot), cluster_features.chunk(self.k_shot), proj_patch_tokens.chunk(self.k_shot), few_shot_paths):
+            # path = os.path.dirname(few_shot_path).split('/')[-1] + "_" + os.path.basename(few_shot_path).split('.')[0]
+            self.anomaly_flag = False
+            results = self.histogram(image, cluster_feature, proj_patch_token, class_name, "few_shot_" + os.path.basename(few_shot_path).split('.')[0])
+            if self.class_name == 'pushpins':
+                patch_token_hist.append(results["clip_patch_hist"])
+                mem_instance_masks.append(results['instance_masks'])
+            elif self.class_name == 'splicing_connectors':
+                foreground_pixel_hist.append(results["foreground_pixel_count"])
+                splicing_connectors_distance.append(results["distance"])
+                patch_token_hist.append(results["sam_patch_hist"])
+                mem_instance_masks.append(results['instance_masks'])
+            elif self.class_name == 'screw_bag':
+                foreground_pixel_hist.append(results["foreground_pixel_count"])
+                patch_token_hist.append(results["clip_patch_hist"])
+                mem_instance_masks.append(results['instance_masks'])
+            elif self.class_name == 'breakfast_box':
+                patch_token_hist.append(results["sam_patch_hist"])
+                mem_instance_masks.append(results['instance_masks'])
+            elif self.class_name == 'juice_bottle':
+                patch_token_hist.append(results["sam_patch_hist"])
+                mem_instance_masks.append(results['instance_masks'])
+            scores.append(results["score"])
+        if len(foreground_pixel_hist) != 0:
+            self.foreground_pixel_hist = np.mean(foreground_pixel_hist)
+        if len(splicing_connectors_distance) != 0:
+            self.splicing_connectors_distance = np.mean(splicing_connectors_distance)
+        if len(patch_token_hist) != 0: # patch hist
+            self.patch_token_hist = np.stack(patch_token_hist)
+        if len(mem_instance_masks) != 0:
+            self.mem_instance_masks = mem_instance_masks
+        # for interests matching
+        len_feature_list = len(self.feature_list)
+        for idx, batch_mem_patch_feature in enumerate(patch_tokens_clip.chunk(len_feature_list, dim=-1)): # 4 stages batch_mem_patch_feature (bsx64x64, 1024)
+            mem_instance_features = []
+            for mem_patch_feature, mem_instance_masks in zip(batch_mem_patch_feature.chunk(self.k_shot), self.mem_instance_masks): # k shot  mem_patch_feature (64x64, 1024)
+                mem_instance_features.extend([mem_patch_feature[mask, :].mean(0, keepdim=True) for mask in mem_instance_masks])
+            mem_instance_features = torch.cat(mem_instance_features, dim=0)
+            mem_instance_features = F.normalize(mem_instance_features, dim=-1) # 4 stages
+            # mem_instance_features_multi_stage.append(mem_instance_features)
+            self.mem_instance_features_multi_stage[idx].append(mem_instance_features)
+        mem_patch_feature_clip_coreset = patch_tokens_clip
+        mem_patch_feature_dinov2_coreset = patch_tokens_dinov2
+        return scores, mem_patch_feature_clip_coreset, mem_patch_feature_dinov2_coreset
+    def process(self, class_name: str, few_shot_samples: list[torch.Tensor], few_shot_paths: list[str]):
+        few_shot_samples = self.transform(few_shot_samples).to(self.device)
+        scores, mem_patch_feature_clip_coreset, mem_patch_feature_dinov2_coreset = self.process_k_shot(class_name, few_shot_samples, few_shot_paths)
+        clip_sampler = KCenterGreedy(embedding=mem_patch_feature_clip_coreset, sampling_ratio=0.25)
+        mem_patch_feature_clip_coreset = clip_sampler.sample_coreset()
+        dinov2_sampler = KCenterGreedy(embedding=mem_patch_feature_dinov2_coreset, sampling_ratio=0.25)
+        mem_patch_feature_dinov2_coreset = dinov2_sampler.sample_coreset()
+        self.mem_patch_feature_clip_coreset.append(mem_patch_feature_clip_coreset)
+        self.mem_patch_feature_dinov2_coreset.append(mem_patch_feature_dinov2_coreset)
+    def setup(self, data: dict) -> None:
+        """Setup the few-shot samples for the model.
+        The evaluation script will call this method to pass the k images for few shot learning and the object class
+        name. In the case of MVTec LOCO this will be the dataset category name (e.g. breakfast_box). Please contact
+        the organizing committee if if your model requires any additional dataset-related information at setup-time.
+        """
+        few_shot_samples = data.get("few_shot_samples")
+        class_name = data.get("dataset_category")
+        few_shot_paths = data.get("few_shot_samples_path")
+        self.class_name = class_name
+        print(few_shot_samples.shape)
+        self.total_size = few_shot_samples.size(0)
+        self.k_shot = 4 if self.total_size > 4 else self.total_size
+        self.cluster_init = False
+        self.mem_instance_features_multi_stage = [[],[],[],[]]
+        self.mem_patch_feature_clip_coreset = []
+        self.mem_patch_feature_dinov2_coreset = []
+        # Check if coreset files already exist
+        clip_file = 'memory_bank/mem_patch_feature_clip_{}.pt'.format(self.class_name)
+        dinov2_file = 'memory_bank/mem_patch_feature_dinov2_{}.pt'.format(self.class_name)
+        instance_file = 'memory_bank/mem_instance_features_multi_stage_{}.pt'.format(self.class_name)
+        files_exist = os.path.exists(clip_file) and os.path.exists(dinov2_file) and os.path.exists(instance_file)
+        if self.save_coreset_features and not files_exist:
+            print(f"Coreset files not found for {self.class_name}, computing and saving...")
+            for i in range(self.total_size//self.k_shot):
+                self.process(class_name, few_shot_samples[self.k_shot*i : min(self.k_shot*(i+1), self.total_size)], few_shot_paths[self.k_shot*i : min(self.k_shot*(i+1), self.total_size)])
+            # Coreset Subsampling
+            self.mem_patch_feature_clip_coreset = torch.cat(self.mem_patch_feature_clip_coreset, dim=0)
+            torch.save(self.mem_patch_feature_clip_coreset, clip_file)
+            self.mem_patch_feature_dinov2_coreset = torch.cat(self.mem_patch_feature_dinov2_coreset, dim=0)
+            torch.save(self.mem_patch_feature_dinov2_coreset, dinov2_file)
+            print(self.mem_patch_feature_dinov2_coreset.shape, self.mem_patch_feature_clip_coreset.shape)
+            self.mem_instance_features_multi_stage = [ torch.cat(mem_instance_features, dim=0) for mem_instance_features in self.mem_instance_features_multi_stage ]
+            self.mem_instance_features_multi_stage = torch.cat(self.mem_instance_features_multi_stage, dim=1)
+            torch.save(self.mem_instance_features_multi_stage, instance_file)
+            print(self.mem_instance_features_multi_stage.shape)
+        elif self.save_coreset_features and files_exist:
+            print(f"Coreset files found for {self.class_name}, loading existing files...")
+            self.process(class_name, few_shot_samples[0 : self.k_shot], few_shot_paths[0 : self.k_shot])
+            self.mem_patch_feature_clip_coreset = torch.load(clip_file)
+            self.mem_patch_feature_dinov2_coreset = torch.load(dinov2_file)
+            self.mem_instance_features_multi_stage = torch.load(instance_file)
+            print(self.mem_patch_feature_dinov2_coreset.shape, self.mem_patch_feature_clip_coreset.shape)
+            print(self.mem_instance_features_multi_stage.shape)
+        else:
+            self.process(class_name, few_shot_samples[0 : self.k_shot], few_shot_paths[0 : self.k_shot])
+            self.mem_patch_feature_clip_coreset = torch.load(clip_file)
+            self.mem_patch_feature_dinov2_coreset = torch.load(dinov2_file)
+            self.mem_instance_features_multi_stage = torch.load(instance_file)
+        self.few_shot_inited = True

model_ensemble_few_shot.py ADDED Viewed

	@@ -0,0 +1,935 @@

+import os
+# Set cache directories to use checkpoint folder for model downloads
+os.environ['TORCH_HOME'] = './checkpoint'
+os.environ['HF_HOME'] = './checkpoint/huggingface'
+os.environ['TRANSFORMERS_CACHE'] = './checkpoint/huggingface/transformers'
+os.environ['HF_HUB_CACHE'] = './checkpoint/huggingface/hub'
+# Create checkpoint subdirectories if they don't exist
+os.makedirs('./checkpoint/huggingface/transformers', exist_ok=True)
+os.makedirs('./checkpoint/huggingface/hub', exist_ok=True)
+import torch
+from torch import nn
+from torchvision.transforms import v2
+from torchvision.transforms.v2.functional import resize
+import cv2
+import json
+import torch
+import random
+import logging
+import argparse
+import numpy as np
+from PIL import Image
+from skimage import measure
+from tabulate import tabulate
+from torchvision.ops.focal_loss import sigmoid_focal_loss
+import torch.nn.functional as F
+import torchvision.transforms as transforms
+import torchvision.transforms.functional as TF
+from sklearn.metrics import auc, roc_auc_score, average_precision_score, f1_score, precision_recall_curve, pairwise
+from sklearn.mixture import GaussianMixture
+import faiss
+import open_clip_local as open_clip
+from torch.utils.data.dataset import ConcatDataset
+from scipy.optimize import linear_sum_assignment
+from sklearn.random_projection import SparseRandomProjection
+import cv2
+from torchvision.transforms import InterpolationMode
+from PIL import Image
+import string
+from prompt_ensemble import encode_text_with_prompt_ensemble, encode_normal_text, encode_abnormal_text, encode_general_text, encode_obj_text
+from kmeans_pytorch import kmeans, kmeans_predict
+from scipy.optimize import linear_sum_assignment
+from scipy.stats import norm
+from segment_anything import sam_model_registry, SamAutomaticMaskGenerator, SamPredictor
+from matplotlib import pyplot as plt
+import matplotlib
+matplotlib.use('Agg')
+import pickle
+from scipy.stats import norm
+from open_clip_local.pos_embed import get_2d_sincos_pos_embed
+def to_np_img(m):
+    m = m.permute(1, 2, 0).cpu().numpy()
+    mean = np.array([[[0.48145466, 0.4578275, 0.40821073]]])
+    std = np.array([[[0.26862954, 0.26130258, 0.27577711]]])
+    m  = m * std + mean
+    return np.clip((m * 255.), 0, 255).astype(np.uint8)
+def setup_seed(seed):
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+class MyModel(nn.Module):
+    """Example model class for track 2.
+    This class applies few-shot anomaly detection using the WinClip model from Anomalib.
+    """
+    def __init__(self) -> None:
+        super().__init__()
+        setup_seed(42)
+        # NOTE: Create your transformation pipeline (if needed).
+        self.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+        self.transform = v2.Compose(
+            [
+                v2.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711)),
+            ],
+        )
+        # NOTE: Create your model.
+        self.model_clip, _, _ = open_clip.create_model_and_transforms('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
+        self.tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K')
+        self.feature_list = [6, 12, 18, 24]
+        self.embed_dim = 768
+        self.vision_width = 1024
+        self.model_sam = sam_model_registry["vit_h"](checkpoint = "./checkpoint/sam_vit_h_4b8939.pth").to(self.device)
+        self.mask_generator = SamAutomaticMaskGenerator(model = self.model_sam)
+        self.memory_size = 2048
+        self.n_neighbors = 2
+        self.model_clip.eval()
+        self.test_args = None
+        self.align_corners = True # False
+        self.antialias = True # False
+        self.inter_mode = 'bilinear' # bilinear/bicubic
+        self.cluster_feature_id = [0, 1]
+        self.cluster_num_dict = {
+            "breakfast_box": 3, # unused
+            "juice_bottle": 8, # unused
+            "splicing_connectors": 10, # unused
+            "pushpins": 10,
+            "screw_bag": 10,
+        }
+        self.query_words_dict = {
+            "breakfast_box": ['orange', "nectarine", "cereals", "banana chips", 'almonds', 'white box', 'black background'],
+            "juice_bottle": ['bottle', ['black background', 'background']],
+            "pushpins": [['pushpin', 'pin'], ['plastic box', 'black background']],
+            "screw_bag": [['screw'], 'plastic bag', 'background'],
+            "splicing_connectors": [['splicing connector', 'splice connector',], ['cable', 'wire'], ['grid']],
+        }
+        self.foreground_label_idx = {  # for query_words_dict
+            "breakfast_box": [0, 1, 2, 3, 4, 5],
+            "juice_bottle": [0],
+            "pushpins": [0],
+            "screw_bag": [0],
+            "splicing_connectors":[0, 1]
+        }
+        self.patch_query_words_dict = {
+            "breakfast_box": ['orange', "nectarine", "cereals", "banana chips", 'almonds', 'white box', 'black background'],
+            "juice_bottle": [['glass'], ['liquid in bottle'], ['fruit'], ['label', 'tag'], ['black background', 'background']],
+            "pushpins": [['pushpin', 'pin'], ['plastic box', 'black background']],
+            "screw_bag": [['hex screw', 'hexagon bolt'], ['hex nut', 'hexagon nut'], ['ring washer', 'ring gasket'], ['plastic bag', 'background']],
+            "splicing_connectors": [['splicing connector', 'splice connector',], ['cable', 'wire'], ['grid']],
+        }
+        self.query_threshold_dict = {
+            "breakfast_box": [0., 0., 0., 0., 0., 0., 0.], # unused
+            "juice_bottle": [0., 0., 0.], # unused
+            "splicing_connectors": [0.15, 0.15, 0.15, 0., 0.], # unused
+            "pushpins": [0.2, 0., 0., 0.],
+            "screw_bag": [0., 0., 0.,],
+        }
+        self.feat_size = 64
+        self.ori_feat_size = 32
+        self.visualization = False
+        self.pushpins_count = 15
+        self.splicing_connectors_count = [2, 3, 5] # coresponding to yellow, blue, and red
+        self.splicing_connectors_distance = 0
+        self.splicing_connectors_cable_color_query_words_dict = [['yellow cable', 'yellow wire'], ['blue cable', 'blue wire'], ['red cable', 'red wire']]
+        self.juice_bottle_liquid_query_words_dict = [['red liquid', 'cherry juice'], ['yellow liquid', 'orange juice'], ['milky liquid']]
+        self.juice_bottle_fruit_query_words_dict = ['cherry', ['tangerine', 'orange'], 'banana']
+        # query words
+        self.foreground_pixel_hist = 0
+        # patch query words
+        self.patch_token_hist = []
+        self.few_shot_inited = False
+        from dinov2.dinov2.hub.backbones import dinov2_vitl14
+        self.model_dinov2 = dinov2_vitl14()
+        self.model_dinov2.to(self.device)
+        self.model_dinov2.eval()
+        self.feature_list_dinov2 = [6, 12, 18, 24]
+        self.vision_width_dinov2 = 1024
+        self.stats = pickle.load(open("memory_bank/statistic_scores_model_ensemble_few_shot_val.pkl", "rb"))
+        self.mem_instance_masks = None
+        self.anomaly_flag = False
+        self.validation = False #True #False
+    def set_viz(self, viz):
+        self.visualization = viz
+    def set_val(self, val):
+        self.validation = val
+    def forward(self, batch: torch.Tensor, batch_path: list) -> dict[str, torch.Tensor]:
+        """Transform the input batch and pass it through the model.
+        This model returns a dictionary with the following keys
+        - ``anomaly_map`` - Anomaly map.
+        - ``pred_score`` - Predicted anomaly score.
+        """
+        self.anomaly_flag = False
+        batch = self.transform(batch).to(self.device)
+        results = self.forward_one_sample(batch, self.mem_patch_feature_clip_coreset, self.mem_patch_feature_dinov2_coreset, batch_path[0])
+        hist_score = results['hist_score']
+        structural_score = results['structural_score']
+        instance_hungarian_match_score = results['instance_hungarian_match_score']
+        anomaly_map_structural = results['anomaly_map_structural']
+        if self.validation:
+            return {"hist_score": torch.tensor(hist_score), "structural_score": torch.tensor(structural_score), "instance_hungarian_match_score": torch.tensor(instance_hungarian_match_score)}
+        def sigmoid(z):
+            return 1/(1 + np.exp(-z))
+        # standardization
+        standard_structural_score = (structural_score - self.stats[self.class_name]["structural_scores"]["mean"]) / self.stats[self.class_name]["structural_scores"]["unbiased_std"]
+        standard_instance_hungarian_match_score = (instance_hungarian_match_score - self.stats[self.class_name]["instance_hungarian_match_scores"]["mean"]) / self.stats[self.class_name]["instance_hungarian_match_scores"]["unbiased_std"]
+        pred_score = max(standard_instance_hungarian_match_score, standard_structural_score)
+        pred_score = sigmoid(pred_score)
+        if self.anomaly_flag:
+            pred_score = 1.
+            self.anomaly_flag = False
+        return {"pred_score": torch.tensor(pred_score), "anomaly_map": torch.tensor(anomaly_map_structural), "hist_score": torch.tensor(hist_score), "structural_score": torch.tensor(structural_score), "instance_hungarian_match_score": torch.tensor(instance_hungarian_match_score)}
+    def forward_one_sample(self, batch: torch.Tensor, mem_patch_feature_clip_coreset: torch.Tensor, mem_patch_feature_dinov2_coreset: torch.Tensor, path: str):
+        with torch.no_grad():
+            image_features, patch_tokens, proj_patch_tokens = self.model_clip.encode_image(batch, self.feature_list)
+            # image_features /= image_features.norm(dim=-1, keepdim=True)
+            patch_tokens = [p[:, 1:, :] for p in patch_tokens]
+            patch_tokens = [p.reshape(p.shape[0]*p.shape[1], p.shape[2]) for p in patch_tokens]
+            patch_tokens_clip = torch.cat(patch_tokens, dim=-1)  # (1, 1024, 1024x4)
+            # patch_tokens_clip = torch.cat(patch_tokens[2:], dim=-1)  # (1, 1024, 1024x2)
+            patch_tokens_clip = patch_tokens_clip.view(1, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            patch_tokens_clip = F.interpolate(patch_tokens_clip, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            patch_tokens_clip = patch_tokens_clip.permute(0, 2, 3, 1).view(-1, self.vision_width * len(self.feature_list))
+            patch_tokens_clip = F.normalize(patch_tokens_clip, p=2, dim=-1) # (1x64x64, 1024x4)
+        with torch.no_grad():
+            patch_tokens_dinov2 = self.model_dinov2.forward_features(batch, out_layer_list=self.feature_list)
+            patch_tokens_dinov2 = torch.cat(patch_tokens_dinov2, dim=-1)  # (1, 1024, 1024x4)
+            patch_tokens_dinov2 = patch_tokens_dinov2.view(1, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            patch_tokens_dinov2 = F.interpolate(patch_tokens_dinov2, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            patch_tokens_dinov2 = patch_tokens_dinov2.permute(0, 2, 3, 1).view(-1, self.vision_width_dinov2 * len(self.feature_list_dinov2))
+            patch_tokens_dinov2 = F.normalize(patch_tokens_dinov2, p=2, dim=-1) # (1x64x64, 1024x4)
+        '''adding for kmeans seg '''
+        if self.feat_size != self.ori_feat_size:
+            proj_patch_tokens = proj_patch_tokens.view(1, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            proj_patch_tokens = F.interpolate(proj_patch_tokens, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            proj_patch_tokens = proj_patch_tokens.permute(0, 2, 3, 1).view(self.feat_size * self.feat_size, self.embed_dim)
+        proj_patch_tokens = F.normalize(proj_patch_tokens, p=2, dim=-1)
+        mid_features = None
+        for layer in self.cluster_feature_id:
+            temp_feat = patch_tokens[layer]
+            mid_features = temp_feat if mid_features is None else torch.cat((mid_features, temp_feat), -1)
+        if self.feat_size != self.ori_feat_size:
+            mid_features = mid_features.view(1, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            mid_features = F.interpolate(mid_features, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            mid_features = mid_features.permute(0, 2, 3, 1).view(-1, self.vision_width * len(self.cluster_feature_id))
+        mid_features = F.normalize(mid_features, p=2, dim=-1)
+        results = self.histogram(batch, mid_features, proj_patch_tokens, self.class_name, os.path.dirname(path).split('/')[-1] + "_" + os.path.basename(path).split('.')[0])
+        hist_score = results['score']
+        '''calculate patchcore'''
+        anomaly_maps_patchcore = []
+        if self.class_name in ['pushpins', 'screw_bag']: # clip feature for patchcore
+            len_feature_list = len(self.feature_list)
+            for patch_feature, mem_patch_feature in zip(patch_tokens_clip.chunk(len_feature_list, dim=-1), mem_patch_feature_clip_coreset.chunk(len_feature_list, dim=-1)):
+                patch_feature = F.normalize(patch_feature, dim=-1)
+                mem_patch_feature = F.normalize(mem_patch_feature, dim=-1)
+                normal_map_patchcore = (patch_feature @ mem_patch_feature.T)
+                normal_map_patchcore = (normal_map_patchcore.max(1)[0]).cpu().numpy() # 1: normal 0: abnormal
+                anomaly_map_patchcore = 1 - normal_map_patchcore
+                anomaly_maps_patchcore.append(anomaly_map_patchcore)
+        if self.class_name in ['splicing_connectors', 'breakfast_box', 'juice_bottle']: # dinov2 feature for patchcore
+            len_feature_list = len(self.feature_list_dinov2)
+            for patch_feature, mem_patch_feature in zip(patch_tokens_dinov2.chunk(len_feature_list, dim=-1), mem_patch_feature_dinov2_coreset.chunk(len_feature_list, dim=-1)):
+                patch_feature = F.normalize(patch_feature, dim=-1)
+                mem_patch_feature = F.normalize(mem_patch_feature, dim=-1)
+                normal_map_patchcore = (patch_feature @ mem_patch_feature.T)
+                normal_map_patchcore = (normal_map_patchcore.max(1)[0]).cpu().numpy() # 1: normal 0: abnormal
+                anomaly_map_patchcore = 1 - normal_map_patchcore
+                anomaly_maps_patchcore.append(anomaly_map_patchcore)
+        structural_score = np.stack(anomaly_maps_patchcore).mean(0).max()
+        anomaly_map_structural = np.stack(anomaly_maps_patchcore).mean(0).reshape(self.feat_size, self.feat_size)
+        instance_masks = results["instance_masks"]
+        anomaly_instances_hungarian = []
+        instance_hungarian_match_score = 1.
+        if self.mem_instance_masks is not None and len(instance_masks) != 0:
+            for patch_feature, batch_mem_patch_feature in zip(patch_tokens_clip.chunk(len_feature_list, dim=-1), mem_patch_feature_clip_coreset.chunk(len_feature_list, dim=-1)):
+                instance_features = [patch_feature[mask, :].mean(0, keepdim=True) for mask in instance_masks]
+                instance_features = torch.cat(instance_features, dim=0)
+                instance_features = F.normalize(instance_features, dim=-1)
+                mem_instance_features = []
+                for mem_patch_feature, mem_instance_masks in zip(batch_mem_patch_feature.chunk(self.k_shot), self.mem_instance_masks):
+                    mem_instance_features.extend([mem_patch_feature[mask, :].mean(0, keepdim=True) for mask in mem_instance_masks])
+                mem_instance_features = torch.cat(mem_instance_features, dim=0)
+                mem_instance_features = F.normalize(mem_instance_features, dim=-1)
+                normal_instance_hungarian = (instance_features @ mem_instance_features.T)
+                cost_matrix = (1 - normal_instance_hungarian).cpu().numpy()
+                row_ind, col_ind = linear_sum_assignment(cost_matrix)
+                cost = cost_matrix[row_ind, col_ind].sum()
+                cost = cost / min(cost_matrix.shape)
+                anomaly_instances_hungarian.append(cost)
+            instance_hungarian_match_score = np.mean(anomaly_instances_hungarian)
+        results = {'hist_score': hist_score, 'structural_score': structural_score,  'instance_hungarian_match_score': instance_hungarian_match_score, "anomaly_map_structural": anomaly_map_structural}
+        return results
+    def histogram(self, image, cluster_feature, proj_patch_token, class_name, path):
+        def plot_results_only(sorted_anns):
+            cur = 1
+            img_color = np.zeros((sorted_anns[0]['segmentation'].shape[0], sorted_anns[0]['segmentation'].shape[1]))
+            for ann in sorted_anns:
+                m = ann['segmentation']
+                img_color[m] = cur
+                cur += 1
+            return img_color
+        def merge_segmentations(a, b, background_class):
+            unique_labels_a = np.unique(a)
+            unique_labels_b = np.unique(b)
+            max_label_a = unique_labels_a.max()
+            label_map = np.zeros(max_label_a + 1, dtype=int)
+            for label_a in unique_labels_a:
+                mask_a = (a == label_a)
+                labels_b = b[mask_a]
+                if labels_b.size > 0:
+                    count_b = np.bincount(labels_b, minlength=unique_labels_b.max() + 1)
+                    label_map[label_a] = np.argmax(count_b)
+                else:
+                    label_map[label_a] = background_class # default background
+            merged_a = label_map[a]
+            return merged_a
+        pseudo_labels = kmeans_predict(cluster_feature, self.cluster_centers, 'euclidean', device=self.device)
+        kmeans_mask = torch.ones_like(pseudo_labels) * (self.classes - 1)    # default to background
+        for pl in pseudo_labels.unique():
+            mask = (pseudo_labels == pl).reshape(-1)
+            # filter small region
+            binary = mask.cpu().numpy().reshape(self.feat_size, self.feat_size).astype(np.uint8)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary, connectivity=8)
+            for i in range(1, num_labels):
+                temp_mask = labels == i
+                if np.sum(temp_mask) <= 8:
+                    mask[temp_mask.reshape(-1)] = False
+            if mask.any():
+                region_feature = proj_patch_token[mask, :].mean(0, keepdim=True)
+                similarity = (region_feature @ self.query_obj.T)
+                prob, index = torch.max(similarity, dim=-1)
+                temp_label = index.squeeze(0).item()
+                temp_prob = prob.squeeze(0).item()
+                if temp_prob > self.query_threshold_dict[class_name][temp_label]: # threshold for each class
+                    kmeans_mask[mask] = temp_label
+        raw_image = to_np_img(image[0])
+        height, width = raw_image.shape[:2]
+        masks = self.mask_generator.generate(raw_image)
+        # self.predictor.set_image(raw_image)
+        kmeans_label = pseudo_labels.view(self.feat_size, self.feat_size).cpu().numpy()
+        kmeans_mask = kmeans_mask.view(self.feat_size, self.feat_size).cpu().numpy()
+        patch_similarity = (proj_patch_token @ self.patch_query_obj.T)
+        patch_mask = patch_similarity.argmax(-1)
+        patch_mask = patch_mask.view(self.feat_size, self.feat_size).cpu().numpy()
+        sorted_masks = sorted(masks, key=(lambda x: x['area']), reverse=True)
+        sam_mask = plot_results_only(sorted_masks).astype(np.int)
+        resized_mask = cv2.resize(kmeans_mask, (width, height), interpolation = cv2.INTER_NEAREST)
+        merge_sam = merge_segmentations(sam_mask, resized_mask, background_class=self.classes-1)
+        resized_patch_mask = cv2.resize(patch_mask, (width, height), interpolation = cv2.INTER_NEAREST)
+        patch_merge_sam = merge_segmentations(sam_mask, resized_patch_mask, background_class=self.patch_query_obj.shape[0]-1)
+        # filter small region for merge sam
+        binary = np.isin(merge_sam, self.foreground_label_idx[self.class_name]).astype(np.uint8)  # foreground 1  background 0
+        num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary, connectivity=8)
+        for i in range(1, num_labels):
+            temp_mask = labels == i
+            if np.sum(temp_mask) <= 32: # 448x448
+                merge_sam[temp_mask] = self.classes - 1 # set to background
+        # filter small region for patch merge sam
+        binary = (patch_merge_sam != (self.patch_query_obj.shape[0]-1) ).astype(np.uint8)  # foreground 1  background 0
+        num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary, connectivity=8)
+        for i in range(1, num_labels):
+            temp_mask = labels == i
+            if np.sum(temp_mask) <= 32: # 448x448
+                patch_merge_sam[temp_mask] = self.patch_query_obj.shape[0]-1 # set to background
+        score = 0. # default to normal
+        self.anomaly_flag = False
+        instance_masks = []
+        if self.class_name == 'pushpins':
+            # object count hist
+            kernel = np.ones((3, 3), dtype=np.uint8)  # dilate for robustness
+            binary = np.isin(merge_sam, self.foreground_label_idx[self.class_name]).astype(np.uint8) # foreground 1  background 0
+            dilate_binary = cv2.dilate(binary, kernel)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(dilate_binary, connectivity=8)
+            pushpins_count = num_labels - 1 # number of pushpins
+            for i in range(1, num_labels):
+                instance_mask = (labels == i).astype(np.uint8)
+                instance_mask = cv2.resize(instance_mask, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+                if instance_mask.any():
+                    instance_masks.append(instance_mask.astype(np.bool).reshape(-1))
+            if self.few_shot_inited and pushpins_count != self.pushpins_count and self.anomaly_flag is False:
+                self.anomaly_flag = True
+                print('number of pushpins: {}, but canonical number of pushpins: {}'.format(pushpins_count, self.pushpins_count))
+            # patch hist
+            clip_patch_hist = np.bincount(patch_mask.reshape(-1), minlength=self.patch_query_obj.shape[0])
+            clip_patch_hist = clip_patch_hist / np.linalg.norm(clip_patch_hist)
+            if self.few_shot_inited:
+                patch_hist_similarity = (clip_patch_hist @ self.patch_token_hist.T)
+                score = 1 - patch_hist_similarity.max()
+            binary_foreground = dilate_binary.astype(np.uint8)
+            if len(instance_masks) != 0:
+                instance_masks = np.stack(instance_masks) #[N, 64x64]
+            if self.visualization:
+                image_list = [raw_image, kmeans_label, kmeans_mask, patch_mask, sam_mask, merge_sam, patch_merge_sam, binary_foreground]
+                title_list = ['raw image', 'k-means', 'kmeans mask', 'patch mask', 'sam mask', 'merge sam mask', 'patch merge sam', 'binary_foreground']
+                plt.figure(figsize=(20, 3))
+                for ind, (temp_title, temp_image) in enumerate(zip(title_list, image_list), start=1):
+                    plt.subplot(1, len(image_list), ind)
+                    plt.imshow(temp_image)
+                    plt.title(temp_title)
+                    plt.margins(0, 0)
+                    plt.axis('off')
+                # Extract relative path from class_name onwards
+                if class_name in path:
+                    relative_path = path.split(class_name, 1)[-1]
+                    if relative_path.startswith('/'):
+                        relative_path = relative_path[1:]
+                    save_path = f'visualization/few_shot/{class_name}/{relative_path}.png'
+                else:
+                    save_path = f'visualization/few_shot/{class_name}/{path}.png'
+                os.makedirs(os.path.dirname(save_path), exist_ok=True)
+                plt.tight_layout()
+                plt.savefig(save_path, bbox_inches='tight', dpi=150)
+                plt.close()
+            # todo: same number in total but in different boxes or broken box
+            return {"score": score, "clip_patch_hist": clip_patch_hist, "instance_masks": instance_masks}
+        elif self.class_name == 'splicing_connectors':
+            #  object count hist for default
+            sam_mask_max_area = sorted_masks[0]['segmentation'] # background
+            binary = (sam_mask_max_area == 0).astype(np.uint8) # sam_mask_max_area is background,  background 0 foreground 1
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary, connectivity=8)
+            count = 0
+            for i in range(1, num_labels):
+                temp_mask = labels == i
+                if np.sum(temp_mask) <= 64: # 448x448 64
+                    binary[temp_mask] = 0 # set to background
+                else:
+                    count += 1
+            if count != 1 and self.anomaly_flag is False: # cable cut or no cable or no connector
+                print('number of connected component in splicing_connectors: {}, but the default connected component is 1.'.format(count))
+                self.anomaly_flag = True
+            merge_sam[~(binary.astype(np.bool))] = self.query_obj.shape[0] - 1 # remove noise
+            patch_merge_sam[~(binary.astype(np.bool))] = self.patch_query_obj.shape[0] - 1 # remove patch noise
+            # erode the cable and divide into left and right parts
+            kernel = np.ones((23, 23), dtype=np.uint8)
+            erode_binary = cv2.erode(binary, kernel)
+            h, w = erode_binary.shape
+            distance = 0
+            left, right = erode_binary[:, :int(w/2)],  erode_binary[:, int(w/2):]
+            left_count = np.bincount(left.reshape(-1), minlength=self.classes)[1]  # foreground
+            right_count = np.bincount(right.reshape(-1), minlength=self.classes)[1] # foreground
+            binary_cable = (patch_merge_sam == 1).astype(np.uint8)
+            kernel = np.ones((5, 5), dtype=np.uint8)
+            binary_cable = cv2.erode(binary_cable, kernel)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_cable, connectivity=8)
+            for i in range(1, num_labels):
+                temp_mask = labels == i
+                if np.sum(temp_mask) <= 64: # 448x448
+                    binary_cable[temp_mask] = 0 # set to background
+            binary_cable = cv2.resize(binary_cable, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+            binary_clamps = (patch_merge_sam == 0).astype(np.uint8)
+            kernel = np.ones((5, 5), dtype=np.uint8)
+            binary_clamps = cv2.erode(binary_clamps, kernel)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_clamps, connectivity=8)
+            for i in range(1, num_labels):
+                temp_mask = labels == i
+                if np.sum(temp_mask) <= 64: # 448x448
+                    binary_clamps[temp_mask] = 0 # set to background
+                else:
+                    instance_mask = temp_mask.astype(np.uint8)
+                    instance_mask = cv2.resize(instance_mask, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+                    if instance_mask.any():
+                        instance_masks.append(instance_mask.astype(np.bool).reshape(-1))
+            binary_clamps = cv2.resize(binary_clamps, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+            binary_connector = cv2.resize(binary, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+            query_cable_color = encode_obj_text(self.model_clip, self.splicing_connectors_cable_color_query_words_dict, self.tokenizer, self.device)
+            cable_feature = proj_patch_token[binary_cable.astype(np.bool).reshape(-1), :].mean(0, keepdim=True)
+            idx_color = (cable_feature @ query_cable_color.T).argmax(-1).squeeze(0).item()
+            foreground_pixel_count = np.sum(erode_binary) / self.splicing_connectors_count[idx_color]
+            slice_cable = binary[:, int(w/2)-1: int(w/2)+1]
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(slice_cable, connectivity=8)
+            cable_count = num_labels - 1
+            if cable_count != 1 and self.anomaly_flag is False: # two cables
+                print('number of cable count in splicing_connectors: {}, but the default cable count is 1.'.format(cable_count))
+                self.anomaly_flag = True
+            # {2-clamp: yellow  3-clamp: blue  5-clamp: red}    cable color and clamp number mismatch
+            if self.few_shot_inited and self.foreground_pixel_hist != 0 and self.anomaly_flag is False:
+                ratio = foreground_pixel_count / self.foreground_pixel_hist
+                if (ratio > 1.2 or ratio < 0.8) and self.anomaly_flag is False:    # color and number mismatch
+                    print('cable color and number of clamps mismatch, cable color idx: {} (0: yellow 2-clamp, 1: blue 3-clamp, 2: red 5-clamp), foreground_pixel_count :{}, canonical foreground_pixel_hist: {}.'.format(idx_color, foreground_pixel_count, self.foreground_pixel_hist))
+                    self.anomaly_flag = True
+            # left right hist for symmetry
+            ratio = np.sum(left_count) / (np.sum(right_count) + 1e-5)
+            if self.few_shot_inited and (ratio > 1.2 or ratio < 0.8) and self.anomaly_flag is False: # left right asymmetry in clamp
+                print('left and right connectors are not symmetry.')
+                self.anomaly_flag = True
+            # left and right centroids distance
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(erode_binary, connectivity=8)
+            if num_labels - 1 == 2:
+                centroids = centroids[1:]
+                x1, y1 = centroids[0]
+                x2, y2 = centroids[1]
+                distance = np.sqrt((x1/w - x2/w)**2 + (y1/h - y2/h)**2)
+                if self.few_shot_inited and self.splicing_connectors_distance != 0 and self.anomaly_flag is False:
+                    ratio = distance / self.splicing_connectors_distance
+                    if ratio < 0.6 or ratio > 1.4:  # too short or too long centroids distance (cable) # 0.6 1.4
+                        print('cable is too short or too long.')
+                        self.anomaly_flag = True
+            # patch hist
+            sam_patch_hist = np.bincount(patch_merge_sam.reshape(-1), minlength=self.patch_query_obj.shape[0])#[:-1]  # ignore background (grid) for statistic
+            sam_patch_hist = sam_patch_hist / np.linalg.norm(sam_patch_hist)
+            if self.few_shot_inited:
+                patch_hist_similarity = (sam_patch_hist @ self.patch_token_hist.T)
+                score = 1 - patch_hist_similarity.max()
+            # todo    mismatch cable link
+            binary_foreground = binary.astype(np.uint8) # only 1 instance, so additionally seperate cable and clamps
+            if binary_connector.any():
+                instance_masks.append(binary_connector.astype(np.bool).reshape(-1))
+            if binary_clamps.any():
+                instance_masks.append(binary_clamps.astype(np.bool).reshape(-1))
+            if binary_cable.any():
+                instance_masks.append(binary_cable.astype(np.bool).reshape(-1))
+            if len(instance_masks) != 0:
+                instance_masks = np.stack(instance_masks) #[N, 64x64]
+            if self.visualization:
+                image_list = [raw_image, kmeans_label, kmeans_mask, patch_mask, sam_mask, binary_connector, merge_sam, patch_merge_sam, erode_binary, binary_cable, binary_clamps]
+                title_list = ['raw image', 'k-means', 'kmeans mask', 'patch mask', 'sam mask', 'binary_connector', 'merge sam', 'patch merge sam', 'erode binary', 'binary_cable', 'binary_clamps']
+                plt.figure(figsize=(25, 3))
+                for ind, (temp_title, temp_image) in enumerate(zip(title_list, image_list), start=1):
+                    plt.subplot(1, len(image_list), ind)
+                    plt.imshow(temp_image)
+                    plt.title(temp_title)
+                    plt.margins(0, 0)
+                    plt.axis('off')
+                # Extract relative path from class_name onwards
+                if class_name in path:
+                    relative_path = path.split(class_name, 1)[-1]
+                    if relative_path.startswith('/'):
+                        relative_path = relative_path[1:]
+                    save_path = f'visualization/few_shot/{class_name}/{relative_path}.png'
+                else:
+                    save_path = f'visualization/few_shot/{class_name}/{path}.png'
+                os.makedirs(os.path.dirname(save_path), exist_ok=True)
+                plt.tight_layout()
+                plt.savefig(save_path, bbox_inches='tight', dpi=150)
+                plt.close()
+            return {"score": score, "foreground_pixel_count": foreground_pixel_count, "distance": distance, "sam_patch_hist": sam_patch_hist, "instance_masks": instance_masks}
+        elif self.class_name == 'screw_bag':
+            # pixel hist of kmeans mask
+            foreground_pixel_count = np.sum(np.bincount(kmeans_mask.reshape(-1))[:len(self.foreground_label_idx[self.class_name])])  # foreground pixel
+            if self.few_shot_inited and self.foreground_pixel_hist != 0 and self.anomaly_flag is False:
+                ratio = foreground_pixel_count / self.foreground_pixel_hist
+                # todo: optimize
+                if ratio < 0.94 or ratio > 1.06:
+                    print('foreground pixel histagram of screw bag: {}, the canonical foreground pixel histogram of screw bag in few shot: {}'.format(foreground_pixel_count, self.foreground_pixel_hist))
+                    self.anomaly_flag = True
+            # patch hist
+            binary_screw = np.isin(kmeans_mask, self.foreground_label_idx[self.class_name])
+            patch_mask[~binary_screw] = self.patch_query_obj.shape[0] - 1 # remove patch noise
+            resized_binary_screw = cv2.resize(binary_screw.astype(np.uint8), (patch_merge_sam.shape[1], patch_merge_sam.shape[0]), interpolation = cv2.INTER_NEAREST)
+            patch_merge_sam[~(resized_binary_screw.astype(np.bool))] = self.patch_query_obj.shape[0] - 1 # remove patch noise
+            clip_patch_hist = np.bincount(patch_mask.reshape(-1), minlength=self.patch_query_obj.shape[0])[:-1]
+            clip_patch_hist = clip_patch_hist / np.linalg.norm(clip_patch_hist)
+            if self.few_shot_inited:
+                patch_hist_similarity = (clip_patch_hist @ self.patch_token_hist.T)
+                score = 1 - patch_hist_similarity.max()
+            # # todo: count of screw, nut and washer, screw of different length
+            binary_foreground = (patch_merge_sam != (self.patch_query_obj.shape[0] - 1)).astype(np.uint8)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_foreground, connectivity=8)
+            for i in range(1, num_labels):
+                instance_mask = (labels == i).astype(np.uint8)
+                instance_mask = cv2.resize(instance_mask, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+                if instance_mask.any():
+                    instance_masks.append(instance_mask.astype(np.bool).reshape(-1))
+            if len(instance_masks) != 0:
+                instance_masks = np.stack(instance_masks) #[N, 64x64]
+            if self.visualization:
+                image_list = [raw_image, kmeans_label, kmeans_mask, patch_mask, sam_mask, merge_sam, patch_merge_sam, binary_foreground]
+                title_list = ['raw image', 'k-means', 'kmeans mask', 'patch mask', 'sam mask', 'merge sam mask', 'patch merge sam', 'binary_foreground']
+                plt.figure(figsize=(20, 3))
+                for ind, (temp_title, temp_image) in enumerate(zip(title_list, image_list), start=1):
+                    plt.subplot(1, len(image_list), ind)
+                    plt.imshow(temp_image)
+                    plt.title(temp_title)
+                    plt.margins(0, 0)
+                    plt.axis('off')
+                # Extract relative path from class_name onwards
+                if class_name in path:
+                    relative_path = path.split(class_name, 1)[-1]
+                    if relative_path.startswith('/'):
+                        relative_path = relative_path[1:]
+                    save_path = f'visualization/few_shot/{class_name}/{relative_path}.png'
+                else:
+                    save_path = f'visualization/few_shot/{class_name}/{path}.png'
+                os.makedirs(os.path.dirname(save_path), exist_ok=True)
+                plt.tight_layout()
+                plt.savefig(save_path, bbox_inches='tight', dpi=150)
+                plt.close()
+            return {"score": score, "foreground_pixel_count": foreground_pixel_count, "clip_patch_hist": clip_patch_hist, "instance_masks": instance_masks}
+        elif self.class_name == 'breakfast_box':
+            # patch hist
+            sam_patch_hist = np.bincount(patch_merge_sam.reshape(-1), minlength=self.patch_query_obj.shape[0])
+            sam_patch_hist = sam_patch_hist / np.linalg.norm(sam_patch_hist)
+            if self.few_shot_inited:
+                patch_hist_similarity = (sam_patch_hist @ self.patch_token_hist.T)
+                score = 1 - patch_hist_similarity.max()
+            # todo: exist of foreground
+            binary_foreground = (patch_merge_sam != (self.patch_query_obj.shape[0] - 1)).astype(np.uint8)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_foreground, connectivity=8)
+            for i in range(1, num_labels):
+                instance_mask = (labels == i).astype(np.uint8)
+                instance_mask = cv2.resize(instance_mask, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+                if instance_mask.any():
+                    instance_masks.append(instance_mask.astype(np.bool).reshape(-1))
+            if len(instance_masks) != 0:
+                instance_masks = np.stack(instance_masks) #[N, 64x64]
+            if self.visualization:
+                image_list = [raw_image, kmeans_label, kmeans_mask, patch_mask, sam_mask, merge_sam, patch_merge_sam, binary_foreground]
+                title_list = ['raw image', 'k-means', 'kmeans mask', 'patch mask', 'sam mask', 'merge sam mask', 'patch merge sam', 'binary_foreground']
+                plt.figure(figsize=(20, 3))
+                for ind, (temp_title, temp_image) in enumerate(zip(title_list, image_list), start=1):
+                    plt.subplot(1, len(image_list), ind)
+                    plt.imshow(temp_image)
+                    plt.title(temp_title)
+                    plt.margins(0, 0)
+                    plt.axis('off')
+                # Extract relative path from class_name onwards
+                if class_name in path:
+                    relative_path = path.split(class_name, 1)[-1]
+                    if relative_path.startswith('/'):
+                        relative_path = relative_path[1:]
+                    save_path = f'visualization/few_shot/{class_name}/{relative_path}.png'
+                else:
+                    save_path = f'visualization/few_shot/{class_name}/{path}.png'
+                os.makedirs(os.path.dirname(save_path), exist_ok=True)
+                plt.tight_layout()
+                plt.savefig(save_path, bbox_inches='tight', dpi=150)
+                plt.close()
+            return {"score": score, "sam_patch_hist": sam_patch_hist, "instance_masks": instance_masks}
+        elif self.class_name == 'juice_bottle':
+            # remove noise due to non sam mask
+            merge_sam[sam_mask == 0] = self.classes - 1
+            patch_merge_sam[sam_mask == 0] = self.patch_query_obj.shape[0] - 1  # 79.5
+            # [['glass'], ['liquid in bottle'], ['fruit'], ['label', 'tag'], ['black background', 'background']],
+            # fruit and liquid mismatch (todo if exist)
+            resized_patch_merge_sam = cv2.resize(patch_merge_sam, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+            binary_liquid = (resized_patch_merge_sam == 1)
+            binary_fruit = (resized_patch_merge_sam == 2)
+            query_liquid = encode_obj_text(self.model_clip, self.juice_bottle_liquid_query_words_dict, self.tokenizer, self.device)
+            query_fruit = encode_obj_text(self.model_clip, self.juice_bottle_fruit_query_words_dict, self.tokenizer, self.device)
+            liquid_feature = proj_patch_token[binary_liquid.reshape(-1), :].mean(0, keepdim=True)
+            liquid_idx = (liquid_feature @ query_liquid.T).argmax(-1).squeeze(0).item()
+            fruit_feature = proj_patch_token[binary_fruit.reshape(-1), :].mean(0, keepdim=True)
+            fruit_idx = (fruit_feature @ query_fruit.T).argmax(-1).squeeze(0).item()
+            if (liquid_idx != fruit_idx) and self.anomaly_flag is False:
+                print('liquid: {}, but fruit: {}.'.format(self.juice_bottle_liquid_query_words_dict[liquid_idx], self.juice_bottle_fruit_query_words_dict[fruit_idx]))
+                self.anomaly_flag = True
+            # # todo centroid of fruit and tag_0 mismatch (if exist) ,  only one tag, center
+            # patch hist
+            sam_patch_hist = np.bincount(patch_merge_sam.reshape(-1), minlength=self.patch_query_obj.shape[0])
+            sam_patch_hist = sam_patch_hist / np.linalg.norm(sam_patch_hist)
+            if self.few_shot_inited:
+                patch_hist_similarity = (sam_patch_hist @ self.patch_token_hist.T)
+                score = 1 - patch_hist_similarity.max()
+            binary_foreground = (patch_merge_sam != (self.patch_query_obj.shape[0] - 1) ).astype(np.uint8)
+            num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(binary_foreground, connectivity=8)
+            for i in range(1, num_labels):
+                instance_mask = (labels == i).astype(np.uint8)
+                instance_mask = cv2.resize(instance_mask, (self.feat_size, self.feat_size), interpolation = cv2.INTER_NEAREST)
+                if instance_mask.any():
+                    instance_masks.append(instance_mask.astype(np.bool).reshape(-1))
+            if len(instance_masks) != 0:
+                instance_masks = np.stack(instance_masks) #[N, 64x64]
+            if self.visualization:
+                image_list = [raw_image, kmeans_label, kmeans_mask, patch_mask, sam_mask, merge_sam, patch_merge_sam, binary_foreground]
+                title_list = ['raw image', 'k-means', 'kmeans mask', 'patch mask', 'sam mask', 'merge sam mask', 'patch merge sam', 'binary_foreground']
+                plt.figure(figsize=(20, 3))
+                for ind, (temp_title, temp_image) in enumerate(zip(title_list, image_list), start=1):
+                    plt.subplot(1, len(image_list), ind)
+                    plt.imshow(temp_image)
+                    plt.title(temp_title)
+                    plt.margins(0, 0)
+                    plt.axis('off')
+                # Extract relative path from class_name onwards
+                if class_name in path:
+                    relative_path = path.split(class_name, 1)[-1]
+                    if relative_path.startswith('/'):
+                        relative_path = relative_path[1:]
+                    save_path = f'visualization/few_shot/{class_name}/{relative_path}.png'
+                else:
+                    save_path = f'visualization/few_shot/{class_name}/{path}.png'
+                os.makedirs(os.path.dirname(save_path), exist_ok=True)
+                plt.tight_layout()
+                plt.savefig(save_path, bbox_inches='tight', dpi=150)
+                plt.close()
+            return {"score": score, "sam_patch_hist": sam_patch_hist, "instance_masks": instance_masks}
+        return {"score": score, "instance_masks": instance_masks}
+    def process_k_shot(self, class_name, few_shot_samples, few_shot_paths):
+        few_shot_samples = F.interpolate(few_shot_samples, size=(448, 448), mode=self.inter_mode, align_corners=self.align_corners, antialias=self.antialias)
+        with torch.no_grad():
+            image_features, patch_tokens, proj_patch_tokens = self.model_clip.encode_image(few_shot_samples, self.feature_list)
+            patch_tokens = [p[:, 1:, :] for p in patch_tokens]
+            patch_tokens = [p.reshape(p.shape[0]*p.shape[1], p.shape[2]) for p in patch_tokens]
+            patch_tokens_clip = torch.cat(patch_tokens, dim=-1)  # (bs, 1024, 1024x4)
+            # patch_tokens_clip = torch.cat(patch_tokens[2:], dim=-1)  # (bs, 1024, 1024x2)
+            patch_tokens_clip = patch_tokens_clip.view(self.k_shot, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            patch_tokens_clip = F.interpolate(patch_tokens_clip, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            patch_tokens_clip = patch_tokens_clip.permute(0, 2, 3, 1).view(-1, self.vision_width * len(self.feature_list))
+            patch_tokens_clip = F.normalize(patch_tokens_clip, p=2, dim=-1)  # (bsx64x64, 1024x4)
+        with torch.no_grad():
+            patch_tokens_dinov2 = self.model_dinov2.forward_features(few_shot_samples, out_layer_list=self.feature_list_dinov2)  # 4 x [bs, 32x32, 1024]
+            patch_tokens_dinov2 = torch.cat(patch_tokens_dinov2, dim=-1)  # (bs, 1024, 1024x4)
+            patch_tokens_dinov2 = patch_tokens_dinov2.view(self.k_shot, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            patch_tokens_dinov2 = F.interpolate(patch_tokens_dinov2, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            patch_tokens_dinov2 = patch_tokens_dinov2.permute(0, 2, 3, 1).view(-1, self.vision_width_dinov2 * len(self.feature_list_dinov2))
+            patch_tokens_dinov2 = F.normalize(patch_tokens_dinov2, p=2, dim=-1)  # (bsx64x64, 1024x4)
+        cluster_features = None
+        for layer in self.cluster_feature_id:
+            temp_feat = patch_tokens[layer]
+            cluster_features = temp_feat if cluster_features is None else torch.cat((cluster_features, temp_feat), 1)
+        if self.feat_size != self.ori_feat_size:
+            cluster_features = cluster_features.view(self.k_shot, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            cluster_features = F.interpolate(cluster_features, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            cluster_features = cluster_features.permute(0, 2, 3, 1).view(-1, self.vision_width * len(self.cluster_feature_id))
+        cluster_features = F.normalize(cluster_features, p=2, dim=-1)
+        if self.feat_size != self.ori_feat_size:
+            proj_patch_tokens = proj_patch_tokens.view(self.k_shot, self.ori_feat_size, self.ori_feat_size, -1).permute(0, 3, 1, 2)
+            proj_patch_tokens = F.interpolate(proj_patch_tokens, size=(self.feat_size, self.feat_size), mode=self.inter_mode, align_corners=self.align_corners)
+            proj_patch_tokens = proj_patch_tokens.permute(0, 2, 3, 1).view(-1, self.embed_dim)
+        proj_patch_tokens = F.normalize(proj_patch_tokens, p=2, dim=-1)
+        num_clusters = self.cluster_num_dict[class_name]
+        _, self.cluster_centers = kmeans(X=cluster_features, num_clusters=num_clusters, device=self.device)
+        self.query_obj = encode_obj_text(self.model_clip, self.query_words_dict[class_name], self.tokenizer, self.device)
+        self.patch_query_obj = encode_obj_text(self.model_clip, self.patch_query_words_dict[class_name], self.tokenizer, self.device)
+        self.classes = self.query_obj.shape[0]
+        scores = []
+        foreground_pixel_hist = []
+        splicing_connectors_distance = []
+        patch_token_hist = []
+        mem_instance_masks = []
+        for image, cluster_feature, proj_patch_token, few_shot_path in zip(few_shot_samples.chunk(self.k_shot), cluster_features.chunk(self.k_shot), proj_patch_tokens.chunk(self.k_shot), few_shot_paths):
+            # path = os.path.dirname(few_shot_path).split('/')[-1] + "_" + os.path.basename(few_shot_path).split('.')[0]
+            self.anomaly_flag = False
+            results = self.histogram(image, cluster_feature, proj_patch_token, class_name, "few_shot_" + os.path.basename(few_shot_path).split('.')[0])
+            if self.class_name == 'pushpins':
+                patch_token_hist.append(results["clip_patch_hist"])
+                mem_instance_masks.append(results['instance_masks'])
+            elif self.class_name == 'splicing_connectors':
+                foreground_pixel_hist.append(results["foreground_pixel_count"])
+                splicing_connectors_distance.append(results["distance"])
+                patch_token_hist.append(results["sam_patch_hist"])
+                mem_instance_masks.append(results['instance_masks'])
+            elif self.class_name == 'screw_bag':
+                foreground_pixel_hist.append(results["foreground_pixel_count"])
+                patch_token_hist.append(results["clip_patch_hist"])
+                mem_instance_masks.append(results['instance_masks'])
+            elif self.class_name == 'breakfast_box':
+                patch_token_hist.append(results["sam_patch_hist"])
+                mem_instance_masks.append(results['instance_masks'])
+            elif self.class_name == 'juice_bottle':
+                patch_token_hist.append(results["sam_patch_hist"])
+                mem_instance_masks.append(results['instance_masks'])
+            scores.append(results["score"])
+        if len(foreground_pixel_hist) != 0:
+            self.foreground_pixel_hist = np.mean(foreground_pixel_hist)
+        if len(splicing_connectors_distance) != 0:
+            self.splicing_connectors_distance = np.mean(splicing_connectors_distance)
+        if len(patch_token_hist) != 0: # patch hist
+            self.patch_token_hist = np.stack(patch_token_hist)
+        if len(mem_instance_masks) != 0:
+            self.mem_instance_masks = mem_instance_masks
+        mem_patch_feature_clip_coreset = patch_tokens_clip
+        mem_patch_feature_dinov2_coreset = patch_tokens_dinov2
+        return scores, mem_patch_feature_clip_coreset, mem_patch_feature_dinov2_coreset
+    def process(self, class_name: str, few_shot_samples: list[torch.Tensor], few_shot_paths: list[str]):
+        few_shot_samples = self.transform(few_shot_samples).to(self.device)
+        scores, self.mem_patch_feature_clip_coreset, self.mem_patch_feature_dinov2_coreset = self.process_k_shot(class_name, few_shot_samples, few_shot_paths)
+    def setup(self, data: dict) -> None:
+        """Setup the few-shot samples for the model.
+        The evaluation script will call this method to pass the k images for few shot learning and the object class
+        name. In the case of MVTec LOCO this will be the dataset category name (e.g. breakfast_box). Please contact
+        the organizing committee if if your model requires any additional dataset-related information at setup-time.
+        """
+        few_shot_samples = data.get("few_shot_samples")
+        class_name = data.get("dataset_category")
+        few_shot_paths = data.get("few_shot_samples_path")
+        self.class_name = class_name
+        self.k_shot = few_shot_samples.size(0)
+        self.process(class_name, few_shot_samples, few_shot_paths)
+        self.few_shot_inited = True

prompt_ensemble.py ADDED Viewed

	@@ -0,0 +1,121 @@

+import os
+from typing import Union, List
+import torch
+import numpy as np
+from tqdm import tqdm
+from imagenet_template import openai_imagenet_template
+def encode_text_with_prompt_ensemble(model, objs, tokenizer, device):
+    prompt_normal = ['{}', 'flawless {}', 'perfect {}', 'unblemished {}', '{} without flaw', '{} without defect', '{} without damage']
+    prompt_abnormal = ['damaged {}', 'broken {}', '{} with flaw', '{} with defect', '{} with damage']
+    prompt_state = [prompt_normal, prompt_abnormal]
+    prompt_templates = ['a bad photo of a {}.', 'a low resolution photo of the {}.', 'a bad photo of the {}.', 'a cropped photo of the {}.', 'a bright photo of a {}.', 'a dark photo of the {}.', 'a photo of my {}.', 'a photo of the cool {}.', 'a close-up photo of a {}.', 'a black and white photo of the {}.', 'a bright photo of the {}.', 'a cropped photo of a {}.', 'a jpeg corrupted photo of a {}.', 'a blurry photo of the {}.', 'a photo of the {}.', 'a good photo of the {}.', 'a photo of one {}.', 'a close-up photo of the {}.', 'a photo of a {}.', 'a low resolution photo of a {}.', 'a photo of a large {}.', 'a blurry photo of a {}.', 'a jpeg corrupted photo of the {}.', 'a good photo of a {}.', 'a photo of the small {}.', 'a photo of the large {}.', 'a black and white photo of a {}.', 'a dark photo of a {}.', 'a photo of a cool {}.', 'a photo of a small {}.', 'there is a {} in the scene.', 'there is the {} in the scene.', 'this is a {} in the scene.', 'this is the {} in the scene.', 'this is one {} in the scene.']
+    text_prompts = {}
+    for obj in objs:
+        text_features = []
+        for i in range(len(prompt_state)):
+            prompted_state = [state.format(obj) for state in prompt_state[i]]
+            prompted_sentence = []
+            for s in prompted_state:
+                for template in prompt_templates:
+                    prompted_sentence.append(template.format(s))
+            prompted_sentence = tokenizer(prompted_sentence).to(device)
+            class_embeddings = model.encode_text(prompted_sentence)
+            class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
+            class_embedding = class_embeddings.mean(dim=0)
+            class_embedding /= class_embedding.norm()
+            text_features.append(class_embedding)
+        text_features = torch.stack(text_features, dim=1).to(device)
+        text_prompts[obj] = text_features
+    return text_prompts
+def encode_general_text(model, obj_list, tokenizer, device):
+    text_dir = '/data/yizhou/VAND2.0/wgd/general_texts/train2014'
+    text_name_list = sorted(os.listdir(text_dir))
+    bs = 100
+    sentences = []
+    embeddings = []
+    all_sentences = []
+    for text_name in tqdm(text_name_list):
+        with open(os.path.join(text_dir, text_name), 'r') as f:
+            for line in f.readlines():
+                sentences.append(line.strip())
+        if len(sentences) > bs:
+            prompted_sentences = tokenizer(sentences).to(device)
+            class_embeddings = model.encode_text(prompted_sentences)
+            class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
+            embeddings.append(class_embeddings)
+            all_sentences.extend(sentences)
+            sentences = []
+        # if len(all_sentences) > 10000:
+        #     break
+    embeddings = torch.cat(embeddings, 0)
+    print(embeddings.size(0))
+    embeddings_dict = {}
+    for obj in obj_list:
+        embeddings_dict[obj] = embeddings
+    return embeddings_dict, all_sentences
+def encode_abnormal_text(model, obj_list, tokenizer, device):
+    embeddings = {}
+    sentences = {}
+    for obj in obj_list:
+        sentence_abnormal = []
+        with open(os.path.join('text_prompt', 'v1', obj + '_abnormal.txt'), 'r') as f:
+            for line in f.readlines():
+                sentence_abnormal.append(line.strip().lower())
+        prompted_sentences = tokenizer(sentence_abnormal).to(device)
+        class_embeddings = model.encode_text(prompted_sentences)
+        class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
+        embeddings[obj] = class_embeddings
+        sentences[obj] = sentence_abnormal
+    return embeddings, sentences
+def encode_normal_text(model, obj_list, tokenizer, device):
+    embeddings = {}
+    sentences = {}
+    for obj in obj_list:
+        sentence_abnormal = []
+        with open(os.path.join('text_prompt', 'v1', obj + '_normal.txt'), 'r') as f:
+            for line in f.readlines():
+                sentence_abnormal.append(line.strip().lower())
+        prompted_sentences = tokenizer(sentence_abnormal).to(device)
+        class_embeddings = model.encode_text(prompted_sentences)
+        class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
+        embeddings[obj] = class_embeddings
+        sentences[obj] = sentence_abnormal
+    return embeddings, sentences
+def encode_obj_text(model, query_words, tokenizer, device):
+    # query_words = ['orange', "nectarine", "cereals", "banana chips", 'almonds', 'white box']
+    # query_words = ['liquid', 'glass', "top", 'black background']
+    # query_words = ["connector", "grid"]
+    # query_words = [['screw'], 'plastic bag', 'background']
+    # query_words = [['pushpin', 'pin'], ['plastic box'], 'box', 'black background']
+    query_features = []
+    with torch.no_grad():
+        for qw in query_words:
+            token_input = []
+            if type(qw) == list:
+                for qw2 in qw:
+                    token_input.extend([temp(qw2) for temp in openai_imagenet_template])
+            else:
+                token_input = [temp(qw) for temp in openai_imagenet_template]
+            query = tokenizer(token_input).to(device)
+            feature = model.encode_text(query)
+            feature /= feature.norm(dim=-1, keepdim=True)
+            feature = feature.mean(dim=0)
+            feature /= feature.norm()
+            query_features.append(feature.unsqueeze(0))
+    query_features = torch.cat(query_features, dim=0)
+    return query_features

requirements.txt ADDED Viewed

	@@ -0,0 +1,77 @@

+aiohappyeyeballs==2.6.1
+aiohttp==3.12.11
+aiosignal==1.3.2
+antlr4-python3-runtime==4.9.3
+async-timeout==5.0.1
+attrs==25.3.0
+certifi==2025.4.26
+charset-normalizer==3.4.2
+contourpy==1.3.2
+cycler==0.12.1
+einops==0.6.1
+faiss-cpu==1.8.0
+filelock==3.18.0
+fonttools==4.58.2
+FrEIA==0.2
+frozenlist==1.6.2
+fsspec==2024.12.0
+ftfy==6.3.1
+hf-xet==1.1.3
+huggingface-hub==0.32.4
+idna==3.10
+imageio==2.37.0
+imgaug==0.4.0
+Jinja2==3.1.6
+joblib==1.5.1
+jsonargparse==4.29.0
+kiwisolver==1.4.8
+kmeans-pytorch==0.3
+kornia==0.7.0
+lazy_loader==0.4
+lightning==2.2.5
+lightning-utilities==0.14.3
+markdown-it-py==3.0.0
+MarkupSafe==3.0.2
+matplotlib==3.10.3
+mdurl==0.1.2
+mpmath==1.3.0
+multidict==6.4.4
+networkx==3.4.2
+omegaconf==2.3.0
+open-clip-torch==2.24.0
+opencv-python==4.8.1.78
+packaging==24.2
+pandas==2.0.3
+pillow==11.2.1
+propcache==0.3.1
+protobuf==6.31.1
+Pygments==2.19.1
+pyparsing==3.2.3
+python-dateutil==2.9.0.post0
+pytorch-lightning==2.5.1.post0
+pytz==2025.2
+PyYAML==6.0.2
+regex==2024.11.6
+requests==2.32.3
+rich==13.7.1
+safetensors==0.5.3
+scikit-image==0.25.2
+scikit-learn==1.2.2
+scipy==1.15.3
+segment-anything==1.0
+sentencepiece==0.2.0
+shapely==2.1.1
+six==1.17.0
+sympy==1.14.0
+tabulate==0.9.0
+threadpoolctl==3.6.0
+tifffile==2025.5.10
+timm==1.0.15
+torchmetrics==1.7.2
+tqdm==4.67.1
+triton==2.1.0
+typing_extensions==4.14.0
+tzdata==2025.2
+urllib3==2.4.0
+wcwidth==0.2.13
+yarl==1.20.0