YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CLIP-Vit-Bert-Chinese pretrained model

这是中文版本的CLIP预训练模型,基于LiT-tuning(Locked-image Text tuning)的策略,使用140万中文图文对数据进行多模态对比学习预训练。

Github: CLIP-Chinese

Bolg: CLIP-Chinese:中文多模态对比学习CLIP预训练模型

Model and Training Detail

该模型主要由文本编码器与图像编码器组成,其中文本编码器为Bert,图像编码器为Vit,我们将该模型称为BertCLIP模型。训练时,Bert使用Langboat/mengzi-bert-base的权重进行初始化,Vit使用openai/clip-vit-large-patch32 的权重进行初始化。采用LiT-tuning(Locked-image Text tuning)的策略进行训练,也就是冻结Vit的权重,训练BertCLIP模型剩余的权重。

Usage

首先将项目clone到本地,并且安装依赖包

git clone https://github.com/yangjianxin1/CLIP-Chinese
pip install -r requirements.txt

使用如下脚本,就可成功加载预训练权重,对图片和文本进行预处理,并且得到模型的输出

from transformers import CLIPProcessor
from component.model import BertCLIPModel
from PIL import Image
import requests

model_name_or_path = 'YeungNLP/clip-vit-bert-chinese-1M'
# 加载预训练模型权重
model = BertCLIPModel.from_pretrained(model_name_or_path)
CLIPProcessor.tokenizer_class = 'BertTokenizerFast'
# 初始化processor
processor = CLIPProcessor.from_pretrained(model_name_or_path)
# 预处理输入
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["一只小狗在摇尾巴", "一只小猪在吃饭"], images=image, return_tensors="pt", padding=True)
inputs.pop('token_type_ids')    # 输入中不包含token_type_ids

outputs = model(**inputs)

# 对于每张图片,计算其与所有文本的相似度
logits_per_image = outputs.logits_per_image  # image-text的相似度得分
probs = logits_per_image.softmax(dim=1)  # 对分数进行归一化

# 对于每个文本,计算其与所有图片的相似度
logits_per_text = outputs.logits_per_text  # text-image的相似度得分
probs = logits_per_text.softmax(dim=1)  # 对分数进行归一化

# 获得文本编码
text_embeds = outputs.text_embeds
# 获得图像编码
image_embeds = outputs.image_embeds

单独加载图像编码器,进行下游任务

from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPVisionModel

model_name_or_path = 'YeungNLP/clip-vit-bert-chinese-1M'
model = CLIPVisionModel.from_pretrained(model_name_or_path)
CLIPProcessor.tokenizer_class = 'BertTokenizerFast'
processor = CLIPProcessor.from_pretrained(model_name_or_path)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output 

单独加载文本编码器,进行下游任务

from component.model import BertCLIPTextModel
from transformers import BertTokenizerFast

model_name_or_path = 'YeungNLP/clip-vit-bert-chinese-1M'
model = BertCLIPTextModel.from_pretrained(model_name_or_path)
tokenizer = BertTokenizerFast.from_pretrained(model_name_or_path)

inputs = tokenizer(["一只小狗在摇尾巴", "一只小猪在吃饭"], padding=True, return_tensors="pt")
inputs.pop('token_type_ids')  # 输入中不包含token_type_ids

outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output
Downloads last month
66
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.