Update README.md

477a2cb verified 3 days ago

16.5 kB

	---
	license: apache-2.0
	datasets:
	- Inst-IT/Inst-IT-Dataset
	- lmms-lab/LLaVA-NeXT-Data
	language:
	- en
	metrics:
	- accuracy
	pipeline_tag: video-text-to-text
	tags:
	- multimodal
	- fine-grained
	- instance-understanding
	model-index:
	- name: LLaVA-Next-Inst-It-Qwen2-7B
	results:
	- task:
	type: multimodal
	dataset:
	name: Inst-IT-Bench-I-OE
	type: Open-Ended
	metrics:
	- type: accuracy
	value: 67.9
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: Inst-IT-Bench-I-MC
	type: Multi-Choice
	metrics:
	- type: accuracy
	value: 75.3
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: AI2D
	type: ai2d
	metrics:
	- type: accuracy
	value: 78.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MMMU
	type: mmmu
	metrics:
	- type: accuracy
	value: 42.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: POPE
	type: pope
	metrics:
	- type: accuracy
	value: 87.6
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: GQA
	type: gqa
	metrics:
	- type: accuracy
	value: 65.5
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MM-Vet
	type: mm-vet
	metrics:
	- type: accuracy
	value: 44.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: Inst-IT-Bench-V-OE
	type: Open-Ended
	metrics:
	- type: accuracy
	value: 45.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: Inst-IT-Bench-V-MC
	type: Multi-Choice
	metrics:
	- type: accuracy
	value: 53.3
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: ActNet-QA
	type: actnet-qa
	metrics:
	- type: accuracy
	value: 55.2
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: EgoSchema
	type: egoschema
	metrics:
	- type: accuracy
	value: 50.4
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: NextQA
	type: nextqa
	metrics:
	- type: accuracy
	value: 73.0
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: VideoMME
	type: videomme
	metrics:
	- type: accuracy
	value: 54.0
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: TempoCompass
	type: tempocompass
	metrics:
	- type: accuracy
	value: 63.9
	name: accuracy
	verified: true
	---

	# LLaVA-Next-Inst-It-Qwen2-7B
	[Homepage](https://inst-it.github.io/) \| [Code](https://github.com/inst-it/inst-it) \| [Paper](https://huggingface.co/papers/2412.03565) \| [arXiv](https://arxiv.org/abs/2412.03565)

	LLaVA-Next-Inst-It-Qwen2-7B is a multimodal model that excels at instance-level understanding,
	which is introduced in the paper [Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning](https://huggingface.co/papers/2412.03565)

	* Architecture: siglip-so400m-patch14-384 + Qwen2-7B
	* Data: LLaVA-NeXT-Data / Inst-IT-Dataset
	* Precision: bfloat16


	## Quick Start
	Install

	Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT to prepare the environment:
	```shell
	pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
	```
	Error Handling
	<details>
	<summary>Click to unfold</summary>

	* Common error case 1:
	``` shell
	Exception: data did not match any variant of untagged enum ModelWrapper at line 757272 column 3
	```
	This is caused by the version of `transformers`, try to update it:
	``` python
	pip install -U transformers
	```

	* Common error case 2:
	``` shell
	RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
	size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([730, 1152]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
	```
	This is a logical error encountered when loading the vision tower from the local path. To fix this issue, you can prepare the environment in any of the following ways.

	Option 1: Install from our fork of LLaVA-NeXT:

	``` shell
	pip install git+https://github.com/inst-it/LLaVA-NeXT.git
	```

	Option 2: Install from LLaVA-NeXT and manually modify its code:
	* step 1: clone source code
	``` shell
	git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
	```
	* step 2: before installing LLaVA-NeXT, you need to modify `line 17` of [llava/model/multimodal_encoder/builder.py](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/multimodal_encoder/builder.py#L17).
	``` python
	# Before modification:
	if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:

	# After modification:
	if "clip" in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
	```
	* step 3: install LLaVA-NeXT from source:
	``` shell
	cd LLaVA-NeXT
	pip install --upgrade pip # Enable PEP 660 support.
	pip install -e ".[train]"
	```

	We recommend the first way because it is simple.
	</details>

	</details>

	Load Model
	``` python
	from llava.model.builder import load_pretrained_model
	from llava.constants import DEFAULT_IMAGE_TOKEN

	from llava.mm_utils import (
	KeywordsStoppingCriteria,
	get_model_name_from_path,
	tokenizer_image_token,
	process_images
	)
	from llava.conversation import SeparatorStyle, conv_templates
	from llava.eval.model_vqa import preprocess_qwen

	overwrite_config = {}
	overwrite_config["mm_spatial_pool_stride"] = 2
	overwrite_config["mm_spatial_pool_mode"] = 'bilinear'
	overwrite_config["mm_pooling_position"] = 'after'
	overwrite_config["mm_newline_position"] = 'no_token'

	model_path = "Inst-IT/LLaVA-Next-Inst-It-Qwen2-7B"
	model_name = get_model_name_from_path(model_path)

	tokenizer, model, image_processor, max_length = load_pretrained_model(
	model_path=model_path,
	model_base=None,
	model_name=model_name,
	device_map="auto",
	torch_dtype='bfloat16',
	overwrite_config=overwrite_config,
	attn_implementation='sdpa')
	```
	Image Inference

	<details>
	<summary>Inference without SoMs</summary>

	Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).

	``` python
	import torch
	import requests
	from PIL import Image

	img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image.jpg?raw=true"
	image = Image.open(requests.get(img_url, stream=True).raw)
	image_tensor = process_images([image], image_processor, model.config).bfloat16()
	image_sizes = [image.size]

	question = "Describe this image."
	question = DEFAULT_IMAGE_TOKEN + "\n" + question

	conv_template = 'qwen_1_5'
	conv = conv_templates[conv_template].copy()
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = preprocess_qwen([{'from': 'human','value': question},{'from': 'gpt','value': None}], tokenizer, has_image=True).cuda()

	pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
	attention_masks = input_ids.ne(pad_token_ids).long().cuda()

	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

	with torch.inference_mode():
	output_ids = model.generate(
	inputs=input_ids,
	images=image_tensor,
	attention_mask=attention_masks,
	modalities="image",
	image_sizes=image_sizes,
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	max_new_tokens=4096
	)

	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	print(pred)
	```
	</details>

	<details>
	<summary>Inference with SoMs</summary>

	Our model performs more fine-grained understanding when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
	You can refer to the instances that you are interested in using their IDs.
	Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
	Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.

	``` python
	import torch
	import requests
	from PIL import Image

	img_url = "https://github.com/inst-it/inst-it/blob/main/assets/demo/image_som.jpg?raw=true"
	image = Image.open(requests.get(img_url, stream=True).raw)
	image_tensor = process_images([image], image_processor, model.config).bfloat16()
	image_sizes = [image.size]

	# You can use [id] to refer to the instances that you are interested in
	question = "Describe [8] in detail."
	question = DEFAULT_IMAGE_TOKEN + "\n" + question

	conv_template = 'qwen_1_5'
	conv = conv_templates[conv_template].copy()
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = preprocess_qwen([{'from': 'human','value': question},{'from': 'gpt','value': None}], tokenizer, has_image=True).cuda()

	pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
	attention_masks = input_ids.ne(pad_token_ids).long().cuda()

	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

	with torch.inference_mode():
	output_ids = model.generate(
	inputs=input_ids,
	images=image_tensor,
	attention_mask=attention_masks,
	modalities="image",
	image_sizes=image_sizes,
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	max_new_tokens=4096
	)

	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	print(pred)
	```
	</details>

	Video Inference

	For the video, we organize each frame into a list. You can use the format \<t\> to refer to a specific timestamp (e.g. <1>).

	<details>
	<summary>Inference without SoMs</summary>

	Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).

	``` python
	import torch
	import requests
	from PIL import Image

	frame_urls = [
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_1.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_2.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_3.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_4.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_5.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_6.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_7.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/frame_8.jpg?raw=true"
	]
	video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
	video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
	video = video.bfloat16()
	videos = [video]

	question = "Describe the video." # overall video caption
	question = "What happens at frame <1>?" # caption a specific moment
	question = DEFAULT_IMAGE_TOKEN + "\n" + question

	conv_template = 'qwen_1_5'
	conv = conv_templates[conv_template].copy()
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = preprocess_qwen([{'from': 'human','value': question},{'from': 'gpt','value': None}], tokenizer, has_image=True).cuda()

	pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
	attention_masks = input_ids.ne(pad_token_ids).long().cuda()

	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

	with torch.inference_mode():
	output_ids = model.generate(
	inputs=input_ids,
	images=videos,
	attention_mask=attention_masks,
	modalities="video",
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	max_new_tokens=4096
	)

	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	print(pred)
	```
	</details>

	<details>
	<summary>Inference with SoMs</summary>

	Our model performs more fine-grained understanding when [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts are provided.
	You can refer to the instances that you are interested in using their IDs.
	Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
	Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.

	``` python
	import torch
	import requests
	from PIL import Image

	frame_urls = [
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_1.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_2.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_3.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_4.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_5.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_6.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_7.jpg?raw=true",
	"https://github.com/inst-it/inst-it/blob/main/assets/demo/som_frame_8.jpg?raw=true"
	]
	video = [Image.open(requests.get(frame_url, stream=True).raw) for frame_url in frame_urls]
	video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda()
	video = video.bfloat16()
	videos = [video]

	# You can use [id] to refer to the instances that you are interested in
	question = "Is [3] visible at <1>?"
	question = DEFAULT_IMAGE_TOKEN + "\n" + question

	conv_template = 'qwen_1_5'
	conv = conv_templates[conv_template].copy()
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt = conv.get_prompt()

	input_ids = preprocess_qwen([{'from': 'human','value': question},{'from': 'gpt','value': None}], tokenizer, has_image=True).cuda()

	pad_token_ids = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id
	attention_masks = input_ids.ne(pad_token_ids).long().cuda()

	stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
	keywords = [stop_str]
	stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

	with torch.inference_mode():
	output_ids = model.generate(
	inputs=input_ids,
	images=videos,
	attention_mask=attention_masks,
	modalities="video",
	use_cache=True,
	stopping_criteria=[stopping_criteria],
	max_new_tokens=4096
	)

	pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
	print(pred)
	```
	</details>

	## Contact
	Feel free to contact us if you have any questions or suggestions
	- Email (Wujian Peng): [email protected]
	- Email (Lingchen Meng): [email protected]

	## Citation
	``` bibtex
	@article{peng2024inst,
	title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
	author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
	journal={arXiv preprint arXiv:2412.03565},
	year={2024}
	}
	```