Image-Text-to-Text
Japanese
English

bilingual-gpt-neox-4b-minigpt4

rinna-icon

Overview

This repository provides an English-Japanese bilingual multimodal conversational model like MiniGPT-4 by combining GPT-NeoX model of 3.8 billion parameters and BLIP-2.

The model is based on rinna/bilingual-gpt-neox-4b and BLIP-2.


I/O Format

A special format has been adopted to construct inputs.

  • An input prompt is formatted as a conversation between ユーザー and システム.
  • Each input utterance consists of (1) its speaker ("ユーザー" or "システム"), (2) a colon (":"), (3) a whitespace (" "), and (4) utterance text (e.g. "猫はどんな体勢をしていますか?").
  • An utterance including an image is formatted as (1) its speaker ("ユーザー"), (2) a colon (":"), (3) a whitespace (" "), (4) a placeholder of the image ("<Img><ImageHere></Img>"), (5) another whitespace (" "), (6) utterance text (e.g. "What can you see?").
    • The placeholder (<ImageHere>) is automatically replaced with the embedding of an input image in the function get_context_emb.
  • The input prompt should be ended with "システム: " to acknowledge the model to generate a response.
  • All the utterances in the input prompt should be separated by a newline \n.

Following is an example to construct input from a conversation.

prompt = [
    {
        "speaker": "ユーザー",
        "text": "<Img><ImageHere></Img> What can you see?"
    },
    {
        "speaker": "システム",
        "text": "a cat on a table with a laptop"
    },
    {
        "speaker": "ユーザー",
        "text": "猫はどんな体勢をしていますか?"
    },
]
prompt = [
    f"{uttr['speaker']}: {uttr['text']}"
    for uttr in prompt
]
prompt = "\n".join(prompt)
prompt = (
    prompt
    + "\n"
    + "システム: "
)
print(prompt)
"""
ユーザー: <Img><ImageHere></Img> What can you see?
システム: a cat on a table with a laptop
ユーザー: 猫はどんな体勢をしていますか?
システム: 
"""

How to use the model

1. Download dependencies

  • BLIP-2 implementation included in MiniGPT-4 is used for inference.
  • customized_mini_gpt4.py is a script to replace LLM from LLaMA architecture to GPT-NeoX one.
  • checkpoint.pth is a finetuned weight of the linear layer (file size: 177 MB).
git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd ./MiniGPT-4
git checkout 22d8888 # latest version as of July 31, 2023.
wget https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/customized_mini_gpt4.py
wget https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/checkpoint.pth

2. Inference

Please run this script in MiniGPT-4 directory.

import torch
import requests
from PIL import Image
from minigpt4.processors.blip_processors import Blip2ImageEvalProcessor
from customized_mini_gpt4 import CustomizedMiniGPT4

ckpt_path = "./checkpoint.pth"

model = CustomizedMiniGPT4(gpt_neox_model="rinna/bilingual-gpt-neox-4b")
tokenizer = model.gpt_neox_tokenizer

if torch.cuda.is_available():
    model = model.to("cuda")

if ckpt_path is not None:
    print("Load BLIP2-LLM Checkpoint: {}".format(ckpt_path))
    ckpt = torch.load(ckpt_path, map_location="cpu")
    model.load_state_dict(ckpt['model'], strict=False)

vis_processor = Blip2ImageEvalProcessor()

image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg"
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
image = vis_processor(raw_image).unsqueeze(0).to(model.device)
image_emb = model.encode_img(image)

embs = model.get_context_emb(prompt, [image_emb])

output_ids = model.gpt_neox_model.generate(
    inputs_embeds=embs,
    max_new_tokens=512,
    do_sample=True,
    temperature=1.0,
    top_p=0.85,
    pad_token_id=tokenizer.pad_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id
)

output = tokenizer.decode(output_ids.tolist()[0], skip_special_tokens=True)
print(output)
"""横になっています。"""

How to cite

@misc{rinna-bilingual-gpt-neox-4b-minigpt4,
    title = {rinna/bilingual-gpt-neox-4b-minigpt4},
    author = {Mitsuda, Koh and Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

Acknowledgement

Licenese

The MIT license

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Datasets used to train rinna/bilingual-gpt-neox-4b-minigpt4

Collections including rinna/bilingual-gpt-neox-4b-minigpt4