metadata

tasks:
  - visual-question-answering
widgets:
  - task: visual-question-answering
    inputs:
      - type: image
        name: image
        title: 图片
        validator:
          max_size: 10M
          max_resolution: 5000*5000
      - type: text
        name: question
        title: 问题
    examples:
      - name: 1
        title: 示例1
        inputs:
          - name: image
            data: >-
              https://alice-open.oss-cn-zhangjiakou.aliyuncs.com/mPLUG/image_mplug_vqa_5.jpg
          - name: question
            data: what name is this guy?
      - name: 2
        title: 示例2
        inputs:
          - name: image
            data: >-
              https://alice-open.oss-cn-zhangjiakou.aliyuncs.com/mPLUG/image_mplug_vqa_4.jpg
          - name: question
            data: what is the name of the planet?
      - name: 3
        title: 示例3
        inputs:
          - name: image
            data: >-
              https://alice-open.oss-cn-zhangjiakou.aliyuncs.com/mPLUG/image_mplug_vqa_1.jpg
          - name: question
            data: what airline owns this plane?
      - name: 4
        title: 示例4
        inputs:
          - name: image
            data: >-
              http://xingchen-data.oss-cn-zhangjiakou.aliyuncs.com/maas/visual-question-answering/visual_question_answering.png
          - name: question
            data: what is grown on the plant?
      - name: 5
        title: 示例5
        inputs:
          - name: image
            data: >-
              https://alice-open.oss-cn-zhangjiakou.aliyuncs.com/mPLUG/image_mplug_vqa_3.jpg
          - name: question
            data: What do you call the devices on top of the pole?
      - name: 6
        title: 示例6
        inputs:
          - name: image
            data: >-
              https://alice-open.oss-cn-zhangjiakou.aliyuncs.com/mPLUG/image_mplug_vqa_2.jpg
          - name: question
            data: what does this machine do?
    inferencespec:
      cpu: 4
      memory: 12000
      gpu: 1
      gpu_memory: 16000
model-type:
  - mplug
domain:
  - multi-modal
frameworks:
  - pytorch
backbone:
  - transformer
containers: null
metrics:
  - accuracy
license: apache-2.0
finetune-support: true
language:
  - en
tags:
  - transformer
  - Alibaba
  - volume:abs/2205.12005
datasets:
  - CC
  - MSCOCO
  - VG
  - SBU
  - VQA

视觉问答介绍

视觉问答：给定一个问题和图片，通过图片信息来给出答案。需要模型具备多模态理解的能力，目前主流的方法大多是基于多模态预训练，最为知名的视觉问答数据集包括：VQA，GQA等。

模型描述

本任务是mPLUG，在英文VQA数据集进行finetune的视觉问答下游任务。mPLUG模型是统一理解和生成的多模态基础模型，该模型提出了基于skip-connections的高效跨模态融合框架。其中，mPLUG在VQA上支持开放阈生成，达到开放阈生成的SOTA，详见：mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

模型生成结果如下图所示：

期望模型使用方式以及适用范围

本模型主要用于给问题和对应图片生成答案。用户可以自行尝试各种输入文档。具体调用方式请参考代码示例。

如何使用

在安装完成MaaS-lib之后即可使用visual-question-answering的能力（注意：模型运行约需占用 9G 内存）

代码范例

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

model_id = 'damo/mplug_visual-question-answering_coco_large_en'
input_vqa = {
    'image': 'https://alice-open.oss-cn-zhangjiakou.aliyuncs.com/mPLUG/image_mplug_vqa.jpg',
    'question': 'What is the woman doing?',
}

pipeline_vqa = pipeline(Tasks.visual_question_answering, model=model_id)
print(pipeline_vqa(input_vqa))

模型局限性以及可能的偏差

模型在数据集上训练，有可能产生一些偏差，请用户自行评测后决定如何使用。

训练数据介绍

本模型训练数据集是VQA，数据集包含83k图片，具体数据可以下载

模型训练流程

预处理

训练数据集需要包含 image，question，answer，以下为使用截取的部分 coco_caption 数据集进行的训练代码预处理示例：

datadict = MsDataset.load('coco_captions_small_slice')
self.train_dataset = MsDataset(datadict['train'].to_hf_dataset().map(
    lambda _: {
        'question': 'what the picture describes?'
    }).rename_column('image:FILE',
                     'image').rename_column('answer:Value', 'answer'))
self.test_dataset = MsDataset(datadict['test'].to_hf_dataset().map(
    lambda _: {
        'question': 'what the picture describes?'
    }).rename_column('image:FILE',
                     'image').rename_column('answer:Value', 'answer'))

训练

以下为使用 modelscope 中的 trainer 进行训练的代码示例：

kwargs = dict(
    model='damo/mplug_visual-question-answering_coco_large_en',
    train_dataset=self.train_dataset,
    eval_dataset=self.test_dataset,
    max_epochs=self.max_epochs,
    work_dir=self.tmp_dir)

trainer: EpochBasedTrainer = build_trainer(
    name=Trainers.nlp_base_trainer, default_args=kwargs)
trainer.train()

数据评估及结果

mPLUG在VQA数据集，同等规模和预训练数据的模型中取得SOTA，VQA榜单上排名前列

xingjianleng
/

mplug_visual-question-answering_coco_large_en