Document Question Answering Model - Kaleidoscope_small_v1

This model is a fine-tuned version of sberbank-ai/ruBert-base designed for the task of document question answering. It has been adapted specifically for extracting answers from a provided document context and fine-tuned on a custom JSON dataset containing context, question, and answer triples.

Key Features

Objective: Extract answers from documents based on user questions.
Base Model: sberbank-ai/ruBert-base.
Dataset: A custom JSON file with fields: context, question, and answer.
Preprocessing: The input is formed by concatenating the question and the document context, guiding the model to focus on the relevant segments.

Training Settings:

Number of epochs: 20.
Batch size: 4 per device.
Warmup steps: 0.1 of total steps.
FP16 training enabled (if CUDA is available).
Hardware: Training was performed on an 1xRTX 3070.

Description

The model was fine-tuned using the Transformers library with a custom training pipeline. Key aspects of the training process include:

Custom Dataset: A loader reads a JSON file containing context, question, and answer triples.

Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.
Training Process: Leveraging mixed precision training and the AdamW optimizer to improve optimization.
Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and employs early stopping based on validation loss.
This model is ideal for interactive document question answering tasks, making it a powerful tool for applications such as customer support, document search, and automated Q&A systems.

While primarily focused on Russian texts, the model also supports English language inputs. The model also supports English language, but its support was not tested

Example Usage

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
model.to(device)

file_path = input("Enter document path: ")
with open(file_path, "r", encoding="utf-8") as f:
    context = f.read()

while True:
    question = input("Enter question (or 'exit' to quit): ")
    if question.lower() == "exit":
        break
    inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_index = torch.argmax(start_logits)
    end_index = torch.argmax(end_logits)
    answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
    print("Answer:", answer)

Example of answering

RU Context:

Альберт Эйнштейн разработал теорию относительности.

Question:

Кто разработал теорию относительности?

Answer:

альберт эинштеин

EN Context:

I had a red car.