Document Question Answering Model - Kaleidoscope_small_v1
This model is a fine-tuned version of sberbank-ai/ruBert-base designed for the task of document question answering. It has been adapted specifically for extracting answers from a provided document context and fine-tuned on a custom JSON dataset containing context, question, and answer triples.
Key Features
- Objective: Extract answers from documents based on user questions.
- Base Model: sberbank-ai/ruBert-base.
- Dataset: A custom JSON file with fields: context, question, and answer.
- Preprocessing: The input is formed by concatenating the question and the document context, guiding the model to focus on the relevant segments.
Training Settings:
- Number of epochs: 20.
- Batch size: 4 per device.
- Warmup steps: 0.1 of total steps.
- FP16 training enabled (if CUDA is available).
- Hardware: Training was performed on an 1xRTX 3070.
Description
The model was fine-tuned using the Transformers library with a custom training pipeline. Key aspects of the training process include:
Custom Dataset: A loader reads a JSON file containing context, question, and answer triples.
- Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.
- Training Process: Leveraging mixed precision training and the AdamW optimizer to improve optimization.
- Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and employs early stopping based on validation loss.
- This model is ideal for interactive document question answering tasks, making it a powerful tool for applications such as customer support, document search, and automated Q&A systems.
While primarily focused on Russian texts, the model also supports English language inputs. The model also supports English language, but its support was not tested
Example Usage
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
model.to(device)
file_path = input("Enter document path: ")
with open(file_path, "r", encoding="utf-8") as f:
context = f.read()
while True:
question = input("Enter question (or 'exit' to quit): ")
if question.lower() == "exit":
break
inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
inputs = {k: v.to(device) for k, v in inputs.items()}
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)
answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print("Answer:", answer)
Example of answering
RU Context:
Альберт Эйнштейн разработал теорию относительности.
Question:
Кто разработал теорию относительности?
Answer:
альберт эинштеин
EN Context:
I had a red car.
Question:
What kind of car did I have?
Answer:
a red car
Finetuned by LaciaStudio | LaciaAI
- Downloads last month
- 15
Model tree for LaciaStudio/Kaleidoscope_small_v1
Base model
ai-forever/ruBert-base