---
license: cc-by-nc-4.0
language:
- ru
- en
pipeline_tag: document-question-answering
tags:
- DocumentQA
- QuestionAnswering
- NLP
- DeepLearning
- Transformers
- Multimodal
- HuggingFace
- ruBert
- MachineLearning
- DeepQA
- AIForDocs
- Docs
- NeuralNetworks
- torch
- pytorch
library_name: transformers
metrics:
- accuracy
- f1
- recall
- exact_match
- precision
base_model:
- ai-forever/ruBert-base
---

![Official Kaleidoscope Logo](https://huggingface.co/LaciaStudio/Kaleidoscope_small_v1/resolve/main/Kaleidoscope.png)

# Document Question Answering Model - Kaleidoscope_small_v1
This model is a fine-tuned version of sberbank-ai/ruBert-base designed for the task of document question answering. It has been adapted specifically for extracting answers from a provided document context and fine-tuned on a custom JSON dataset containing context, question, and answer triples.

# Key Features
* Objective: Extract answers from documents based on user questions.
* Base Model: sberbank-ai/ruBert-base.
* Dataset: A custom JSON file with fields: context, question, and answer.
* Preprocessing: The input is formed by concatenating the question and the document context, guiding the model to focus on the relevant segments.
# Training Settings:
* Number of epochs: 20.
* Batch size: 4 per device.
* Warmup steps: 0.1 of total steps.
* FP16 training enabled (if CUDA is available).
* Hardware: Training was performed on an 1xRTX 3070.

# Description
The model was fine-tuned using the Transformers library with a custom training pipeline. Key aspects of the training process include:

Custom Dataset: A loader reads a JSON file containing context, question, and answer triples.

* *Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.*
* *Training Process: Leveraging mixed precision training and the AdamW optimizer to improve optimization.*
* *Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and employs early stopping based on validation loss.*
* *This model is ideal for interactive document question answering tasks, making it a powerful tool for applications such as customer support, document search, and automated Q&A systems.*

While primarily focused on Russian texts, the model also supports English language inputs.
**The model also supports English language, but its support was not tested**

# Example Usage

```python
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_small_v1")
model.to(device)

file_path = input("Enter document path: ")
with open(file_path, "r", encoding="utf-8") as f:
    context = f.read()

while True:
    question = input("Enter question (or 'exit' to quit): ")
    if question.lower() == "exit":
        break
    inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model(**inputs)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    start_index = torch.argmax(start_logits)
    end_index = torch.argmax(end_logits)
    answer_tokens = inputs["input_ids"][0][start_index:end_index + 1]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
    print("Answer:", answer)
```

# Example of answering
**RU**
*Context:*

```
Альберт Эйнштейн разработал теорию относительности.
```

*Question:*

```
Кто разработал теорию относительности?
```

*Answer:*

```
альберт эинштеин
```
**EN**
*Context:*

```
I had a red car.
```

*Question:*

```
What kind of car did I have?
```

*Answer:*

```
a red car
```

**Finetuned by LaciaStudio | LaciaAI**