--- license: cc-by-nc-4.0 language: - ru - en pipeline_tag: document-question-answering tags: - DocumentQA - QuestionAnswering - NLP - DeepLearning - Transformers - Multimodal - HuggingFace - ruBert - MachineLearning - DeepQA - AIForDocs - Docs - NeuralNetworks - torch - pytorch library_name: transformers metrics: - accuracy - f1 - recall - exact_match - precision base_model: - ai-forever/ruBert-base --- ![Official Kaleidoscope Logo](https://huggingface.co/LaciaStudio/Kaleidoscope_small_v1/resolve/main/Kaleidoscope.png) # Document Question Answering Model - Kaleidoscope_small_v1 This model is a fine-tuned version of sberbank-ai/ruBert-base designed for the task of document question answering. It has been adapted specifically for extracting answers from a provided document context and fine-tuned on a custom JSON dataset containing context, question, and answer triples. # Key Features * Objective: Extract answers from documents based on user questions. * Base Model: sberbank-ai/ruBert-base. * Dataset: A custom JSON file with fields: context, question, and answer. * Preprocessing: The input is formed by concatenating the question and the document context, guiding the model to focus on the relevant segments. # Training Settings: * Number of epochs: 20. * Batch size: 4 per device. * Warmup steps: 0.1 of total steps. * FP16 training enabled (if CUDA is available). * Hardware: Training was performed on an 1xRTX 3070. # Description The model was fine-tuned using the Transformers library with a custom training pipeline. Key aspects of the training process include: Custom Dataset: A loader reads a JSON file containing context, question, and answer triples. * *Feature Preparation: The script tokenizes the document and question with a sliding window approach to handle long texts.* * *Training Process: Leveraging mixed precision training and the AdamW optimizer to improve optimization.* * *Evaluation and Checkpointing: The training script evaluates model performance on a validation set, saves checkpoints, and employs early stopping based on validation loss.* * *This model is ideal for interactive document question answering tasks, making it a powerful tool for applications such as customer support, document search, and automated Q&A systems.* While primarily focused on Russian texts, the model also supports English language inputs. **The model also supports English language, but its support was not tested** # Example Usage ```python import torch from transformers import AutoTokenizer, AutoModelForQuestionAnswering device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = AutoTokenizer.from_pretrained("LaciaStudio/Kaleidoscope_small_v1") model = AutoModelForQuestionAnswering.from_pretrained("LaciaStudio/Kaleidoscope_small_v1") model.to(device) file_path = input("Enter document path: ") with open(file_path, "r", encoding="utf-8") as f: context = f.read() while True: question = input("Enter question (or 'exit' to quit): ") if question.lower() == "exit": break inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=384) inputs = {k: v.to(device) for k, v in inputs.items()} outputs = model(**inputs) start_logits = outputs.start_logits end_logits = outputs.end_logits start_index = torch.argmax(start_logits) end_index = torch.argmax(end_logits) answer_tokens = inputs["input_ids"][0][start_index:end_index + 1] answer = tokenizer.decode(answer_tokens, skip_special_tokens=True) print("Answer:", answer) ``` # Example of answering **RU** *Context:* ``` Альберт Эйнштейн разработал теорию относительности. ``` *Question:* ``` Кто разработал теорию относительности? ``` *Answer:* ``` альберт эинштеин ``` **EN** *Context:* ``` I had a red car. ``` *Question:* ``` What kind of car did I have? ``` *Answer:* ``` a red car ``` **Finetuned by LaciaStudio | LaciaAI**