RuBERTExtSumGazeta

Model description

Model for extractive summarization based on rubert-base-cased

Intended uses & limitations

How to use

Colab: link

import razdel
from transformers import AutoTokenizer, BertForTokenClassification

model_name = "IlyaGusev/rubert_ext_sum_gazeta"

tokenizer = AutoTokenizer.from_pretrained(model_name)
sep_token = tokenizer.sep_token
sep_token_id = tokenizer.sep_token_id

model = BertForTokenClassification.from_pretrained(model_name)

article_text = "..."
sentences = [s.text for s in razdel.sentenize(article_text)]
article_text = sep_token.join(sentences)

inputs = tokenizer(
    [article_text],
    max_length=500,
    padding="max_length",
    truncation=True,
    return_tensors="pt",
)
sep_mask = inputs["input_ids"][0] == sep_token_id

# Fix token_type_ids
current_token_type_id = 0 
for pos, input_id in enumerate(inputs["input_ids"][0]):
    inputs["token_type_ids"][0][pos] = current_token_type_id
    if input_id == sep_token_id:
        current_token_type_id = 1 - current_token_type_id

# Infer model
with torch.no_grad(): 
    outputs = model(**inputs) 
logits = outputs.logits[0, :, 1]

# Choose sentences 
logits = logits[sep_mask]
logits, indices = logits.sort(descending=True)
logits, indices = logits.cpu().tolist(), indices.cpu().tolist()
pairs = list(zip(logits, indices))
pairs = pairs[:3]
indices = list(sorted([idx for _, idx in pairs]))
summary = " ".join([sentences[idx] for idx in indices])
print(summary)

Limitations and bias

  • The model should work well with Gazeta.ru articles, but for any other agencies it can suffer from domain shift

Training data

Training procedure

TBD

Eval results

TBD

Evaluation: https://github.com/IlyaGusev/summarus/blob/master/evaluate.py

Flags: --language ru --tokenize-after --lower

Downloads last month
144
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Dataset used to train IlyaGusev/rubert_ext_sum_gazeta