Model Card for scholawrite-bert-classifier

Model Details

Model Description

This model is refered as BERT-SW-CLF in the paper. It is fined-tuned based on base-base-uncased Hugging Face, using train split of ScholaWrite dataset. The sole purpose of this model is to predict the next writing intention given scholarly writing in latex.

Developed by: *Linghe Wang, *Minhwa Lee, Ross Volkov, Luan Chau, Dongyeop Kang
Language: English
Finetuned from model: bert-base-uncased

Model Sources [optional]

Repository: ScholaWrite Github Repository
Paper: [More Information Needed]

Uses

Direct Use

The model is intended to used for next writing intention prediction in LaTex paper draft. It takes 'before' text warped by special tokens as input, and output the next writing intention which is 1 of 15 predefined labels.

Out-of-Scope Use

The model is fine-tuned only for next writing intention prediction and infereneced in closed enviroment. Its main goal is to examine the usefullness of our dataset. It is suitable for acdamic use, but not suitable for production, general public use, or consumer-oriented service. In addition, use this model on tasks besides next intention prediction in LaTex paper draft may not work well.

Bias and Limitations

The bias and limitations of this model mainly came from the dataset (ScholaWrite) it fine-tuned on.

First, the ScholaWrite dataset is currently limited to the computer science domain, as LaTeX is predominantly used in computer science journals and conferences. This domain-specific focus in dataset may restrict the model's generalizability to other scientific disciplines. Future work could address this limitation by collecting keystroke data from a broader range of fields with diverse writing conven554 tions and tools, such as the humanities or biological sciences. For example, students in humanities usu556 ally write book-length papers and integrate more sources, so it could affect cognitive complexities.

Second, all participants were early-career researchers (e.g., PhD students) at an R1 university in the United States, which means the models may not learn the professional writing behavior and cognitive process from expert. Expanding the dataset to include senior researchers, such as post-doctoral fellows and professors, could offer valuable insights into how writing strategies and revision behaviors evolve with research experience and expertise.

Third, the dataset is exclusive to English-language writing, which restricts model's capability to predict next writing intention in multilingual or non-English contexts. Expanding to multilingual settings could reveal unique cognitive and linguistic insights into writing across languages.

How to Get Started with the Model

import os
from dotenv import load_dotenv

import torch
from transformers import BertTokenizer, BertForSequenceClassification, RobertaTokenizer, RobertaForSequenceClassification
from huggingface_hub import login

load_dotenv()
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
login(token=HUGGINGFACE_TOKEN)

TOTAL_CLASSES = 15

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.add_tokens("<INPUT>")  # start input
tokenizer.add_tokens("</INPUT>") # end input
tokenizer.add_tokens("<BT>")     # before text
tokenizer.add_tokens("</BT>")    # before text
tokenizer.add_tokens("<PWA>")    # start previous writing action
tokenizer.add_tokens("</PWA>")   # end previous writing action

model = BertForSequenceClassification.from_pretrained('minnesotanlp/scholawrite-bert-classifier', num_labels=TOTAL_CLASSES)

before_text = "sample before text"
text = "<INPUT>" + "<BT>" + before_text + "</BF> " + "</INPUT>"

input = tokenizer(text, return_tensors="pt")
pred = model(input["input_ids"]).logits.argmax(1)
print("class:", pred)

fine-tuning Details

fine-tuning Data

This model is fine-tuned on minnesotanlp/scholawrite dataset train split. It is keystroke logs of an end-to-end scholarly writing process, with thorough annotations of cognitive writing intentions behind each keystroke. No additional data pre-processing or filtering performed on the dataset.

fine-tuning Procedure

The model was fine tuned by passing in the before_text section of a prompt as the input, and using the intention as the ground truth data. The model output an integer according to each intention label (1-15).

fine-tuning Hyperparameters

fine-tuning regime: fp32
learning_rate 2e-5
per_device_train_batch_size 2
per_device_eval_batch_size 8
num_train_epochs 10
weight_decay 0.01

Machine Specs

Hardware: 2 X Nvidia RTX A6000
Hours used: 3.5 hrs
Compute Region: Minnesota

Testing Procedure

Testing Data

minnesotanlp/scholawrite

Metrics

The data has class imbalanced on both training and testing data splits, so we use weighted F1 to measure the performance.

Results

	BERT	RoBERTa	LLama-8B-Instruct	GPT-4o
Base	0.04	0.02	0.12	0.08
+ SW	0.64	0.64	0.13	-

Summary

Table above presents the weighted F1 scores for predicting writing intentions across baselines and fine-tuned models. All models finetuned on ScholaWrite show a improvement performance compared to their baselines. BERT and RoBERTa achieved the most improvement, while LLama-8B-Instruct showed a modest improvement after fine-tuning. Those results demonstrate the effectiveness of our ScholaWrite dataset to align language models with writers' intentions.

BibTeX

@misc{wang2025scholawritedatasetendtoendscholarly,
      title={ScholaWrite: A Dataset of End-to-End Scholarly Writing Process},
      author={Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang},
      year={2025},
      eprint={2502.02904},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02904},
      }

minnesotanlp
/

scholawrite-bert-classifier