File size: 3,898 Bytes
389b858 319d7c8 389b858 319d7c8 389b858 4995454 389b858 422664a 4995454 2f35259 4995454 f3b6f0d 4995454 0fe4858 319d7c8 4995454 389b858 9fb05b3 4995454 9fb05b3 1f3275c 9fb05b3 1f3275c 9fb05b3 389b858 9fb05b3 389b858 9fb05b3 389b858 4995454 ac86603 4995454 9fb05b3 4995454 c034ea9 4995454 1f3275c 319d7c8 1f3275c 422664a 4995454 f3b6f0d 4995454 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
---
base_model: google-bert/bert-base-multilingual-uncased
library_name: transformers
license: apache-2.0
tags:
- autotrain
- text-classification
---
# 📚 Institutional Books Topic Classifier
This model was trained as part of the analysis and post-processing work performed in preparation for the release of the [Institutional Books 1.0 dataset](https://huggingface.co/collections/instdin/institutional-books-68366258bfb38364238477cf) by the Institutional Data Initiative.
We used this text classifier to assign a topic, derived from the first level of the [Library of Congress' Classification Outline](https://www.loc.gov/catdir/cpso/lcco/), to individual volumes.
Complete experimental setup and results are available in our [technical report](https://arxiv.org/abs/2506.08300) (Section 4.5).
**Code:** https://github.com/instdin/institutional-books-1-pipeline
## Base model
[google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased)
## Input format
Book metadata, formated as follows:
```
Title: A treatise on analytical geometry of tree dimensions, containing the theory of curve surfaces and of curves of double curvature.
Author: Hymers, J.
Year: 1848
Language: English
General Note: Example of a general note
```
All of the fields listed in this example are optional.
## Categories
- GENERAL WORKS
- PHILOSOPHY. PSYCHOLOGY. RELIGION
- AUXILIARY SCIENCES OF HISTORY
- WORLD HISTORY AND HISTORY OF EUROPE, ASIA, AFRICA, AUSTRALIA, NEW ZEALAND, ETC.
- HISTORY OF THE AMERICAS
- GEOGRAPHY. ANTHROPOLOGY. RECREATION
- SOCIAL SCIENCES
- POLITICAL SCIENCE
- LAW
- EDUCATION
- MUSIC AND BOOKS ON MUSIC
- FINE ARTS
- LANGUAGE AND LITERATURE
- SCIENCE
- MEDICINE
- AGRICULTURE
- TECHNOLOGY
- MILITARY SCIENCE
- NAVAL SCIENCE
- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
## Training data
- Train split: 80,830 samples
- Test split: 5,000 samples
- An additional set of 1,000 samples was set aside for benchmarking purposes
## Validation Metrics
| Metric | Value |
| --- | --- |
| loss | 0.157407745718956 |
| f1_macro | 0.9613886456444749 |
| f1_micro | 0.9694 |
| f1_weighted | 0.9693030681223207 |
| precision_macro | 0.9679892485977634 |
| precision_micro | 0.9694 |
| precision_weighted | 0.9695713537396466 |
| recall_macro | 0.9560667596679707 |
| recall_micro | 0.9694 |
| recall_weighted | 0.9694 |
| accuracy | 0.9694 |
Post-training benchmark accuracy: 97.8% (978/1000)
## Quickstart
```python
from transformers import pipeline
to_label = """
Title: A treatise on analytical geometry of tree dimensions, containing the theory of curve surfaces and of curves of double curvature.
Author: Hymers, J.
Year: 1848
Language: English
General Note: Example of a general note
"""
pipe = pipeline("text-classification", model="instdin/institutional-books-topic-classifier-bert")
result = pipe(to_label.strip())
print(result[0]) # {'label': 'SCIENCE', 'score': 0.9996894598007202}
```
## About IDI
The Institutional Data Initiative at Harvard Law School Library works with knowledge institutions—from libraries and museums to cultural groups and government agencies—to refine and publish their collections as data. [Reach out to collaborate on your collections](https://institutionaldatainitiative.org/#get-involved).
## Cite
```bibtext
@misc{cargnelutti2025institutionalbooks10242b,
title={Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability},
author={Matteo Cargnelutti and Catherine Brobston and John Hess and Jack Cushman and Kristi Mukk and Aristana Scourtas and Kyle Courtney and Greg Leppert and Amanda Watson and Martha Whitehead and Jonathan Zittrain},
year={2025},
eprint={2506.08300},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.08300},
}
``` |