File size: 3,898 Bytes
389b858
319d7c8
389b858
319d7c8
389b858
 
 
 
 
4995454
389b858
422664a
4995454
2f35259
4995454
f3b6f0d
4995454
0fe4858
319d7c8
4995454
 
389b858
9fb05b3
4995454
9fb05b3
1f3275c
 
 
9fb05b3
1f3275c
9fb05b3
389b858
9fb05b3
389b858
9fb05b3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
389b858
4995454
 
 
ac86603
4995454
9fb05b3
 
 
 
 
 
 
 
 
 
 
 
 
4995454
 
c034ea9
4995454
1f3275c
 
 
 
 
 
 
 
 
 
 
 
 
 
319d7c8
1f3275c
 
 
422664a
 
 
4995454
f3b6f0d
 
 
 
 
 
 
 
 
 
4995454
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
base_model: google-bert/bert-base-multilingual-uncased
library_name: transformers
license: apache-2.0
tags:
- autotrain
- text-classification
---

# 📚 Institutional Books Topic Classifier

This model was trained as part of the analysis and post-processing work performed in preparation for the release of the [Institutional Books 1.0 dataset](https://huggingface.co/collections/instdin/institutional-books-68366258bfb38364238477cf) by the Institutional Data Initiative.

We used this text classifier to assign a topic, derived from the first level of the [Library of Congress' Classification Outline](https://www.loc.gov/catdir/cpso/lcco/), to individual volumes.

Complete experimental setup and results are available in our [technical report](https://arxiv.org/abs/2506.08300) (Section 4.5).

**Code:** https://github.com/instdin/institutional-books-1-pipeline

## Base model
[google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased)

## Input format 
Book metadata, formated as follows:
```
Title: A treatise on analytical geometry of tree dimensions, containing the theory of curve surfaces and of curves of double curvature. 
Author: Hymers, J.   
Year: 1848
Language: English
General Note: Example of a general note
```

All of the fields listed in this example are optional.

## Categories
- GENERAL WORKS
- PHILOSOPHY. PSYCHOLOGY. RELIGION
- AUXILIARY SCIENCES OF HISTORY
- WORLD HISTORY AND HISTORY OF EUROPE, ASIA, AFRICA, AUSTRALIA, NEW ZEALAND, ETC.
- HISTORY OF THE AMERICAS
- GEOGRAPHY. ANTHROPOLOGY. RECREATION
- SOCIAL SCIENCES
- POLITICAL SCIENCE
- LAW
- EDUCATION
- MUSIC AND BOOKS ON MUSIC
- FINE ARTS
- LANGUAGE AND LITERATURE
- SCIENCE
- MEDICINE
- AGRICULTURE
- TECHNOLOGY
- MILITARY SCIENCE
- NAVAL SCIENCE
- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

## Training data
- Train split: 80,830 samples
- Test split: 5,000 samples
- An additional set of 1,000 samples was set aside for benchmarking purposes

## Validation Metrics
| Metric | Value |
| --- | --- | 
| loss | 0.157407745718956 |
| f1_macro | 0.9613886456444749 |
| f1_micro | 0.9694 |
| f1_weighted | 0.9693030681223207 |
| precision_macro | 0.9679892485977634 |
| precision_micro | 0.9694 |
| precision_weighted | 0.9695713537396466 |
| recall_macro | 0.9560667596679707 |
| recall_micro | 0.9694 |
| recall_weighted | 0.9694 |
| accuracy | 0.9694 |

Post-training benchmark accuracy: 97.8% (978/1000) 

## Quickstart

```python
from transformers import pipeline

to_label = """
Title: A treatise on analytical geometry of tree dimensions, containing the theory of curve surfaces and of curves of double curvature. 
Author: Hymers, J.   
Year: 1848
Language: English
General Note: Example of a general note
"""

pipe = pipeline("text-classification", model="instdin/institutional-books-topic-classifier-bert")
result = pipe(to_label.strip())
print(result[0]) # {'label': 'SCIENCE', 'score': 0.9996894598007202}
```

## About IDI
The Institutional Data Initiative at Harvard Law School Library works with knowledge institutions—from libraries and museums to cultural groups and government agencies—to refine and publish their collections as data. [Reach out to collaborate on your collections](https://institutionaldatainitiative.org/#get-involved).

## Cite
```bibtext
@misc{cargnelutti2025institutionalbooks10242b,
      title={Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability}, 
      author={Matteo Cargnelutti and Catherine Brobston and John Hess and Jack Cushman and Kristi Mukk and Aristana Scourtas and Kyle Courtney and Greg Leppert and Amanda Watson and Martha Whitehead and Jonathan Zittrain},
      year={2025},
      eprint={2506.08300},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.08300}, 
}
```