MatteoCargnelutti commited on
Commit
4995454
·
verified ·
1 Parent(s): 9fb05b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -6
README.md CHANGED
@@ -9,12 +9,19 @@ widget:
9
  license: apache-2.0
10
  ---
11
 
12
- # 📚 Institutional Books Pipeline
13
 
14
- ## Training data
 
 
 
 
 
 
 
15
 
16
  ## Input format
17
- Text, formatted as follows:
18
  ```
19
  Title: Full title of the book
20
  Author: Lorem Ipsum
@@ -26,8 +33,6 @@ General Note: A great book
26
  All of the fields listed in this example are optional.
27
 
28
  ## Categories
29
- First level of the [Library of Congress Classification Outline](https://www.loc.gov/catdir/cpso/lcco/)
30
-
31
  - GENERAL WORKS
32
  - PHILOSOPHY. PSYCHOLOGY. RELIGION
33
  - AUXILIARY SCIENCES OF HISTORY
@@ -49,6 +54,12 @@ First level of the [Library of Congress Classification Outline](https://www.loc.
49
  - NAVAL SCIENCE
50
  - BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
51
 
 
 
 
 
 
 
52
  ## Validation Metrics
53
  | Metric | Value |
54
  | --- | --- |
@@ -62,4 +73,11 @@ First level of the [Library of Congress Classification Outline](https://www.loc.
62
  | recall_macro | 0.9560667596679707 |
63
  | recall_micro | 0.9694 |
64
  | recall_weighted | 0.9694 |
65
- | accuracy | 0.9694 |
 
 
 
 
 
 
 
 
9
  license: apache-2.0
10
  ---
11
 
12
+ # 📚 Institutional Books Topic Classifier
13
 
14
+ This model was trained as part of the analysis and experiments performed in preparation of the release of the [Institutional Books 1.0 dataset](https://huggingface.co/collections/instdin/institutional-books-68366258bfb38364238477cf).
15
+
16
+ It is a text classifier, that we used to assign 1 of 20 topics, derived from the first level of the [Library of Congress' Classification Outline](https://www.loc.gov/catdir/cpso/lcco/), to individual volumes.
17
+
18
+ Complete experimental setup and results are available in our [technical report]() (Section 4.5).
19
+
20
+ ## Base model
21
+ [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased)
22
 
23
  ## Input format
24
+ Book metadata, formated as follows:
25
  ```
26
  Title: Full title of the book
27
  Author: Lorem Ipsum
 
33
  All of the fields listed in this example are optional.
34
 
35
  ## Categories
 
 
36
  - GENERAL WORKS
37
  - PHILOSOPHY. PSYCHOLOGY. RELIGION
38
  - AUXILIARY SCIENCES OF HISTORY
 
54
  - NAVAL SCIENCE
55
  - BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
56
 
57
+ ## Training data
58
+ - Train split: 80,830 samples
59
+ - Test split: 5,000 samples
60
+
61
+ An additional set of 1,000 samples was set aside for benchmarking purposes.
62
+
63
  ## Validation Metrics
64
  | Metric | Value |
65
  | --- | --- |
 
73
  | recall_macro | 0.9560667596679707 |
74
  | recall_micro | 0.9694 |
75
  | recall_weighted | 0.9694 |
76
+ | accuracy | 0.9694 |
77
+
78
+ **Benchmark accuracy:** 97.2% (920)
79
+
80
+ ## Cite
81
+ ```
82
+ TBD
83
+ ```