INTRODUCTION:

This model, developed as part of the BookNLP-fr project, is a coreference resolution model built on top of camembert-large embeddings. It is trained to link mentions of the same entity across a text, focusing on literary works in French.

This specific model has been trained to link entities of the following types: PER, LOC, FAC, TIME, VEH, GPE.

MODEL PERFORMANCES (LOOCV):

Overall Coreference Resolution Performances for non-overlapping windows of different length:

Window width (tokens) Document count Sample count MUC F1 B3 F1 CEAFe F1 CONLL F1
0 500 29 677 92.34% 84.96% 81.83% 86.37%
1 1,000 29 332 92.55% 81.71% 78.50% 84.25%
2 2,000 28 162 92.72% 77.82% 75.57% 82.03%
3 5,000 19 56 93.02% 72.37% 70.03% 78.47%
4 10,000 18 27 93.37% 67.07% 67.81% 76.09%
5 25,000 2 3 94.63% 57.67% 58.21% 70.17%
6 50,000 1 1 97.32% 55.82% 49.44% 67.52%

Coreference Resolution Performances on the fully annotated sample for each document:

Token count Mention count MUC F1 B3 F1 CEAFe F1 CONLL F1
0 1,864 306 96.82% 91.68% 77.76% 88.75%
1 2,034 354 96.55% 88.22% 83.37% 89.38%
2 2,141 352 95.03% 80.11% 76.04% 83.73%
3 2,251 252 91.53% 78.58% 70.66% 80.26%
4 2,343 320 86.76% 69.48% 72.74% 76.33%
5 2,441 358 92.73% 67.48% 69.44% 76.55%
6 2,554 376 87.63% 65.67% 70.02% 74.44%
7 2,860 474 91.95% 78.24% 75.48% 81.89%
8 2,929 435 95.16% 61.06% 80.01% 78.74%
9 4,067 569 94.55% 82.97% 75.41% 84.31%
10 5,425 671 87.39% 56.20% 63.71% 69.10%
11 10,305 1,551 95.83% 68.15% 72.73% 78.90%
12 10,982 1,252 96.05% 65.75% 75.09% 78.96%
13 11,768 1,932 92.65% 67.69% 72.51% 77.62%
14 11,834 861 88.99% 60.48% 71.20% 73.56%
15 11,902 1,999 93.95% 59.57% 68.01% 73.84%
16 12,281 1,480 92.24% 71.24% 80.86% 81.45%
17 12,285 1,735 94.85% 72.31% 70.94% 79.37%
18 12,315 1,745 93.64% 60.48% 68.45% 74.19%
19 12,389 2,059 92.87% 63.61% 69.26% 75.25%
20 12,557 1,498 92.24% 79.00% 78.10% 83.11%
21 12,703 2,297 88.94% 61.19% 76.12% 75.42%
22 13,023 1,861 91.53% 66.21% 74.13% 77.29%
23 14,299 1,849 95.73% 71.32% 77.98% 81.68%
24 14,637 2,471 94.67% 71.41% 76.06% 80.71%
25 15,408 2,013 91.54% 56.61% 64.54% 70.90%
26 24,776 3,092 92.94% 63.22% 70.87% 75.67%
27 30,987 3,481 89.25% 52.00% 70.11% 70.45%
28 71,219 11,857 97.28% 53.34% 46.67% 65.76%

TRAINING PARAMETERS:

  • Entities types: PER, LOC, FAC, TIME, VEH, GPE
  • Split strategy: Leave-one-out cross-validation (29 files)
  • Train/Validation split: 0.85 / 0.15
  • Batch size: 16,000
  • Initial learning rate: 0.0004
  • Focal loss gamma: 1
  • Focal loss alpha: 0.25
  • Pronoun lookup antecedents: 30
  • Common and Proper nouns lookup antecedents: 300

MODEL ARCHITECTURE:

Model Input: 2,165 dimensions vector

  • Concatenated maximum context camembert-large embeddings (2 * 1,024 = 2,048 dimensions)

  • Additional mentions features (106 dimensions):

    • Length of mentions
    • Position of the mention's start token within the sentence
    • Grammatical category of the mentions (pronoun, common noun, proper noun)
    • Dependency relation of the mention's head (one-hot encoded)
    • Gender of the mentions (one-hot encoded)
    • Number (singular/plural) of the mentions (one-hot encoded)
    • Grammatical person of the mentions (one-hot encoded)
  • Additional mention pairs features (11 dimensions):

    • Distance between mention IDs
    • Distance between start tokens of mentions
    • Distance between end tokens of mentions
    • Distance between sentences containing mentions
    • Distance between paragraphs containing mentions
    • Difference in nesting levels of mentions
    • Ratio of shared tokens between mentions
    • Exact text match between mentions (binary)
    • Exact match of mention heads (binary)
    • Match of syntactic heads between mentions (binary)
    • Match of entity types between mentions (binary)
  • Hidden Layers:

    • Number of layers: 3
    • Units per layer: 1,900 nodes
    • Activation function: relu
    • Dropout rate: 0.6
  • Final Layer:

    • Type: Linear
    • Input: 1900 dimensions
    • Output: 1 dimension (mention pair coreference score)

Model Output: Continuous prediction between 0 (not coreferent) and 1 (coreferent) indicating the degree of confidence.

HOW TO USE:

*** IN CONSTRUCTION ***

TRAINING CORPUS:

Document Tokens Count Is included in model eval
0 1836_Gautier-Theophile_La-morte-amoureuse 14,299 tokens True
1 1840_Sand-George_Pauline 12,315 tokens True
2 1842_Balzac-Honore-de_La-Maison-du-chat-qui-pelote 24,776 tokens True
3 1844_Balzac-Honore-de_La-Maison-Nucingen 30,987 tokens True
4 1844_Balzac-Honore-de_Sarrasine 15,408 tokens True
5 1856_Cousin-Victor_Madame-de-Hautefort 11,768 tokens True
6 1863_Gautier-Theophile_Le-capitaine-Fracasse 11,834 tokens True
7 1873_Zola-Emile_Le-ventre-de-Paris 12,557 tokens True
8 1881_Flaubert-Gustave_Bouvard-et-Pecuchet 12,281 tokens True
9 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_1-MADEMOISELLE-FIFI 5,425 tokens True
10 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_2-MADAME-BAPTISTE 2,554 tokens True
11 1882_Guy-de-Maupassant_Mademoiselle-Fifi-1_3-LA-ROUILLE 2,929 tokens True
12 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_1-MARROCA 4,067 tokens True
13 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_2-LA-BUCHE 2,251 tokens True
14 1882_Guy-de-Maupassant_Mademoiselle-Fifi-2_3-LA-RELIQUE 2,034 tokens True
15 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_1-FOU 1,864 tokens True
16 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_2-REVEIL 2,141 tokens True
17 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_3-UNE-RUSE 2,441 tokens True
18 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_4-A-CHEVAL 2,860 tokens True
19 1882_Guy-de-Maupassant_Mademoiselle-Fifi-3_5-UN-REVEILLON 2,343 tokens True
20 1901_Lucie-Achard_Rosalie-de-Constant-sa-famille-et-ses-amis 12,703 tokens True
21 1903_Conan-Laure_Elisabeth_Seton 13,023 tokens True
22 1904_Rolland-Romain_Jean-Christophe_Tome-I-L-aube 10,982 tokens True
23 1904_Rolland-Romain_Jean-Christophe_Tome-II-Le-matin 10,305 tokens True
24 1917_Adèle-Bourgeois_Némoville 12,389 tokens True
25 1923_Radiguet-Raymond_Le-diable-au-corps 14,637 tokens True
26 1926_Audoux-Marguerite_De-la-ville-au-moulin 11,902 tokens True
27 1937_Audoux-Marguerite_Douce-Lumiere 12,285 tokens True
28 Manon_Lescaut_PEDRO 71,219 tokens True
29 TOTAL 346,579 tokens 29 files used for cross-validation

CONTACT:

mail: antoine [dot] bourgois [at] protonmail [dot] com

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for AntoineBourgois/BookNLP-fr_coreference-resolution_camembert-large_FAC_GPE_LOC_PER_TIME_VEH

Finetuned
(10)
this model