Dudeman523 commited on
Commit
9c1caa9
·
verified ·
1 Parent(s): 308afb5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +254 -3
README.md CHANGED
@@ -1,3 +1,254 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - FacebookAI/roberta-base
7
+ ---
8
+ ---
9
+ language: en
10
+ license: mit
11
+ library_name: transformers
12
+ tags:
13
+ - token-classification
14
+ - ner
15
+ - plants
16
+ - botany
17
+ - roberta
18
+ - biology
19
+ - horticulture
20
+ datasets:
21
+ - custom
22
+ widget:
23
+ - text: "I have a Rosa damascena and some Quercus alba trees in my garden."
24
+ example_title: "Scientific plant names"
25
+ - text: "My hibiscus and pachypodium plants need watering."
26
+ example_title: "Common plant names"
27
+ - text: "The beautiful roses are blooming next to the oak tree."
28
+ example_title: "Mixed plant references"
29
+ pipeline_tag: token-classification
30
+ model-index:
31
+ - name: roberta-plant-ner
32
+ results:
33
+ - task:
34
+ type: token-classification
35
+ name: Token Classification
36
+ dataset:
37
+ type: custom
38
+ name: Plant NER Dataset
39
+ metrics:
40
+ - type: f1
41
+ value: 0.92
42
+ name: F1 Score
43
+ - type: precision
44
+ value: 0.90
45
+ name: Precision
46
+ - type: recall
47
+ value: 0.94
48
+ name: Recall
49
+ ---
50
+
51
+ # RoBERTa Plant Named Entity Recognition
52
+
53
+ ## Model Description
54
+
55
+ This model is a fine-tuned version of [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base) for **plant named entity recognition**. It identifies and classifies plant names in text into two categories:
56
+
57
+ - **PLANT_COMMON**: Common names for plants (e.g., "rose", "hibiscus", "oak tree")
58
+ - **PLANT_SCI**: Scientific/botanical names (e.g., "Rosa damascena", "Quercus alba")
59
+
60
+ ## Intended Uses & Limitations
61
+
62
+ ### Intended Uses
63
+ - **Botanical text analysis**: Extract plant mentions from research papers, articles, and documentation
64
+ - **Gardening applications**: Identify plants mentioned in gardening guides, forums, and care instructions
65
+ - **Agricultural text processing**: Parse agricultural documents and reports
66
+ - **Educational tools**: Assist in botany and horticulture education
67
+ - **Content management**: Automatically tag and categorize plant-related content
68
+
69
+ ### Limitations
70
+ - Trained primarily on English text
71
+ - May have lower accuracy on rare or highly specialized plant species
72
+ - Performance may vary on informal text, social media, or heavily abbreviated content
73
+ - Does not distinguish between live plants and plant products (e.g., "rose oil")
74
+
75
+ ## Training Data
76
+
77
+ The model was trained on a custom dataset containing:
78
+ - Botanical literature and research papers
79
+ - Gardening guides and plant care instructions
80
+ - Agricultural documents
81
+ - Horticultural databases
82
+ - Plant identification guides
83
+
84
+ **Data Format**: CoNLL-style IOB2 tagging with whole-word tokenization
85
+ **Training Examples**: Thousands of annotated sentences containing plant references
86
+
87
+ ## Training Procedure
88
+
89
+ ### Training Hyperparameters
90
+ - **Base Model**: FacebookAI/roberta-base
91
+ - **Training Framework**: Hugging Face Transformers
92
+ - **Tokenization**: RoBERTa tokenizer with whole-word alignment
93
+ - **Label Encoding**: IOB2 (Inside-Outside-Begin) format
94
+ - **Sequence Length**: 512 tokens maximum
95
+ - **Batch Size**: Optimized for training efficiency
96
+ - **Learning Rate**: Adaptive with warmup
97
+ - **Training Epochs**: Multiple epochs with early stopping
98
+
99
+ ### Label Schema
100
+ ```
101
+ O # Outside any plant entity
102
+ B-PLANT_COMMON # Beginning of common plant name
103
+ I-PLANT_COMMON # Inside/continuation of common plant name
104
+ B-PLANT_SCI # Beginning of scientific plant name
105
+ I-PLANT_SCI # Inside/continuation of scientific plant name
106
+ ```
107
+
108
+ ### Training Features
109
+ - **Whole-word tokenization**: Ensures proper handling of plant names
110
+ - **B-I-O validation**: Automatic correction of invalid tag sequences
111
+ - **Class balancing**: Weighted sampling for entity type balance
112
+ - **Data augmentation**: Synthetic examples for robustness
113
+
114
+ ## Evaluation
115
+
116
+ The model achieves strong performance on plant entity recognition:
117
+
118
+ | Metric | Overall | PLANT_COMMON | PLANT_SCI |
119
+ |--------|---------|--------------|-----------|
120
+ | **Precision** | 0.90 | 0.88 | 0.92 |
121
+ | **Recall** | 0.94 | 0.96 | 0.91 |
122
+ | **F1-Score** | 0.92 | 0.92 | 0.91 |
123
+
124
+ ### Performance Notes
125
+ - Excellent recall for common plant names (0.96)
126
+ - Strong precision for scientific names (0.92)
127
+ - Robust performance across different text types
128
+
129
+ ## Usage
130
+
131
+ ### Quick Start
132
+ ```python
133
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
134
+
135
+ # Load model and tokenizer
136
+ model_name = "Dudeman523/roberta-plant-ner"
137
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
138
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
139
+
140
+ # Create pipeline
141
+ ner_pipeline = pipeline(
142
+ "token-classification",
143
+ model=model,
144
+ tokenizer=tokenizer,
145
+ aggregation_strategy="simple"
146
+ )
147
+
148
+ # Extract plant entities
149
+ text = "I love my Rosa damascena roses and the old oak tree in my garden."
150
+ entities = ner_pipeline(text)
151
+
152
+ for entity in entities:
153
+ print(f"Plant: {entity['word']} | Type: {entity['entity_group']} | Confidence: {entity['score']:.2f}")
154
+ ```
155
+
156
+ ### Advanced Usage
157
+ ```python
158
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
159
+ import torch
160
+
161
+ # Load model
162
+ tokenizer = AutoTokenizer.from_pretrained("Dudeman523/roberta-plant-ner")
163
+ model = AutoModelForTokenClassification.from_pretrained("Dudeman523/roberta-plant-ner")
164
+
165
+ # Tokenize input
166
+ text = "The Pachypodium lamerei succulent needs minimal watering."
167
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
168
+
169
+ # Get predictions
170
+ with torch.no_grad():
171
+ outputs = model(**inputs)
172
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
173
+
174
+ # Process results
175
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
176
+ predicted_labels = torch.argmax(predictions, dim=-1)[0]
177
+
178
+ for token, label_id in zip(tokens, predicted_labels):
179
+ label = model.config.id2label[label_id.item()]
180
+ if label != "O":
181
+ print(f"Token: {token} | Label: {label}")
182
+ ```
183
+
184
+ ### Batch Processing
185
+ ```python
186
+ # Process multiple texts efficiently
187
+ texts = [
188
+ "My hibiscus is blooming beautifully this spring.",
189
+ "Quercus alba and Acer saccharum are common in this forest.",
190
+ "I need care instructions for my Rosa damascena plant."
191
+ ]
192
+
193
+ # Batch prediction
194
+ results = ner_pipeline(texts)
195
+
196
+ for i, (text, entities) in enumerate(zip(texts, results)):
197
+ print(f"\nText {i+1}: {text}")
198
+ for entity in entities:
199
+ print(f" 🌱 {entity['word']} ({entity['entity_group']}) - {entity['score']:.2f}")
200
+ ```
201
+
202
+ ## Model Architecture
203
+
204
+ - **Base Architecture**: RoBERTa (Robustly Optimized BERT Pretraining Approach)
205
+ - **Parameters**: ~125M parameters
206
+ - **Layers**: 12 transformer layers
207
+ - **Hidden Size**: 768
208
+ - **Attention Heads**: 12
209
+ - **Vocabulary**: 50,265 tokens
210
+ - **Classification Head**: Linear layer for 5-class token classification
211
+
212
+ ## Ethical Considerations
213
+
214
+ ### Bias and Fairness
215
+ - Model may reflect geographical and cultural biases present in training data
216
+ - Potential underrepresentation of plants from certain regions or cultures
217
+ - May perform better on commonly cultivated plants versus wild or rare species
218
+
219
+ ### Environmental Impact
220
+ - Training computational cost: Moderate (fine-tuning only)
221
+ - Inference efficiency: Optimized for production use
222
+ - Carbon footprint: Minimal incremental impact over base model
223
+
224
+ ## Technical Specifications
225
+
226
+ - **Input**: Text sequences up to 512 tokens
227
+ - **Output**: Token-level classifications with confidence scores
228
+ - **Inference Speed**: ~100-500 texts/second (depending on hardware)
229
+ - **Memory Requirements**: ~500MB RAM for inference
230
+ - **Supported Formats**: Raw text, tokenized input
231
+
232
+ ## Citation
233
+
234
+ If you use this model in your research, please cite:
235
+
236
+ ```bibtex
237
+ @misc{roberta-plant-ner,
238
+ title={RoBERTa Plant Named Entity Recognition Model},
239
+ author={Dudeman523},
240
+ year={2024},
241
+ publisher={Hugging Face},
242
+ url={https://huggingface.co/Dudeman523/roberta-plant-ner}
243
+ }
244
+ ```
245
+
246
+ ## Contact
247
+
248
+ For questions, issues, or collaboration opportunities, please open an issue on the model repository or contact the model author.
249
+
250
+ ---
251
+
252
+ **Model Version**: 1.0
253
+ **Last Updated**: December 2024
254
+ **Framework Compatibility**: transformers >= 4.21.0