metadata
viewer: false
license:
- apache-2.0
language:
- en
Model Summary
In order to be able to reproduce GneissWeb, we provide here GneissWeb.Med_classifier - a medical category fastText classifier. This fastText model is used as part of the ensemble filter in GneissWeb to detect documents with medical content.
Please refer to the GneissWeb page for more details.
Developers: IBM Research
Release Date: Feb 21st, 2025
License: Apache 2.0.
Training Data
The model is trained on 800k documents, labeled using the WatsonNLP hierachical categorization. Please refer to fastText text classification tutorial for details. Training data is selected as follows:
- Positive documents: 400k documents randomly sampled from the documents labeled with medical category with a confidence score 0.95 and above.
- Negative documents: 400k documents randomly sampled from the documents labeled with any category other than science, education, medical, and technology categories with a confidence score of 0.95 and above.