bhatta1's picture
Update README.md
9915cd9 verified
metadata
viewer: false
license:
  - apache-2.0
language:
  - en

Model Summary

In order to be able to reproduce GneissWeb, we provide here GneissWeb.Sci_classifier - a science category fastText classifier. This fastText model is used as part of the ensemble filter in GneissWeb to detect documents with science content.

Please refer to the GneissWeb for more details.

     Developers: IBM Research

     Release Date: Feb 21st, 2025

     License: Apache 2.0.

Training Data

The model is trained on 800k documents, labeled using the WatsonNLP hierachical categorization. Please refer to fastText text classification tutorial for details. Training data is selected as follows:

  • Positive documents: 400k documents randomly sampled from the documents labeled with science category with a confidence score 0.95 and above.
  • Negative documents: 400k documents randomly sampled from the documents labeled with any category other than science, education, medical, and technology categories with a confidence score of 0.95 and above.