Model Summary

In order to be able to reproduce GneissWeb, we provide here GneissWeb.Tech_classifier - a technology category fastText classifier. This fastText model is used as part of the ensemble filter in GneissWeb to detect documents with technology content.

Please refer to the GneissWeb for more details.

     Developers: IBM Research

     Release Date: Feb 21st, 2025

     License: Apache 2.0.

Training Data

The model is trained on 800k documents, labeled using the WatsonNLP hierachical categorization. Please refer to fastText text classification tutorial for details. Training data is selected as follows:

  • Positive documents: 400k documents randomly sampled from the documents labeled with technology category with a confidence score 0.95 and above.
  • Negative documents: 400k documents randomly sampled from the documents labeled with any category other than science, education, medical, and technology categories with a confidence score of 0.95 and above.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including ibm-granite/GneissWeb.Tech_classifier