oflorez/Wayra-Perplexity-Estimator-55M-TensorRT: TensorRT Optimized WayraPPL

🚀 A100-optimized TensorRT version of WayraPPL for high-throughput prediction of Perplexity.

⚠️ Hardware Requirements

This model works on NVIDIA A100 GPUs with:

  • GPU Architecture: sm_80 (A100-80GB)
  • CUDA: 12.8+
  • TensorRT: 10.13.x
  • Driver: 570.124.06+

🚀 Performance

  • Throughput: ~50,000+ samples/sec (A100)
  • Latency: <1ms per sample
  • Batch Size: Up to 2048
  • Memory: ~2GB GPU memory

📦 Installation

# Install requirements (A100 + CUDA 12.8+ required)
pip install -r tensorrt_requirements.txt

# Verify TensorRT installation
python -c "import tensorrt; print(tensorrt.__version__)"  # Should be 10.13.x

🔧 Usage

Option 1: PyTorch Model (Standard)

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
model = AutoModel.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")

texts = ["Your text here"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
print(f"PPL: {outputs['ppl']}")

Option 2: TensorRT Engine (High Performance)

from tensorrt_inference import WayraPPLTensorRT
from transformers import AutoTokenizer

# Load TensorRT model (A100 required)
model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine")
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")

# High-throughput inference
texts = ["Your text here"] * 1000  # Large batch
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy())

Files Included

  • PyTorch Model: Standard HuggingFace format

    • pytorch_model.bin - Model weights
    • config.json - Model configuration
    • tokenizer.json - Tokenizer
  • TensorRT Engine: A100-optimized

    • wayrappl_fp16_bs2048.engine - TensorRT engine (A100 only)
    • tensorrt_config.json - Engine configuration
    • tensorrt_inference.py - Inference code
    • tensorrt_requirements.txt - Dependencies

Use Cases

  • Semantic Filtering
  • Curriculum Learning
  • Large-scale dataset cleaning (millions of documents)
  • Real-time perplexity estimation
  • High-throughput data quality assessment
  • Production MLOps pipelines

Model Details

  • Base: Knowledge distillation from meta-llama/Llama-3.2-1B
  • Architecture: GPT2-based Transformer blocks with perplexity heads
  • Languages: Spanish, Portuguese, English
  • Max Length: 512 tokens
  • Precision: FP16 (TensorRT), FP32 (PyTorch)

⚡ Benchmarks (A100)

Model Type Throughput Latency Memory
Llama 3 1B ~200/sec 50ms 8GB
Wayra PyTorch ~1,000/sec 10ms 4GB
Wayra TensorRT ~50,000/sec <1ms 2GB

Troubleshooting

"TensorRT engine not compatible"

  • Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture)
  • Check CUDA version: nvidia-smi (should be 12.8+)
  • Verify TensorRT: python -c "import tensorrt" (should be 10.13.x)

"CUDA out of memory"

  • Reduce batch size in inference
  • Use gradient checkpointing if training

Citation

@software{WayraPPL,
  title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty},
  author={Omar U. Florez and LatamGPT Team},
  year={2025},
  url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M}
}

License

Apache 2.0 - See LICENSE file


Note: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.

Downloads last month
8
Safetensors
Model size
55.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support