oflorez/Wayra-Perplexity-Estimator-55M-TensorRT: TensorRT Optimized WayraPPL
🚀 A100-optimized TensorRT version of WayraPPL for high-throughput prediction of Perplexity.
⚠️ Hardware Requirements
This model works on NVIDIA A100 GPUs with:
- GPU Architecture: sm_80 (A100-80GB)
- CUDA: 12.8+
- TensorRT: 10.13.x
- Driver: 570.124.06+
🚀 Performance
- Throughput: ~50,000+ samples/sec (A100)
- Latency: <1ms per sample
- Batch Size: Up to 2048
- Memory: ~2GB GPU memory
📦 Installation
# Install requirements (A100 + CUDA 12.8+ required)
pip install -r tensorrt_requirements.txt
# Verify TensorRT installation
python -c "import tensorrt; print(tensorrt.__version__)" # Should be 10.13.x
🔧 Usage
Option 1: PyTorch Model (Standard)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
model = AutoModel.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
texts = ["Your text here"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
print(f"PPL: {outputs['ppl']}")
Option 2: TensorRT Engine (High Performance)
from tensorrt_inference import WayraPPLTensorRT
from transformers import AutoTokenizer
# Load TensorRT model (A100 required)
model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine")
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
# High-throughput inference
texts = ["Your text here"] * 1000 # Large batch
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy())
Files Included
PyTorch Model: Standard HuggingFace format
pytorch_model.bin
- Model weightsconfig.json
- Model configurationtokenizer.json
- Tokenizer
TensorRT Engine: A100-optimized
wayrappl_fp16_bs2048.engine
- TensorRT engine (A100 only)tensorrt_config.json
- Engine configurationtensorrt_inference.py
- Inference codetensorrt_requirements.txt
- Dependencies
Use Cases
- Semantic Filtering
- Curriculum Learning
- Large-scale dataset cleaning (millions of documents)
- Real-time perplexity estimation
- High-throughput data quality assessment
- Production MLOps pipelines
Model Details
- Base: Knowledge distillation from meta-llama/Llama-3.2-1B
- Architecture: GPT2-based Transformer blocks with perplexity heads
- Languages: Spanish, Portuguese, English
- Max Length: 512 tokens
- Precision: FP16 (TensorRT), FP32 (PyTorch)
⚡ Benchmarks (A100)
Model Type | Throughput | Latency | Memory |
---|---|---|---|
Llama 3 1B | ~200/sec | 50ms | 8GB |
Wayra PyTorch | ~1,000/sec | 10ms | 4GB |
Wayra TensorRT | ~50,000/sec | <1ms | 2GB |
Troubleshooting
"TensorRT engine not compatible"
- Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture)
- Check CUDA version:
nvidia-smi
(should be 12.8+) - Verify TensorRT:
python -c "import tensorrt"
(should be 10.13.x)
"CUDA out of memory"
- Reduce batch size in inference
- Use gradient checkpointing if training
Citation
@software{WayraPPL,
title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty},
author={Omar U. Florez and LatamGPT Team},
year={2025},
url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M}
}
License
Apache 2.0 - See LICENSE file
Note: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.
- Downloads last month
- 8
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support