oflorez/Wayra-Perplexity-Estimator-55M-TensorRT: TensorRT Optimized WayraPPL

🚀 A100-optimized TensorRT version of WayraPPL for high-throughput prediction of Perplexity.

⚠️ Hardware Requirements

This model works on NVIDIA A100 GPUs with:

GPU Architecture: sm_80 (A100-80GB)
CUDA: 12.8+
TensorRT: 10.13.x
Driver: 570.124.06+

🚀 Performance

Throughput: ~50,000+ samples/sec (A100)
Latency: <1ms per sample
Batch Size: Up to 2048
Memory: ~2GB GPU memory

📦 Installation

# Install requirements (A100 + CUDA 12.8+ required)
pip install -r tensorrt_requirements.txt

# Verify TensorRT installation
python -c "import tensorrt; print(tensorrt.__version__)"  # Should be 10.13.x

🔧 Usage

Option 1: PyTorch Model (Standard)

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
model = AutoModel.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")

texts = ["Your text here"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
print(f"PPL: {outputs['ppl']}")

Option 2: TensorRT Engine (High Performance)

from tensorrt_inference import WayraPPLTensorRT
from transformers import AutoTokenizer

# Load TensorRT model (A100 required)
model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine")
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")

# High-throughput inference
texts = ["Your text here"] * 1000  # Large batch
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy())

Files Included

PyTorch Model: Standard HuggingFace format
- pytorch_model.bin - Model weights
- config.json - Model configuration
- tokenizer.json - Tokenizer
TensorRT Engine: A100-optimized
- wayrappl_fp16_bs2048.engine - TensorRT engine (A100 only)
- tensorrt_config.json - Engine configuration
- tensorrt_inference.py - Inference code
- tensorrt_requirements.txt - Dependencies

Use Cases

Semantic Filtering
Curriculum Learning
Large-scale dataset cleaning (millions of documents)
Real-time perplexity estimation
High-throughput data quality assessment
Production MLOps pipelines

Model Details

Base: Knowledge distillation from meta-llama/Llama-3.2-1B
Architecture: GPT2-based Transformer blocks with perplexity heads
Languages: Spanish, Portuguese, English
Max Length: 512 tokens
Precision: FP16 (TensorRT), FP32 (PyTorch)

⚡ Benchmarks (A100)

Model Type	Throughput	Latency	Memory
Llama 3 1B	~200/sec	50ms	8GB
Wayra PyTorch	~1,000/sec	10ms	4GB
Wayra TensorRT	~50,000/sec	<1ms	2GB

Troubleshooting

"TensorRT engine not compatible"

Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture)
Check CUDA version: nvidia-smi (should be 12.8+)
Verify TensorRT: python -c "import tensorrt" (should be 10.13.x)

"CUDA out of memory"

Reduce batch size in inference
Use gradient checkpointing if training

Citation

@software{WayraPPL,
  title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty},
  author={Omar U. Florez and LatamGPT Team},
  year={2025},
  url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M}
}

License

Apache 2.0 - See LICENSE file

Note: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.