🚦 La Route 2.0 — AI Prompt Router

La Route 2.0 is like a GPS for AI prompts.
When you give it a piece of text (a question, a request, or any message), it analyzes it and decides:

How sensitive the content is (low / high)
What size model you need (small / large)
Which tool is best to answer (an offline LLM, an LLM with extra research abilities, or a search engine)

The goal: ✅ save resources, improve safety, and get better answers by sending each prompt to the right place instead of using the same heavy model for everything.

📊 What It Predicts

Task	Labels
Sensitivity	`low`, `high`
Model size	`small`, `large`
Best tool	`LLM-with-research-mode`, `Offline-LLM`, `Search-engine`

🔎 How It Works (In Simple Terms)

You send a prompt (e.g. "Who is the Prime Minister of Canada?")
The model classifies it:
- Sensitivity → Low
- Model size → Small
- Best tool → Search engine
The system then routes the prompt to the cheapest, safest, or most efficient tool.

It’s like a traffic controller for prompts — making sure each one takes the best route to the right “answering engine.”

🖼️ Workflow Diagram

(add an exported image file workflow.png with this chart so it displays on Hugging Face)

User Prompt
     │
     ▼
Shared ModernBERT Encoder
     │
     ├── Sensitivity → low/high
     ├── Model Size → small/large
     └── Best Tool → LLM / Offline-LLM / Search Engine
     │
     ▼
 Route to Best Model for Answer

💡 Why use La Route 2.0?

⚖️ Safer by design: Prompts are automatically routed to the most appropriate model. Instead of forcing all requests through the strictest (or loosest) setup, you can use cloud LLMs for everyday, non‑sensitive queries and keep sensitive prompts on secure, on‑premise models.
💸 More efficient: Don’t waste compute on heavyweight models when a smaller one will do. This saves costs, energy, and latency by balancing resources intelligently.
🛠 Right tool for the job: Not all prompts need an LLM. For factual lookups, a search engine may be faster and more accurate. For longer reasoning, a research‑mode LLM is better. Routing ensures each request is solved by the tool best suited to it.

🔧 Quick Usage Example

from transformers import AutoTokenizer, AutoModel
from huggingface_hub import snapshot_download
import torch, json, torch.nn.functional as F

repo_id = "monsimas/la-route-2"
model_dir = snapshot_download(repo_id)

tokenizer = AutoTokenizer.from_pretrained(model_dir)

# Load label maps
with open(f"{model_dir}/label_maps.json") as f:
    label_maps = json.load(f)
with open(f"{model_dir}/num_labels.json") as f:
    num_labels_dict = json.load(f)

# Define model
class MultiTaskModel(torch.nn.Module):
    def __init__(self, shared_model, num_labels_dict):
        super().__init__()
        self.shared_model = shared_model
        h = shared_model.config.hidden_size
        self.heads = torch.nn.ModuleDict({
            task: torch.nn.Linear(h, n) for task, n in num_labels_dict.items()
        })
    def forward(self, input_ids, attention_mask):
        out = self.shared_model(input_ids=input_ids, attention_mask=attention_mask)
        pooled = out.last_hidden_state[:,0]
        return {t: self.heads[t](pooled) for t in self.heads}

# Load base encoder + multitask heads
base_model = AutoModel.from_pretrained("answerdotai/ModernBERT-base")
model = MultiTaskModel(base_model, num_labels_dict)
state_dict = torch.load(f"{model_dir}/model_state.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

def classify_text(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=384, padding=True)
    with torch.no_grad():
        logits = model(**inputs)
    predictions = {}
    for task, logit in logits.items():
        probs = F.softmax(logit, dim=-1)
        pred = torch.argmax(probs, dim=-1).item()
        predictions[task] = {
            "label": label_maps[task][str(pred)],
            "confidence": float(probs[0, pred])
        }
    return predictions

print(classify_text("Who is the Prime Minister of Canada?"))

🛠️ Training Details

Base model: answerdotai/ModernBERT-base
Data: Compar:IA-conversations + ShareGPT (augmented for coverage)
Max length: 384 tokens
Batch size: 8
Learning rate: 5e‑5
Multitask heads: Sensitivity, Model Size, Best Tool

⚖️ Limitations

Tool and label definitions are domain-specific.
The classifier does not generate answers itself — only routes prompts.
Sensitive classification may mislabel edge cases.

monsimas
/

la-route-2