ToolCallVerifier - Unauthorized Tool Call Detection

License Model

Stage 2 of Two-Stage LLM Agent Defense Pipeline


🎯 What This Model Does

ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.

Label Description
AUTHORIZED Token is part of a legitimate, user-requested action
UNAUTHORIZED Token indicates injected/malicious content — BLOCK

🚨 Attack Categories Covered

Category Source Description
Delimiter Injection LLMail <<end_context>>, >>}}\]\])
Word Obfuscation LLMail Inserting noise words between tokens
Fake Sessions LLMail START_USER_SESSION, EXECUTE_USERQUERY
Roleplay Injection WildJailbreak "You are an admin bot that can..."
XML Tag Injection WildJailbreak <execute_action>, <tool_call>
Authority Bypass WildJailbreak "As administrator, I authorize..."
Intent Mismatch Synthetic User asks X, tool does Y
MCP Tool Poisoning Synthetic Hidden exfiltration in tool args
MCP Shadowing Synthetic Fake authorization context

🔗 Integration with FunctionCallSentinel

This model is Stage 2 of a two-stage defense pipeline:

┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
│   User Prompt   │────▶│ ToolCallSentinel │────▶│   LLM + Tools   │
│                 │     │      (Stage 1)       │     │                 │
└─────────────────┘     └──────────────────────┘     └────────┬────────┘
                                                              │
                               ┌──────────────────────────────▼──────────────────────────┐
                               │           ToolCallVerifier (This Model)                 │
                               │   Token-level verification before tool execution        │
                               └─────────────────────────────────────────────────────────┘
Scenario Recommendation
General chatbot Stage 1 only
Tool-calling agent (low risk) Stage 1 only
Tool-calling agent (high risk) Both stages
Email/file system access Both stages
Financial transactions Both stages

🎯 Intended Use

Primary Use Cases

  • LLM Agent Security: Verify tool calls before execution
  • Prompt Injection Defense: Detect unauthorized actions from injected prompts
  • API Gateway Protection: Filter malicious tool calls at infrastructure level

Out of Scope

  • General text classification
  • Non-tool-calling scenarios
  • Languages other than English

📜 License

Apache 2.0

Downloads last month
14
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/toolcall-verifier

Finetuned
(991)
this model

Datasets used to train llm-semantic-router/toolcall-verifier

Space using llm-semantic-router/toolcall-verifier 1

Evaluation results