SDR-Arena
Open arenaBenchmarking the generative personalization capabilities of LLMs
| Rank | Agent | Quality Score | Duration | Tokens/Prompt | Searches/Prompt | Prompts |
|---|---|---|---|---|---|---|
| 1 |
STORM
Stanford STORM Team
|
42.5%
|
128.4s | 39,250 | 12.3 | 179 |
| 2 |
Qwen + WebSearch
DR-Bench Team
|
36.8%
|
55.3s | 8,627 | 3.8 | 179 |
| 3 |
Azure GPT-4o + WebSearch
DR-Bench Team
|
36.4%
|
38.2s | 16,727 | 4.1 | 179 |
| 4 |
Open Deep Research
LangChain Team
|
33.5%
|
156.7s | 106,366 | 8.6 | 179 |
Agent Details
Select an agent from the dropdown to view detailed metrics and methodology.
Browse individual benchmark prompts and compare how each agent responded.
Prompt #0
| Ground Truth Point | Score | Reasoning |
|---|---|---|
| The electronic CLIQ® locking system solved the ... | 3/5 | The candidate pitch captures the core value of reliability and user... |
| The eCLIQ locking system solved the need for ro... | 4/5 | The candidate pitch captures the use of 1,100 cylinders and the web... |
| The eCLIQ locking system solved the need for co... | 5/5 | The candidate pitch perfectly captures the GT point, including the ... |
| Ground Truth Point | Score | Reasoning |
|---|---|---|
| The electronic CLIQ® locking system solved the ... | 2/5 | The candidate pitch identifies the correct product and mentions imp... |
| The eCLIQ locking system solved the need for ro... | 1/5 | The candidate pitch vaguely mentions security improvements but does... |
| The eCLIQ locking system solved the need for co... | 1/5 | The candidate pitch does not address the flexibility of the system ... |
| Ground Truth Point | Score | Reasoning |
|---|---|---|
| The electronic CLIQ® locking system solved the ... | 2/5 | The candidate pitch mentions streamlining facility management and c... |
| The eCLIQ locking system solved the need for ro... | 1/5 | The candidate pitch vaguely touches on security but does not addres... |
| The eCLIQ locking system solved the need for co... | 0/5 | The candidate pitch does not address the need for consistent and re... |
| Ground Truth Point | Score | Reasoning |
|---|---|---|
| The electronic CLIQ® locking system solved the ... | 3/5 | The candidate pitch captures the idea of scalability and reliabilit... |
| The eCLIQ locking system solved the need for ro... | 2/5 | The candidate pitch mentions the 1,100 cylinders, which aligns with... |
| The eCLIQ locking system solved the need for co... | 3/5 | The candidate pitch captures the idea of scalability and expansion ... |
Upload a JSON file containing your agent's outputs for the benchmark prompts. Once uploaded, the SDR-Arena team will evaluate your results using our LLM-as-judge coverage scoring pipeline and add your agent to the leaderboard.
Expected Format
Your JSON file should follow this structure:
{
"agent_name": "my-agent-v1",
"agent_author": "Your Name",
"agent_description": "Brief description...",
"results": {
"0": {
"prompt_id": 0,
"status": "completed",
"output": "Agent's research report...",
"duration_seconds": 45.2,
"tokens": {
"prompt_tokens": 12000,
"completion_tokens": 3000,
"total_tokens": 15000
},
"searches": [
{
"queries": ["query 1"],
"num_results": 10
}
]
}
}
}
Required Fields
agent_name— Unique name for your agentresults— Object keyed by prompt ID
Per result:
status— "completed" or "failed"output— The agent's research report text
Optional (recommended):
agent_author,agent_descriptionduration_seconds,tokens,searches
About SDR-Arena
The first framework for benchmarking the generative personalization capabilities of LLMs, grounded in Bayesian Persuasion theory. Agents act as Sales Development Representatives (SDRs), researching prospects via time-restricted web search and generating personalized pitch points scored against ground truth from real-world customer success stories.
🎯 What is SDR-Arena?
SDR-Arena evaluates how well LLM-based agents can perform generative personalization — the task of researching a prospect and articulating why a specific product addresses that prospect's needs. Each agent acts as an SDR:
- Receives a seller–buyer pair and a time boundary
- Researches the prospect through time-restricted web search (only information available before the original interaction date)
- Generates personalized value propositions (pitch points)
- Is scored against ground-truth pitch points extracted from real-world customer success stories
The framework is rooted in Bayesian Persuasion: the SDR agent must select and present information that maximally shifts the prospect's beliefs toward the value of the product.
Customer Success Stories
A public corpus of 6,200+ success stories across 22 industries and 200 enterprises, forming the SDR-Bench dataset for rigorous evaluation.
Historical Internet Simulation
A temporal boundary (Wt) prevents future data leakage — agents only see information available at the original interaction date, ensuring fair comparison across time periods.
LLM-as-Judge Coverage
Outputs scored on a 0–5 Likert scale measuring Sales Effectiveness and Factual Precision, aggregated into a Weighted Coverage Score.
Open Submissions
Anyone can run their own agent, collect outputs, and submit results. Upload a JSON file and the SDR-Arena team will evaluate and rank your agent on the leaderboard.
📑 Dataset: SDR-Bench Corpus
The benchmark draws from the SDR-Bench corpus — 6,279 customer success stories spanning 22 industries and 200 enterprises. Each success story captures a real seller–buyer engagement with documented value propositions.
Note: The underlying paper also studies emails and call transcripts, but this public leaderboard benchmarks exclusively on success stories. This ensures a consistent ground truth derived from published customer outcomes.
For each prompt the agent receives:
- The seller (product/company) and buyer (prospect company)
- A temporal boundary (Wt) — the date before which web information is considered available
- Instructions to generate personalized pitch points
🧪 Evaluation: Coverage Judge
Agent outputs are evaluated using an LLM-as-Judge Coverage Scoring pipeline. Ground-truth pitch points are extracted from each success story, and a judge LLM evaluates how well the agent's output covers each ground-truth point on a 0–5 Likert scale:
| Score | Label | Meaning |
|---|---|---|
| 0 | Miss | No relevant mention of the ground-truth point |
| 1 | Marketing Fluff | Vague or generic claim without substance |
| 2 | Topic Match | Correct topic area but missing specific connection |
| 3 | Implied Match | Reasonable inference but not explicitly stated |
| 4 | Strong Argument | Clear, specific, and well-supported connection |
| 5 | Strategic Bullseye | Exact match with compelling, evidence-backed reasoning |
The Weighted Coverage Score (WCS) is computed as:
WCS = ∑ scores / (5 × N)
where N is the number of ground-truth points. The final quality score is expressed as a percentage (0–100%).
📏 Metrics
| Metric | Description |
|---|---|
| Quality Score | LLM-as-judge Weighted Coverage Score (0–100%) |
| Avg Duration | Average time per prompt in seconds |
| Tokens / Prompt | Average LLM tokens consumed per prompt |
| Searches / Prompt | Average number of web search queries per prompt |
📄 Submission Format
Upload a JSON file with your agent's outputs. The expected structure:
{
"agent_name": "my-agent-v1",
"agent_author": "Your Name",
"agent_description": "Brief description...",
"results": {
"0": {
"prompt_id": 0,
"status": "completed",
"output": "Agent's personalized pitch points...",
"duration_seconds": 45.2,
"tokens": {
"prompt_tokens": 12000,
"completion_tokens": 3000,
"total_tokens": 15000
},
"searches": [
{ "queries": ["query"], "num_results": 10 }
]
}
}
}
Each result entry should contain the agent's personalized pitch points for the corresponding benchmark prompt. Include timing, token, and search metrics if available.
📤 How to Submit
- Run your agent on the SDR-Bench prompt dataset
- Collect the outputs into a JSON file following the format above
- Go to the Upload Results tab
- Upload your JSON file — it will be validated automatically
- The SDR-Arena team will evaluate your results and update the leaderboard
Built with Gradio · Hosted on Hugging Face Spaces