SDR-Arena

Open arena

Benchmarking the generative personalization capabilities of LLMs

Agents
4
Benchmark Prompts
179
Best Quality Score
42.5%
Fastest Avg Duration
38.2s
Sort by
Rank Agent Quality Score Duration Tokens/Prompt Searches/Prompt Prompts
1
STORM
Stanford STORM Team
42.5%
128.4s 39,250 12.3 179
2
Qwen + WebSearch
DR-Bench Team
36.8%
55.3s 8,627 3.8 179
3
Azure GPT-4o + WebSearch
DR-Bench Team
36.4%
38.2s 16,727 4.1 179
4
Open Deep Research
LangChain Team
33.5%
156.7s 106,366 8.6 179
Quality scores based on 179 evaluated prompts using LLM-as-Judge Coverage (0-5 Likert)

Agent Details

Select an agent

Select an agent from the dropdown to view detailed metrics and methodology.