Benchmarks

Realistic benchmarks for financial AI.

We evaluate AI models on the tasks that matter most to financial institutions—using real data, realistic scenarios, and the metrics that are most relevant for the domain.

PIBench - Prompt Injection Resistance

Updated Jun 30, 2026

The first benchmark of prompt-injection resistance for agentic underwriting. Measures defense success across 16 frontier models, three providers, and five attack vectors—with and without untrusted-content tagging.

Koen Roelofs, Jakob Schmitt, Maximilian Eber, PhD

Task types

Prompt-injection defense
Untrusted-content tagging
Attack-vector analysis
False-positive testing

Data source

78 cases (53 injections, 25 benign) hand-authored by underwriting and security experts

Evaluation method

Defense Success Rate across 5 attack vectors, with and without tagging

Last updated

2026-06-30

View benchmark →

KYBench - Adverse Media Search

Updated Apr 2, 2026

Evaluating AI agents for adverse media research in Know Your Business reviews. Tests how well AI systems investigate regulatory red flags, fraud history, and sanctions violations across real businesses.

David Ahn, Maximilian Eber, PhD, Sahith Jagarlamudi

Task types

Web investigation
Adverse media detection
Evidence quality
Risk calibration

Data source

47 real businesses annotated by expert compliance practitioners

Evaluation method

Elo ratings, Adj F1, and RAIS evidence quality scoring

Last updated

2026-04-02

View benchmark →

FinSpread-Bench

Updated Mar 10, 2026

The first public benchmark for agentic financial spreading. Evaluates how well AI systems extract, calculate, and reason across financial documents—like bank statements, tax returns, payslips, and financial spreads—in real-world decision scenarios.

Nico Klees, Maximilian Eber, PhD

Task types

Extraction
Cross-document reasoning
Calculation
Structured output

Data source

Anonymized data from Taktile co-development partners

Evaluation method

Automated metrics and expert human evaluation

Last updated

2026-03-04

View benchmark →