Taktile Labs

Introduction

A compliance analyst at a mid-size fintech spends between 30 and 90 minutes on a single Know Your Business review. The bottleneck is not form processing or database lookups; those are solved. The bottleneck is adverse media research: the open-ended web investigation that determines whether a business has a history of fraud, sanctions violations, money laundering, or other regulatory red flags. It is unstructured, judgment-intensive, and scales poorly.

KYB compliance spans onboarding and ongoing monitoring, and a large portion of that work is manual human investigation. At a mid-size bank screening 10,000 businesses per month, that is 120,000 onboarding investigations per year. Factor in annual re-screening of the existing portfolio at similar volume, and the total reaches roughly 200,000 investigator-hours per year at 30 minutes per investigation. At a fully-loaded analyst rate of $80 per hour, manual investigation alone runs into the tens of millions of dollars annually. For most compliance teams, it is one of the most time-consuming and costly components of the KYB process.

We built KYBench - Adverse Media Search to answer a simple question: which AI agent configurations are actually good at this task, how reliable are they, and how do you deploy them in production? This post covers the key findings, configuration analysis, model comparisons, and deployment guidance. The full methodology will be published in an accompanying research paper.

Key findings

Three results stood out. First, how you configure the agent matters more than which model you choose. Investigation strategy alone swings performance by 10 percentage points, rivalling the spread between the best and worst models on the leaderboard. Second, the agent produces different findings on the same business across identical runs. At 83% consistency, this is on par with disagreement rates between senior compliance analysts on the same cases. Third, AI agents already produce higher-quality evidence than human analysts on average (14.60 vs. 13.50 out of 20), across all four scoring dimensions including risk calibration, the area where human judgment was assumed to dominate.

Test dataset

KYBench - Adverse Media Search covers 47 real businesses across the full spectrum of adverse media risk, from clean small businesses to entities with active fraud convictions, OFAC sanctions, and organized crime connections. Each business was annotated by expert compliance practitioners with 25+ years of AML industry experience across 9 adverse media categories.

Two businesses from the dataset illustrate the range of outcomes: one comes back clean, one gets escalated.

No findings

S'Wich Bistro, LLC

No adverse media detected across 9 categories

✓ Clear to proceed

Flagged

Ozy Media Inc.

FRAUDSEC_VIOLATIONS

⚠ Escalate for EDD

We will return to Ozy Media when we look at how agents and humans compare on the same investigation. The task requires finding the right sources, filing them under the right risk categories, and getting the severity right. All three are harder than they look at scale.

Evaluation methodology

We evaluated 31 agent configurations across varying models, investigation strategies, prompt framings, temperature settings, step budgets, and agent architectures. Ground truth was established by human annotators.

Methodology note

Why general-purpose search falls short for compliance.

General-purpose search engines are optimized for consumer intent, not evidentiary retrieval. They rank by engagement signals, not legal authority. In practice, this means SEC enforcement actions surface below news summaries of the same case; OFAC designation notices rank behind legal marketing pages that reference them; and court filings from foreign jurisdictions often don't appear at all. For compliance, those are exactly the sources that matter most.

All investigations in this benchmark were powered by Parallel, whose retrieval is built around an evidentiary objective rather than an engagement objective. Sources are ranked by legal authority: government enforcement databases, regulatory filings, and court records rank above commentary about those same documents. The RAIS scores throughout this benchmark reflect agents running on that retrieval layer, and the gap between primary-source citations and news-coverage citations is a direct consequence of that ranking signal.

What makes a good review?

Measuring AI quality is straightforward when a task has a clear right answer. Adverse media research does not. Analysts aren't grading a test; they're making a judgment call. What matters is not whether the agent found many hits, but whether those hits are real, credible, and actionable.

That specificity is what makes standard benchmarks poorly suited to this domain. A raw precision/recall score treats a DOJ press release and a single-source blog post as equally strong evidence of the same risk. A coding benchmark leaderboard tells you nothing about whether a model's citations would hold up in a compliance audit. The questions a compliance team actually needs answered are about coverage (did the agent identify the risk categories that are genuinely present?), source quality (are those findings backed by primary-source government and regulatory documents, not secondary news coverage of the same events?), and precision (did the agent avoid flagging unrelated entities or low-signal noise?).

Designing metrics that reflect what compliance teams actually care about is a core part of the Taktile Labs research agenda. The benchmark below is built around that principle: every metric traces back to a question a senior compliance analyst would recognize as meaningful.

How we score

Beyond latency and cost per case, each configuration was scored on three complementary performance metrics (expand each for the full methodology):

▸Elo Rating – pairwise head-to-head quality, false positives penalized

Elo answers: which report would you rather put in front of a reviewer? Rather than measuring against a fixed checklist, Elo is determined through direct head-to-head comparison, the same approach used to rank chess players or sports teams.

For each business, a judge model reads two investigation reports side-by-side and picks the stronger one, based on how clearly findings are communicated, how well evidence is organized, and how useful the report would be for a human reviewer. A model that writes clear, confident reports tends to win, even if another model found more raw evidence. Models that consistently beat others accumulate higher Elo scores; beating a strong competitor counts for more than beating a weak one.

One deliberate design choice: reports that over-flag unrelated entities (citing adverse media about the wrong company just to appear thorough) will lose to a more precise report, even if the over-flagging report found all the true positives. This makes Elo resistant to a common gaming strategy (flood the report with citations) and rewards the kind of judgment a good analyst would exercise.

Elo scores are on an arbitrary scale (higher is better); the absolute number matters less than the relative ranking.

▸Adjusted F1 – category detection that doesn't penalize novel discoveries

Adjusted F1 answers: out of all the risk categories that matter for this business (fraud, sanctions, money laundering, regulatory violations), did the agent get the right ones?

It measures which of the 9 adverse media categories the agent correctly identifies. A standard F1 compares agent findings against human annotations, but human annotation is imperfect, and if an agent surfaces real adverse media the human missed, it would unfairly count as a false positive.

Adjusted F1 fixes this: when an agent's novel finding is later confirmed as genuine, it is removed from the false positive count. The result is a score that reflects actual detection ability rather than agreement with an imperfect reference. An agent that catches everything the human caught, plus legitimate findings the human missed, scores above a naive baseline.

\begin{aligned} \text{precision} &= \frac{TP}{TP + FP} \\[6pt] \text{recall} &= \frac{TP}{TP + FN} \\[6pt] \text{Adjusted F1} &= \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} \end{aligned}

Confirmed novel findings are reclassified from FP to TP, so the agent is not penalized for genuine discoveries the human missed.

▸Risk-Adjusted Investigation Score (RAIS) – holistic evidence quality

RAIS answers: was the evidence the agent produced authoritative enough to support a compliance decision: drawn from credible primary sources, covering the right risk categories, and free of entity-confusion errors?

It is a composite of four components whose positive weights sum to 1.0, giving a range of [0, 1] (minus any FIC penalty):

\text{RAIS} = 0.53 \cdot \text{EW-F1} + 0.24 \cdot \text{ECS} + 0.24 \cdot D - 0.15 \cdot \text{FIC}

Weights are grounded in Wolfsberg Adverse Media Guidance (2019), FATF Recommendation 10, and FinCEN SAR standards. Expand each component for a full explanation.

▸EW-F1 Evidence-weighted source overlap with human annotation

EW-F1 answers: did the agent find the same evidence the human found, and were those sources as authoritative? It is the most heavily weighted component because finding the right sources is the core task of an adverse media investigation.

Not all sources are equally valuable. A DOJ conviction entry carries far more weight than a blog post mentioning the same case. Each source is scored by its Evidence Weight (EW), a quality score between 0 and 1 based on: entity match (gate), legal stage (35%), source credibility (30%), recency (20%), and role clarity (15%).

\begin{aligned} \text{EW}(s) &= \text{entity\_match} \times [0.35 \cdot \text{legal\_stage} + 0.30 \cdot \text{source\_cred} \\ &\qquad\qquad + 0.20 \cdot \text{recency} + 0.15 \cdot \text{role\_clarity}] \\[6pt] \text{EW-precision} &= \frac{\sum \text{EW}(s)_{[\text{agent} \cap \text{human}]}}{\sum \text{EW}(s)_{[\text{agent}]}} \\[6pt] \text{EW-recall} &= \frac{\sum \text{EW}(s)_{[\text{agent} \cap \text{human}]}}{\sum \text{EW}(s)_{[\text{human}]}} \\[6pt] \text{EW-F1} &= \frac{2 \cdot \text{EW-precision} \cdot \text{EW-recall}}{\text{EW-precision} + \text{EW-recall}} \end{aligned}

▸ECS Evidence Coverage Score – source credibility breadth

ECS answers: did the agent consult authoritative sources, or did it mostly rely on social media and general news?

Every URL is assigned to a quality tier: Tier 3 (1.0, primary regulatory/legal), Tier 2 (0.6, major financial press), Tier 1 (0.3, general news), Tier 0 (0.1, social media). The score uses a logarithmic scale to reward depth over volume.

\text{ECS} = \frac{\log(1 + \sum \text{tier}(u))}{\log(1 + 27)}

$u$ = each URL visited by the agent; the denominator $\log(1+27)$ reflects a max evidence investigation: 3 Tier-3 sources across each of 9 categories $(3 \times 9 \times 1.0 = 27)$ .

▸D Discovery Delta – novel adverse media confirmed beyond the human set

D answers: did the agent uncover genuine adverse media that the human investigator missed?

Novel sources are independently verified. D = 0 means nothing new; D = 0.5 means novel findings equal in quality to the entire human set; D → 1.0 represents vastly more discoveries with diminishing returns.

\begin{aligned} R &= \frac{\sum \text{EW}(s)_{[\text{novel, CONFIRMED}]}}{\sum \text{EW}(s)_{[\text{human}]}} \\[6pt] D &= \frac{R}{1 + R} \end{aligned}

▸FIC False Include Cost – penalty for entity confusion

FIC answers: are the adverse findings actually about the right company? It penalizes citing records about the wrong entity, calibrated to the severity of the mistake.

\text{FIC} = \sum \text{EW}^*(s) \text{ for } s \text{ where entity\_match} = 0

$\text{EW}^*$ ignores the entity gate, so high-quality misattributed sources incur a larger penalty.

Configuration experiments

What moves performance: investigation strategy matters more than model choice

The single largest performance lever is not model selection. It is investigation strategy. Running Gemini 2.5 Pro with different strategies produces dramatically different results. Hypothesis planning (structuring the investigation upfront) raises adjusted F1 from 77.9% to 87.7%, a +9.8 pp gain. Self-critique adds a post-hoc review pass for a +7.8 pp gain at lower operational overhead. Corroboration gates and chain-of-thought requirements show smaller gains and introduce conservative bias that suppresses legitimate findings in sparse categories like money laundering.

Gemini 2.5 Pro · Strategy variations · Adj F1 vs. baseline (77.9%)

Adj F1

Hypothesis planningPre-structured investigation plan

87.7%

+9.8 pp

Self-critiquePost-hoc output self-review

85.7%

+7.8 pp

Skeptic personaAdversarial review persona

85.7%

+7.8 pp

Corroboration gate2-source required before flagging

79.2%

+1.3 pp

CoT chainsExplicit chain-of-thought

79.2%

+1.3 pp

BaselineStandard prompt

77.9%

n/a

Gemini 2.5 Pro with hypothesis planning at $0.11/business matches Claude Opus 4.6's accuracy (87.7% adjusted F1) at one-tenth the cost, the most important cost-efficiency finding in the benchmark.

Six words can shift performance by 10 percentage points

We ran semantically equivalent prompt variants against the same model on the same businesses. Adjusted F1 ranged from 77.1% to 87.1%, a 10 percentage-point spread from prompt wording alone. That gap exceeds the performance difference between models within the same family at standard settings.

Gemini 2.5 Pro · Task framing sensitivity · 47 businesses

Adj F1

RAIS

"Identify regulatory and financial crime risks for: [business name]"

87.1%

0.347

"Screen for compliance concerns about: [business name]"

82.0%

0.317

"Research adverse media for: [business name]"

77.1%

0.319

The metrics diverge on which framing wins. On adjusted F1, “Identify regulatory and financial crime risks” is the clear winner (87.1%), activating more domain-specific detection behavior. On RAIS, “Screen for compliance concerns” scores highest because it produces higher evidence-weighted recall against the human citation set, even at lower category coverage. The right choice depends on what you're optimising for: detection breadth or evidence quality. Before shipping any prompt modification, run a regression test across at least 20 businesses with ground truth labels; check per-category consistency, not just overall F1.

In order of impact: investigation strategy (hypothesis planning +9.8 pp, self-critique +7.8 pp), prompt framing (up to 10 pp swing from wording alone), temperature (T=0 outperforms T=0.7 by 10 pp), step budget (15 steps outperforms 50), and guidance level (category taxonomy alone outperforms taxonomy plus explicit thresholds).

Results

With configuration effects established, here is how variations in the underlying model affect the agent's performance using the standard configuration.

Leaderboard

Elo score95% CIHuman baseline (913)

Model	Elo↓	Adj F1↕	RAIS↕	Avg time↕	Cost / case↕
1Claude Sonnet 4.6	1289	77.0%	0.485	1.3m	$0.12
2Claude Haiku 4.5	1204	81.0%	0.492	29s	$0.06
3Claude Opus 4.6	1143	87.7%	0.529	1.3m	$1
4Gemini 3.1 Pro	977	77.2%	0.454	35s	$0.45
5Gemini 3.1 Flash-Lite	927	72.2%	0.430	18s	$0.001
HHuman Analyst1	913	n/a	n/a	30.0m	$30
6GPT-5.2 Pro*	899	90.3%	0.472	5.0m	$0.80
7Gemini 2.5 Pro	835	85.7%	0.414	39s	$0.12
8GPT-5	816	81.5%	0.379	24s	$0.60

* GPT-5.2 Pro hits its context window at 30 steps; more than 20% of complex cases fail to complete. Adj F1 of 90.3% reflects a capped 15-step configuration (precision=66.7%, recall=43.8%). Avg time reflects failing cases and is not representative.
¹ Adj F1 and RAIS are omitted for the Human Analyst; both metrics compare agent outputs against human-derived ground truth and are not meaningful when applied to the reference itself.
§ Human time and cost estimated at 25–35 minutes per investigation at a fully-loaded analyst rate of $60–70/hr, consistent with industry benchmarks for adverse media screening. Excludes QA review and case documentation.

The adjusted F1 and Elo columns tell different stories, and both are useful. Elo measures whether the final report is persuasive and usable: it rewards presentation quality. RAIS measures which specific sources were cited and how authoritative they are: it rewards evidence depth. A model can lead on Elo by writing clear, confident reports while scoring lower on RAIS if its citations are less authoritative than a competitor's. Claude Sonnet 4.6 leads on Elo (1289) but ranks mid-table on adjusted F1 (77.0%) and RAIS (0.485): it wins more head-to-head matchups but produces thinner evidence trails than Opus. GPT-5.2 Pro ranks first on adjusted F1 (90.3%) but seventh on Elo (899): its context truncation reduces evidence breadth even when individual citations are high quality. Gemini 3.1 Flash-Lite ranks lowest on adjusted F1 but fifth on Elo (927), winning head-to-head matchups more often than its accuracy score suggests, particularly on clear-cut cases. For compliance deployment, RAIS is the column that matters most: Claude Opus leads at 0.529, reflecting consistent primary-source citations with well-calibrated legal stage assessments.

Agents already exceed human evidence quality

The most striking result: AI agents produce higher-quality investigation outputs than human compliance analysts. To measure this, each report, human and agent alike, was scored by an independent LLM judge on a 20-point rubric across four dimensions (0–5 each):

Evidence completeness – did the investigator find all material adverse findings?
Entity precision – are all flagged findings actually about the right business, not a namesake?
Source quality – how credible and current are the cited sources?
Risk calibration – are findings assessed at the right severity level, distinguishing allegations from convictions?

The judge has web-search access to independently verify what exists, so evidence completeness is assessed against what is actually findable, not just what the annotator chose to include. The judge scores blind to whether each output is human or agent-produced.

Human analysts averaged 13.50/20; AI agents averaged 14.60/20. These are average scores across 47 human annotations and 183 agent runs on the same dataset. Agents scored higher across all four dimensions. The largest gaps were on entity precision and risk calibration; the smallest on evidence completeness and source quality.

Case study · Ozy Media Inc.

Let's return to Ozy Media Inc., whose CEO was convicted of securities fraud in 2024. The AI scored 19/20 on the independent rubric; the human scored 17/20. The gap is on risk calibration: the human's legal stage assessments had inconsistencies the AI avoided.

Cat	Source	Legal	Cred	Entity
FRAUD	DOJ: Watson sentenced	4	3	2
FRAUD	NPR: sentencing	1 ⚠	3	2
FRAUD	Courthouse News	4	1	2
SEC	SEC enforcement	4	3	2
SEC	Deadline: conviction	1 ⚠	1	2
SEC	WSJ: conviction	4	3	2
OTHER	Reuters: sentencing	4	3	2
OTHER	NBC: commutation	4	2	2
OTHER	BBC: conviction	4	2	2

Reliability

Agents don't hallucinate: they flag too eagerly

The threat from AI adverse media agents is miscalibrated thresholds applied to real evidence, not invented evidence. The hallucination rate across all false positives is 0%. Every false positive cited a real, accessible source; the agent found something real, and the question is whether a compliance officer would have flagged it. The dominant error type (69% of false positives) is over-flagging real but marginal evidence: the agent cited something genuine, but a human annotator would have set it aside — because the case was still at the allegation stage, the reporting was too old to be actionable, or the person involved played only a peripheral role in the business. Entity confusion accounted for 1.5% of false positives.

A human review queue built around this reality should focus on threshold adjudication, not fact-checking. Throughput and completion rate must also be tracked alongside accuracy: models that hit context window limits during deep investigations will silently truncate their analysis, producing incomplete outputs that pass accuracy checks on short cases but fail on complex ones.

Stochastic consistency is a real deployment concern

Accuracy measured once is not the same as accuracy you can rely on. Consistency rate measures how often an agent produces the same risk category verdicts when run on the same business multiple times. We ran three frontier agent configurations three times each on all 47 businesses with identical inputs. The Claude Opus 4.6 agent configuration achieved 83.3% consistency rate; the GPT-5.2 Pro configuration, 71.1%; the Gemini 3.1 Pro configuration, 74.5%.

While 83% might sound low, it is on par with human performance. Given the ambiguity in the task, it is not unusual for people to disagree on the correct judgment for one in six cases. Our own annotation process produced category-level disagreements between annotators on exactly the businesses that also drove AI inconsistency. Claude Opus's 83.3% CR reflects a comparable level of judgment variance to what you'd see between two senior compliance analysts applying the same guidelines independently.

Two signals make the remaining inconsistency manageable in practice. Inconsistency concentrates in genuinely borderline cases, the same businesses that fall below Claude Opus's confidence threshold in the hybrid routing model, as the example below illustrates. And higher capability aligns with higher consistency; the most accurate models are also the most repeatable.

Case study · Hess Services Inc.

Run 1

⚠ OTHER

Run 2

✓ Clean

Run 3

✓ Clean

All three runs investigated the same oilfield services company in Hays, Kansas and found the same evidence: a federal employment discrimination lawsuit (Myers v. Hess Services, Inc., Title VII) filed in 2020, settled at ADR, and dismissed with prejudice. The facts were identical across runs. What differed was the judgment call: Run 1 flagged the settled lawsuit as OTHER; Runs 2 and 3 concluded it did not meet the threshold for adverse media. The human annotator agreed with Runs 2 and 3.

Claude Opus confidence score · max across all categories

routing threshold: 70

0 — uncertain → human review100 — confident → auto-approve

A confidence score of 25 sits well below the routing threshold of 70. This case would be sent to human review automatically. A reviewer confirms: the lawsuit was settled and dismissed, no ongoing risk, business cleared. The inconsistency is real, but the routing system catches it before it reaches a decision.

Deployment strategies

Full automation is the wrong target. A hybrid model (where the agent screens every case first and a human analyst reviews only the uncertain ones) reduces analyst workload by 93% while matching or exceeding human practitioner performance. The agent assigns each case a confidence score from 0 to 100. Running fully automatically, the agent still misses enough cases that a human backstop matters. Routing only the low-confidence cases (those scoring below 70) to a human changes the picture: only 6.7% of cases require a human look, yet results are on par with reviewing every case manually. Those are the genuinely hard cases where human judgment makes the difference.

Configuration quick reference

Recommended configurations by deployment goal

Goal	Model	Key setting	RAIS	Cost
Best overall	Claude Opus 4.6	Self-critique · confidence routing (θ=0.70) · 93% of cases auto-resolved	0.529	$1.20
Best value	Claude Haiku 4.5	Standard prompt	0.492	$0.06
Lowest cost	Gemini 3.1 Flash-Lite	Standard prompt	0.430	$0.001

Conclusion

KYBench - Adverse Media Search provides a repeatable evaluation framework for AI-assisted adverse media screening, and the first results point to a few clear conclusions: frontier agents now deliver human-level results across the leaderboard. Configuration choices matter more than model selection: investigation strategy alone swings performance by 10 percentage points. Every model produces inconsistent verdicts on borderline cases, but those are precisely the cases a hybrid routing model catches. Combining agents with human review of uncertain cases reduces analyst workload by 93% while raising the overall quality bar above what either achieves alone. Future work will focus on expanding the annotated dataset across jurisdictions and entity types, and on improving confidence calibration for the borderline cases where models still disagree. We will update results as new models are released.