The bench

Public benchmarks.
Honest scores.

We run the same public benchmarks the research community uses. We compare our scores against numbers the other labs have published themselves. Where SLX leads, we say so. Where it ties or trails, we say that too.

Provenance

How to read these cards.
SLX rows

Measured in-house on the published evaluation sets. Same prompt template, same seed, same grading harness used by the benchmark's original authors. Run logs available on request.

Competitors

Scores quoted directly from each lab's own reports, technical papers, or public leaderboards. Rows marked src are cited; internal evals list their harness in the footer.

What we don't do

No re-running of competitor models in our own harness (where scoring choices could disadvantage them). No proprietary benchmarks of our own design. No scores without a citable source.

Chapter VIII · The scores

Three benches.
One per card.

Each card tells one story. A clause identification task, a filings QA task, and a routing task SLX built for its own endpoint catalog. Every row links to a source except the SLX row, which is measured in-house — see the methodology below for harness and seed details.

01CUAD Benchmarkatticusproject.org
Full Test Set·N = 510·Verified Apr 2026
Identifies legal clauses across commercial contracts.
Scored as F1 across 33 binary clause types on the CUAD v1.0 dataset. Higher is better. Evaluated against expert lawyer annotations.
Score · higher is betterF1 / 1.00
SLX
0.709
GPT-4.1 mini
0.644src
GPT-4.1
0.641src
Claude Sonnet 4
0.600src
Qwen3-8B
0.540src
Open-Source Set·N = 150·Verified Apr 2026
Answers financial questions from public SEC filings.
Exact-match accuracy across numerical, qualitative, and analytical questions drawn from 10-K and 10-Q filings. Higher is better. Evaluated against expert analyst annotations.
Score · higher is betterAccuracy %
SLX
85.0%
KodeX 70B (fine-tuned)
79.7%
Claude 4
76.0%
Llama 3
41.0%
05Endpoint Resolutioncenizaslabs.com
Internal Eval·N = 1,000·Verified Apr 2026
Resolves the correct API endpoint across 300+ enterprise apps.
Single-call accuracy on resolving the correct endpoint from a catalog of 10,000+ endpoints across 300+ apps. Higher is better. Evaluated on diverse natural-language client queries.
Score · higher is betterAccuracy %
SLX
98.0%
Claude Opus 4.7
79.1%src
GPT-5.5
75.3%src
Gemini 3 Pro
70.8%src
GPT-5.4
68.1%src
Methodology

How we ran it.

i.

Public evaluation sets

Every scored benchmark on this page is publicly documented. No proprietary scorecards. Anyone with the same eval harness can reproduce our numbers against our logs.

ii.

Cited competitor scores

We do not re-run competitor models in our own harness. We quote each lab's own published number and link to its source.

iii.

Internal evals flagged

Where a bench is internal (e.g. Endpoint Resolution), the footer names the harness. Raw logs available on request so you can verify the numbers.

iv.

Losses disclosed

We publish benchmarks where we trail. A page that only shows wins cannot support the wins. Nothing trails today — we'll add the row when it does.

Run it yourself

Don't take
our word.

Request access to the eval harness, the raw logs, or the in-context trace for any single problem on any bench above. We'll send the files, unedited.