SLX — AI Workspace for Deals, Drafts & Pipelines

Full Test Set·N = 510·Verified Apr 2026

Contract clause extraction across commercial agreements — identifying 40+ provision types (termination, liability, indemnification, IP, change of control) across contracts running 20–100+ pages.

Scored as F1 across 33 binary clause types on the CUAD v1.0 dataset. Higher is better. Evaluated against expert lawyer annotations.

Score · higher is betterF1 / 1.00

SLX

0.85

GPT-4.1 mini

0.644src

GPT-4.1

0.641src

Claude Sonnet 4

0.600src

Qwen3-8B

0.540src

Scores as published. Competitor numbers from ContractEval, Aug 2025.SLX leads by 6.5

03FinanceBenchpatronus-ai/financebench

Open-Source Set·N = 150·Verified Apr 2026

Financial document question-answering on public filings — extracting figures, ratios, and qualitative disclosures from 100–300 page 10-Ks and 10-Qs with citations to source.

Exact-match accuracy across numerical, qualitative, and analytical questions drawn from 10-K and 10-Q filings. Higher is better. Evaluated against expert analyst annotations.

Score · higher is betterAccuracy %

SLX

95.0%

KodeX 70B (fine-tuned)

79.7%

Claude 4

76.0%

Llama 3

41.0%

Scores as published. Competitor numbers from Islam et al., Patronus AI (2023).SLX leads by 5.3pp

05Endpoint Resolutioncenizaslabs.com

Internal Eval·N = 73,316·Verified Apr 2026

Tool selection across large API catalogs — mapping a natural-language request to the correct endpoint among 10,000+ across 300+ enterprise applications.

Single-call accuracy on resolving the correct endpoint from a catalog of 10,000+ endpoints across 300+ apps. Higher is better. Evaluated on diverse natural-language client queries.

Score · higher is betterAccuracy %

SLX

98.0%

Claude Opus 4.7

48.1%src

GPT-5.5

46.6%src

Gemini 3.3

17.0%src

Claude Sonnet 4.6

35.4%src

GPT-5.4-mini

27.4%src

SLX scored on internal benchmark. Competitor numbers from same internal eval harness, Apr 2026.SLX leads by 49.9pp

Public benchmarks.
Honest scores.

Provenance

Competitors

What we don't do

Three benches.
One per card.

How we ran it.

Public evaluation sets

Cited competitor scores

Internal evals flagged

Losses disclosed

Don't take
our word.

Public benchmarks.Honest scores.