We run the same public benchmarks the research community uses. We compare our scores against numbers the other labs have published themselves. Where SLX leads, we say so. Where it ties or trails, we say that too.
Measured in-house on the published evaluation sets. Same prompt template, same seed, same grading harness used by the benchmark's original authors. Run logs available on request.
Scores quoted directly from each lab's own reports, technical papers, or public leaderboards. Rows marked src are cited; internal evals list their harness in the footer.
No re-running of competitor models in our own harness (where scoring choices could disadvantage them). No proprietary benchmarks of our own design. No scores without a citable source.
Each card tells one story. A clause identification task, a filings QA task, and a routing task SLX built for its own endpoint catalog. Every row links to a source except the SLX row, which is measured in-house — see the methodology below for harness and seed details.
Every scored benchmark on this page is publicly documented. No proprietary scorecards. Anyone with the same eval harness can reproduce our numbers against our logs.
We do not re-run competitor models in our own harness. We quote each lab's own published number and link to its source.
Where a bench is internal (e.g. Endpoint Resolution), the footer names the harness. Raw logs available on request so you can verify the numbers.
We publish benchmarks where we trail. A page that only shows wins cannot support the wins. Nothing trails today — we'll add the row when it does.