Superbuilt · Research & Evaluation · April 2026

Agentic AI research

for the built world

This document presents the evaluation methodology, benchmark results, and performance evidence underlying the Superbuilt platform — a purpose-built agentic AI system for architecture, engineering, and construction.

450+

Projects evaluated

Residential, mixed-use, commercial, interior

90–92%

Compliance accuracy

Practitioner-reviewed, multi-jurisdiction

100+

Specialized AI agents

AEC-native, not general-purpose

65%

Review time reduction

~10 min → 3–4 min per compliance check

§ 01 — Origin Research · Compliance Reasoning

Where it began — compliance reasoning at human grade

The foundational research underlying Superbuilt originated as a single, testable hypothesis: that a domain-specific reasoning system could achieve statistical parity with a senior practicing architect on regulatory compliance evaluation. All evaluation reported here was conducted by licensed, actively practising architects using the system within real project workflows — not under controlled laboratory conditions.

The primary evaluation corpus comprises 450+ projects spanning residential, mixed-use, interior, and large-scale commercial typologies, selected to reflect the full distribution of real-world submission standards and drafting practices. Regulatory coverage was structured to include national building codes, regional and state-level regulations, municipal by-laws, zoning frameworks, and fire and life-safety norms. The reasoning infrastructure validated through this corpus was subsequently extended to underpin the broader agentic platform.

Evaluation pipeline — practitioner-reviewed

Project Input

›

Rule Parsing

›

Issue Detection

›

Architect Review

Time-to-compliance

Practicing architect~10 min

Superbuilt3–4 min

Regulatory rule coverage depth

National building codes96%

State & regional regulations81%

Municipal by-laws74%

Zoning & land-use68%

Fire & safety norms79%

Performance metrics — 450+ practitioner-reviewed projects

Recall~94%

exceeds human-equivalent threshold

Precision93–95%

exceeds human-equivalent threshold

Overall accuracy90–92%

exceeds human-equivalent threshold

Issues confirmed / drawing71–72

exceeds human-equivalent threshold

False negative rate6–7%

known limitation — borderline cases

Training trajectory — internal validation curve

Recall94%

Precision93%

Accuracy91%

~80

issues identified by a practicing architect per project review

71–72

confirmed correct detections per project by Superbuilt

None

publicly available global benchmark for AEC compliance exists

First

rigorous practitioner-led AEC evaluation of its kind globally

§ 02 — Cross-Agent Benchmark Evaluation

24 public benchmarks across 6 capability domains

All evaluations conducted zero-shot or few-shot using published protocols. No benchmark-specific fine-tuning was applied at any stage. Results are independently reproducible.

24 benchmarks

Benchmark	Capability measured	Primary agent(s)	Platform score	Prior baseline	Δ Delta
IFEvalNLP	Instruction following — prompt-level strict accuracy	Email · Meeting · Submittal	86.4%	62.3%	+24.1pp
HELMETNLP	Long-context RAG — F1 over retrieved spans	Drive · Research · Drawing Review	81.7%	54.1%	+27.6pp
ZeroSCROLLSNLP	Long-document summarization — ROUGE-L / BERTScore	Meeting · Email	79.2%	58.6%	+20.6pp
FRAMESNLP	Multi-hop multi-document QA — exact match	Research · Site Intelligence	82.1%	61.4%	+20.7pp
DocVQAVision	Document visual QA — ANLS score	Drawing Review · Spec Validation	92.6%	71.3%	+21.3pp
ChartQAVision	Chart comprehension — relaxed accuracy	BOQ · Research	88.3%	65.8%	+22.5pp
OCRBenchVision	Dense text recognition in technical layouts	Drawing Review · Accessibility	856/1000	694/1000	+162pts
AI2DVision	Schematic & diagram parsing — MC accuracy	Sketch-to-CAD · Moodboard	89.4%	63.2%	+26.2pp
CVBench spatialVision	2D spatial relation & count reasoning	Clash Detection · Drawing Review	81.7%	58.6%	+23.1pp
InfoVQAVision	Infographic & table comprehension — ANLS	Site Intelligence · BOQ	84.5%	61.4%	+23.1pp
GSM8KReasoning	Multi-step arithmetic — solve rate	BOQ · Site Intelligence	95.1%	57.1%	+38.0pp
MATH-500Reasoning	Competition-level mathematics — exact match	BOQ · Space Planning	74.6%	42.2%	+32.4pp
FinQAReasoning	Financial table + text reasoning — exec accuracy	BOQ · Vendor Management	81.3%	52.7%	+28.6pp
HotpotQAReasoning	Multi-hop cross-document reasoning — F1	Compliance · Clash Detection	86.3%	68.4%	+17.9pp
MuSiQueReasoning	4-hop compositional inference — exact match	Spec Validation · Green Cert	58.9%	31.2%	+27.7pp
BFCLAgentic	API function calling accuracy — Berkeley FCL	Vendor Calling · Calendar · Slack	91.5%	60.3%	+31.2pp
τ-benchAgentic	Multi-tool agentic task chains — success rate	Vendor · Submittal · Email	68.4%	41.2%	+27.2pp
StructEvalAgentic	JSON schema adherence — validity rate	BOQ · Spec Validation · Calendar	97.3%	71.4%	+25.9pp
MMLU EngineeringDomain	Engineering & architecture subsets — accuracy	Code Compliance · Green Cert · Accessibility	88.7%	58.4%	+30.3pp
MMLU ProDomain	Professional-level domain QA — accuracy	Green Certification · Vastu	71.4%	49.3%	+22.1pp
SciFactDomain	Scientific claim verification — F1	Green Cert · Accessibility	84.2%	64.1%	+20.1pp
QMSumProductivity	Query-based meeting summarization — ROUGE-L	Meeting Agent	31.8	22.4	+9.4pts
SchemaGuidedDialogueProductivity	Goal-oriented dialogue state tracking	Calendar · Vendor Calling	87.3%	66.2%	+21.1pp
AMI CorpusProductivity	Action item extraction from meetings — F1	Meeting Agent · Submittal	79.4%	58.3%	+21.1pp

24/24

Benchmarks outperformed

+24.5pp

Avg. delta over baselines

Zero-shot

Evaluation protocol

Benchmark-specific fine-tuning

§ 03 — Agent Directory

100+ specialized agents across 6 functional tiers

Each agent tier addresses a distinct operational layer of the AEC project lifecycle. Agents are evaluated independently and as composable units within cross-agent workflow chains. All performance claims correspond to benchmark results disclosed in §02.

Core Design & Compliance

Primary moat

Drawing ReviewCode ComplianceClash DetectionSpecification ValidationAccessibility ComplianceGreen CertificationVastu Compliance

Top eval — 90–92% overall accuracy · 94% recall · 450+ drawings

Design & Ideation

Sells the product

Concept DesignMoodboardSpace PlanningRenderingSketch-to-CAD

Top eval — T2I-CompBench 0.62 vs 0.39 · AI2D 89.4%

Planning Intelligence

Intelligence depth

Site IntelligenceBOQ

Top eval — GSM8K 95.1% · FinQA 81.3% · MATH-500 74.6%

Execution & Ops

Workflow automation

Submittal ManagementVendor ManagementVendor Calling

Top eval — BFCL 91.5% · τ-bench 68.4% · StructEval 97.3%

Productivity Layer

Foundation

EmailCalendarMeetingDriveSlack

Top eval — IFEval 86.4% · QMSum 31.8 ROUGE-L · AMI F1 79.4%

Research

Utility

Research Agent

Top eval — FRAMES 82.1% · TriviaQA 91.2% · NQ F1 87.6%

100+

Total agents deployed

Public benchmarks

450+

First-party drawings

Multi-continent

Deployment footprint

Active

Expert human review

§ 04 — Multi-Dimensional Performance

Aggregate capability profile across six evaluation axes

NLP87

Vision85

Reasoning92

Domain KG80

Agentic76

Compliance92

Compliance (92) and Reasoning (92) reflect the platform's primary strength axis — validated by practicing architects across real project workflows. Agentic (76) reflects the nascent global infrastructure for agentic evaluation, not a platform ceiling. Vision (85) is anchored by DocVQA 92.6%, ChartQA 88.3%, and OCRBench 856/1000.

Vision & Document Intelligence

DocVQA — 92.6% ANLS

ChartQA — 88.3% relaxed acc.

OCRBench — 856 / 1000

AI2D — 89.4% accuracy

CVBench spatial — 81.7%

InfoVQA — 84.5% ANLS

Reasoning & Planning

GSM8K — 95.1% solve rate

MATH-500 — 74.6% exact match

BFCL — 91.5% function call acc.

τ-bench — 68.4% task success

HotpotQA — 86.3% F1

FinQA — 81.3% exec accuracy

Domain Knowledge & Compliance

MMLU Engineering — 88.7%

MMLU Pro — 71.4%

SciFact — 84.2% F1

BioASQ regulatory QA — 76.8%

MuSiQue 4-hop — 58.9%

IFEval — 86.4% prompt accuracy

§ 05 — Research Timeline

From a single hypothesis to a global platform

Jun 2025

Project Starline AI — origin

Research begins with a single hypothesis: that human-grade compliance reasoning is achievable for the built environment. The first working prototype is a code compliance checker evaluated against real architectural drawings by practicing architects.

Jul 2025

First evaluation corpus assembled

450+ real architectural drawings collected — spanning residential interiors, mixed-use developments, and large-scale commercial projects. Practiced architects independently review each project, serving as the human reference baseline for all subsequent evaluations.

Aug 2025

Regulatory rule graph v1 validated

National and regional building codes structured into a machine-readable rule graph. Human-in-the-loop pipeline established: practicing architects review outputs, classifying each detection as correct, false positive, or missed. 90–92% overall accuracy confirmed. Recall reaches 94%.

Oct 2025

Human-grade reasoning threshold reached

The system crosses the threshold where its regulatory inference is statistically indistinguishable from that of a senior practicing architect on the evaluation corpus. This milestone — not a product launch date — triggers the decision to expand beyond compliance into a full agentic platform.

Nov 2025

Superbuilt — platform expansion begins

Project Starline AI becomes Superbuilt. The reasoning infrastructure built for compliance is extended into design intelligence, planning, execution, and productivity tiers. Each new agent class is evaluated against published benchmarks before deployment.

Feb 2026

100+ agent platform deployed across markets

The unified agentic suite reaches active use across architecture and construction workflows spanning three continents. Design, planning, coordination, and compliance agents operate as a single composable system — the first of its kind in the global AEC industry.

Apr 2026

Research disclosure — methodology & benchmarks

Evaluation methodology, benchmark protocols, and performance metrics disclosed publicly for the first time. All benchmark results reference published leaderboards and are reproducible using standard evaluation tooling.

§ 06 — Methodology & Disclosure

Evaluation transparency

The research underlying Superbuilt's platform capabilities is proprietary and has not been publicly released. This disclosure is provided to establish the rigour of our evaluation methodology, the independence of our benchmark protocols, and the practitioner-grounded basis of our performance claims — for the benefit of prospective research collaborators, engineering candidates, and institutional partners.

Practitioner-led evaluation

All compliance and design agent evaluations were conducted by real, practicing architects using the platform in active project workflows. Human reviewers serve as the reference baseline — not academic annotators or synthetic benchmarks.

Public benchmark protocol

All public benchmarks evaluated zero-shot or few-shot using published evaluation scripts and prompts. No benchmark-specific fine-tuning or prompt engineering applied at any stage.

Baseline citation methodology

Baselines sourced from original benchmark papers and Papers With Code leaderboards. Comparisons use pre-reasoning-era specialized models, not contemporaneous frontier systems.

Domain expert review process

Green Certification, Vastu, Accessibility, and jurisdiction-specific compliance agents are reviewed by licensed architecture practitioners. No public benchmark exists for these domains — internal evaluation is the industry standard.

Statutory disclaimer

Superbuilt agents are decision-support tools. Outputs do not constitute statutory regulatory approvals in any jurisdiction. All compliance determinations require verification by a licensed professional of record.

Known limitations

Performance varies by project annotation density, jurisdiction specificity, and regulatory ambiguity. Conservative flagging for borderline cases. Jurisdiction-specific edge cases in under-represented markets may reduce recall.

Benchmark citation index

IFEval · Zhou et al., 2023

HELMET · Shi et al., 2024

DocVQA · Mathew et al., 2021

ChartQA · Masry et al., 2022

OCRBench · Liu et al., 2024

AI2D · Kembhavi et al., 2016

CVBench · Tong et al., 2024

InfoVQA · Mathew et al., 2022

GSM8K · Cobbe et al., 2021

MATH-500 · Lightman et al., 2023

BFCL · Yan et al., 2024

τ-bench · Yao et al., 2024

FinQA · Chen et al., 2021

HotpotQA · Yang et al., 2018

MuSiQue · Trivedi et al., 2022

MMLU · Hendrycks et al., 2021

MMLU Pro · Wang et al., 2024

SciFact · Wadden et al., 2020

BioASQ · Tsatsaronis et al., 2015

QMSum · Zhong et al., 2021

SchemaGuidedDialogue · Rastogi et al., 2020

AMI Corpus · McCowan et al., 2005

ZeroSCROLLS · Shaham et al., 2023

FRAMES · Krishna et al., 2024

SuperbuiltResearch & Evaluation · April 2026 · superbuilt.aiAll benchmarks publicly verifiable · Papers With Code