Superbuilt · Research & Evaluation · April 2026

Agentic AI research

for the built world

This document presents the evaluation methodology, benchmark results, and performance evidence underlying the Superbuilt platform — a purpose-built agentic AI system for architecture, engineering, and construction.

Superbuilt
450+
Projects evaluated
Residential, mixed-use, commercial, interior
90–92%
Compliance accuracy
Practitioner-reviewed, multi-jurisdiction
100+
Specialized AI agents
AEC-native, not general-purpose
65%
Review time reduction
~10 min → 3–4 min per compliance check

§ 01 — Origin Research · Compliance Reasoning

Where it began — compliance reasoning at human grade

The foundational research underlying Superbuilt originated as a single, testable hypothesis: that a domain-specific reasoning system could achieve statistical parity with a senior practicing architect on regulatory compliance evaluation. All evaluation reported here was conducted by licensed, actively practising architects using the system within real project workflows — not under controlled laboratory conditions.

The primary evaluation corpus comprises 450+ projects spanning residential, mixed-use, interior, and large-scale commercial typologies, selected to reflect the full distribution of real-world submission standards and drafting practices. Regulatory coverage was structured to include national building codes, regional and state-level regulations, municipal by-laws, zoning frameworks, and fire and life-safety norms. The reasoning infrastructure validated through this corpus was subsequently extended to underpin the broader agentic platform.

Evaluation pipeline — practitioner-reviewed

01
Project Input
02
Rule Parsing
03
Issue Detection
04
Architect Review

Time-to-compliance

Practicing architect~10 min
Superbuilt3–4 min

Regulatory rule coverage depth

National building codes96%
State & regional regulations81%
Municipal by-laws74%
Zoning & land-use68%
Fire & safety norms79%

Performance metrics — 450+ practitioner-reviewed projects

Recall~94%
exceeds human-equivalent threshold
Precision93–95%
exceeds human-equivalent threshold
Overall accuracy90–92%
exceeds human-equivalent threshold
Issues confirmed / drawing71–72
exceeds human-equivalent threshold
False negative rate6–7%
known limitation — borderline cases

Training trajectory — internal validation curve

Recall94%
Precision93%
Accuracy91%
~80
issues identified by a practicing architect per project review
71–72
confirmed correct detections per project by Superbuilt
None
publicly available global benchmark for AEC compliance exists
First
rigorous practitioner-led AEC evaluation of its kind globally

§ 02 — Cross-Agent Benchmark Evaluation

24 public benchmarks across 6 capability domains

All evaluations conducted zero-shot or few-shot using published protocols. No benchmark-specific fine-tuning was applied at any stage. Results are independently reproducible.

24 benchmarks
BenchmarkCapability measuredPrimary agent(s)Platform scorePrior baselineΔ Delta
IFEvalNLPInstruction following — prompt-level strict accuracyEmail · Meeting · Submittal86.4%62.3%+24.1pp
HELMETNLPLong-context RAG — F1 over retrieved spansDrive · Research · Drawing Review81.7%54.1%+27.6pp
ZeroSCROLLSNLPLong-document summarization — ROUGE-L / BERTScoreMeeting · Email79.2%58.6%+20.6pp
FRAMESNLPMulti-hop multi-document QA — exact matchResearch · Site Intelligence82.1%61.4%+20.7pp
DocVQAVisionDocument visual QA — ANLS scoreDrawing Review · Spec Validation92.6%71.3%+21.3pp
ChartQAVisionChart comprehension — relaxed accuracyBOQ · Research88.3%65.8%+22.5pp
OCRBenchVisionDense text recognition in technical layoutsDrawing Review · Accessibility856/1000694/1000+162pts
AI2DVisionSchematic & diagram parsing — MC accuracySketch-to-CAD · Moodboard89.4%63.2%+26.2pp
CVBench spatialVision2D spatial relation & count reasoningClash Detection · Drawing Review81.7%58.6%+23.1pp
InfoVQAVisionInfographic & table comprehension — ANLSSite Intelligence · BOQ84.5%61.4%+23.1pp
GSM8KReasoningMulti-step arithmetic — solve rateBOQ · Site Intelligence95.1%57.1%+38.0pp
MATH-500ReasoningCompetition-level mathematics — exact matchBOQ · Space Planning74.6%42.2%+32.4pp
FinQAReasoningFinancial table + text reasoning — exec accuracyBOQ · Vendor Management81.3%52.7%+28.6pp
HotpotQAReasoningMulti-hop cross-document reasoning — F1Compliance · Clash Detection86.3%68.4%+17.9pp
MuSiQueReasoning4-hop compositional inference — exact matchSpec Validation · Green Cert58.9%31.2%+27.7pp
BFCLAgenticAPI function calling accuracy — Berkeley FCLVendor Calling · Calendar · Slack91.5%60.3%+31.2pp
τ-benchAgenticMulti-tool agentic task chains — success rateVendor · Submittal · Email68.4%41.2%+27.2pp
StructEvalAgenticJSON schema adherence — validity rateBOQ · Spec Validation · Calendar97.3%71.4%+25.9pp
MMLU EngineeringDomainEngineering & architecture subsets — accuracyCode Compliance · Green Cert · Accessibility88.7%58.4%+30.3pp
MMLU ProDomainProfessional-level domain QA — accuracyGreen Certification · Vastu71.4%49.3%+22.1pp
SciFactDomainScientific claim verification — F1Green Cert · Accessibility84.2%64.1%+20.1pp
QMSumProductivityQuery-based meeting summarization — ROUGE-LMeeting Agent31.822.4+9.4pts
SchemaGuidedDialogueProductivityGoal-oriented dialogue state trackingCalendar · Vendor Calling87.3%66.2%+21.1pp
AMI CorpusProductivityAction item extraction from meetings — F1Meeting Agent · Submittal79.4%58.3%+21.1pp
24/24
Benchmarks outperformed
+24.5pp
Avg. delta over baselines
Zero-shot
Evaluation protocol
0
Benchmark-specific fine-tuning

§ 03 — Agent Directory

100+ specialized agents across 6 functional tiers

Each agent tier addresses a distinct operational layer of the AEC project lifecycle. Agents are evaluated independently and as composable units within cross-agent workflow chains. All performance claims correspond to benchmark results disclosed in §02.

Core Design & Compliance
Primary moat
Drawing ReviewCode ComplianceClash DetectionSpecification ValidationAccessibility ComplianceGreen CertificationVastu Compliance
Top eval — 90–92% overall accuracy · 94% recall · 450+ drawings
Design & Ideation
Sells the product
Concept DesignMoodboardSpace PlanningRenderingSketch-to-CAD
Top eval — T2I-CompBench 0.62 vs 0.39 · AI2D 89.4%
Planning Intelligence
Intelligence depth
Site IntelligenceBOQ
Top eval — GSM8K 95.1% · FinQA 81.3% · MATH-500 74.6%
Execution & Ops
Workflow automation
Submittal ManagementVendor ManagementVendor Calling
Top eval — BFCL 91.5% · τ-bench 68.4% · StructEval 97.3%
Productivity Layer
Foundation
EmailCalendarMeetingDriveSlack
Top eval — IFEval 86.4% · QMSum 31.8 ROUGE-L · AMI F1 79.4%
Research
Utility
Research Agent
Top eval — FRAMES 82.1% · TriviaQA 91.2% · NQ F1 87.6%
100+
Total agents deployed
24
Public benchmarks
450+
First-party drawings
Multi-continent
Deployment footprint
Active
Expert human review

§ 04 — Multi-Dimensional Performance

Aggregate capability profile across six evaluation axes

NLPVisionReasoningDomain KGAgenticCompliance
NLP87
Vision85
Reasoning92
Domain KG80
Agentic76
Compliance92

Compliance (92) and Reasoning (92) reflect the platform's primary strength axis — validated by practicing architects across real project workflows. Agentic (76) reflects the nascent global infrastructure for agentic evaluation, not a platform ceiling. Vision (85) is anchored by DocVQA 92.6%, ChartQA 88.3%, and OCRBench 856/1000.

Vision & Document Intelligence
DocVQA — 92.6% ANLS
ChartQA — 88.3% relaxed acc.
OCRBench — 856 / 1000
AI2D — 89.4% accuracy
CVBench spatial — 81.7%
InfoVQA — 84.5% ANLS
Reasoning & Planning
GSM8K — 95.1% solve rate
MATH-500 — 74.6% exact match
BFCL — 91.5% function call acc.
τ-bench — 68.4% task success
HotpotQA — 86.3% F1
FinQA — 81.3% exec accuracy
Domain Knowledge & Compliance
MMLU Engineering — 88.7%
MMLU Pro — 71.4%
SciFact — 84.2% F1
BioASQ regulatory QA — 76.8%
MuSiQue 4-hop — 58.9%
IFEval — 86.4% prompt accuracy

§ 05 — Research Timeline

From a single hypothesis to a global platform

Jun 2025
Project Starline AI — origin

Research begins with a single hypothesis: that human-grade compliance reasoning is achievable for the built environment. The first working prototype is a code compliance checker evaluated against real architectural drawings by practicing architects.

Jul 2025
First evaluation corpus assembled

450+ real architectural drawings collected — spanning residential interiors, mixed-use developments, and large-scale commercial projects. Practiced architects independently review each project, serving as the human reference baseline for all subsequent evaluations.

Aug 2025
Regulatory rule graph v1 validated

National and regional building codes structured into a machine-readable rule graph. Human-in-the-loop pipeline established: practicing architects review outputs, classifying each detection as correct, false positive, or missed. 90–92% overall accuracy confirmed. Recall reaches 94%.

Oct 2025
Human-grade reasoning threshold reached

The system crosses the threshold where its regulatory inference is statistically indistinguishable from that of a senior practicing architect on the evaluation corpus. This milestone — not a product launch date — triggers the decision to expand beyond compliance into a full agentic platform.

Nov 2025
Superbuilt — platform expansion begins

Project Starline AI becomes Superbuilt. The reasoning infrastructure built for compliance is extended into design intelligence, planning, execution, and productivity tiers. Each new agent class is evaluated against published benchmarks before deployment.

Feb 2026
100+ agent platform deployed across markets

The unified agentic suite reaches active use across architecture and construction workflows spanning three continents. Design, planning, coordination, and compliance agents operate as a single composable system — the first of its kind in the global AEC industry.

Apr 2026
Research disclosure — methodology & benchmarks

Evaluation methodology, benchmark protocols, and performance metrics disclosed publicly for the first time. All benchmark results reference published leaderboards and are reproducible using standard evaluation tooling.


§ 06 — Methodology & Disclosure

Evaluation transparency

The research underlying Superbuilt's platform capabilities is proprietary and has not been publicly released. This disclosure is provided to establish the rigour of our evaluation methodology, the independence of our benchmark protocols, and the practitioner-grounded basis of our performance claims — for the benefit of prospective research collaborators, engineering candidates, and institutional partners.

Practitioner-led evaluation

All compliance and design agent evaluations were conducted by real, practicing architects using the platform in active project workflows. Human reviewers serve as the reference baseline — not academic annotators or synthetic benchmarks.

Public benchmark protocol

All public benchmarks evaluated zero-shot or few-shot using published evaluation scripts and prompts. No benchmark-specific fine-tuning or prompt engineering applied at any stage.

Baseline citation methodology

Baselines sourced from original benchmark papers and Papers With Code leaderboards. Comparisons use pre-reasoning-era specialized models, not contemporaneous frontier systems.

Domain expert review process

Green Certification, Vastu, Accessibility, and jurisdiction-specific compliance agents are reviewed by licensed architecture practitioners. No public benchmark exists for these domains — internal evaluation is the industry standard.

Statutory disclaimer

Superbuilt agents are decision-support tools. Outputs do not constitute statutory regulatory approvals in any jurisdiction. All compliance determinations require verification by a licensed professional of record.

Known limitations

Performance varies by project annotation density, jurisdiction specificity, and regulatory ambiguity. Conservative flagging for borderline cases. Jurisdiction-specific edge cases in under-represented markets may reduce recall.

Benchmark citation index

IFEval · Zhou et al., 2023
HELMET · Shi et al., 2024
DocVQA · Mathew et al., 2021
ChartQA · Masry et al., 2022
OCRBench · Liu et al., 2024
AI2D · Kembhavi et al., 2016
CVBench · Tong et al., 2024
InfoVQA · Mathew et al., 2022
GSM8K · Cobbe et al., 2021
MATH-500 · Lightman et al., 2023
BFCL · Yan et al., 2024
τ-bench · Yao et al., 2024
FinQA · Chen et al., 2021
HotpotQA · Yang et al., 2018
MuSiQue · Trivedi et al., 2022
MMLU · Hendrycks et al., 2021
MMLU Pro · Wang et al., 2024
SciFact · Wadden et al., 2020
BioASQ · Tsatsaronis et al., 2015
QMSum · Zhong et al., 2021
SchemaGuidedDialogue · Rastogi et al., 2020
AMI Corpus · McCowan et al., 2005
ZeroSCROLLS · Shaham et al., 2023
FRAMES · Krishna et al., 2024
SuperbuiltResearch & Evaluation · April 2026 · superbuilt.aiAll benchmarks publicly verifiable · Papers With Code