Superbuilt · Research & Evaluation · April 2026
This document presents the evaluation methodology, benchmark results, and performance evidence underlying the Superbuilt platform — a purpose-built agentic AI system for architecture, engineering, and construction.
§ 01 — Origin Research · Compliance Reasoning
The foundational research underlying Superbuilt originated as a single, testable hypothesis: that a domain-specific reasoning system could achieve statistical parity with a senior practicing architect on regulatory compliance evaluation. All evaluation reported here was conducted by licensed, actively practising architects using the system within real project workflows — not under controlled laboratory conditions.
The primary evaluation corpus comprises 450+ projects spanning residential, mixed-use, interior, and large-scale commercial typologies, selected to reflect the full distribution of real-world submission standards and drafting practices. Regulatory coverage was structured to include national building codes, regional and state-level regulations, municipal by-laws, zoning frameworks, and fire and life-safety norms. The reasoning infrastructure validated through this corpus was subsequently extended to underpin the broader agentic platform.
Evaluation pipeline — practitioner-reviewed
Time-to-compliance
Regulatory rule coverage depth
Performance metrics — 450+ practitioner-reviewed projects
Training trajectory — internal validation curve
§ 02 — Cross-Agent Benchmark Evaluation
All evaluations conducted zero-shot or few-shot using published protocols. No benchmark-specific fine-tuning was applied at any stage. Results are independently reproducible.
§ 03 — Agent Directory
Each agent tier addresses a distinct operational layer of the AEC project lifecycle. Agents are evaluated independently and as composable units within cross-agent workflow chains. All performance claims correspond to benchmark results disclosed in §02.
§ 04 — Multi-Dimensional Performance
Compliance (92) and Reasoning (92) reflect the platform's primary strength axis — validated by practicing architects across real project workflows. Agentic (76) reflects the nascent global infrastructure for agentic evaluation, not a platform ceiling. Vision (85) is anchored by DocVQA 92.6%, ChartQA 88.3%, and OCRBench 856/1000.
§ 05 — Research Timeline
Research begins with a single hypothesis: that human-grade compliance reasoning is achievable for the built environment. The first working prototype is a code compliance checker evaluated against real architectural drawings by practicing architects.
450+ real architectural drawings collected — spanning residential interiors, mixed-use developments, and large-scale commercial projects. Practiced architects independently review each project, serving as the human reference baseline for all subsequent evaluations.
National and regional building codes structured into a machine-readable rule graph. Human-in-the-loop pipeline established: practicing architects review outputs, classifying each detection as correct, false positive, or missed. 90–92% overall accuracy confirmed. Recall reaches 94%.
The system crosses the threshold where its regulatory inference is statistically indistinguishable from that of a senior practicing architect on the evaluation corpus. This milestone — not a product launch date — triggers the decision to expand beyond compliance into a full agentic platform.
Project Starline AI becomes Superbuilt. The reasoning infrastructure built for compliance is extended into design intelligence, planning, execution, and productivity tiers. Each new agent class is evaluated against published benchmarks before deployment.
The unified agentic suite reaches active use across architecture and construction workflows spanning three continents. Design, planning, coordination, and compliance agents operate as a single composable system — the first of its kind in the global AEC industry.
Evaluation methodology, benchmark protocols, and performance metrics disclosed publicly for the first time. All benchmark results reference published leaderboards and are reproducible using standard evaluation tooling.
§ 06 — Methodology & Disclosure
The research underlying Superbuilt's platform capabilities is proprietary and has not been publicly released. This disclosure is provided to establish the rigour of our evaluation methodology, the independence of our benchmark protocols, and the practitioner-grounded basis of our performance claims — for the benefit of prospective research collaborators, engineering candidates, and institutional partners.
All compliance and design agent evaluations were conducted by real, practicing architects using the platform in active project workflows. Human reviewers serve as the reference baseline — not academic annotators or synthetic benchmarks.
All public benchmarks evaluated zero-shot or few-shot using published evaluation scripts and prompts. No benchmark-specific fine-tuning or prompt engineering applied at any stage.
Baselines sourced from original benchmark papers and Papers With Code leaderboards. Comparisons use pre-reasoning-era specialized models, not contemporaneous frontier systems.
Green Certification, Vastu, Accessibility, and jurisdiction-specific compliance agents are reviewed by licensed architecture practitioners. No public benchmark exists for these domains — internal evaluation is the industry standard.
Superbuilt agents are decision-support tools. Outputs do not constitute statutory regulatory approvals in any jurisdiction. All compliance determinations require verification by a licensed professional of record.
Performance varies by project annotation density, jurisdiction specificity, and regulatory ambiguity. Conservative flagging for borderline cases. Jurisdiction-specific edge cases in under-represented markets may reduce recall.
Benchmark citation index