Comparative Analysis: Pre-Flight vs MITRE/FAA ALUE Benchmarks

A comprehensive analysis of two pioneering aviation LLM assurance benchmarks, examining how Airside Labs' Pre-Flight and MITRE/FAA's ALUE address distinct operational layers in aerospace AI safety.
Pre-Flight vs. ALUE: Two Aviation AI Benchmarks, Two Very Different Jobs
The Short Version
2025 gave the aviation industry not one but two purpose-built benchmarks for evaluating large language models. We launched Pre-Flight in January. Nine months later, MITRE and the FAA released the Aerospace Language Understanding Evaluation (ALUE). The natural instinct is to compare them head-to-head, but that misses the point. These benchmarks tackle different problems at different layers of the aviation stack, and understanding where each one fits matters more than picking a winner.
Pre-Flight was built for the commercial side of the house — ground operations, ICAO compliance, dispatch procedures, and increasingly, security and regulatory audit readiness against frameworks like the EU AI Act, OWASP LLM Top 10, and NIST RMF. Think of it as validating whether an LLM can be trusted with the operational and compliance realities that airlines and ground handlers face every day.
ALUE, backed by the FAA through MITRE's federally funded research centre, is aimed squarely at the National Airspace System. Its focus is on aviation-specific language, nomenclature, and the kind of system-level safety assurance you need before letting an AI anywhere near air traffic management decisions. The FAA wants a "definitive library" of aviation terminology that LLMs must demonstrate they understand before they can be considered for use in NAS operations.
The two benchmarks complement each other. Pre-Flight validates compliance and security posture early — does this model meet governance and risk standards? ALUE then validates whether the model has the deep aerospace domain understanding required for integration into federal safety-critical systems. It is a layered approach, not a competition.
Why Aviation AI Needs Purpose-Built Evaluation
There is nothing controversial about saying that AI errors in aviation could be catastrophic. What is more interesting is how the industry is responding. Both benchmarks emerged from the same recognition: generic AI evaluation methods are not fit for purpose in safety-critical aerospace applications. You cannot rely on general-purpose leaderboards to tell you whether an LLM will hallucinate a non-existent NOTAM, confuse ICAO wake turbulence categories, or leak sensitive passenger data.
The Regulators Are Watching
EASA's September 2025 survey results confirmed what most of us already knew. Aviation professionals are most concerned about the limits of AI performance, data protection, accountability, and safety implications. A strong majority called for active regulation and supervision by EASA and national authorities. The message is clear: assurance is not optional, it is a prerequisite.
These concerns are exactly what both benchmarks aim to address, albeit from different angles.
ALUE: The Federal Standard
MITRE has been shaping aerospace capabilities for over sixty years through its work with the FAA's Center for Advanced Aviation System Development (CAASD). That institutional weight matters. When MITRE publishes a benchmark, it carries the authority of the federal aviation safety establishment.
ALUE is designed to set common, verifiable standards for AI tools operating within the NAS. The emphasis is on whether LLMs genuinely understand aviation language and context — not just surface-level pattern matching, but the kind of precise, technically accurate comprehension that air traffic management demands. Their research found that general-purpose models tend to produce verbose, unstructured outputs when confronted with aviation tasks, and that structured prompts and in-context examples significantly improved performance. This finding is baked into the benchmark's design.
Pre-Flight: Commercial Agility and Compliance
We built Pre-Flight to bridge the gap between cutting-edge AI capabilities and safe, reliable deployment for commercial aviation. The design reflects a market-driven need: airlines, ground handlers, and aviation technology providers need rapid, demonstrable validation before integrating LLMs into their operations. That means testing against the EU AI Act, GDPR, OWASP, NIST RMF, and MITRE ATLAS — not as a theoretical exercise, but to produce audit-ready compliance documentation.
Where the government-backed ALUE focuses on NAS safety and airspace integrity, Pre-Flight addresses the immediate, tangible risks that commercial partners face: data privacy vulnerabilities, regulatory non-compliance, and the business consequences of deploying an AI system that cannot reliably follow standard operating procedures.
A Closer Look at ALUE
The ALUE benchmark, developed by Eugene Mangortey, Kunal Sarkhel, and their team, is built to be flexible. It supports open-source and domain-specific LLMs, custom datasets, user-defined prompts, and various quantitative metrics. This versatility is intentional — the FAA and broader aerospace community need a framework that can adapt to different evaluation needs across the sector.
Where ALUE Is Heading
The more interesting story is ALUE's roadmap. Future versions are expected to require LLMs to extract data from charts, consult aircraft operational manuals, and determine technical parameters like thrust settings and flap configurations under specific conditions. This is a fundamentally different challenge from answering knowledge questions. The model would need to synthesise information from multiple sources — text, charts, technical manuals — and arrive at a specific, safety-critical conclusion.
In practical terms, ALUE is moving toward validating Retrieval-Augmented Generation (RAG) and multi-modal reasoning. That positions it at the high end of the aviation operational hierarchy, where system-critical performance and real-time decision support become the bar to clear.
A Closer Look at Pre-Flight
Pre-Flight focuses on the operational layer of aviation — the "last mile" of commercial operations where things happen under the wing. The benchmark tests an LLM's understanding of ICAO annex documentation, flight dispatch rules, and critically, airport ground operations safety procedures: ground equipment operations, fuelling safety, aircraft towing, emergency response protocols, and scenarios like managing snow removal during winter operations. The dataset of roughly 300 questions is derived from standard international airline and airport ground operations safety manuals.
V2: Testing AI Agents, Not Just Knowledge
We ran into the same problem that plagues most AI benchmarks — the leading models started saturating the top scores. The response was V2, which transforms Pre-Flight from a knowledge test into a tool for validating complex, multi-step AI agents.
V2 requires models to perform multiple logical steps within a single prompt and mandates structured outputs that conform to predefined schemas. This is not a content refresh; it is an engineering requirement. In regulated environments, you cannot have an LLM producing creative, free-form answers when the downstream system expects a specific data format. An AI agent managing a turnaround workflow needs to be predictable and auditable, not eloquent. Pre-Flight V2 validates exactly that: can this model act as a reliable component within an automated operational workflow?
The Security and Compliance Layer
Pre-Flight is part of a broader risk assessment framework. Beyond domain knowledge, the testing covers compliance against the EU AI Act, GDPR (data leakage risks), OWASP LLM Top 10 (common vulnerabilities), and MITRE ATLAS (adversarial threats against AI systems). The inclusion of MITRE ATLAS is worth highlighting — it means we are testing not just whether the model knows the right answers, but whether it can withstand deliberate attempts to make it produce wrong ones. This adversarial testing is a separate concern from the functional performance that ALUE measures, but equally important for any production deployment.
How They Compare
The honest assessment is that Pre-Flight and ALUE are largely complementary. They address different layers of the aviation ecosystem and different phases of the LLM deployment lifecycle.
ALUE targets the systemic layer — NAS, air traffic management, and real-time decision support. Pre-Flight targets the operational layer — ICAO procedures, dispatch, and ground operations. The overlap is limited to foundational regulatory knowledge and general safety concepts drawn from ICAO Annexes.
The methodological differences are equally significant. ALUE is structurally broad, supporting diverse data formats and focused on ensuring models produce concise, technically accurate outputs rather than verbose noise. Pre-Flight V2 is focused on deterministic correctness through engineered constraints — structured outputs and constrained decoding that validate whether an LLM can function reliably as a software component.
| Dimension | Pre-Flight (Jan 2025) | ALUE (Sept 2025) | Assessment |
|---|---|---|---|
| Sponsor | Commercial R&D, Security Testing | Government/FFRDC, NAS Safety | Different mandates: commercial agility vs. federal standard |
| Domain | ICAO, Ground Ops, Dispatch (Under the Wing) | Aviation Nomenclature, NAS, ATM (In the Air) | Pre-Flight holds a specific operational niche |
| Method | MCQ (v1), Structured Outputs and complex logic (v2) | Diverse tasks, custom prompts, quantitative metrics | ALUE is broader; Pre-Flight V2 validates agent logic |
| Complexity | Multi-step business tasks, workflow automation | Chart extraction, manual consultation, thrust/flap parameters | ALUE targets multi-modal/RAG; Pre-Flight targets workflow automation |
| Compliance | EU AI Act, OWASP, NIST RMF, MITRE ATLAS | Addresses hallucinations, biases, privacy generally | Pre-Flight has a clear advantage in explicit compliance and audit readiness |
What This Means
For LLM developers targeting aviation, the takeaway is straightforward. If you are building AI tools for commercial airline operations, ground handling, or compliance-sensitive applications, Pre-Flight is where you start. If your ambition is integration into federal airspace management or safety-critical ATM systems, ALUE sets the standard you need to meet. For comprehensive assurance, you will likely need both.
The fact that 2025 produced two serious, purpose-built aviation AI benchmarks from very different corners of the industry is itself a signal. The era of evaluating aviation AI with general-purpose methods is ending. The question is no longer whether domain-specific assurance is necessary, but how quickly the industry can adopt it.
Works cited
- MITRE and FAA Introduce Novel Aerospace Large Language Model Evaluation Benchmark
- LLM Benchmarks January 2025 - Rinat Abdullin
- Pre-Flight Aviation AI Benchmark | Open-Source Testing Suite - Airside Labs
- Airside Labs: AI Security Testing & Compliance
- EASA publishes survey results on ethics of Artificial Intelligence in Aviation at AI Days event
- Aviation | MITRE
- FAA System Security Testing and Evaluation - MITRE Corporation
- AirsideLabs (Airside Labs) - Hugging Face
- Aerospace Language Understanding Evaluation (ALUE): Large Language Benchmark with Aerospace Datasets | AIAA
- MITRE, FAA Launch Aerospace LLM Evaluation Benchmark - ExecutiveGov
- Benchmark Problem for Autonomous Urban Air Mobility - NASA
- Defining Terminal Airspace Air Traffic Complexity Indicators - MDPI
- AirsideLabs/pre-flight-06 - Hugging Face Datasets

Airside Labs Team
Research & Development
The Airside Labs team comprises aviation experts, AI researchers, and safety-critical systems engineers dedicated to advancing AI evaluation methodologies. Our collective expertise spans air traffic management, ground operations, commercial aviation, and AI security.
Ready to enhance your AI testing?
Contact us to learn how AirsideLabs can help ensure your AI systems are reliable, compliant, and ready for production.
Book A Demo