Pre-Flight Aviation AI Benchmark
Pre-Flight is an evaluation benchmark that tests the model's understanding of ICAO annex documentation, flight dispatch rules and airport ground operations safety procedures and protocols. The evaluation consists of multiple-choice questions related to international airline and airport ground operations safety manuals.
The benchmark was developed by Airside Labs' founder along with a small community of aviation experts with experience in Air Traffic Management, ground operations and commercial flying.
Dataset Overview
The dataset consists of multiple-choice questions drawn from standard international airport ground operations safety manuals. Each question has 4-5 possible answer choices, with one correct answer.
Topics Covered
- • Airport safety procedures
- • Ground equipment operations
- • Staff training requirements
- • Emergency response protocols
- • Fueling safety
- • Aircraft towing procedures
Dataset Sections
Note: Gaps in ID sequences enable future additions
Example Questions
Note: Actual evaluation uses the multiple_choice solver. These examples have been lightly edited for readability.
Example 1: Basic Aviation Safety
Q: What is the effect of alcohol consumption on functions of the body?
A. "Alcohol has an adverse effect, especially as altitude increases."
B. "Alcohol has little effect if followed by an ounce of black coffee for every ounce of alcohol."
C. "Small amounts of alcohol in the human system increase judgment and decision-making abilities."
D. "no suitable option"
Correct Answer: A
Example 2: Complex Reasoning Scenario
Q: An airport is managing snow removal during winter operations on January 15th. Given the following information:
Current conditions:
- • Time: 07:45
- • Temperature: -3°C
- • Snowfall: Light, expected to continue for 2 hours
- • Ground temperature: -6°C
Aircraft movements and stand occupancy:
| Stand | Current/Next Departure | Next Arrival |
|---|---|---|
| A | BA23 with TOBT 07:55 | BA24 at 08:45 |
| B | AA12 with TOBT 08:10 | AA14 at 09:00 |
| C | Currently vacant | ETD112 at 08:15 |
| D | ETD17 with TOBT 08:40 | ETD234 at 09:30 |
Assuming it takes 20 minutes to clear each stand and only one stand can be cleared at a time, what is the most efficient order to clear the stands to maximize the number of stands cleared before they're reoccupied?
A. "A, B, C, D 1 hour 20 minutes"
B. "D, A, C, B 5 hours 5 minutes"
C. "C, A, B, D 1 hour 20 minutes"
D. "no suitable option"
Correct Answer: C
Scoring Methodology
The benchmark is scored using accuracy, which is the proportion of questions answered correctly. Each question has one correct answer from the multiple choices provided. The implementation uses the multiple_choice solver and the choice scorer.
Evaluation Report
Results from running the full dataset (300 samples) across multiple models on March 21, 2025
| Model | Accuracy | Correct Answers | Total Samples |
|---|---|---|---|
| anthropic/claude-3-7-sonnet-20250219 | 0.747 | 224 | 300 |
| openai/gpt-4o-2024-11-20 | 0.733 | 220 | 300 |
| openai/gpt-4o-mini-2024-07-18 | 0.733 | 220 | 300 |
| anthropic/claude-3-5-sonnet-20241022 | 0.713 | 214 | 300 |
| groq/llama3-70b-8192 | 0.707 | 212 | 300 |
| anthropic/claude-3-haiku-20240307 | 0.683 | 205 | 300 |
| anthropic/claude-3-5-haiku-20241022 | 0.667 | 200 | 300 |
| groq/llama3-8b-8192 | 0.660 | 198 | 300 |
| openai/gpt-4-0125-preview | 0.660 | 198 | 300 |
| openai/gpt-3.5-turbo-0125 | 0.640 | 192 | 300 |
| groq/gemma2-9b-it | 0.623 | 187 | 300 |
| groq/qwen-qwq-32b | 0.587 | 176 | 300 |
| groq/llama-3.1-8b-instant | 0.557 | 167 | 300 |
These results demonstrate the capability of modern language models to comprehend aviation safety protocols and procedures from international standards documentation. The strongest models achieve approximately 75% accuracy on the dataset.
Test Your AI on Aviation Safety
Contact us to evaluate your AI systems against the Pre-Flight benchmark or develop custom aviation-specific testing frameworks.
Schedule Consultation