Aviation Eval – Flight Test 1: Anthropic Models Compared

With the exciting release of Anthropic's updated Sonnet and Haiku models, we're sharing our first evaluation results from the Pre-Flight benchmark.
25th November 2024
With the exciting release this week of Anthropic's updated Sonnet (new) and Haiku models, it seemed the perfect time to start sharing evaluation results. The benchmark is still under work and is not yet peer reviewed, however, there are over 100 aviation knowledge and reasoning questions in the test across a range of categories.
The Questions
The questions are drawn from:
- ICAO Annexes 1-7 descriptions and documentation
- FAA regulations and website documentation
- FAA Sample questions for regulated aviation operational roles
- Simulated scenarios for "BusyHub airport" including taxiing, gate usage and weather responses
As may be expected, the new Anthropic model performed the best with a score of 69.29% across the test, and Haiku was the weakest model with a score of 60%. The questions range from aviation general knowledge to operational decision making to test both the recall of facts and operational understanding.
Initial Findings
- The poor performance of O1 Mini from OpenAI is surprising and requires further investigation
- Some factual questions are arguably ambiguous
- The older models fail to follow instructions frequently
- Models struggle with operational understanding, considering real-world physical constraints and timing separation

Airside Labs Team
Research & Development
The Airside Labs team comprises aviation experts, AI researchers, and safety-critical systems engineers dedicated to advancing AI evaluation methodologies. Our collective expertise spans air traffic management, ground operations, commercial aviation, and AI security.
Ready to enhance your AI testing?
Contact us to learn how AirsideLabs can help ensure your AI systems are reliable, compliant, and ready for production.
Book A Demo