With the exciting release this week of Anthropic's updated Sonnet (new) and Haiku models, it seemed the perfect time to start sharing evaluation results. The benchmark is still under work and is not yet peer reviewed, however, there are over 100 aviation knowledge and reasoning questions in the test across a range of categories. The questions are drawn from:
- ICAO Annexes 1-7 descriptions and documentation
- FAA regulations and website documentation
- FAA Sample questions for regulated aviation operational roles
- Simulated scenarios for "BusyHub airport" including taxiing, gate usage and weather responses
You can contribute to the eval question set here and receive access and credit.
As may be expected, the new Anthropic model performed the best with a score of 69.29% across the test, and Haiku was the weakest model with a score of 60%. The questions range from aviation general knowledge to operational decision making to test both the recall of facts and operational understanding. For example:
"If a required instrument on a multiengine airplane becomes inoperative during flight, which document required under 14 CFR part 121 dictates whether the flight may continue en route?"
A question derived from the official FAA documentation
(A) A Master Minimum Equipment List for the airplane.
(B) Certificate holder`s manual.
(C) Original dispatch release.
The results were interesting and broadly similar, ranging from 60-70% across the models.
Initial Findings and Next Steps
- The poor performance of O1 Mini from OpenAI is surprising and requires further investigation.
- Some factual questions are arguably ambiguous and after manually reviewing the LLM responses, a subset of the questions (approx. 20) require careful review and possible revision.
- The older models fail to follow instructions frequently, i.e. answering with "(B) answer" instead of "B" and using end of line characters. The test assertions were permissive and allowed these answers if the correct option was chosen in the response.
- Models struggle with operational understanding, considering real-world physical constraints and timing separation. Additional questions shall be crafted with a focus on defined use cases and varying operational scenarios.