Preflight: The AI Reasoning Benchmark

The goal of Preflight is to be a peer reviewed and challenging test to evaluate an AI's understanding of the aviation industry. If you contribute to the test, you will be named on the subsequent paper, and gain access to all questions and results.

4 Simple Steps

Together, we are collecting the hardest and broadest set of questions ever for the Preflight Aviation AI Benchmark. Can you think of something you know that would stump current artificial intelligence (AI) systems? 

  • Shapes svgrepo com
    Craft a Challenging Question
  • Check double svgrepo com
    AI Evaluation and Explanation
  • Hourglass half svgrepo com
    Peer Review
  • Circle check svgrepo com
    Publication

Example Questions and Answers

Hard questions (that a regular person might not know) can test factual knowledge but others need to verify the "real world" understanding that an AI must use to make business or resource optimisation decisions.

Simple factual knowledge

Question

What is the required color of the recorder container according to 14 CFR 23.1457?

Answer

Orange or Bright orange

Simple knowledge check for de-icing

Question

What action is required prior to takeoff if snow is adhering to the wings of an air carrier airplane?

Answer

Assure that the snow is removed from the airplane and de-icing procedures are followed.

Complex scenario that requires step by step reasoning

Question 

BusyHub Airport is preparing for a challenging morning. The airport has three gates, but one is out of action due to maintenance. There's a single runway, and air traffic control requires a 5-minute gap between all runway movements. Three aircraft (Flight A, Flight B, and Flight C) are all scheduled to arrive at 9:00 AM and depart later in the day. Many passengers on these flights need to make connections at BusyHub. As the airport manager at BusyHub, you receive the following updates from different teams:

The maintenance team reports that Gate 1 will be unavailable until noon due to urgent repairs.

Air traffic control informs you that they must maintain a strict 5-minute separation between all runway movements for safety reasons.

The airline representatives notify you that they have several passengers with tight connections on the incoming flights.

The flights scheduled times have Flight A, Flight B, and Flight C due to land at 9:00 AM, the FPL submitted prior to departure confirmed the 9:00 AM arrival, but they have en-route calculated ETAs of 8:55 AM, 9:16 AM and 9.05 AM respectively

Given this information:

a) What is the gate occupancy at 9:30 AM assuming taxiing is less than 5 minutes from runway to gate?

b) Will the flights maintain their arrival On Time Performance, i.e. arrive within 15 minutes of their scheduled time?

Answer

Gate occupancy at 9.30am is 100% as both Flights A and C will have been routed to the two operational gates. Flight B will have to park at a remote stand and use steps for passenger de-planing. A and C are on time (within 15 mins), Flight B is late by 1 minute.

Frequently Asked Questions

Together, we are collecting the hardest and broadest set of questions ever for the Preflight Aviation AI Benchmark. Can you think of something you know that would stump current artificial intelligence (AI) systems?

  • What is the Preflight Aviation AI Benchmark?

    The Preflight Aviation AI Benchmark is a comprehensive AI test designed to evaluate an AI's understanding of the aviation industry. We are collecting challenging questions to better evaluate the capabilities of AI systems in the years to come.
  • How can I contribute to the benchmark?

    You can contribute to the benchmark by submitting a challenging question that you think would stump current AI systems. If your question passes review, your name will be associated with the question and you will be invited as an author of the paper corresponding to this dataset.
  • What happens after I submit a question?

    After you submit a question, it will go through a review process. First, it will be evaluated by AI, and then it will be reviewed by a human. If your question passes review, it will be included in the benchmark.
  • How many questions are in the benchmark already?

    There are over one hundred questions in the initial benchmark, created by hand using FAA and EASA reference material and a human author.

  • What makes a good question for the benchmark?

    A good question for the benchmark is one that is challenging for AI systems. If it's hard for the AIs, it is likely a good question to submit.
  • Who owns the questions and the results data?

    Airside Labs will own and license the questions and will publish the results under a Creative Commons with attribution license. An example evaluation question set will be published on github under a highly permissive and suitable open source license. The majority of the questions shall not be published to avoid model training or tuning data contamination.

Built on Unicorn Platform