MIA: Shared Task

Final Results (Updated: June 30, 2022)

The final micro F1 scores of the proposed systems in the shared task:

System	Total	XOR QA	MKQA
Baseline	27.55	37.95	17.14
Constrained systems:
Texttron	32.02	45.50	18.54
mLUKE-FID	31.61	40.93	22.29
CMUmQA	31.53	40.20	22.87
ZusammenQA	27.00	37.95	16.04

Our shared task summary paper is available here.

Task Format

Cross-lingual Open Question Answering is a challenging multilingual NLP task, where given questions are written in a user’s preferred language, a system needs to find evidence in large-scale document collections written in many different languages, and return an answer in the user’s preferred language, as indicated by their question. For instance, a system needs to answer in Arabic to answer an Arabic question, but it can use evidence passages written in any language included in a large-document corpus.

The evaluation is based on the macro-averaged scores across different target languages.

Target Languages

Our shared task will evaluate systems in 14 languages, 7 of which will not be covered in our training data. The training and evaluation data is originally from Natural Questions, XOR-TyDi QA, and MKQA (See details below in the Dataset section).

The full list of the languages:

Languages with training data: Arabic, Bengali, English Finnish, Japanese, Korean, Russian, Telugu
Languages without training data: Spanish, Khmer, Malay, Swedish, Turkish, Chinese (simplified)

Evaluations

Participants will run their systems on the evaluation files (without answer data) and then submit their predictions to our competition site hosted at eval.ai. Systems will first be evaluated using automatic metrics: Exact match and token-level F1 (Lee et al., 2019; Asai et al., 2021; Longpre et al., 2021). For non-spacing languages (i.e., Japanese, Khmer and Chinese) we use token-level tokenizers, Mecab, khmernltk and jieba to tokenize both predictions and ground-truth answers.

Due to the difference of the datasets' nature, we will calculate macro-average scores on XOR-TyDi and MKQA datasets, and then take the average of the XOR-TyDi QA average {F1, EM} and MKQA average {F1, EM}.

Although EM is often used as a primarily evaluate metric for English open-retrieval QA, the risk of surface-level mismatching (Min et al., 2021) can be more pervasive in cross-lingual open-retrieval QA. Therefore, we will use F1 as our primary metric and rank systems using their macro averaged F1 scores.

Prizes

We have 3 award categories, each with Google Cloud credit prizes!

The Best {unconstrained, constrained} system: These prizes will be given to the {constrained, unconstrained} systems (see the details in the Dataset Section) obtaining the highest macro-average F1 scores.
Creativity award: We plan to give additional award(s) to systems that employ a creative approach to the problem, or undertake interesting experiments to better understand the problem. This award is designed to encourage interesting contributions even from teams without access to the largest models or computational resources. Examples include attempts to improve generalization ability/language equality, reducing model sizes, or understanding the weaknesses in existing systems.

Important dates

May 15, 2022: Shared Task System & System Description Paper Due Dates
July 15, 2022: MIA Workshop

Training Datasets

We release training data, which consists of English open-QA data from Natural Questions-open (Kwiatkowski et al., 2019; Lee et al., 2019) and the XOR-TyDi QA train data. See our GitHub repository for details:

To keep submissions comparable, even in the unconstrained setup, participants are not allowed to train on the development data and also the subsets of the Natural Questions and TyDi QA data, which are used to create MKQA or XOR-TyDi QA data. We will release the list of the IDs of those prohibited questions before the official start of the shared task. All submissions should explicitly state that this constraint was properly followed.

Constrained Setup

To be considered as constrained setup submission, participants are required to use our official training corpus, which consists of examples pooled from the aforementioned datasets. No other question answering data may be used for training. We allow and encourage participants to use off-the-shelf tools for linguistic annotation (e.g. POS taggers, syntactic parsers), as well as any publicly available unlabeled data and models derived from these (e.g. word vectors, pre-trained language models). NB: In the constrained setup, participants may NOT use external blackbox APIs such as Google Search API / Google Translate API for *inference*, but they are permitted to use them for offline data augmentation or training.

Unconstrained Setup

Although participants can use whatever data, all of the submissions of the models trained on additional human-annotated question answering data will be considered as “unconstrained” setup, and participants need to clarify it and provide the details of the additional resources used during the training. For example, you can use publicly available QA datasets (e.g., CMRC 2018, FQuAD) to create a larger scale of training data. Note that automatically augmenting the original training data using machine translations or additional resources (e.g., Wikidata information) will still be considered as “constrained” as mentioned above. Yet, we encourage participants who use additional training data or augment training data using those tools to release their training corpora. Again, even in the unconstrained setup, participants may not train on the development subsets of Natural Questions and TyDi QA. NB: In the unconstrained setup, participants may use external blackbox APIs such as Google Search API / Google Translate API for inference or training.

Development & Test Datasets

The evaluation data is originally from the XOR-TyDi QA dataset and MKQA data.

XOR-TyDi QA (Asai et al., 2021)
MKQA-answerable (Longpre et al., 2021) (We only include the questions with answer annotations and remove the "no answers" questions)

As the original MKQA dataset does not have a dev/test split, we will randomly split the data into dev (1758 questions) and test (5000) questions.

Although we have multiple ground-truth answers from XOR-TyDi QA, for questions that have Wikipedia entities as answers, we use Wikipedia aliases as valid answers along with the given answer, following Trivia QA (Joshi et al., 2017) and MKQA.

Answer Retrieval Corpus

We will release preprocessed Wikipedia passages in the target 14 languages. If you are participating in the constrained tack, you may not use any additional knowledge sources. If you are participating in the unconstrained track, you can use additional knowledge sources (e.g., non-Wikipedia articles), but again you have to provide detailed descriptions of the knowledge sources. We also encourage you to release your own knowledge sources if possible.

Model Size

We do not have any restrictions on the model size or model type, but participants will be required to provide detailed information about their models and training procedures as well as the answer retrieval corpus.

Baseline Models

We will release easy-to-use baselines for cross-lingual open-retrieval QA in February, with trained models. Please stay tuned!

Data Format and Submission Instructions

Submission instructions will be released in February 2022. We will detail data format and submission instructions, along with our baseline models, in our GitHub repository.

For any inquiry about the shared task and the submission, please make a new issue in the repository.

You can make submissions at the eval.ai platform.

Registration

If you are participating in our shared task, please register your team through this form. It will help us to plan better for human evaluation, etc., as well as occasionally send announcements when there is a major update on baseline or dataset.

Shared Task on Cross-lingual Open-Retrieval QA