Welcome to KSAA-RD: Arabic Reverse Dictionary shared task at ArabicNLP 2023!

Introduction

A Reverse Dictionary (RD) is a type of dictionary that allows users to find words based on their meaning or definition. Unlike a traditional dictionary, where users search for a word by its spelling, a reverse dictionary allows users to enter a description of a word or a phrase, and the dictionary will generate a list of words that match that description. Reverse dictionaries can be useful for writers, crossword puzzle enthusiasts, non-native language learners, and anyone looking to expand their vocabulary. Specifically, it addresses the Tip-of-Tongue (TOT) problem, which refers to the situation where a person is aware of a word they want to say but is unable to express it accurately. This shared task includes two subtasks: Arabic RD (Arabic => Arabic) and Cross-lingual Reverse Dictionary (CLRD) (Arabic => English).

Paper: KSAA-RD Shared Task: Arabic Reverse Dictionary

Registration

Participants need to register via this link.

Dataset

The dataset includes three main components:

- Arabic dictionary has (58010) entries that were selected from LMF Contemporary Arabic dictionary after revising and editing by our annotation team (Task1).
- Mapped dictionary between Arabic and English words to be used as supervision in the second task .(Task2).
- English dictionary from SemEval 2022 reverse dictionary task has (63596) entries.

Table 1: Data Statistics.

	Train	Dev	Test
Arabic Entries	45200	6400	6410
Arabic English mapped dictionary	2843	299	1213

Arabic and English dictionaries file structure

Dataset files are in JSON format. A dataset file contains a list of examples. Each example is a JSON object, containing the following keys:

• "id"
• "word"
• "gloss"
• "sgns"
• "electra"
• "enId"

As a concrete instance, here is an example from the training dataset for the Arabic dictionary:


                           

                           {

                           
                           "id":"ar.45",

                           "word":"عين",

                           "gloss":"عضو الإبصار في ...",

                           "pos":"n",

                           "electra":[0.4, 0.3, …],

                           "sgns":[0.2, 0.5, …],

                           "enId": "en.150" 

                           }

The value associated to "id" key tracks the language and unique identifier for this example. The value associated to the "gloss" key is a definition, as you would find in a classical dictionary. It is to be used as the source in the RD task. The value associated to "enId" key tracks the mapped identifier in the English dictionary. All other keys ("sgns", "electra") correspond to embeddings, and the associated values are arrays of floats representing the components. They all can serve as targets for the RD task.

- "sgns" corresponds to skip-gram embeddings (word2vec)
- "electra" corresponds to Transformer-based contextualized embeddings.

As a concrete instance, here is an example from the training dataset for the Mapped dictionary:


                           

                           {

                           "id":"ar.45",

                           "arword":"عين",

                           "argloss":"عضو الإبصار في ...",

                           "arpos":"n",

                           "electra":[0.4, 0.3, …],

                           "sgns":[0.2, 0.5, …],

                           "enId":"en.150",

                           "word":"eye",

                           "gloss":"One of the two ...",

                           "pos":"n",


                           }

The value associated to "id" key tracks the Arabic unique identifier in the Arabic dictionary. The value associated to the "argloss" and "gloss" keys is the Arabic and English definitions, as you would find in an Arabic and English dictionary, respectively. The "gloss" is to be used as the source in the CLRD task. The value associated to "enId" key tracks the mapped identifier in the English dictionary. All other keys ("sgns", "electra") correspond to embeddings, and the associated values are arrays of floats representing the components. They all can serve as targets for the CLRD task.

- "sgns" corresponds to skip-gram embeddings (word2vec)
- "electra" corresponds to Transformer-based contextualized embeddings.

As a concrete instance, here is an example from the training dataset for the English dictionary:


                        {

                        "id":"en.150",

                        "word":"eye",

                        "gloss":"One of the two ...",

                        "pos":"n",

                        "electra":[0.7, 0.1, …],

                        "sgns":[0.2, 0.8, …]

                        }

The English dictionary has the same value as the Arabic dictionary and can be utilized in the second task This shared task includes 2 tracks:

- Closed Track: Participants should only use the provided dataset.
- Open Track: Participants can develop their own datasets using the provided English and Arabic dictionaries.

Tasks

Task1: Arabic Reverse Dictionary (RD) - Closed Track

The structure of reverse dictionaries (sequence-to-vector) is the opposite of traditional dictionaries. This task focuses on learning how to convert a definition of the word into word embedding vectors in Arabic. The task involves reconstructing the word embedding vector of the defined word, rather than simply finding the target word. This would enable the users to search for words based on the definition or meanings they anticipate. The training set of the data points should contain word vector representation and its corresponding word definition (gloss). The proposed model should generate new word vector representations for the target unseen readable definitions in the test set.

In this task, the input for the model is an Arabic word definition (gloss) and the output is an Arabic word embedding.

The baseline repository is available here..

Task2: Cross-lingual Reverse Dictionary (CLRD) - Open Track

The objective of the cross-lingual reverse dictionaries task (sequence-to-vector) is to acquire the ability to convert an English definition into word embedding vectors in Arabic. The main objective of this task is to identify the most accurate and suitable Arabic word vector that can efficiently express the identical semantic English definition or gloss, which is referred to as Arabicization "تَعْرِيب". The task involves reconstructing the word embedding vector that represents the Arabic word to its corresponding English definition. This approach enables users to search for words in other languages based on their anticipated meanings or definitions in English. This task facilitates cross-lingual search, language understanding, and language translation.

In this task, the input for the model is an English word definition (gloss) and the output is Arabic word embeddings.

The baseline repository is available here..

Baselines

The baseline architecture proposed by Mickus et al. (2022) is based on the Transformer model introduced by Vaswani et al. (2017) . The architecture involves feeding the input gloss, which is represented as a sequence starting with a special token ‘bos’ and ending with another special token ‘eos’, into a straightforward Transformer encoder. The encoder generates hidden representations, which are then summed to produce the prediction. Additionally, a small non-linear feed-forward module is used to further refine the prediction. The evaluation for 2 tasks will be using three different metrics including mean squared error (MSE), cosine similarity measure, and ranking metric.

Table 2: Baseline results for Task (1) and Task (2).

		Dev			Test
	No. epochs	Cosine similarity	mean squared error (MSE)	Rank	Cosine similarity	mean squared error (MSE)	Rank
Task1 (RD) (Sgns)	200	35.61	5.03	38.52	40.58	4.49	36.28
Task1 (RD) (Electra)	200	48.84	24.94	31.27	50.79	23.04	31.87
Task2 (CLRD) (Sgns)	300	26.22	4.92	50.16	25.21	4.85	49.95
Task2 (CLRD) (Electra)	300	54.09	22.10	36.22	51.66	23.81	40.72

Submission and evaluation

The model evaluation process follows a hierarchy of metrics. The primary metric is the ranking metric, which is used to assess how well the model ranks predictions compared to ground truth values. If models have similar rankings, the secondary metric, mean squared error (MSE), is considered. Lastly, if further differentiation is needed, the tertiary metric, cosine similarity, provides additional insights. This approach ensures the selection of a top-performing and well-rounded model.

The evaluation of shared tasks will be hosted through CODALAB. Here are the CODALAB links for each task:

Expected output format

During the evaluation phase, submissions are expected to reconstruct the same JSON format. The test JSON files will contain the "id" and the gloss keys. In both RD and CLRD, participants should construct JSON files that contain at least the two following keys:

- The original "id"

Awards

We are pleased to announce the awards for the Arabic Reverse Dictionary Shared Task at ArabicNLP 2023. The top-ranked teams in each task will receive cash prizes as follows:

Task 1: Arabic Reverse Dictionary (RD) - Closed Track

- 1st Ranked: $350
- 2nd Ranked: $250
- 3rd Ranked: $150

Task2: Cross-lingual Reverse Dictionary (CLRD) - Open Track

- 1st Ranked: $350
- 2nd Ranked: $250
- 3rd Ranked: $150

The winners will be determined based on the official evaluation metrics specified for each task. Best of luck to all the teams, and we look forward to announcing the winners at the conclusion of the competition!

Important Dates

• Release of training, dev data, and evaluation scripts: 16th of July 2023.
• Registration deadline: 14th of August 2023.
• Release of test data (and final training and dev data): 14th of August 2023.
• End of the evaluation cycle (test set submission closes): 20th of August 2023.
• Results released: 21th of August 2023
• System description paper submissions due: 5th of September 2023.

Recent Updates

• 6st June, 2023: Website is up!
• 16th July, 2023: Release of training and dev data, and evaluation scripts
• 14th August, 2023: Registration deadline
• 14th August, 2023: Release of test data
• 21th of August, 2023: Results released:

Contact

Email: r.almatham@gmail.com
Email: waad.wtss@gmail.com

Organizers

• Rawan Almatham, King Salman Global Academy for Arabic Language (KSAA)
• Waad Alshammari, King Salman Global Academy for Arabic Language (KSAA)
• Abdulrahman AlOsaimy, Imam Mohammad Ibn Saud Islamic University (IMSIU)
• Sarah Alhumoud, Imam Mohammad Ibn Saud Islamic University (IMSIU)
• Asma Al Wazrah, Imam Mohammad Ibn Saud Islamic University (IMSIU)
• Afrah Altamimi, Imam Mohammad Ibn Saud Islamic University (IMSIU)
• Halah Alharbi, King Salman Global Academy for Arabic Language (KSAA)
• Abdullah Alfaifi , Imam Mohammad Ibn Saud Islamic University (IMSIU)