Welcome to KSAA-2026 Shared Task

On Arabic Speech Dictation with Automatic Diacritization


Register Now CodaLab Page

Introduction

The KSAA-2026 Shared Task introduces a new multimodal benchmark focused on transforming raw Arabic speech transcripts into fully diacritized text. This challenge targets diacritic restoration, a persistent and unresolved problem in Arabic NLP, stemming from lexical ambiguity, syntactic variation, and the absence of diacritics in most written texts.

Automatic diacritization of speech dictation remains a challenging task due to the mismatch between speech-based transcriptions and traditional text-only diacritization approaches. While ASR systems often produce undiacritized or partially normalized text, text-based diacritization models are not designed to leverage acoustic information. This shared task aims to bridge this gap by focusing on speech-aware diacritization.

This shared task provides two subtasks:

  1. Data Contribution: Participants contribute at least one hour of recorded Arabic speech through the KSAA VoiceWall platform, enriching the benchmark.
  2. Automatic Diacritization of Speech Dictation: Participants build models that take speech audio + undiacritized transcripts as inputs and generate fully diacritized text.

Registration

Participants need to register via this Link

https://forms.office.com/r/KF4bvNNASP?origin=lprLink

CODABENCH platform Link

https://www.codabench.org/competitions/11859/

Dataset

The dataset consists of approximately five hours of Arabic speech audio collected via VoiceWall, a crowdsourcing audio platform developed by the King Salman Global Academy for Arabic Language. The recordings were obtained from male and female speakers and cover Modern Standard Arabic (MSA) as well as Arabic dialectal speech.

All utterances are short, with a maximum duration of nine seconds, to support accurate speech–text alignment and automatic diacritization. The recordings span multiple domains and underwent automatic validation and manual review to ensure audio quality and transcription accuracy.

The annotation process involved aligning the speech with written transcripts and ensuring diacritic accuracy. Multiple layers of quality control were implemented, including file normalization, systematic labeling, and manual reviews of diacritization to guarantee consistency and reliability.

 

Tasks

Task 1: Data Contribution

Each team is required to contribute at least one hour of speech data. All submitted recordings will undergo automatic validation followed by a manual review conducted by another member within the same team, based on a shared evaluation guideline provided to all participants.

Contributors are required to follow predefined transcription and diacritization guidelines to ensure consistency between speech content and textual representation.

After validation, the contributed data will be released to all participating teams to support fair benchmarking and encourage continuous dataset growth and diversity.

 

Task 2: Automatic Diacritization of Speech Dictation

In this task, participants are required to build systems that take speech audio and undiacritized transcripts as input and generate fully diacritized Arabic text.

The task requires predicting full Arabic diacritics at the character level, including fatḥa, ḍamma, kasra, sukūn, shaddah, and tanwīn marks, for each character in the undiacritized text.

The table below illustrates the input–output data structure for the task.

Input

Speech clip

 

Undiacritized transcript

أريد أن أشرب كوبًا من الشاي

Output

Diacritized transcript

أُرِيدُ أَنْ أَشْرَبَ كُوبًا مِنَ الشَّاي

 

Evaluation

Task 1

The data contribution will be evaluated based on the duration and quality of the contributed speech recordings:

  • Each team must submit at least one hour of valid speech data.
  • Data quality will be evaluated through automatic checks and a manual review by another team member, focusing on audio clarity, absence of background noise, correct reading of prompts, and alignment between speech content and the corresponding transcript.
  • The manual review will assess recording clarity, adherence to prompts, and consistency of transcription and diacritization, following a standardized evaluation guideline provided to all participants.
  • Each recording will be linked to both the team name and the individual contributor for full attribution.
  • All validated data will be released to shared-task participants after the official submission deadline, and will be used for benchmarking and future research purposes, ensuring fairness and supporting continuous dataset expansion.

 

Task 2

Systems are evaluated using three complementary metrics:

  • Diacritic Error Rate (DER): Measures the percentage of incorrect diacritics at the character level.
  • Word Error Rate (WER): Counts a word as incorrect if it contains at least one diacritic error.
  • Sentence Error Rate (SER): Considers a sentence incorrect if any diacritic error occurs within it.

Among these metrics, Word Error Rate (WER) is considered the primary evaluation measure, as it requires the full word to be diacritized correctly and therefore provides a stricter assessment of system performance. Diacritic Error Rate (DER) and Sentence Error Rate (SER) are reported as complementary metrics to provide finer-grained and sentence-level analysis.

 

To ensure a comprehensive and transparent evaluation, results are reported under two evaluation settings that reflect different levels of linguistic difficulty:

  1. Including case endings (iʿrāb).
  2. Excluding case endings (iʿrāb).

Case endings (iʿrāb) correspond to the final-word diacritics that encode grammatical roles and represent the most challenging aspect of Arabic diacritization due to their strong dependence on syntactic context.

 

Baselines

We provide two baseline systems corresponding to the two participation tracks. These baselines are intended as reference implementations to illustrate the task setup and are not optimized for performance.

  • Text-only baseline: A transformer-based diacritization model that operates solely on undiacritized text.
  • Speech + text baseline (Main Track): A transformer-based model that utilizes ASR outputs together with undiacritized text, without explicitly incorporating acoustic features from the speech signal.

Baseline results are reported under both evaluation settings described above to illustrate the impact of case endings (iʿrāb) on system performance.

 

 

 

Evaluation Setting (%)

Text+ASR

Text-only

Fine-Tuned Text+ASR

DER

WER

SER

DER

WER

SER

DER

WER

SER

Including no diacritic

With case ending

16.16

47.96

86.54

19.38

54.21

95.00

10.70

36.60

90.77

Without case ending

10.98

28.22

79.62

13.49

32.07

85.38

7.47

21.35

76.92

Excluding no diacritic

With case ending

17.57

43.33

83.08

22.28

51.44

94.62

12.04

34.32

89.23

Without case ending

10.72

22.21

73.08

14.35

27.29

82.31

7.78

18.39

73.85

* Lower is better

 

Submission & Expected Output Format

[

  {

    "id": "utt_00123",

    "text_diacritized": "النص المُشَكَّل هنا"

  },

  {

    "id": "utt_00124",

    "text_diacritized": "هَذا نَصٌّ مُشَكَّلٌ آخَر"

  }

]

Participants must generate a fully diacritized transcript for each provided audio + undiacritized text pair.
Output format will follow a simple JSON structure (released with the final data package):

Awards

We are pleased to announce the awards for the Shared Task at LREC 2026. The top-ranked teams in each task will receive cash prizes as follows:

Task 1: Data Contribution

  • - 1st Ranked: $350
  • - 2nd Ranked: $250
  • - 3rd Ranked: $150

Task 2: Automatic Diacritization of Speech Dictation

  • - 1st Ranked: $350
  • - 2nd Ranked: $250
  • - 3rd Ranked: $150


The winners will be determined based on the official evaluation metrics specified for each task. Best of luck to all the teams, and we look forward to announcing the winners at the conclusion of the competition!

 

Important Dates

  • 18 December 2025: 1st CFP
  • 10 January 2026: 2nd CFP
  • 15 January 2026: Training set release
  • 15 February 2026: Blind test set release
  • 1 March 2026: System submission deadline
  • 10 March 2026: Release of results
  • 20 March 2026: Paper submission deadline
  • 15 April 2026: Notification of acceptance
  • 30 April 2026: Camera-ready deadline
  • 11–16 May 2026: LREC 2026 workshops (TBC)

 

Contact

Email:  aalwazrah@ksaa.gov.sa , ralrasheed@ksaa.gov.sa

 

Organizers

  • Waad Alshammari — King Salman Global Academy for Arabic Language (KSAA).
  • Asma Al Wazrah — King Salman Global Academy for Arabic Language (KSAA).
  • Rawan Almatham — King Salman Global Academy for Arabic Language (KSAA).
  • Afrah Altamimi — King Salman Global Academy for Arabic Language (KSAA).
  • Raghad Al-Rasheed — King Salman Global Academy for Arabic Language (KSAA).
  • Sawsan Alqahtani — Princess Nourah bint Abdulrahman University (PNU).
  • Hanan Aldarmaki — Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).
  • Rufael Marew — Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).
  • Abdulrahman Alshehri — King Salman Global Academy for Arabic Language (KSAA).
  • Mohamed Assar — King Salman Global Academy for Arabic Language (KSAA).
  • Abdullah Alharbi — King Salman Global Academy for Arabic Language (KSAA).
  • Abdulrahman AlOsaimy — King Salman Global Academy for Arabic Language (KSAA).