Welcome to KSAA-2026 Shared Task

On Arabic Speech Dictation with Automatic Diacritization


Introduction

The KSAA-2026 Shared Task introduces a new multimodal benchmark focused on transforming raw Arabic speech transcripts into fully diacritized text. Unlike conventional ASR tasks that emphasize transcription alone, this challenge targets the restoration of diacritics persistent and unresolved problems in Arabic NLP due to lexical ambiguity, syntactic variation, and the absence of diacritics in most written text.
This shared task provides two subtasks:

  1. Data Contribution—Participants contribute at least one hour of recorded Arabic speech through the KSAA VoiceWall platform, enriching the benchmark.
  2. Automatic Diacritization—Participants build models that take speech audio + undiacritized transcripts as inputs and generate fully diacritized text.

The benchmark currently includes 5 hours of MSA and multi-dialectal speech and will expand through community participation.

Registration

Participants need to register via this link:


https://forms.office.com/r/KF4bvNNASP?origin=lprLink

 

Dataset

The dataset consists of approximately 5 hours of Arabic speech audio collected from male and female speakers across a wide range of dialects (Saudi, Egyptian, Kuwaiti, Bahraini, Sudanese, Qatari, Algerian, Syrian, and Palestinian). Utterances are short (≤9 seconds) and cover diverse domains such as politics, sports, economy, news, and religion.

The annotation process involved aligning the speech with written transcripts and ensuring diacritic accuracy. Multiple layers of quality control were implemented, including file normalization, systematic labeling, and manual reviews of diacritization to guarantee consistency and reliability.

Each team is required to contribute at least one hour of speech, while all submitted recordings will undergo automatic checks and a manual review conducted by another member within the same team. To ensure consistency and transparency, a standard for manual evaluation will also be provided to participants. All validated recordings will then be shared with every team for fair benchmarking. To encourage wider contributions, recognition will be given to participants who achieve the largest volume of high-quality recordings and the highest accuracy in diacritic representation, ensuring continuous dataset growth and broader dialectal and speaker diversity.

 

Tasks

Task 1: Data Contribution

Each team is required to contribute at least one hour of speech, while all submitted recordings will undergo automatic checks and a manual review conducted by another member within the same team.

Task 2: Automatic Diacritization of Speech Dictation

Text-only diacritization systems often fail when applied to speech transcripts due to domain and style mismatches, and ASR systems rarely generate well-diacritized outputs. This subtask directly addresses this gap by requiring participants to build systems that leverage both speech and undiacritized transcripts to produce fully diacritized text.

A sample of the input–output data structure will be provided to illustrate the task format.

Input

Speech clip

 

Undiacritized transcript

أريد أن أشرب كوبًا من الشاي

Output

Diacritized transcript

أُرِيدُ أَنْ أَشْرَبَ كُوبًا مِنَ الشَّاي

 

The task is to add diacritics for each character in the undiacritized text. By explicitly addressing the mismatch between ASR output and text-only diacritization.

Evaluation

Task 1

The data contribution will be evaluated based on the duration and quality of the contributed speech recordings:

  • Each team must submit at least one hour of valid speech data.
  • Data quality will be evaluated through automatic checks and a manual review by another team member.
  • The manual review will assess recording clarity, adherence to prompts, and consistency of diacritization.
  • A standardized manual evaluation guideline will be provided to ensure consistency and transparency.
  • Each recording will be linked to both the team name and the individual contributor for full attribution.
  • All validated data will be released to shared-task participants, ensuring fairness and expanding the dataset.

 

Task 2

Models will be evaluated using:

  • Character Error Rate (CER)
  • Diacritic Error Rate (DER)
  • Word Error Rate (WER)

Baselines

Two evaluation tracks are offered:

  1. Text-Only Baseline

Using CAMeL Tools MLE Diacritizer, which receives plain text transcripts and produces diacritized versions.

  1. Speech + Text Baseline (Main Track)

Using the Text+ASR LSTM model (Shatnawi et al., 2024), which integrates acoustic cues from audio with ASR outputs and undiacritized text transcripts to refine diacritization.

Model

WER

CER

DER

CAMeL Tools (text-only)

85.36%

19.86%

35.80%

LSTM (speech+text)

soon

soon

soon

 

Submission & Expected Output Format

[

  {

    "id": "utt_00123",

    "text_diacritized": "النص المُشَكَّل هنا"

  },

  {

    "id": "utt_00124",

    "text_diacritized": "هَذا نَصٌّ مُشَكَّلٌ آخَر"

  }

]

Participants must generate a fully diacritized transcript for each provided audio + undiacritized text pair.
Output format will follow a simple JSON structure (released with the final data package):

Awards

We are pleased to announce the awards for the Shared Task at LREC 2026. The top-ranked teams in each task will receive cash prizes as follows:

Task 1: Data Contribution

  • - 1st Ranked: $350
  • - 2nd Ranked: $250
  • - 3rd Ranked: $150

Task 2: Automatic Diacritization of Speech Dictation

  • - 1st Ranked: $350
  • - 2nd Ranked: $250
  • - 3rd Ranked: $150


The winners will be determined based on the official evaluation metrics specified for each task. Best of luck to all the teams, and we look forward to announcing the winners at the conclusion of the competition!

 

Important Dates

  • 10 December 2025: 1st CFP
  • 10 January 2026: 2nd CFP
  • 15 January 2026: Training set release
  • 15 February 2026: Blind test set release
  • 1 March 2026: System submission deadline
  • 10 March 2026: Release of results
  • 20 March 2026: Paper submission deadline
  • 15 April 2026: Notification of acceptance
  • 30 April 2026: Camera-ready deadline
  • 11–16 May 2026: LREC 2026 workshops (TBC)

 

Contact

Email:  aalwazrah@ksaa.gov.sa , ralrasheed@ksaa.gov.sa

 

Organizers

  • Waad Alshammari — King Salman Global Academy for Arabic Language (KSAA).
  • Asma Al Wazrah — King Salman Global Academy for Arabic Language (KSAA).
  • Rawan Almatham — King Salman Global Academy for Arabic Language (KSAA).
  • Afrah Altamimi — King Salman Global Academy for Arabic Language (KSAA).
  • Raghad Al-Rasheed — King Salman Global Academy for Arabic Language (KSAA).
  • Sawsan Alqahtani — Princess Nourah bint Abdulrahman University (PNU).
  • Hanan Aldarmaki — Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).
  • Rufael Marew — Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).
  • Abdulrahman Alshehri — King Salman Global Academy for Arabic Language (KSAA).
  • Mohamed Assar — King Salman Global Academy for Arabic Language (KSAA).
  • Abdullah Alharbi — King Salman Global Academy for Arabic Language (KSAA).
  • Abdulrahman AlOsaimy — King Salman Global Academy for Arabic Language (KSAA).