The KSAA-2026 Shared Task introduces a new multimodal
benchmark focused on transforming raw Arabic speech
transcripts into fully diacritized text. Unlike conventional
ASR tasks that emphasize transcription alone, this challenge
targets the restoration of diacritics persistent and
unresolved problems in Arabic NLP due to lexical ambiguity,
syntactic variation, and the absence of diacritics in most
written text.
This shared task provides two subtasks:
The benchmark currently includes
5 hours of MSA and multi-dialectal speech
and will expand through community participation.
Participants need to register via this link:
https://forms.office.com/r/KF4bvNNASP?origin=lprLink
The dataset consists of approximately
5 hours of Arabic speech audio collected
from male and female speakers across a wide range of
dialects (Saudi, Egyptian, Kuwaiti, Bahraini, Sudanese,
Qatari, Algerian, Syrian, and Palestinian). Utterances are
short (≤9 seconds) and cover diverse domains such as
politics, sports, economy, news, and religion.
The annotation process involved aligning the speech with written transcripts and ensuring diacritic accuracy. Multiple layers of quality control were implemented, including file normalization, systematic labeling, and manual reviews of diacritization to guarantee consistency and reliability.
Each team is required to contribute at least one hour of speech, while all submitted recordings will undergo automatic checks and a manual review conducted by another member within the same team. To ensure consistency and transparency, a standard for manual evaluation will also be provided to participants. All validated recordings will then be shared with every team for fair benchmarking. To encourage wider contributions, recognition will be given to participants who achieve the largest volume of high-quality recordings and the highest accuracy in diacritic representation, ensuring continuous dataset growth and broader dialectal and speaker diversity.
Task 1: Data Contribution
Each team is required to contribute at least one hour of
speech, while all submitted recordings will undergo
automatic checks and a manual review conducted by another
member within the same team.
Task 2: Automatic Diacritization of Speech Dictation
Text-only diacritization systems often fail when applied to speech transcripts due to domain and style mismatches, and ASR systems rarely generate well-diacritized outputs. This subtask directly addresses this gap by requiring participants to build systems that leverage both speech and undiacritized transcripts to produce fully diacritized text.
A sample of the input–output data structure will be provided to illustrate the task format.
|
Input |
Speech clip |
|
|
Undiacritized transcript |
أريد أن أشرب كوبًا من الشاي |
|
|
Output |
Diacritized transcript |
أُرِيدُ أَنْ أَشْرَبَ كُوبًا مِنَ الشَّاي |
The task is to add diacritics for each character in the undiacritized text. By explicitly addressing the mismatch between ASR output and text-only diacritization.
Task 1
The data contribution will be evaluated based on the duration and quality of the contributed speech recordings:
Task 2
Models will be evaluated using:
Two evaluation tracks are offered:
Using CAMeL Tools MLE Diacritizer, which receives plain text transcripts and produces diacritized versions.
Using the Text+ASR LSTM model (Shatnawi et
al., 2024), which integrates acoustic cues from audio with
ASR outputs and undiacritized text transcripts to refine
diacritization.
|
Model |
WER |
CER |
DER |
|
CAMeL Tools (text-only) |
85.36% |
19.86% |
35.80% |
|
LSTM (speech+text) |
soon |
soon |
soon |
|
[ { "id": "utt_00123", "text_diacritized": "النص المُشَكَّل هنا" }, { "id": "utt_00124", "text_diacritized": "هَذا نَصٌّ مُشَكَّلٌ آخَر" } ] |
Participants must generate a fully diacritized transcript
for each provided audio + undiacritized text pair.
Output format will follow a simple JSON structure (released
with the final data package):
We are pleased to announce the awards for the Shared Task at LREC 2026. The top-ranked teams in each task will receive cash prizes as follows:
The winners will be determined based on the official
evaluation metrics specified for each task. Best of luck to
all the teams, and we look forward to announcing the winners
at the conclusion of the competition!
Email: aalwazrah@ksaa.gov.sa , ralrasheed@ksaa.gov.sa