The goal of the ASR task is to automatically produce a verbatim, case-insensitive transcript of all words spoken in an audio recording. ASR performance is measured by the word error rate(WER), calculated as the sum of errors (deletions, insertions and substitutions) divided by the total number of words from the reference.
Manual transcription for the Dev and Eval sets has been carried out using hand annotated SAD labels as a starting point.
An overall Word Error Rate (WER) will be computed as the fraction of token recognition errors per maximum number of reference tokens (scorable and optionally deletable tokens):
where,
Track 1 for Automatic Speech Recognition consists of audio streams each of length 30 minutes. Each audio file has a corresponding transcript consisting of long audio segments which provides speaker information, However the user will have no ground truth diarized labels and are encouraged to build their own SAD/Diarization systems.
For information about FSC - Phase 1 baseline results, Pleaseclick here
Set type | WER |
---|---|
Development Set | 83.8% |
Evaluation Set | 84.054% |
Coming soon!!
Track 2 for Automatic Speech Recognition consists of audio segments that are diarized.
Set type | WER |
---|---|
Development Set | 80.5% |
Evaluation Set | 82.23% |
Coming soon!!