Podcast Speaker Diarization Automatic Speech Recognition
Ahmed Othman | Faisal Aldrees | Abdulaziz Alotabi
Supervisor: Dr. Sultan S Aldera
Aims
- How can we maximize the benefit of Arabic podcasts?
- How can we expose the Arabic podcasts to more people?
- How can we make podcast episodes more searchable?
Aims
Objectives
- Integrating with a Diarization model
ASR
- Automatic speech recognition is a task that converts an audio signal into text, and what makes it important is that speech works as a natural interface for human communication, and by converting, it enables endless applications and ideas
ASR
Speaker Diarization
- Speaker diarization addresses the problem of “who spoke when” which is a process of partitioning a conversation recording into several speech recordings, each recording belongs to a specific speaker.
Speaker Diarization
The joint between ASR and Diarization
- Most speech applications dealing with conversations require not only a transcript of the text but also determining who spoke when. This task can be broken down into two tasks: ASR and diarization, and this is typically done by first, running speaker diarization to know who spoke when. Second transcribing the text using the ASR system, Lastly, combining the result of these two systems.
The joint between ASR and Diarization
How to evaluate models
How to evaluate the models
Word Error Rate
- WER is a common metric that measures how accurately an ASR system performs.
Word Error Rate
- S: the number of substitutions occurs when a word gets replaced
- D: the number of deletions happens when a word is left out of the transcript completely
- I: the number of insertions when a word is added that wasn’t said
- N: the number of total words
Character Error Rate
- CER counts the minimum number of character-level operations required to transform the ground truth text
Character Error Rate
- S = Number of Substitutions
- D = Number of Deletions
- I = Number of Insertions
- N = Number of characters in reference text (aka ground truth)
Diarization Error Rate
- The accuracy of the speaker diarization system is measured using diarization error rate (DER) where DER is the sum of three different error types: False alarm (FA) of speech, missed detection of speech, and confusion between speaker labels
Diarization Error Rate
- F: False Alarm
- M: Miss
- C: Confusion
- T: Total Duration Time
Connectionist Temporal Classification
CTC is cost function that are used for sequence to sequence task when the number of inputs are larger than the number of outputs.
CTC
Self-Attention models
‘Attention is all you need’ introduce a new way to handle a sequence to sequence model that use a self-attention mechanism make it able to handle long-term dependency, while it also efficient since it can run in parallel.
Self-Attention models
Conformer
Conformer
‘Conformer: Convolution-augmented Transformer for Speech Recognition’ paper combine the strength of transformer and CNN. By incorporating convolution layers within the transformer blocks, the model can capture local patterns in the input data, while still retaining the ability to capture long-term dependency using the multi-head attention.
Squeezeformer
Squeezeformer
‘Squeezeformer: An Efficient Transformer for Automatic Speech Recognition’ in this paper, the authors found that the conformer architecture’s design choice are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, Researchers proposed Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes
API Methodology
API Methodology
There are several key steps in the API methodology
- The first step is to carefully consider the characteristics of the data sources.
- The next step is to process them in a way that is appropriate for the ASR model
- The final step is to combine the processed audio and text data and use them in the ASR model
Data sources characteristic
- Even though Arabic is widely spoken, there is a shortage in Arabic ASR datasets.
- Since our system targets the podcast industry, we can assume that the data we receive will be recorded in clean environments.
- When it comes to Arabic data sources, there are a few different options to consider. Modern Standard Arabic (MSA) is the formal version of the language, and there are also many different dialects of Arabic that are spoken in different regions, we will consider that the data is mostly in Modern Standard Arabic (MSA) or Saudi Arabian dialects.
Standardizing the sources
- Audio Duration: to ensure that the model has a chance to detect long-range patterns without oversteps the duration limits since it will increasing computation costs, we took a sample from LibriSpeech, a famous English dataset used to train ASR models. From this dataset, we determined that audio duration should range between 6 and 16 seconds.
- Text cleaning: this include removing non Arabic letters, removing harakat, and normalizing ‘alef’ should be enough.
Audio Augmentation
The Augmentation techniques we will use were presented in a paper introduced by Google entitled “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition”
Audio Augmentation
Raw mel-spoctrogram
Raw
mel-spoctrogram
Frequency Masking
Frequency Masking
Frequency mel-spectrogram
Time Masking
Time Masking
Time masking mel-spectrogram
Combine augmentation policy
Combine augmentation policy
Text Processing
Text processing in ASR systems involves converting text into format that the machine can handle.
Before tokenizing the text, it is important to clean it. Text cleaning will remove all non-Arabic letters and diacritics, and will normalize the letter ‘alef’.
Characters tokens
Character tokens represent the individual characters in the Arabic letters and converted to integer tokens. This allows ASR systems to transcribe out-of-vocabulary words that are not in the dictionary.
Characters tokens
Subword tokens
Subword Tokens
Subword tokens in Automatic Speech Recognition (ASR) refer to the process of breaking down words into smaller units called subwords.
ASR model
We will use a Squeezeformer. Sehoon Kim et al. proposed several sizes for the model, and we will select the mid-size since it offers a good balance between computation and performance.
ASR model
Speaker Diarization model
Speaker Diarization model
For speaker diarization, we use the pyannote.audio toolkit, which includes a pre-trained end-to-end neural network to build a speaker diarization pipeline.
However, there are questions about whether a diarization model trained on an English dataset can work with an Arabic dataset. To address this, we conducted experiments to demonstrate that the model can be used with different languages.
Experiment steps
To answer the question, we collected 27 different voices from 3 languages, 10 Japanese voices, 10 Arabic voices, and 7 English voices, with 2 different contexts, angry and normal.
Experiment steps
Experiment Result
We found that the model is not effected by the language. according to the experiment that we have done.
Experiment result
Data sources
- In the case of our Arabic audio corpus datasets, we obtained them from two different sources: a dataset called MASC introduced in August 2022 and SADA introduced in January 2023
- Massive Arabic Speech Corpus (MASC) is a dataset that contains 1,000 hours of speech sampled at 16,000 Hz and crawled from over 700 YouTube channels.
- “SADA” dataset, which stands for "Saudi Audio Dataset for Arabic”. This dataset contains audio recordings sourced from more than 80 TV shows provided by the Saudi Broadcasting Authority. The total number of hours published for these recordings is ~667 hours.
- -the characteristics of the data sources that we focused on is :
- clean dataset ( no noise in the recording )
- MSA accent and Saudi accent .
Standardizing the sources
- MASC dataset has 2 different environments, noisy and clean, our model should be trained only using the clean portion of the dataset, after taking the clean part of the data, we ended up in 485 hours.
- After standarizing the data we ended up with 447 hours in total.
Standardizing the sources
- With SADA dataset, after having the clean part, we end up with 105 hours, and there is no need for standardizing the data because the duration and the start-end interval for each audio segment is already there in the description of each audio file. As we mentioned in the report each segment should be between 6 and 16 seconds, after taking those segments from the dataset, we end up with roughly 40 hours.
Training
- We used PyTorch as the deep learning framework and DataCrunch.io as the GPU cloud provider to build two models. The first model was built using character tokens, and was trained for 160 epochs. The second model used subword tokens with a transfer learning technique, where we only trained the last three layers and changed the output layer.
Training
Results
- Evaluating the performance of a model can be done using Word Error Rate (WER) and Character Error Rate (CER).
Results
Discussion
- Our ASR model achieved mixed results in our evaluation. We found that the error rate was relatively high, which we attribute to the lack of a language model and the small amount of data used to train the model. Despite this, we observed that the use of subword tokens resulted in a significantly lower error rate compared to the use of character tokens.
- Our analysis suggests that the use of subword tokens, which break words down into smaller units that can be recognized by the model, is a promising approach to improving the accuracy of ASR models. However, we acknowledge that this approach may require additional data and computational resources to be effective. We believe that future research should focus on further exploring the use of subword tokens and other techniques for improving the accuracy of ASR models.
Deployment
- We created a simple live demo for our Automatic Speech Recognition (ASR) and Diarization model using the Gradio library. This demo is designed to showcase the effectiveness of our model in identifying speakers in audio recordings and transcribing their speech.
- As you will be seing it now
Deployment
Search strategy
A search strategy is important for several reasons:
- Efficient use of time and resources
- Reduced bias
- Reproducibility
Search strategy
Research Question:
Research Questions
- RQ 1: What are the most frequent techniques used in this field?
- RQ2: What is the performance of these models in term of following terms:
- - SRQ 2.1: Word Error Rate WER.
- - SRQ 2.2: Computation Cost.
- - SRQ 2.3: Diarization Error Rate DER
- - RQ 3: What is being used in feature extraction ?
Key concepts
we choose key concepts and free text terms that refer to our topic as you see in the table below from the report:
Key concepts
Concepts
C1: “Automatic speech recognition” OR “ASR” OR “Speech to text” OR “Voice to text” OR “recognition” OR “Speech transcription” OR “SST” OR “SRT”.
Concepts
C2: “Diarization” OR “cluster” OR “segment” OR “identification” OR “change detection” OR “verification”.
C3: “Arabic” OR “Multilingual”.
Our search strings
After we combine our concepts using these oparetors ( C1 AND C2 AND C3 ) we have 3 search strings as you see below
Search strings
S1: (“Automatic speech recognition” OR “ASR” OR “Speech to text” OR “Voice to text” OR “recognition” OR “Speech transcription” OR “SST” OR “SRT”) AND (“Diarization” OR “speaker clustering” OR “speaker segmentation” OR “speaker identification” OR “speaker change detection” OR “speaker verification”) AND (“Arabic” OR “Multilingual”).
S2: (Google scholar): allintitle: (“Automatic speech recognition” OR “ASR” OR “Speech recognition”) AND (“Diarization” OR “cluster” OR “segment” OR “identification” OR “change detection”) AND (“Arabic” OR “Multilingual”).
S3: (ScienceDirect): (“Automatic speech recognition” OR “ASR” OR “Speech to text” OR “speech recognition”) AND (“Diarization” OR “speaker clustering” OR “speaker segmentation” OR “speaker identification”) AND (“Arabic”).
Conclusion and Future work
- Over all, we created a ASR Diarization model, and we built an incomplete search strategy.
- In future work, we will be adding a language model, more data, and additional features to the API.
- Our plan for future work is to enhance the performance of our ASR system by incorporating a language model. A language model will help to improve the accuracy of our system by enabling it to better understand the context and meaning of spoken language. This will be particularly useful for languages with complex grammar and syntax, such as Arabic.
- Additionally, we plan to expand our dataset to include more data sources, which will help to improve the robustness and accuracy of our system. By including data from a wider range of sources and dialects, we can ensure that our system is able to handle a variety of real-world scenarios.
- Finally, we will be adding more features to our API to improve the user experience and make it easier for developers to integrate our system into their applications. This may include features such as real-time transcription and translation, as well as tools for analyzing and visualizing the results of our system.
- Overall, our future work will focus on improving the accuracy and performance of our ASR system, as well as enhancing the user experience and making it easier for developers to use our API in their applications. We believe that these improvements will help to make our system a valuable tool for a wide range of industries and applications.
Conclusion and
Future work