Introducing

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

GP

Faisal Al-darees

Updated April 27, 2023

Transcript

Podcast Speaker Diarization Automatic Speech Recognition

Ahmed Othman | Faisal Aldrees | Abdulaziz Alotabi

Supervisor: Dr. Sultan S Aldera

Aims and Objectives

Aims

How can we maximize the benefit of Arabic podcasts?

How can we expose the Arabic podcasts to more people?

How can we make podcast episodes more searchable?

Aims

Objectives

Building an ASR model.

Integrating with a Diarization model

Making an API

Background

ASR

Automatic speech recognition is a task that converts an audio signal into text, and what makes it important is that speech works as a natural interface for human communication, and by converting, it enables endless applications and ideas

ASR

Speaker Diarization

Speaker diarization addresses the problem of “who spoke when” which is a process of partitioning a conversation recording into several speech recordings, each recording belongs to a specific speaker.

Speaker Diarization

The joint between ASR and Diarization

Most speech applications dealing with conversations require not only a transcript of the text but also determining who spoke when. This task can be broken down into two tasks: ASR and diarization, and this is typically done by first, running speaker diarization to know who spoke when. Second transcribing the text using the ASR system, Lastly, combining the result of these two systems.

The joint between ASR and Diarization

How to evaluate models

How to evaluate the models

Word Error Rate

WER is a common metric that measures how accurately an ASR system performs.

Word Error Rate

Where:

S: the number of substitutions occurs when a word gets replaced
D: the number of deletions happens when a word is left out of the transcript completely
I: the number of insertions when a word is added that wasn’t said
N: the number of total words

Character Error Rate

CER counts the minimum number of character-level operations required to transform the ground truth text

Character Error Rate

Where:

S = Number of Substitutions
D = Number of Deletions
I = Number of Insertions
N = Number of characters in reference text (aka ground truth)

Diarization Error Rate

The accuracy of the speaker diarization system is measured using diarization error rate (DER) where DER is the sum of three different error types: False alarm (FA) of speech, missed detection of speech, and confusion between speaker labels

Diarization Error Rate

Where:

F: False Alarm
M: Miss
C: Confusion
T: Total Duration Time

Related Work

Connectionist Temporal Classification

CTC is cost function that are used for sequence to sequence task when the number of inputs are larger than the number of outputs.

CTC

Self-Attention models

‘Attention is all you need’ introduce a new way to handle a sequence to sequence model that use a self-attention mechanism make it able to handle long-term dependency, while it also efficient since it can run in parallel.

Self-Attention models

Conformer

‘Conformer: Convolution-augmented Transformer for Speech Recognition’ paper combine the strength of transformer and CNN. By incorporating convolution layers within the transformer blocks, the model can capture local patterns in the input data, while still retaining the ability to capture long-term dependency using the multi-head attention.

Squeezeformer

‘Squeezeformer: An Efficient Transformer for Automatic Speech Recognition’ in this paper, the authors found that the conformer architecture’s design choice are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, Researchers proposed Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes

API Methodology

There are several key steps in the API methodology

The first step is to carefully consider the characteristics of the data sources.

The next step is to process them in a way that is appropriate for the ASR model

The final step is to combine the processed audio and text data and use them in the ASR model

Data sources characteristic

Even though Arabic is widely spoken, there is a shortage in Arabic ASR datasets.

Since our system targets the podcast industry, we can assume that the data we receive will be recorded in clean environments.

When it comes to Arabic data sources, there are a few different options to consider. Modern Standard Arabic (MSA) is the formal version of the language, and there are also many different dialects of Arabic that are spoken in different regions, we will consider that the data is mostly in Modern Standard Arabic (MSA) or Saudi Arabian dialects.

Standardizing the sources

Audio Duration: to ensure that the model has a chance to detect long-range patterns without oversteps the duration limits since it will increasing computation costs, we took a sample from LibriSpeech, a famous English dataset used to train ASR models. From this dataset, we determined that audio duration should range between 6 and 16 seconds.

Text cleaning: this include removing non Arabic letters, removing harakat, and normalizing ‘alef’ should be enough.

Spectrogram

Audio Augmentation

The Augmentation techniques we will use were presented in a paper introduced by Google entitled “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition”

Audio Augmentation

Raw mel-spoctrogram

Raw

mel-spoctrogram

Frequency Masking

Raw mel-spectrogram

Frequency Masking

Frequency mel-spectrogram

Time Masking

Raw mel-spectrogram

Time Masking

Time masking mel-spectrogram

Combine augmentation policy

Text Processing

Text processing in ASR systems involves converting text into format that the machine can handle.

Before tokenizing the text, it is important to clean it. Text cleaning will remove all non-Arabic letters and diacritics, and will normalize the letter ‘alef’.

Characters tokens

Character tokens represent the individual characters in the Arabic letters and converted to integer tokens. This allows ASR systems to transcribe out-of-vocabulary words that are not in the dictionary.

Characters tokens

Subword tokens

Subword Tokens

Subword tokens in Automatic Speech Recognition (ASR) refer to the process of breaking down words into smaller units called subwords.

Models

ASR model

We will use a Squeezeformer. Sehoon Kim et al. proposed several sizes for the model, and we will select the mid-size since it offers a good balance between computation and performance.

ASR model

Speaker Diarization model

For speaker diarization, we use the pyannote.audio toolkit, which includes a pre-trained end-to-end neural network to build a speaker diarization pipeline.

However, there are questions about whether a diarization model trained on an English dataset can work with an Arabic dataset. To address this, we conducted experiments to demonstrate that the model can be used with different languages.

Experiment steps

To answer the question, we collected 27 different voices from 3 languages, 10 Japanese voices, 10 Arabic voices, and 7 English voices, with 2 different contexts, angry and normal.

Experiment steps

Experiment Result

We found that the model is not effected by the language. according to the experiment that we have done.

Experiment result

API Results

Data sources

In the case of our Arabic audio corpus datasets, we obtained them from two different sources: a dataset called MASC introduced in August 2022 and SADA introduced in January 2023

Massive Arabic Speech Corpus (MASC) is a dataset that contains 1,000 hours of speech sampled at 16,000 Hz and crawled from over 700 YouTube channels.

“SADA” dataset, which stands for "Saudi Audio Dataset for Arabic”. This dataset contains audio recordings sourced from more than 80 TV shows provided by the Saudi Broadcasting Authority. The total number of hours published for these recordings is ~667 hours.

-the characteristics of the data sources that we focused on is :
clean dataset ( no noise in the recording )
MSA accent and Saudi accent .

Standardizing the sources

MASC dataset has 2 different environments, noisy and clean, our model should be trained only using the clean portion of the dataset, after taking the clean part of the data, we ended up in 485 hours.

After standarizing the data we ended up with 447 hours in total.

Standardizing the sources

With SADA dataset, after having the clean part, we end up with 105 hours, and there is no need for standardizing the data because the duration and the start-end interval for each audio segment is already there in the description of each audio file. As we mentioned in the report each segment should be between 6 and 16 seconds, after taking those segments from the dataset, we end up with roughly 40 hours.

Training

We used PyTorch as the deep learning framework and DataCrunch.io as the GPU cloud provider to build two models. The first model was built using character tokens, and was trained for 160 epochs. The second model used subword tokens with a transfer learning technique, where we only trained the last three layers and changed the output layer.

Training

Results

Evaluating the performance of a model can be done using Word Error Rate (WER) and Character Error Rate (CER).

Results

Discussion

Our ASR model achieved mixed results in our evaluation. We found that the error rate was relatively high, which we attribute to the lack of a language model and the small amount of data used to train the model. Despite this, we observed that the use of subword tokens resulted in a significantly lower error rate compared to the use of character tokens.

Our analysis suggests that the use of subword tokens, which break words down into smaller units that can be recognized by the model, is a promising approach to improving the accuracy of ASR models. However, we acknowledge that this approach may require additional data and computational resources to be effective. We believe that future research should focus on further exploring the use of subword tokens and other techniques for improving the accuracy of ASR models.

Deployment

We created a simple live demo for our Automatic Speech Recognition (ASR) and Diarization model using the Gradio library. This demo is designed to showcase the effectiveness of our model in identifying speakers in audio recordings and transcribing their speech.

As you will be seing it now

Deployment

Search strategy

A search strategy is important for several reasons:

Efficient use of time and resources
Reduced bias
Reproducibility

Search strategy

Research Question:

Research Questions

RQ 1: What are the most frequent techniques used in this field?
RQ2: What is the performance of these models in term of following terms:
- SRQ 2.1: Word Error Rate WER.
- SRQ 2.2: Computation Cost.
- SRQ 2.3: Diarization Error Rate DER
- RQ 3: What is being used in feature extraction ?

Key concepts

we choose key concepts and free text terms that refer to our topic as you see in the table below from the report:

Key concepts

Concepts

C1: “Automatic speech recognition” OR “ASR” OR “Speech to text” OR “Voice to text” OR “recognition” OR “Speech transcription” OR “SST” OR “SRT”.

Concepts

C2: “Diarization” OR “cluster” OR “segment” OR “identification” OR “change detection” OR “verification”.

C3: “Arabic” OR “Multilingual”.

Our search strings

After we combine our concepts using these oparetors ( C1 AND C2 AND C3 ) we have 3 search strings as you see below

Search strings

S1: (“Automatic speech recognition” OR “ASR” OR “Speech to text” OR “Voice to text” OR “recognition” OR “Speech transcription” OR “SST” OR “SRT”) AND (“Diarization” OR “speaker clustering” OR “speaker segmentation” OR “speaker identification” OR “speaker change detection” OR “speaker verification”) AND (“Arabic” OR “Multilingual”).

S2: (Google scholar): allintitle: (“Automatic speech recognition” OR “ASR” OR “Speech recognition”) AND (“Diarization” OR “cluster” OR “segment” OR “identification” OR “change detection”) AND (“Arabic” OR “Multilingual”).

S3: (ScienceDirect): (“Automatic speech recognition” OR “ASR” OR “Speech to text” OR “speech recognition”) AND (“Diarization” OR “speaker clustering” OR “speaker segmentation” OR “speaker identification”) AND (“Arabic”).

Conclusion and Future work

Over all, we created a ASR Diarization model, and we built an incomplete search strategy.
In future work, we will be adding a language model, more data, and additional features to the API.
Our plan for future work is to enhance the performance of our ASR system by incorporating a language model. A language model will help to improve the accuracy of our system by enabling it to better understand the context and meaning of spoken language. This will be particularly useful for languages with complex grammar and syntax, such as Arabic.
Additionally, we plan to expand our dataset to include more data sources, which will help to improve the robustness and accuracy of our system. By including data from a wider range of sources and dialects, we can ensure that our system is able to handle a variety of real-world scenarios.
Finally, we will be adding more features to our API to improve the user experience and make it easier for developers to integrate our system into their applications. This may include features such as real-time transcription and translation, as well as tools for analyzing and visualizing the results of our system.
Overall, our future work will focus on improving the accuracy and performance of our ASR system, as well as enhancing the user experience and making it easier for developers to use our API in their applications. We believe that these improvements will help to make our system a valuable tool for a wide range of industries and applications.

Conclusion and

Future work

Choose a template

Music Festival (AI Assisted)

Elevate your presentation with our dynamic and visually stunning Music Festival Prezi AI-assisted presentation template, designed to captivate audiences and showcase the rhythm of your event in every slide.

Data Analysis (AI Assisted)

Transform data into insights with our Data Analysis Prezi AI-assisted presentation template, strategically designed to visually convey complex information, enabling impactful presentations that drive informed decision-making.

Colorful Nature - Dark (AI Assisted)

A whimsical flower motif sets the fun tone for this Prezi AI-assisted presentation template. Just add your own text, images, videos, or other content to create a memorable and engaging presentation your audience will love. Like all Prezi templates, it’s easily customizable.

See more templates →

Presentations from around the world

Marie Curie

andrea d'orsi

Perkenalan

Ratna Mufidah

Creative Report

Gabriele Roncoroni

See staff picks →

Learn more about creating dynamic, engaging presentations with Prezi

Why Prezi is better