Introducing 

Prezi AI.

Your new presentation assistant.

Refine, enhance, and tailor your content, source relevant images, and edit visuals quicker than ever before.

Loading…
Transcript

IESL

21th Embedded SW Contest

Potential Field Extension

Speech Emotion Recognition

Introduction

Robotic Control Care Robot

Introduction

Why Emotion Recognition?

Emotion recognition technology, which recognizes and empathizes with people's hidden emotions, is rapidly developing. Emotion recognition technology involves collecting and analyzing emotional data. It interprets objective emotional information numerically through a person's face, voice, body movements, and biological signals. The most crucial role of this technology is to automatically recognize emotions, even when they are not explicitly expressed by individuals in given environmental conditions, and provide real-time personalized environments. It also plays a role in recognizing vague emotions that even the individual cannot define, creating the most comfortable state for them.

Why Emotion Recognition?

  • Emotion recognition technology identifies and empathizes with hidden emotions by analyzing facial expressions, voice, body movements, and biological signals.

  • It automatically recognizes emotions, including those not outwardly expressed, in different environments.

  • The technology creates real-time personalized environments based on detected emotions.

  • It can identify vague emotions that individuals themselves might not be aware of.

Introduction

In various industry sectors, emotion recognition technology can be utilized in four major aspects.

  • It can be used to improve products and services oriented towards the future through usability evaluation.

  • It can be utilized in developing services that foster empathy and communication.

  • It can be used to provide optimized personalized environments and services.

  • It can be applied in technology for assessing and responding to safety-related situations, such as emergencies.

Speech Emotion Recognition

2.

4.

Data Augmentation

&

Pre-Processing

Classifier Training

Emotion

Feature Extraction

Audio Signals

3.

1.

  • The RAVDESS audio-visual dataset was used as input.
  • RAVDESS contains 1440 files, recorded by 24 actors, with each actor performing 60 recordings.
  • The recordings are categorized into 8 emotional classes: neutral, calm, happy, sad, angry, fearful, disgusted, surprised.
  • Audio files were normalized to ensure consistent volume levels.
  • Silent sections at the beginning and end of each recording were removed.
  • Zero-padding was applied to the right side of the audio files to standardize their lengths.
  • White Noise Addition: Double the dataset size by adding white noise for data augmentation.
  • Normalization: Adjust all voices to a consistent volume by applying uniform amplification.
  • Trimming: Remove silent sections at the beginning and end of the audio to eliminate static noise, focusing on the effective part of the signal.
  • Zero-padding: Add padding on the right side of each audio file to standardize the average length across all voice data.

Time-Domain Feature Extraction

Spectral-Domain Feature Extraction

Time-Domain Feature Extraction

RMS (Root Mean Square) was utilized to calculate the average volume of the voice signal. This involved determining a value corresponding to the amplitude of the entire voice signal and deriving the RMS energy, which extracts the intensity of both the voice and noise.

The rate at which a voice signal crosses zero, or the rate of change in the signal's sign, is known as the Zero Crossing Rate (ZCR). A higher ZCR, indicating frequent sign changes as the signal crosses zero, can be interpreted as the presence of more noise in the signal. Thus, ZCR can be used to identify sections of voice data that contain only the voice signal and perform endpoint detection.

Spectral-Domain Feature Extraction

The Mel-frequency Cepstrum captures characteristics of the frequency of the signal represented on the Mel scale, which closely aligns to the nonlinear nature of human hearing. By extension, the Mel-frequency Cepstrum Coefficients(MFCC) represent the "spectruem of the spectrum." MFCC's can be derived by mapping the powers of the frequency spectrum onto the mel scale, and then by taking the log of these powers, followed by discrete cosine transform.

  • Four-layer LSTM structure:
  • Layer 1: 256 nodes
  • Layer 2: 256 nodes
  • Layer 3: 128 nodes
  • Layer 4: 64 nodes
  • Activation function: tanh (common in LSTM)
  • Dropout layers: Added between LSTM layers to prevent overfitting, with a dropout rate of 0.2
  • Classification layer: Uses Dense layer with SoftMax function, suitable for multi-class classification
  • Batch size: 32
  • Optimizer: Adam
  • Learning rate: 0.001
  • Data split: 60 : 20 : 20 = train : validation : test.

Standby

Operate

Fear

Sad

Neutral

Angry

Happy

Calm

Disgust

Surprised

PoC

01

Robotic Control

SLAM

02

Care Robot

03

Navigation

04

Object Detection

Manipulation

05

  • The "Care Robot," a combination of the TurtleBot3 Waffle and the OpenMANIPULATOR, is a compact, multifunctional robot designed for autonomous exploration and interaction in various environments.

  • It's well-suited for tasks such as object handling and service roles.
  • SLAM technology enables real-time mapping and location tracking, correcting errors through loop closure to produce a detailed map and location history.

  • In ROS, Gmapping from OpenSLAM uses Occupancy Grid Maps (OGM) and particle filters for SLAM. OGMs depict environments using colors: white for free areas, black for occupied spaces like walls, and gray for unexplored regions. The color scheme, representing space occupancy probabilities, can vary with different algorithms.
  • Algorithm: AMCL (Adaptive Monte Carlo Localization)
  • Purpose: Used for estimating and correcting a robot's position.
  • Key Technique: Particle Filter
  • Data Source: LIDAR data.
  • Function: Probabilistically predicts the robot's location.
  • Process for Minimizing Uncertainty:
  • Map Comparison: Compares a pre-made map with current LIDAR data in real-time.
  • Calculation of Similarity: Determines how closely the LIDAR data matches the map.
  • Weight Assignment: Assigns weights to different locations based on similarity.
  • Dynamic Weight Adjustment:
  • Method: Dynamically adjusts weights based on location probability.
  • Outcome: Iteratively refines and improves the accuracy of the robot's position estimation.
  • YOLOv5 (You Only Look Once) is a model based on Convolutional Neural Network (CNN) for object detection.
  • It generalizes well, easily learning and detecting new categories beyond existing ones.
  • The model's lightweight nature makes it suitable for real-time processing.
  • Unlike previous models like R-CNN that split the image into parts for analysis, YOLO looks at the entire image only once.
  • It is known for its real-time object detection capability, offering decent performance that enables quick processing.
  • A machine that mimics the function of a human arm, primarily used to move objects, is a basic concept in many robots.

  • There are several types:
  • Manipulators operated by humans.
  • Fixed types that automatically repeat the same actions.
  • Numerically controlled types performing complex actions based on computer commands.

  • These technologies are evolving rapidly, becoming more complex and sophisticated.

By Using SER

Customer Emotion Diagnosis

APPLICATIONS

Empathy Robot for Autistic Children

Connected Car

Emotional Counseling Robot

Learn more about creating dynamic, engaging presentations with Prezi