Back to Projects

LSTMSE-Net: Audio-Visual Speech Enhancement

Developed LSTMSE-Net, an audio-visual speech enhancement model to isolate and enhance speaker audio in noisy environments using temporal feature extraction with RNN and LSTM units.

ML Deep Learning Audio Processing Research Publications

Tech Stack

Python PyTorch Deep Learning LSTM RNN Audio Processing Video Processing
View on GitHub
Read Paper

Results

Achieved a 3× reduction in inference time compared to the baseline model with improvements in speech quality. Paper accepted at InterspeechW 2024.

Key Ideas

  • Engineered a temporal feature extraction pipeline using RNN and LSTM units to jointly model audio-visual dependencies
  • Designed architecture to isolate and enhance speaker audio in noisy environments
  • Currently working on advanced version using ConvNeXtV2 based video pipeline and audio decoder inspired from deep state space modelling

Overview

LSTMSE-Net is a novel audio-visual speech enhancement model developed to isolate and enhance speaker audio in noisy environments. The project focuses on leveraging both audio and visual cues to improve speech quality through deep learning techniques.

Methodology

The model uses a temporal feature extraction pipeline that employs RNN and LSTM units to jointly model audio-visual dependencies. This approach allows the model to leverage visual information (lip movements, facial expressions) to better enhance the corresponding audio signal.

Key Innovations

  • Temporal Modeling: RNN and LSTM units capture temporal dependencies in both audio and video streams
  • Joint Audio-Visual Processing: Simultaneous processing of audio and visual features for better enhancement
  • Efficient Architecture: Achieved 3× reduction in inference time compared to baseline while improving quality

Results

The model achieved significant improvements:

  • 3× reduction in inference time compared to baseline
  • Improved speech quality metrics
  • Paper accepted at InterspeechW 2024

Future Work

Currently developing an advanced version using:

  • ConvNeXtV2-based video pipeline for better visual feature extraction
  • Audio decoder inspired by deep state space modeling