State of the art audio enhancement in 5 minutes
Learn how to leverage an AI audio enhancement app with open-source for your vlogs and other media, rivaling the best APIs in the market. Try it for yourself!
/blog-assets/authors/abhiu.jpg
by Abhi Upadhyay
Cover Image for State of the art audio enhancement in 5 minutes

In the world of multimedia, podcasting and vlogs, user experience often hinges on audio quality. Superior quality has the capacity to captivate audiences, encouraging prolonged engagement with your content, whether they be podcasts, tutorials, or promos. More specifically, audio quality is not just about enhancing user experience but also plays a crucial role in ensuring the accuracy of downstream AI tasks such as automated transcription services, like transcription with OpenAI's Whisper. In fact, an entire industry of startups like Descript, Krisp AI, and more are working in the space.

With the help of a few open source models, namely AudioSR and DeepFilterNet, we launched an audio enhancement app that helps remove background noise and enhance speech, making sure your audio quality is always up to par.

Here’s a quick example of the sound quality difference using open-source models:

In this post, we’ll go through a quick demonstration of how you might integrate this solution into your AI project and some background on the models we used.

How To Improve Audio Quality

You can try out the pre-built audio enhancement app in a few clicks here with your own audio samples here. Here’s a few more samples from podcasts and YouTube videos:

Sample 1

Original
Enhanced

Sample 2

Original
Enhanced

Run via API or Python

You can also integrate the app into your current workflow through an API call or Python call with the following steps:

  1. Sign up for a Sieve account and find your API key here.

  2. Run audio enhancement via API (or see below for Python)

    curl -X POST https://mango.sievedata.com/v2/push \
    -H "X-API-Key: <your-api-key>" \
    -d '{
      "function": "sieve/audio_enhancement",
      "inputs": {
        "audio": {
          "url": "<your-audio-url>"
        }
      }
    }'
  3. Run via Python client

    • Install the Python client
    pip install sievedata
    
    sieve login
    • Run this Python script with your own audio!
    import sieve
    
    audio_enhancer = sieve.function.get('sieve/audio_enhancement')
    
    # Specify "upsample", "noise" or "all" for the filtering type
    enhanced_audio = audio_enhancer.run(sieve.Audio(path="./speech.wav"), "upsample")
    
    # View results on Sieve dashboard or locally from this path
    print(enhanced_audio.path)

It’s as simple as that! Results will now be viewable on your Sieve dashboard or directly from your Python code.

What Open Source Models Are Most Useful In Improving Audio Quality

Two leading open-source AI models – AudioSR and DeepFilterNet – are best-suited in our view to the task of improving audio.

AudioSR

AudioSR is a generative model that uses a diffusion-based approach to estimate the high-frequency components of a low-resolution audio signal. It does this by training a latent diffusion model to learn the conditional generation of high-resolution spectrograms from low-resolution spectrograms. The model can handle a flexible input sampling rate between 4kHz and 32kHz, covering most real-world scenarios. AudioSR has achieved promising results on speech, music, and sound effects with different input sampling rate settings and has been verified to be a plug-and-play module for enhancing the audio quality of various audio generation models.

DeepFilterNet

DeepFilterNet is a deep learning-based speech enhancement framework that utilizes harmonic structure of speech to efficiently enhance speech quality by removing unwanted noise from audio files. It operates in two stages, with the first stage enhancing the speech envelope in the ERB domain, and the second stage using deep filtering to enhance the periodic component. Several optimizations have been made to the training procedure, data augmentation, and network structure resulting in state-of-the-art speech enhancement performance while reducing processing, making it applicable to run on embedded devices in real-time.

Other Models

While other models like Demucs exist for narrower use cases like music source separation, ultimately AudioSR and DeepFilterNet stand out due to effectiveness at transforming lower quality audio and efficient processing, respectively.

Open Source Models Versus Commercial Tools for Audio Quality Improvement

Let's take a look at how open source stands against the other prominent ones in the market, namely the Dolby Enhance API:

Original

Dolby

AudioSR + DeepFilterNet

The results are very promising from a purely open-source solution!

What next?

Audio enhancement is a feature that can be used upstream of most other AI audio functionality. For example, a feature like speech editing (also featured on Sieve) can use audio enhancement capabilities to enhance the quality of the generated voices.

Sieve's cloud platform makes combining functionality in this way easy. To try it, create an account!