Introductory Guide to Speech Representation for ML Engineers

For a long time, I have wanted to consolidate basic methods of sound representation used in machine learning tasks. Understanding the properties and representation methods of sound, especially speech, has often helped me build and debug speech synthesis algorithms.

Sound waves

What is sound?

In an elastic medium, particles tend to return to equilibrium. If the pressure at a specific point in space changes, it creates a sound wave. This principle is how acoustic speakers and microphones work: speakers increase or decrease pressure, while microphones detect these changes. For example, a recorded signal might look like this:

Sound Wave

Discretization

In a computer, we store signals digitally, which means we need to convert the analog signal into a digital one through a process called pulse code modulation (PCM). This involves discretizing the signal in both time and amplitude.

Time and amplitude discretization

The number of samples per second of time we store is called the sample rate, measured in Hertz (Hz). The number of bits needed to represent one time sample is called bits per sample. The number of bits of information needed to encode a signal per second of time is called bitrate. For PCM, it is calculated as num_channels * sample_rate * bits_per_sample, though bitrate becomes more relevant when some compression codec is used.

A signal is usually represented by an array of integer values of type int8, int16, or int32. However, it is sometimes preferable to represent it as a floating-point array. For example, when an ML model requires raw audio waveform as an input or as a target, the values typically range from -1 to 1.

Note: Be cautious of going out of range when converting back to integer types. Some filters or ML models might produce values that exceed the expected floating point range, which can result in clipping or integer overflow.

Working with waveforms in Python

Python offers several libraries for working with audio, each with its own strengths. For example, librosa is popular for its extensive set of digital signal processing (DSP) algorithms commonly used in machine learning.

When reading audio files, you can specify the number of channels you need, the desired sample rate, and the data type. Here is an example:

import librosa

signal, sample_rate = librosa.load(
    "example.wav",
    sr=16000,
    dtype=np.float32, 
    mono=True,
)

However, librosa does not provide audio writing functionality, so you can use soundfile for reading and/or writing, and librosa for processing.

sf.write("example-16kHz.wav", signal, sample_rate)

If the library does not support the given format, you can use the ffmpeg tool to convert it to WAV.

ffmpeg -i <any file containing audio track> audio.wav

Volume and loudness

Volume is a physical measurement of sound pressure, while loudness is a perceptual measure that varies depending on the listener. For instance, two sounds with the same volume can be perceived as having different loudness due to their frequency content.

As you may have heard, volume is measured in decibels. In audio processing, the unit is often dBFS (decibels relative to full scale). For int16 signals, the full scale is 32767 (since int16 ranges from -32768 to 32767). For floating point signals in the range [-1, 1], the full scale is 1. For a signal \( x_{i=1,\dots,N} \), the formula to calculate the level in dBFS is as follows:

\[ dBFS = 20 \log_{10} \left(\frac{RMS}{\text{Full Scale}}\right), \quad RMS = \sqrt{\frac{1}{N} \sum_{i=1}^N x_i^2} \]

To obtain peak dBFS just replace \(RMS\) with \(\max_i |x_i|\).

Loudness – perceived volume

Human perception, particularly in the realm of sound, correlates with a logarithmic scale but requires more complex formulas that can be empirically evaluated. We must consider that our perception of loudness varies with the same energy level depending on the frequency. Equal-loudness contours obtained from experiments show how:

Equal-loudness contours

There are bunch of standarts to measure sound loudness, but the most popular is Loudness, K-weighted, relative to full scale which measured in units called LUFS. A 1 LUFS increase or decrease corresponds to a 1 dB increase or decrease in perceived loudness.

I bet you never thought why you do not need to change volume while watching videos on youtube no matter what content you watch. This is because Youtube uses -14 LUFS as a reference level.

Human’s ear logarithmically percives not only loudness, but also frequency, but we will discuss later.

Python demo

Estimation of volume is straighforward:

peak_dBFS = 20 * np.log10(np.max(np.abs(signal)))
dBFS = 20 * np.log10(np.sqrt(np.mean(np.square(signal))))

To change the volume by d dBFS we need to convert it to linear scale:

new_signal = signal * 10 ** (d / 20)

The implementation of the loudness estimation is not straightforward, but pyloudnorm provides ITU-R BS.1770-4 standart implementation.

import pyloudnorm

meter = pyloudnorm.Meter(rate=sample_rate)
loudness = meter.integrated_loudness(signal) # LUFS

Normalization and clipping

Be aware of clipping; do not allow dBFS to exceed the 0 limit, as this causes artifacts. The maximum you can set is peak_dBFS = 0, which is called peak normalization.

Sometimes this is used in machine learning algorithms, but it is undesirable because peaks and loudness are weakly correlated. A good practice is to adjust the loudness to a specific value (e.g., -16dB If you need to increase the loudness without exceeding peak limits, consider using compression, limiting, or equalization.

Increasing the sample rate or converting a digital signal to an analog signal may cause the peak value to be exceeded, even if the discretized version of the signal has peak_dBFS < 0. This is caused by a phenomenon known as inter-sample peaks. In the broadcasting industry, true peaks are often estimated and controlled.

Time vs frequency domains

Raw waveforms are challenging to work with and often redundant for ML algorithms. For instance, speech recognition algorithms perform equally well on both 16kHz and 24kHz audio samples. Additionally, managing hundreds of thousands of timestamps in a short audio track is quite challenging.

Of course, there are models that work with raw waveforms, such as vocoders, whose main purpose is to generate fine-grained waveforms from some coarse representation.

However, almost any transformation you might apply to a waveform can result in the loss of crucial information. Even dropping out every n-th sample can cause an aliasing effect. Reducing the resolution of an images without applying anti-aliasing techniques clearly demonstrates this effect:

Aliasing

Time and frequency domains

A waveform is a time-domain representation of audio, where we have values at each timestamp. However, if we represent a signal as a composition of sine and cosine functions with different amplitudes, the signal can be represented by the amplitude values for given frequencies. The Fourier theorem proves the existence of such decomposition.

Time-frequency domain

The transformation from the time domain to the frequency domain is called the Fourier transform. Conversely, the inverse Fourier transform converts data back from the frequency domain to the time domain. However, there is a caveat: this process involves complex numbers.

Nyquist frequency

Before we will go further to discrete case, it worth mentioning the Nyquist–Shannon (or Kotelnikov) sampling theorem, which states:

If a signal contains no frequencies higher than \( B \) hertz, then it can be completely determined with sampling rate \( 2B \).

This means the highest frequency that can be confidently obtained (i.e., without aliasing) from a signal sampled at a rate \( f_s \) is \( \frac{f_s}{2} \) . This frequency is known as the Nyquist frequency.

DFT

The technique also works for sequences of values, such as audio signals. This is known as the Discrete Fourier Transform (DFT). In this case, not only is the audio signal time-discrete, but the frequency scale is discretized as well.

DFT

The result of the transformation is called the spectrum. Sometimes it’s referred to as “linear” in contrast to the log-scaled version that we will cover in the future. Typically, phase information is omitted, and values are taken by modulo.

As you might have noticed, the plotted spectrum is symmetrical around its center. This symmetry arises from the conjugate symmetry of the DFT result, which occurs due to the real-valued input.

DFT bin idicies can be converted to frequences with the formula: \( f = \frac{k F_{sampling}}{N} \), where \(k = 0, \dots, N - 1\) is the bin index, \(N\) is the number of points in the DFT, and \(F_{sampling}\) is the sampling frequency (sample rate).

DFT bins

FFT

The straightforward implementation of the DFT has a time complexity of \( O(n^2) \), which is quite slow when working with thousands of samples. The Fast Fourier Transform (FFT) algorithm implements the DFT with a time complexity of \( O(n \log n) \). The most common FFT algorithm is the Cooley-Tukey algorithm, which follows a divide-and-conquer approach.

The DFT can be rewritten in the form of vector-matrix multiplication, where the matrix is constant. To understand the intuition behind the use of a divide-and-conquer approach, we can visualize the magnitude values of this matrix and slice it to observe patterns in its structure.

Visualization of DFT matrix

Have I given you some intuition? I don’t know, I just like these hypno-plots. :)

FFT in Python

import numpy as np

frequencies = 20 * np.log10(np.abs(np.fft.rfft(signal)))
bin_to_Hz = np.fft.rfftfreq(len(signal), d=1 / sample_rate)

Short-time Fourier transform

The DFT provides us with a new representation of an audio signal, but it doesn’t significantly simplify the representation. From \(n\) real numbers, \( \lceil \frac{n}{2} \rceil \) complex numbers are produced. It would be beneficial to somehow shrink the signal on the time axis, as longer sequences make it harder to model the relationships between timestamps.

STFT

Short-time Fourier transform (STFT) algorithms solve the problem as follows:

Split a signal into overlapping segments.
For each segment, apply a windowing function (e.g., Hann).
Apply the FFT algorithm to each segment.
Get the magnitudes and perform a log transformation.
Stack everything together to form a matrix.

A typical window size is 1024 or 2048, and the typical hop size is \( \frac{1}{4} \) of the window size. The windowing function is applied to minimize spectral leakage, which we will discuss further. The log transformation function makes the distribution more normal (Gaussian); remember, even human perception is logarithmic?

We discard phase information because it’s uncommon for ML algorithms to work with complex numbers. In speech processing, we primarily work with magnitudes. If you need to reconstruct the phase, you can use the Griffin-Lim algorithm.

Linear and Log Spectrograms

Windowing functions

Let’s consider two waveforms representing sine functions with the same frequency but different sampling conditions. The second waveform has a few more additional timestamps sampled. After applying the DFT, we would expect to see the dominant frequency component corresponding to the sine wave’s frequency. This is indeed what we get for one of the signals, but the spectrum of the second waveform is different:

Leakage example

The observed phenomenon is called “spectral leakage.” It occurs when a signal is not perfectly periodic within the observation window, leading to discontinuities at the window boundaries. These discontinuities cause energy from the sine wave’s main frequency to spread to adjacent frequencies in the frequency domain representation.

We can slightly mitigate this effect by tapering the signal smoothly to zero at the boundaries using a window function (e.g., Hanning).

Leakage metigated

STFT in Python

import librosa
import numpy as np

min_level_db = -60
window_size = 1024
window_step = window_size // 4
    
complex_spec = librosa.stft(
    signal,
    n_fft=window_size,
    hop_length=window_step,
    win_length=window_size)
linear_spec = np.abs(complex_spec)

min_level = np.exp(min_level_db / 20 * np.log(10))
linear_spec = np.maximum(min_level, linear_spec)

log_spec = 20 * np.log10(linear_spec)

Mel-scale

Humans perceive frequency logarithmically, meaning that as the frequency increases, it becomes more difficult to perceive changes in it.

One empirically evaluated frequency-to-perception transformation is given by the formula \( m = 2595 \log_{10} (1 + \frac{f}{700}) \) , known as the mel scale. There is also the Bark scale, among others. Changes in these perception scales are perceived equally, regardless of their value.

Mel scale

Filter bank and mel-transform

The Mel scale allows us to obtain a more compact representation compared to the STFT. The Mel transform distributes information more uniformly across the frequency range than the STFT spectrogram does.

To perform the Mel transformation, a Mel filter bank is constructed using a specific formula.

Mel Filter Bank

By multiplying the linear spectrogram by the Mel basis (projecting it onto the Mel basis), we obtain the Mel-frequency spectrogram.

Mel spectrogram

The Mel scale is crucial in speech processing because it aligns more closely with human perception of sound, making it useful for creating more effective machine learning models.

Mel-transform in Python

mel_num_bins = 80
min_frequency = 0
max_frequency = 11000

mel_basis = librosa.filters.mel(
    sr=sample_rate,
    n_fft=window_size,
    n_mels=mel_num_bins,
    fmin=min_frequency,
    fmax=max_frequency)

mel_spec = np.dot(mel_basis, linear_spec)
log_mel_spec = 20 * np.log10(mel_spec)

How speech is produced

To better understand future concepts, we should take a look at how speech is produced. Without diving too much into the anatomical aspects of the vocal tract, the process is as follows: the vocal cords oscillate due to pressure from the lungs. This sound travels to the mouth and nasal cavity, where it reflects off surfaces, producing resonants that form vowels and voiced consonants. Additionally, outgoing air encounters obstacles such as the tongue, lips, and teeth, producing fricatives.

By changing the shape of the “resonance chamber,” different vowels and consonants are produced.

Phonetic notation

Assigning a symbol to each phoneme produces phonetic notation. There are many types, with the most popular for English being the International Phonetic Alphabet (IPA) and the American Phonetic Alphabet (APA).

Incorporating phonetic notation into your speech synthesis algorithm can help you precisely control how your Text-to-Speech (TTS) system pronounces rare words. Without it, you might have to resort to awkward transliterations to achieve the desired pronunciation.

Phonetic representations are much rarer in speech recognition. They are almost never used for general tasks like video or phone call dialogue transcription because the times when phonetics were beneficial for speech recognition have passed. However, some tasks might still rely on phonetics, such as precise speech timing alignment.

Speech timings using Montreal Forced Aligner

Some speech synthesis algorithms and dataset pipelines utilize the correspondence between phonemes and their timings, known as phoneme alignment.

The Montreal Forced Aligner provides a convenient Python interface for Kaldi, a powerful toolkit for speech recognition. MFA comes with pretrained acoustic models and phonetic dictionaries for popular languages.

The installation of MFA might be tricky, but you can find resources on the internet. I use the pretrained English model and dictionary with the ARPAbet phoneset.

mfa models download acoustic english_us_arpa
mfa models download dictionary english_us_arpa
ffmpeg -i audio.wav -ac 1 -ar 16000 corpus/recording1.wav
echo "this is a guide to speech representations for m l engineers" > corpus/recording1.txt
mfa align -t tmp corpus english_us_arpa english_us_arpa output

In TextGrid format, you will obtain the following alignment:

Alignment

Fundamental frequency \(F_0\) and harmonics

One interesting feature that is still extensively used (e.g., for voice conversion) is the fundamental frequency, also referred to as pitch or \(F_0\). The fundamental frequency is determined by the frequency at which the vocal cords oscillate.

In Indo-European languages, variations in pitch while speaking relate to intonation. Intonation patterns can convey different meanings, emotions, and grammatical structures (e.g., statements, questions, exclamations). In Chinese, \(F_0\) forms tones, which are integral to the phonetic notation.

It’s worth mentioning that emotion goes beyond just pitch; a lot of emotional information is encoded in how the phonemes are pronounced. Extracting this information is non-trivial.

Part of spec

On a spectrogram, you may see many waves stacked on top of each other, moving synchronously – these are harmonics. They are multiples of the fundamental frequency \(F_0\), such as \(2F_0, 3F_0, 4F_0\) , and so on. The vocal cords act like a string that vibrates at frequencies that are multiples of the main one.

Algorithm of extraction of fundamental frequency

F0 and peak

Unfortunately, \(F_0\) is not necessarily the highest energy peak in the spectrum, because resonances in the vocal tract may amplify other harmonics.

There are dozens of methods to estimate the fundamental frequency. The simplest method involves autocorrelation: multiplying the signal by itself with different shifts produces an autocorrelation signal. The peak in this signal corresponds to the period of the fundamental frequency.

Extract F0 using Python

There are many implementations of \(F_0\) detection algorithms. One such implementation (an autocorrelation with some heuristics) is provided by the praat-parselmouth package. Additionally, the librosa library offers an implementation of the Yin algorithm.

pip install praat-parselmouth

import parselmouth

snd = parselmouth.Sound(signal, sample_rate)
pitch = snd.to_pitch(pitch_floor=50, pitch_ceiling=500)
pitch_values = pitch.selected_array['frequency']
pitch_values[pitch_values==0] = np.nan

Speech formants

When examining a spectrogram closely, you might notice additional frequency peaks beyond those produced by the fundamental frequency \( F_0 \) and its harmonics. These peaks are due to resonances in the vocal tract and are referred to as formants. Typically, the first three formants, denoted as \(F_1\), \(F_2\), and \(F_3\), are of particular interest.

Formants

Although modern machine learning algorithms often do not directly utilize formants, early speech recognition algorithms relied on them extensively. Each vowel phoneme corresponds to specific values and relationships of \(F_1\), \(F_2\), and \(F_3\), which were critical in distinguishing between different vowel sounds in those early systems.

Sound waves#

What is sound?#

Discretization#

Working with waveforms in Python#

Volume and loudness#

How waveform range and decibels (dB) are related?#

Loudness – perceived volume#

Python demo#

Normalization and clipping#

Time vs frequency domains#

Time and frequency domains#

Nyquist frequency#

DFT#

FFT#

FFT in Python#

Short-time Fourier transform#

Windowing functions#

STFT in Python#

Mel-scale#

Filter bank and mel-transform#

Mel-transform in Python#

How speech is produced#

Phonetic notation#

Speech timings using Montreal Forced Aligner#

Fundamental frequency \(F_0\) and harmonics#

Algorithm of extraction of fundamental frequency#

Extract F0 using Python#

Speech formants#