Many great apps rely on audio processing, and their functionality is centered. What would the mobile world be like without Shazam and SoundHound?
Many mobile apps we work on at Yalantis involve audio processing, and we’ve decided to share some more details about the techniques we used for the Horizon project, our own open-source library for audio visualization. We’re always happy to share Yalantis’s expertise!
How does sound work?
First, let’s brush up on sound theory. As we know, sound is a vibration of the air, and these vibrations can be described with sinusoidal waveforms. So if we want to visualize this sound, we need to draw waves that represent it mathematically.
Every sound we hear consists of a combination of pure tones, and a pure tone has a sinusoidal waveform. A sine wave is characterized by its frequency and amplitude. Frequency is the number of cycles per second, which is measured in Hertz (Hz). Amplitude is the change in value on the Y axis during each cycle, and relates to the loudness of the sound. For example, the human ear can hear pure tones from 20 Hz to 20,000 Hz. Human perception of loudness depends on frequency as well. For tones of the same amplitude, the sound wave with the higher frequency seems louder.
A 440 Hz pure tone looks like this:
Pure tones don’t exist in real life though, except for with some musical instruments. All sounds we hear are combinations of various pure tones. Here’s an example of some natural sound:
Music typically involves multiple instruments. And each of those instruments produces different combinations of sine waves of different amplitudes. So songs, in this sense, are a combination of multiple sine waves. We can see this music with a spectrogram.
Below we can see a spectrogram and an oscillogram of someone saying “Rice University.” On the X axis we have Time, and on the Y axis we have Frequencies for all tones present in the sound. In this graph, color represents the amplitude of the sound. Red indicates higher amplitude, blue lower.
Now we can see sound. But how can we calculate all these frequencies and amplitudes over time?
How can you calculate sound?
First, we need to understand how analogue sound is converted into digital forms. This process is called digitization. Since analogue sound is a continuous wave, we need to separate this wave into small chunks to store it in a digital format. These small chunks are called samples, and they represent the amplitude of a sound for certain period of time — for a fraction of a second. The standard format for digital audio is 44,100 samples per second. This value is not random. According to the theorem of Nyquist and Shannon, you need to have at least twice as many samples as the highest frequency in a given audio file to achieve good sound quality without jamming. As we remember, the highest frequency that humans can hear is about 20 kHz. Therefore 40,000 is approximately the minimum number of samples per second you need to achieve clear and smooth audio.
Here’s a simple example of how sampling looks when plotted:
The red line shows the sine wave of the sound, and the blue dots are sample values. The more samples per second you store, the smoother the line becomes. Since we’re talking about digital files, the choice of how many samples to take is all about balancing file size and quality.
But this is only one side of the coin. The other is the volume of the sound. We need to store the loudness of every part of the audio. To do so, we need to represent many gradations between the lowest and highest volume in the song. For example, if we have only had 4 states of loudness — silent, soft, loud, and loudest — our audio won’t sound natural. The number of possible values for loudness is called quantization. The standard level of quantization is 16-bit, which provides 65,536 levels. This is enough to give the human ear the perception of a natural sound wave.
How can you calculate the spectrogram for digital sound?
To get frequencies from digital audio, we need to transform a function of time into a function of frequency. In other words, given some finite number of samples, we need to determine the intensity of every frequency among these samples. This can be achieved with a Discrete Fourier Transformation (DFT). The result is called a spectrum. A spectrogram show us a spectrum of time, and the spectrum shows us the amplitudes of frequencies.
We have a special formula for calculating the amplitude of frequencies:
- N is the number of samples
- X(n) represents the nth band of frequencies
- X[k] represents the kth sample.
The resulting value may be a bit confusing, as we receive a result for a band of frequencies, not a specific frequency. This happens because we do all calculations with a discrete spectrum. So we need to separate our frequency diapason into a number of chunks equal to our number of samples. Given a sample rate of 44.1 kHz and 4,096 samples per second, for example, we are going to get a bin of frequencies equal to 10.77 Hz. This means that we cannot differentiate between two frequencies that are closer than 10.77Hz distant.
After applying this formula to every sample in our audio, we can calculate an array of spectrums for our whole frequency diapason. Now we can compare the strength of the sound (bin of frequences in a particular time frame) for each group of frequencies.
According to our Horizon design, we visualize five different waves. So we need to separate our array of spectrums into five equal chunks. For each chunk, we calculate the maximum strength of amplitude. Then we convert these values to a standardized format for comparison, so we will generate an array of five values from 0 to where 1 is the maximum and 0 is the minimum strength in this bin of frequencies. This array is what we provide to the Bezier renderer.
This is one way you can process audio in your app.
Check out how we created visualization for Horizon, our open-source library for sound visualization. And also, see how we drew Cubic Bezier with OpenGL ES for Horizon.