Executive SummaryMy project was to create a program to reduce the storage space requirement for audio data while having little perceivable impact on the output sound, so that portable devices supporting the format would be able to store more music in a comparable amount of memory than devices reading compression schemes such as MP3 and WMA. The program reads input from a file, splits it into units called frames, calculates the frequency spectrum of the sound, eliminates frequencies the program deems inaudible, scales the rest of the frequencies to take up half the space they otherwise would, and sends them to gzip, a lossless compression scheme, for further compression. |
My project was to create a program to efficiently store audio data (such as music) by eliminating imperceivable qualities in the original recording.
The problem is choosing data that can be eliminated, as to produce an output that is perceived as high quality in spite of the reduced actual fidelity of the signal.
I chose this project because, owning a portable MP3/WMA player, I was interested in compressing songs in a way that would fit a couple hours of music on its 32MB of Flash memory. However, this requires a bitrate of 32kbps, certainly not enough for a stereo CD-quality sound. Even at 64kbps, CD-quality was impossible with the formats the device supported. MP3 would drop the upper pitches in a signal, and WMA's sound became "muddy" at anything below 96kbps, as well as introducing audible high-pitched noise into the output. This was less than satisfactory, so I decided to create a new format that would be able to store sound more efficiently.
I decided the program would target near CD-quality (as defined by my ears and listening tests on a sampling of the general population) for a single-channel signal at 24-32kbps. The program does not currently handle stereo signals or correctly handle samplerates other than 44100Hz. It reads signed 16-bit native-endianness PCM data with no header, and writes an output file in its own format. The decoder reverses the process, writing raw PCM as output.
The first step of the compression process is to split the input PCM data into frames. In choosing the frame size, there is a trade-off between frequency resolution and ease of controlling echo effects because of limited time resolution. The frame size used is 16384 samples, chosen because it allowed enough frequency resolution to make eliminating the clicking noises generated between frames much easier.
The next step is to transform this PCM data into the frequency domain,
as this makes compressing based on features of human perception possible.
The equation for the DFT (Discreet Fourier Transform) is:
N is the number of samples in each frame. P(n) is the PCM data at sample n
within a frame. x is the frequency for which the function is being
evaluated. The reverse transform is very similar:
F(n) is the coeffient for frequency n (on a scale determined by the number
of samples in each frame and the samplerate) and x describes the point on
the output waveform whose amplitude is being computed. The transform is
produced by FFTW (the Fastest Fourier Transform in the West), as it
provided an efficient implementation of the DFT.
The data returned by the FFT is still unsuitable for further processing,
as it provides a complex coefficient. However, magnitude and phase
information can be extracted from the FFT output as follows:
where m is the magnitude, p is the phase, r is the real part of the FFT
output, and a is the imaginary part. Only the magnitude is used by the
program, as the output is a compressed form of the FFT output.
There are two forms of masking the program uses to determine frequencies
that are safe to eliminate. The first kind is threshold masking, which
eliminates frequencies that the program deems inaudible. The second type
uses the fact that quieter sounds similar in pitch to a louder sound are
masked by the louder sound. Several sets of equations were tried to see
what produced the best results. The current expression to calculate the
mask is:
where N is the number of frequencies output by the FFT, x is the
logarithmically scaled magnitude for the frequency in question, y is the
logarithmically scaled convoluted magnitude for the frequency in question,
t is a user-specified multiplier, and r is the RMS (Root Mean Square) of
the magnitude of the frame. The convolution is performed by transforming a
spreading function using the FFT, multiplying the FFT output of the
magnitude by it, and then transforming back using the IFFT (Inverse Fast
Fourier Transform). The RMS is the square root of the mean of the squares
of the magnitudes ().
The application of the mask is relatively simple. The program simply compares each magnitude with the mask, and, if the magnitude is less than the mask, both the real and imaginary portions of the respective points on the FFT output are zeroed.
The FFT coefficients are stored on an 8-bit exponiential scale. This allows for more accurate representation of lower volumes than higher ones, thus acheiving a 2:1 compression from this step alone, with no perceivable quality loss.
As the final step in compression, the data is sent to zlib to be compressed using the lossless gzip compression scheme.
The program currently acheives 8:1 compression with almost no perceivable quality loss. This is, however, short of my goal. Some sets of equations have produced as much as 40:1 with lesser but still almost acceptable quality, so I'm working on merging the features of several sets to acheive better compression.
For the purpose for which this project was created, 12:1 compression (64kbps) is currently the maximum useable compression without significant quality loss.
A measure of how tonal signals are within a frame would allow for more intelligent masking, as tonal signals are hard to mask but mask other signals well.
Currently, the quantization scheme allocates a static number of bits to each coefficient. This should be improved by allowing for bands where different numbers of bits are used, giving more attention to important frequencies.
The program currently passes on eliminated frequencies (replaced with 0s) to the lossless compression. Fixing this would improve the compression ratio with no quality loss.
The program currently only works on monaural signals. With a stereo signal, it could also take advantage of redundancy between channels to further compress the signal, as well as allow for a better listening experience.