AiS Challenge Team Interim

Team Number: 001

School Name: Alamogordo High School

Area of Science: Behavioral and Social Sciences

Project Title: Heuristic Encoding Algorithm for Effective Audio Resynthesis

I wish to create an audio compression system based on an evolving auditory perception model, a model describing how our perception of sounds is affected by various factors, such as the ear and sound processing within the brain. This allows unimportant data to be omitted from the output, meaning that more important data can be stored in a given space. This is useful when streaming media over slower connections or when storing audio on a portable device, where file size is an important factor but high quality is still desired. The goal is to create a program that will achieve a sound quality equal to that of current MP3 encoders with a lesser bitrate. To do this, the program will read PCM data, analyze it with an FFT, DCT, or other algorithm, and then determine which data from the resulting output would be more important to a human listener. The decoder will simply re-synthesize the waveforms from the data in the compressed file. Effectiveness will be measured by having listeners test the output against that of LAME and/or BladeEnc and comparing it with the output of my program without knowing which format is which. User feedback will also be used by the program to improve the algorithm by adjusting its idea of what the human listener will consider important.

The basic encoder exists--it currently uses a DFT to encode the sound it reads--it represents the sound in the frequency domain instead of the time domain. It currently only accepts 44100Hz single-channel signed 16-bit native byte order PCM data as input, and it writes a text file containing frequency data as output. This algorithm produces problems, though, as it cannot detect the beginning and end of a frequency within a frame, and overlapping is too time-consuming, as even FFT, a faster DFT, is too slow to be run more than once per frame. Due to the program's limitations, though, it does not deal well with complex sounds--they sound like white noise because it indiscriminately omits frequencies to keep the file size down to an acceptable level. This will be corrected when the perception model is created.

There are three major things left to be done. First, the DFT implementation needs to be replaced with an optimized FFT implementation to make encoding files more practical, as the current program takes about an hour and fourty minutes to encode twelve seconds of CD-quality monaural audio. Then, the perception model needs to be developed so the program can determine what frequencies can be omitted. Currently, to reduce output file size, it takes only the peaks of the frequency response curve, which means it's able to deal with simple sounds, such as voices, but it has trouble with more complex sounds, such as music from an electric guitar. Finally, a back-end compressor needs to be added to it to decrease the output file size without impacting audio quality or to increase the number of frequencies that can be stored in a given bitrate, thus reducing the amount of work the perception model has to do in eliminating frequencies to produce acceptable results. In addition to these things, the program needs to output a binary format instead of text to save space.

Team Email

Team Member: Jeremy Pepper

Sponsoring Teacher: Albert Simon