AiS Challenge Team Interim Report

Team Number: 071

School Name: Sandia Prep School

Area of Science: Biochemistry

Project Title: Identifying Alternative Gene Splicing in Human Chromosome 21

The project’s purpose is to identify alternative splicing in human chromosome 21. Alternative splicing results when multiple mRNA transcripts are coded for by the same DNA sequence. This process is involved in immune system function, muscle cell differentiation, and other fundamental biological processes. Understanding the mechanism of alternative splicing may help to develop revolutionary treatments to diseases. Our program will find alterative splicing by comparing expressed sequence tags (EST) and genomic sequences provided by NCBI (National Center for Biotechnology Informatics).

When a cell needs a certain protein, it sends out an mRNA transcript to the ribosome. The ribosome reads the mRNA transcript resulting in the manufacture of a protein through translation. Multiple transcripts allow for the formation of multiple proteins due to the existence of multiple distinct mRNAs. Alternative splicing influences the type of proteins the cell produces.

We have taken advantage of object oriented programming to separate the code into objects. By dividing the code up into objects, we are able to keep the code in compartments that operate independently of each other. Each of these objects has a specific task that it performs. For example, the object called GenomicFileIO reads files containing the genomic sequence and several EST sequences. It then loads these sequences into individual strings. GenomicFileIO will later have additional functionality to send the results to a separate file. Another object we have implemented is called DNAHash. It takes a particular 11-mer and indexes it into an array that is 4^11 long. This allows us to search for 11-mers in constant time. We then use another object that contains the functionality of the Smith-Waterman algorithm. Smith-Waterman is a dynamic programming algorithm that allows us to compare EST and Genomic DNA sequences. It finds the highest scoring matches and places them into a variable. Lastly, we have created an object called DNA, which can take the genetic code, and turn a DNA sequence into the protein that it codes for.

To date we have learned about various algorithms and data structures such as merge sort and binary trees. As an exercise, we implemented merge sort and demonstrated its efficiency over other sorting algorithms such as bubble sort. We have also taken advantage of various structures in the Standard Template Library, especially the vector and string classes.

We still have several important pieces of code that need to be implemented. We need to make an object that can compare the EST sequences to the Genomic sequence. Once we compare these lists, we can construct an object that can derive exon and intron lists and then process this list into the set of alternative splices.

We hope that our code will enable us to find examples of alternative splicing in human chromosome 21. Once we have done this we would like to identify many of the proteins that would be created due to alternative splicing in HC21. With that information, we would continue with a literature search and see which of these proteins are involved in disease states.

Our current development environment is Microsoft Visual C++ 6 on a personal computer. Already using toy data, our program takes several minutes to run. This suggests that we will need to migrate our code to a super computer once we are using real data.

Aside from the actual purpose of the code, we wish to expand our computational skills. Also we would like to become more familiar with scientific literature, with the medical sciences, and with biology as a whole. We are also trying to learn teamwork skills as we complete this project together.

Purves et. al. Life, The Science of Biology fifth edition, W.H. Freeman and Company, pg. 326, 1998.

M.S. Waterman. Efficient sequence alignment algorithms. Journal of Theoretical Biology, 108 : 333-337, 1984.

T.F. Smith and M.S. Waterman. Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks. Nucleic Acids Research, 21:607-613, 1993.

A.A. Mironov, J.W. Fickett, and M.S. Gelfand. Frequent alternative splicing of human genes. Genome Research, 9:1288-1293, 1999.

Team Members

Team Mail

Sponsoring Teacher(s)

Neil McBeth

Project Mentor(s)

Mark Fleharty