New Mexico Supercomputing Challenge | |||||||||||
|
|||||||||||
|
Challenge Team Interim Report
Background: There are over 150 subtypes of leukemia postulated to exist. Researchers are trying to identify them because they each have unique clinical courses, and respond in unique ways to different treatment regimes. It is important for the health of the patients that they receive the treatment that is most effective for their specific cancer types; otherwise, there may be unnecessarily severe side effects and/or failure to put the cancer into remission. Researchers from UNM have already collected blood and tissue samples from leukemia patients all over the nation and have begun to analyze them using cDNA microarrays to identify the varying levels of gene expression in normal vs. cancer samples. Visually, the level of expression for each gene is represented using a red/green scale, with red corresponding to the expression level from a patient tumor sample. This data is superimposed with data from normal tissue samples, which are represented visually by different shades of green. As a result, black dots occur wherever genes are equally expressed in the normal and tumor samples. The shades of the colors depend on the amount of expression in the genes. Therefore, the lighter the shade, the less expressed the gene is and, inversely, the darker the shade, the more expressed the gene is. Neural networking is an approach to problem solving that is useful in situations in which a generalization must be made. The purpose of this project is to use neural networks to classify different subtypes of leukemia based on common patterns of gene expression. The challenge is to find a way to do this without crossing the fine line between under-classification and over-fitting the data (e.g., fitting noise in the data). There are two types of neural network approaches that can be taken: one consists of supervised learning, while the other consists of unsupervised learning. Our preliminary program will utilize supervised learning, meaning that we will specify the classes into which our data will be classified and we will use a set of training data to determine the characteristics of our neural network before trying to classify new data. Depending on our progress using supervised learning, towards the end of the project we may progress to unsupervised learning. Genetic programming, also known as evolutionary programming, is a type of programming that utilizes the same properties of natural selection found in biological evolution. The general idea behind genetic programming is to start with a collection of functions and randomly combine them into programs. After this, run the programs and see which gives the best results. Then, keep the best ones (natural selection), mutate some of the others, and test the new generation; repeat this process until a clear best program emerges. Problem Definition: There is a team of researchers from the UNM School of Medicine (SOM), lead by Cheryl L. Willman, MD, and Albuquerque High Performance Computing Center (AHPCC) that is working on this project of identifying and classifying all of the different types of leukemia using gene expression. Our program will only be a first small step toward this larger goal. Problem Solution: We are going to use neural networking and genetic programming in a C program to try to identify some of these types and subtypes. The program will use raw numerical input data in which each number represents a different shade or intensity of green, red, or the two combined. In our first input data set, we will only be distinguishing between the two most common general types of leukemia: AML (acute myeloid leukemia) and ALL (acute lymphoblastic leukemia). AML sometimes denoted as adult leukemia since the majority of patients are adults, although a lesser percentage of children do succumb to the cancer. ALL (also referred to as acute lymphocytic leukemia and acute lymphoid leukemia) is the most common form of childhood cancer, often known as pediatric leukemia for this reason. Once we have gained experience in classifying data using only these two subtypes we will attempt to progress to a more detailed classification. Our program will attempt to group the data, with each group containing data from a different type of leukemia according to which specific pattern of genes was expressed in each tissue sample in the input data. We will begin by reanalyzing the classification of a small data set recently published by the Lander group at MIT. We will then apply the program to the larger data sets that will be available shortly at UNM. Toward the end of this project, we will develop a simple parallel implementation of our neural network program so that the classification procedure can eventually be scaled up to ultra large data sets. Project to Date: As of now, we have completed sufficient research to begin programming. We have learned basic parallel programming in the form of MPI (message passing interface), the skeleton of neural networking required to create our program, and the UNIX commands necessary to utilize the computers at AHPCC where we have obtained accounts. We have yet to determine the specific equation to be used in the neural network to determine exactly how similar two data sets must be to be classified as the same subtype of leukemia. Expected Results: Our output data will be numeric also. We will use a graphics and analysis program called VxInsight, which was developed by researchers at Sandia National Laboratories who are working with the UNM group, to display the results of our program. Applications: This work will contribute to the larger efforts of the UNM researchers, with the long-term goal of saving leukemia patients from enduring treatments that are less likely to be helpful and which may cause terrible side effects. There are current treatments that have proved very effective for certain types of leukemia, but it is important that patients are correctly diagnosed and treated given their specific genes and cancer types. Team Members Sponsoring Teacher Project Advisor
|