Statistical Analysis of Benchmarking Code, Year 2

Team: 28

School: MANZANO HIGH

Area of Science: Mathematics, Computer Science

Interim:

SC Team #28

Statisctical Analysis of Parallel Code, Year 2

Stephanie McAllister - Manzano High School

Vincent Moore - Eldorado High School

The definition of the problem that we are trying to solve is the overall efficiency
of high performance computing platforms and what might be able to change to improve
the efficiency.

The plan that we have come up with to solve this problem experimentally is to
run the benchmarking codes on multiple machines and then statistically process the
data that is received from the output of the codes. These codes include the High
Performance Computing (HPC) Challenge Benchmarking code[2], which consists
of many tests within one package, and the Numerical Aerospace Simulation (NAS)
Parallel Benchmarks[7].

Of the many NAS Parallel Benchmarks, we have begun with Embarrassingly
Parallel (EP), Conjugate Gradient (CG), and LU Decomposition (LU). EP is a
pseudo-random number generator code in which all of the processors are trying to
come up with the same number with little communication. The code starts every run by
putting the application code on each of the nodes. Each node receives a different
generation seed, or a number that starts off the random number generator. CG is
an iterative method for solving systems of linear equations. This type of code is best
for optimization problems because it is quicker and easier than the method of steepest
descents. LU Decomposition is a parallel code that decomposes an N x N matrix into
a lower and upper triangle matrix system.

The HPC benchmarking codes that we are going to use to assess the overall
efficiency of the supercomputers are: HPLinPack (HPL), PTRANS, MPI Random Access
(GUPS), and the Fast Fourier Transform (FFT). The HPL code is a direct solution
method for large, dense matrices. The PTRANS does matrix transposing. The GUPS
test allocates a lot of memory space on many different nodes and randomly accesses
the memory locations among the processors. The FFT code does a fast-Fourier
transform.

For every code that is run on one of these high performance computing
platforms, there is a file that is produced by the code stating statistical data such
as how long the job took, what size and class that the job was, and other computable
data. The next step in the process is collecting the correct data for the analysis. This
will be done with a perl script[4], which is meant to go through a file and pick out all of the
data that is needed for the statistical analysis. The perl script creates another file and
writes the data that it found in a specific section of the benchmarking file to a single
line of the outfile. This creates a file that is much easier to deal with when computing
values and separating out values by the size of the job. The outfile generated from the
perl script will then be processed by a C++ or Java program that will do the
mathematical, and therefore statistical, analysis.

So far, we have been able to get the NAS Parallel Benchmarks running on
multiple platforms[1], and the HPC Benchmarking code running on some platforms. We
are still working on getting these benchmarks on other platforms in order to have a
better range of data for each system and have a better feel for what performs the best
on a supercomputer. The perl[4] script for the NAS Parallel Benchmarks has been
written and works properly. The HPC perl script is being modified to correctly select
values for analysis. The Java program has been started and has a few errors left in it
which we will debug before we can see how well it works, but we are also trying to come
up with the correct formulas[5,6] to use to do a correct statistical analysis. One of the
issues that we have run into is the formula that we have been using to compute the
sample mean. It is supposed to be in operations per Megabyte(op/M), or time per unit
of work (e.g. seconds per million operations, and our value is in Megabytes per
operation(Mops). The way that we have decided to deal with this is to change our
Mops value into s/Mop by inverting the numerical value.

We have also been speaking to other experts in this field through our mentor,
Sue Goudy. She has taken a statistical analysis class and has spoken to the professor,
Rob Easterling[6], about the project. He told her that we are on the right track and that
the first things that we should do are the analysis of the sample mean and standard
deviation. These equations are standard formulas found in Lilja's text[5]. We have
begun this work and will continue. He also confirmed that it is proper to convert the
Megabytes per operation into operations per Megabyte. This has helped us to keep
pushing along and moving forward to the analysis portion of the project.

Results expected

We have a few hypotheses so far. One of these is that EP will run the most
efficiently because it requires the least amount of communication between the nodes.
The efficiency of EP will thus depend almost solely on what processor is in use, unlike
other benchmark codes that depend upon the network fabric. The other hypotheses
that we have come up with so far are not set yet. We still need to do more research to
clarify what they are.

We also know that there will be discrepancies[3] for every run that is
performed. There are multiple possible causes for this. Many of the reasons that jobs
are slowed down have to do with the operating system that is used. For example, if a
daemon from the operating system wakes up and decides that there is a necessary
operation that needs to be performed on a node with only one processor, the processor
stops the benchmark calculation and its communication, and may even remove it from
the node completely. If the job is one that is in communication with other nodes and has
to wait to calculate because of a necessary value to the calculation, the entire process of
computing could be interrupted, causing a massive slowdown within the benchmark that
is running. Another major problem of job slowdown is bad hardware[3]. If the job is on a
node and something goes wrong with the hardware, such as memory errors or a bad
hard drive, the job allocated to that node cannot function as well as other nodes, causing
a communication slowdown or a total failure, if it is completely removed.

Citations:
1. Sandia's C-Plant website
2. HPC website: http://icl.cs.utk.edu/hpcc/
3. Interview with Donna Brown (Sandia National Laboratories), with
contributions from Paula McAllister (Sandia National
Laboratories), private communication
4. Schwartz, Randal L., Learning Perl, O'Reilly & Associates, Inc., 1993
Wall, Larry & Schwartz, Randal L., Programming Perl, O'Reilly &
Associates, Inc., 1991
5. Lilja, David J., Measuring Computer Performance: A Practitioner's
Guide, Cambridge University Press, 2000
6. Rob Easterling
7.http://www.nas.nasa.gov/Software/NPB/Specs/npb2_0/npb2_0.html
8. Greenberg, "Java Tutorials". From Java class at CEC
Sierra, Kathy & Bates, Bert, Head First Java, O'Reilly Media, Inc., 2003

Team Members:

Stephanie McAllister
Vincent Moore

Sponsoring Teacher: Stephen Schum