GitHub - sandialabs/app_model: A program that mimics a parallel application's checkpoint/restart behavior on a system with redundant compute nodes.

This repository has been archived by the owner on Oct 1, 2024. It is now read-only.

sandialabs / app_model Public archive

Notifications You must be signed in to change notification settings
Fork 0
Star 0

A program that mimics a parallel application's checkpoint/restart behavior on a system with redundant compute nodes.

View license

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Search		Search
CHANGELOG		CHANGELOG
LICENSE		LICENSE
Makefile		Makefile
README		README
app.c		app.c
app.h		app.h
data_structs.c		data_structs.c
data_structs.h		data_structs.h
globals.c		globals.c
globals.h		globals.h
input.c		input.c
input.h		input.h
main.c		main.c
phases.c		phases.c
phases.h		phases.h
rMPI_model.c		rMPI_model.c
rMPI_model.h		rMPI_model.h
report.c		report.c
report.h		report.h
rnd.c		rnd.c
rnd.h		rnd.h
timing.c		timing.c
timing.h		timing.h

Repository files navigation

VERSION
1.005

INTRODUCTION
The program in this directory, copyrighted and licensed under
the name app_model, mimics a parallel application's cycles
of performing work, saving occasional checkpoints, getting
interrupted, restarting, redoing lost work, and continuing this
sequence until it has completed the number of work hours specified
at the beginning. It has many command line options that control
the timing of this cycle, when and how often the application
gets interrupted, and the configuration of the system it runs on.

The program was designed to study checkpoint/restart behavior
and its impact on total execution time in light of redundant
computing. The program can mimic running an application on
N nodes. A failure of any one of those N nodes leads to an
application interrupt and a restart. It is possible to specify
that some or all of the N nodes have a redundant partner node. If
a node in such a pair faults, the application is not interrupted
and continues to do work. Only when both nodes in a pair fault,
will the application take an interrupt and has to restart.

Using this program we found that for large-scale systems and
the right conditions, it is sometimes worthwhile to use twice
as many nodes to get a certain amount of work done in much
less than half the time. The Sandia Technical report 2009-6753
documents some of our early findings and describes this program.

The program currently only mimics coordinated checkpoint/restart
and has some other limitations, such as no more than one
redundant node for each active node. We are working on
several extensions and improvements. please contact us if you
are interested.

For comments and questions, please contact the author: Kurt
Fereirra <[email protected]>

COMPILING
The included Makefile should build the code on most Unix
systems. It has been tested on several Linux versions. The
code requires the GNU Scientific Library (gsl). Which is
not always installed by default. You can obtain it from
http://www.gnu.org/software/gsl, if necessary.

The Makefile creates an executable name two_step (as in two
steps forward, one step back, mimicking an application's march
towards completion ;-)

USAGE
The program understand the options below. None of them are
mandatory. Long options; e.g., --soft_reboot, can be truncated
as long as they are still distinct from all other options;
e.g., --soft.

-w, --work_time HOURS
The application will complete work for that many hours
on each node (weak scaling). Default is 168 hours (one
week). The elapsed time of the application will be higher due
to checkpoint/restart overhead and application interrupts.

-n, --num_bundles NUM
Number of nodes the application runs on. Default is 512
nodes. These are the nodes an application "sees", and does
not include any redundant nodes that would not be visible
to the application.

-r, --redundant NUM
Number of redundant nodes: 0...N, where N is the number of
bundles (-n) selected. 0 means no redundant computing at
all, and N means each node in the system has a redundant
partner. Values in between are acceptable if an application
is run in partial redundant mode. Default is no redundant
nodes (0).

-v, --verbose
This option may be repeated for increased verbosity. Mostly
useful for debugging and to observe the inner workings of
the program.

--distribution DIST
Selects the random distribution function for the fault
generator. DIST can be exp (default), gamma, or weibull.

--scale a
Set the scale parameter for the Weibull and gamma
distribution. The default s 43800.000 hours.

--shape b
Set the shape parameter for the gamma and weibull
distribution. The default shape parameter b is 0.5 and
must be > 0.

-s, --seed
Use a fixed seed for the random number generator. This
is useful to repeat experiments with the same start
conditions. Without this option, a random seed based on the
current time and PID of the process is used, which results
in different results for each run.

-c, --checkpoint_time MINUTES
Amount of time needed to checkpoint an application. Default
is 5 minutes. This is currently fixed, although it should
vary with application size and the I/O characteristics of
the system.

-R, --restart_time MINUTES
Time to restart and application. Default is 10 minutes. The
program currently assumes that the application is restarted
right away after an interrupt and spends this many minutes
reading in the last successful checkpoint. Wait time in
a batch queue is not considered here. After this restart
time, any work lost since the previous checkpoint is redone
before the regular work phase can be entered again. This is
currently fixed, although it should vary with application
size and the I/O characteristics of the system.

-t, --tau MINUTES
Checkpoint interval; i.e., the elapsed time before the
next checkpoint is written. Unless specified, this time is
computed using Daly's equation for the optimum checkpoint
interval.

-m, --mtbf_node HOURS
The Mean Time Between Failures (MTBF) of a single
node. Default is 5 years (43800 hours). It is assumed
that the MTBF and failure characteristics is the same for
all nodes.

--mtbf_sys HOURS
The system MTBF is calculated by default and used for
comparison at the end of the simulation. This is the MTBF of
the system; i.e., the time between faults. The application
may have a larger MTBF, if redundant nodes are used.

-a, --mtbf_app HOURS
By default the application MTBF is calculated based on the
node MTBF. It is used to calculate the optimal checkpoint
interval. This is the time between interrupts the application
experiences, which may be different from the system MTBF,
if redundant nodes are used. If provided, it will override
node MTBF for that calculation. The node MTBF will still
be used for the random fault generator.

-d, --delay_ras MINUTES
Multiple faults within this time count as one. Sometimes
multiple components fail at nearly the same time, or
cause other failures. This option allows the cumulation
of all faults within this time interval into a single
event. Individual faults are still reported at the
end, but the application does not go through a full
restart/rework/interrupt cycle for each one of them.

--finterrupts FILENAME
For each application interrupt write the interrupt time
(in hours since application start) and the number of faults
since the last interrupt. If no redundant nodes are present,
each fault leads to an interrupt. Specifying "-" directs
output to stdout.

--ffaults FILENAME
For each node fault write the fault time (in hours since
application start) to the specified file. Specifying "-"
directs output to stdout.

--soft_reboot <success>,<reboot time>
By default failed nodes are not reused. With this option
it is possible to reboot nodes after each fault. The
probability that they reboot successfully and can be
reintegrated into the computation can be specified in the
range 0%...100%. It takes a certain number of minutes
for a node to reboot. Both values must be specified. E.g.,
--soft_reboot 50,10 specifies that on average half of the
nodes can be reintegrated (50%), and it takes 10 minutes
for each node to boot. If the other node receives a fault
during those 10 minutes, the bundle, and the application will
fail. After the 10 minutes and a successful re-integration,
redundancy is restored.

--help
Short information about the command line options.

--input FILENAME
Instead of letting a random number generator create node
faults, application interrupts can be read from an input
file. The file should have three values per line: The time,
in seconds from application start, when the interrupt occurs;
the node that causes the interrupt; and a string describing
the error. The fields are separated by white space.

The node number is checked but not further used. The fault
description string must be present, but is ignored. This
format was chosen to make it easy to process fault logs
that can be found on the Internet. They may need to be
pre-processed, but can be easily converted into the format
required here.

-p, --performance
Display performance data about the simulation itself.

OUTPUT
A typical run and its output are shown here. The line by line
explanation is below.

00 ./two_step -n 100000 -r 100000 -w 720 -p
01 Version 1.005
02 Command line "./two_step -n 100000 -r 100000 -w 720 -p"
03 PARAMETERS
04 Active nodes 100000
05 Redundant nodes 100000
06 Total nodes 200000
07 Checkpoint duration 5.00 minutes
08 Restart duration 10.00 minutes
09 Work to be done 720.00 hours
10 Node MTBF 43800.00 hours
11 File for interrupt times ""
12 File for fault times ""
13 Seed for pseudo random generator random
14 Fault distribution: exponential
15 RAS delay 0.00 minutes
16 Soft reboot time not used
17
18 CALCULATED
19 System MTBF 0.22 hours (13.140 minutes)
20 Application MTBI 122.90 hours (7373.718 minutes)
21 Checkpoint interval 4.47 hours (268.223 minutes)
22 Faults/interrupt 561.17
23 WARNING: checkpoint + restart time > system MTBF! This may take quite a while.
24
25 SIMULATION
26 Application completed 720.00 hours of work (100.00% of work to be done)
27 Elapsed time 740.25 hours (Overhead is 2.81%)
28 Total restart time 0.67 hours ( 0.09%)
29 Total rework time 6.17 hours ( 0.83%)
30 Total work time 720.00 hours ( 97.26%)
31 Total checkpoint time 13.42 hours ( 1.81%)
32 Total RAS delay 0.00 hours ( 0.00%)
33 -----------------------------------------------------
34 Totals 740.25 hours (100.00%)
35
36 Number of restarts: 4 Failed: 0
37 Number of rework: 4 Failed: 0
38 Number of work segments: 162 Failed: 4
39 Number of checkpoints: 161 Failed: 0
40 ----------------------------------------------------
41 Fails: 4
42 Interrupts: 4
43
44 Faults: 3364
45 Failed nodes: 3364 Repaired: 3088 (276 nodes to be repaired after app completion)
46 Successful soft reboots: 0 Failed: 0 (0.00%)
47 Avg faults per int: 841.000, 49.87% over calculated 561.17
48 System MTBF 0.22 hours (13.203 minutes), 0.48% over calculated 0.22 hours
49 App. MTBI 185.06 hours (11103.783 minutes), 50.59% over calculated 122.90 hours
50 Modeled elapsed time 748.19 hours (44891.581 minutes), 1.07%, over simulated 740.25 hours
51
52 PROGRAM PERFORMANCE INFORMATION:
53 Generated 203092 random numbers and 0 random probabilities
54 Calls to rMPI() 5
55 Read 0 faults from input file, accepted 0 (0.00%)
56 Time to model this application: 0h:00m:0.015353

Line by line description.
-------------------------
Line 00 The executable is called two_step. No command line
options are mandatory.

Line 01 Version of the executable.

Line 02 Capture the command line given.

Line 03 Starting here is a list of parameters. These are either
defaults or were specified on the command line.

Line 04 The number of bundles being simulated.

Line 05 The total number of redundant nodes, in addition to
the active nodes in each bundle.

Line 06 The total number of nodes: active nodes + redundant
nodes.

Line 07 The time necessary to write a checkpoint.

Line 08 The time necessary to restart an application.

Line 09 The amount of work to be done (on each node).

Line 10 The node MTBF.

Line 11 File name to write interrupt times and number of
faults. "" means no output will be generated.

Line 12 File name to write fault times. "" means no output
will be generated.

Line 13 Seed used for pseudo random generator. Either "random"
or "fixed".

Line 14 Random number distribution. Exponential, Gamma,
or Weibull.

Line 15 Time interval within which all faults are considered
to be simultaneous. If > 0 avoids an application restart
for each interrupt, if they are only a few seconds apart.

Line 16 Time to reboot a node after a fault and probability
of a successful re-integration.

Line 18 Calculated values from the parameters known so far.

Line 19 Estimated system MTBF; i.e., time between node faults.

Line 20 Estimated application MTBF; i.e., time between
application interrupts.

Line 21 Calculated (optimal) checkpoint interval, based on
Daly's equation.

Line 22 Estimated faults per application interrupt.

Line 23 A (false in this case) warning that the simulation may require
longer than usual.

Line 25 Following are the observed values after the simulation
has completed.

Line 26 The amount of work completed. Should always be 100%.

Line 27 The elapsed time, which includes work, checkpoints,
restarts, and redoing of lost work.

Line 28 Time and percentage spent in restart phase.

Line 29 Time and percentage spent in rework (lost work) phase.

Line 30 Time and percentage spent in work phase. Should be
number of work hours specified on command line
(or default).

Line 31 Time and percentage spent check-pointing.

Line 32 Time and percentage spent waiting for fault bursts to
pass (-d option). (Not well tested.)

Line 34 Total should be equal to elapsed time and 100%.

Line 36 Number of restarts and how many of those failed.

Line 37 Number of rework phases and how many of those failed.

Line 38 Number of work phases and how many of those failed.

Line 39 Number of checkpoints and how many of those failed.

Line 41 The total of failed phases listed above. This and
the next line below should always match.

Line 42 Number of interrupts the application experienced.

Line 44 Total number of node faults the system experienced.

Line 45 How many nodes failed during the entire
simulation. Should match the number of faults seen
above. The number of repaired nodes is the same, if
no redundant nodes are present. It might be less with
redundant nodes, since the application was running
with some broken nodes that can be repaired after the
run completes.

Line 46 Statistics about rebooting nodes and re-integration.

Line 47 Observed and calculated number of faults per interrupt.
For non-redundant runs, this should always be 1.
The large mismatch between calculated and simulation is
due to the small number of total interrupts: only four
in this example.

Line 48 Observed system MTBF and comparison to estimate above.

Line 49 Observed application MTBF and comparison to estimate
above. Again, off by quite a bit due to the low number
of interrupts.

Line 50 Calculated elapsed time using Daly's model and
comparison to simulation run.

Line 52 Some performance data about the simulation
itself. Triggered by the -p option.

Line 53 Number random numbers generated (for faults) and number
of random probabilities generated (for soft reboot
success).

Line 54 Number of calls to the rMPI() function.

Line 55 How many faults (application interrupts) were read
from the input file.

Line 56 Wall-clock time of this simulation run.

DESIGN
Some of the design of this program is described in the Sandia
Technical report 2009-6753. Here we briefly cover what each
source file contains.

main.c
Start of program, command line option processing, calculation
of optimal checkpoint interval, and output of the initial
parameters (banner). Also opens input and output files,
if necessary.

app.c, app.h
Basically a state machine that cycles between work,
checkpoint, restart, and rework until all the work has been
done. It calls rMPI to determine when the next application
interrupt will occur.

rMPI_model.c, rMPI_model.h
Figure out which node dies next and when, and whether it
kills the application.

phases.c, phases.h
Handle each phase of the state machine in app.c. Calculate
how long each phase takes, whether a checkpoint needs
to be written, and whether the phase is cut short by an
application interrupt.

report.c, report.h
Print out the final report after the simulation. Includes
calculations to compare against estimates before the
simulation.

data_structs.c, data_structs.h
Data structure to manage nodes and their state.

globals.c, globals.h
Global variables that are shared among several files and
functions.

input.c, input.h
Functions to read interrupt times from an input file.

rnd.c, rnd.h
Compute next node failure time and other random number
related functions.

timing.c, timing.h
Calculate running time of the simulation.

LICENSE
app_model, version 1.0 - a program that mimics a parallel
application's checkpoint/restart behavior on a system with
redundant compute nodes.

Copyright 2009 - 2011 Sandia Corporation. Under the terms
of Contract DE-AC04-94AL85000 with Sandia Corporation, the
U.S. Government retains certain rights in this software.

This program has been released into the public domain using
the GNU General Public License Version 3: you can redistribute
it and/or modify it under the terms of the GNU General Public
License as published by the Free Software Foundation, either
version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
file LICENSE in this directory for a copy of the GPL 3.