Computer Science Seminar: Chao Chen

Time

-

Locations

Stuart Building, Room 111 10 West 31st Street, Chicago, IL 60616

This event is open to all Illinois Tech faculty and staff. 

Abstract

Transient faults are becoming a significant concern for emerging extreme-scale high-performance computing (HPC) systems. This nascent problem is exacerbated by technology trends toward smaller transistor size, higher circuit density, and the use of near-threshold voltage techniques to save power. They could corrupt the execution of long-running scientific applications by leading to either SDCs (incorrect values in outputs) or soft failures (abnormal termination, e.g., process crashes). While SDCs harm the confidence in computations and could lead to inaccurate and untrustworthy scientific insights, soft failures degrade system efficiency and performance since they require the impacted jobs to be restarted from their checkpoints and re-executing the lost computations before continuing the normal operation. As a consequence, transient faults detection as well as recovery must be dealt with in the HPC system design for its usability (trust in the output results) and efficiency (speedup and energy efficiency). In particular, solutions must be designed that have very low regular execution overheads, as well as an ability to detect (and potentially recover from) a large set of faults with negligible downtime.

In this talk, Chao Chen will present two compiler driven resilience techniques, called LADR and CARE, which are designed respectively for SDC detection and soft failure (SF) recovery. By exploring applications’ knowledge via compiler techniques, they both achieve high fault coverage (approximately 80 percent), but incur negligible or even zero runtime overheads. Chen will first describe LADR, which detects the SDCs in scientific applications by watching for data anomaly of their state variables (those of scientific interest) and employs compile-time data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads. The compiler analysis uses the algebraic properties of the underlying data-flow to select the variables where the fault appears in a magnified manner. The technique is able to maintain a high level of fault coverage with low false positive rates. Chen will then introduce CARE, a compiler-assisted online recovery technique against soft failures. The advantages of CARE are that it can quickly (within milliseconds) repair the (crashed) process on-the-fly, allowing applications to continue their executions instead of being simply terminated and restarted, and incur zero runtime overhead during the normal execution of applications. For recovery, it utilizes the live variables of the program resident in registers and reconstructs the failed computation. Finally, Chen will conclude my talk by describing future directions toward applying compiler technologies for efficient implementation of the desired system properties.

Bio

Chen is a Ph.D. candidate in the School of Computer Science at Georgia Tech, advised by Santosh Pande and Greg Eisenhauer. His research interests are broadly in the areas of compilers and systems, with a thesis research on lightweight resilience techniques for HPC applications by exploring applications’ properties. His work appears in top-tier HPC venues, and was nominated for Best Student Paper at SC ’19.

Event Contact

Getting to Campus