Very cool stuff:
Now researchers at Purdue University and high-performance computing experts at the National Nuclear Security Administration’s (NNSA) Lawrence Livermore National Laboratory have solved several problems hindering the use of the ultra-precise simulations. NNSA is the quasi-independent agency within the U.S. Department of Energy that oversees the nation’s nuclear security activities.
The simulations, which are needed to more efficiently certify nuclear weapons, may require 100,000 machines, a level of complexity that is essential to accurately show molecular-scale reactions taking place over milliseconds, or thousandths of a second.
The same types of simulations also could be used in areas such as climate modeling and studying the dynamic changes in a protein’s shape. Such highly complex jobs must be split into many processes that execute in parallel on separate machines in large computer clusters.
“Due to natural faults in the execution environment there is a high likelihood that some processing element will have an error during the application’s execution, resulting in corrupted memory or failed communication between machines,” says Saurabh Bagchi, an associate professor in Purdue University’s School of Electrical and Computer Engineering. “There are bottlenecks in terms of communication and computation.”
These errors are compounded as long as the simulation continues to run before the glitch is detected and may cause simulations to stall or crash altogether.
“We are particularly concerned with errors that corrupt data silently, possibly generating incorrect results with no indication that the error has occurred,” says Bronis R. de Supinski, co-leader of the ASC Application Development Environment Performance Team at Lawrence Livermore. “Errors that significantly reduce system performance are also a major concern since the systems on which the simulations run are very expensive.”
The researchers have developed automated methods to detect a glitch soon after it occurs.
Read the original study




