Author of the publication

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery.

, , and . PDP, page 178-185. IEEE Computer Society, (2018)

Please choose a person to relate this publication to

To differ between persons with the same name, the academic degree and the title of an important publication will be displayed. You can also use the button next to the name to display some publications already assigned to the person.

 

Other publications of authors with the same name

Job-Site Level Fault Tolerance for Cluster and Grid environments., , , , , , and . CLUSTER, page 1-9. IEEE Computer Society, (2005)Blue Gene/L Log Analysis and Time to Interrupt Estimation., , , , , , , and . ARES, page 173-180. IEEE Computer Society, (2009)Symmetric Active/Active Replication for Dependent Services., , , and . ARES, page 260-267. IEEE Computer Society, (2008)A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance., , , and . IPDPS, page 1-10. IEEE, (2007)Machine Learning Models for GPU Error Prediction in a Large Scale HPC System., , , , , , and . DSN, page 95-106. IEEE Computer Society, (2018)Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale., , , and . CLUSTER, page 758-765. IEEE Computer Society, (2017)Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy., , , , , , and . DSN, page 311-322. IEEE Computer Society, (2016)Detection and correction of silent data corruption for large-scale high-performance computing., , , , , and . SC, page 78. IEEE/ACM, (2012)Improving the Performance of the Extreme-Scale Simulator., and . DS-RT, page 198-207. IEEE Computer Society, (2014)A tunable holistic resiliency approach for high-performance computing systems., , , , , , , , , and 4 other author(s). PPOPP, page 305-306. ACM, (2009)