PUMA publications for /user/clausbraun/ABFT%20linear%20fault-tolerance%20algebraMon Mar 19 16:42:05 CET 2018Algorithm-based fault tolerance for matrix operations on graphics processing units: analysis and extension to autonomous operation.2015ABFT GPGPU GPU SimTech algebra algorithm-based error error-detection fault fault-tolerance linear matrix-operations myown simulation Mon Mar 19 16:15:07 CET 2018Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'14)443--454{A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units}2014ABFT GPGPU GPU SimTech adaptivity algebra algorithm-based autonompous error error-correction error-detection fault-tolerance linear matrix matrix-multiplication metric myown rounding rounding-error Graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. To allow scientific computing on GPUs with high performance and reliability requirements, the application of software-based fault tolerance is attractive. Algorithm-Based Fault Tolerance (ABFT) protects important scientific operations like matrix multiplications. However, the application to floating-point operations necessitates the runtime classification of errors into inevitable rounding errors, allowed compute errors in the magnitude of such rounding errors, and into critical errors that are larger than those and not tolerable. Hence, an ABFT scheme needs suitable rounding error bounds to detect errors reliably. The determination of such error bounds is a highly challenging task, especially since it has to be integrated tightly into the algorithm and executed autonomously with low performance overhead.
In this work, A-ABFT for matrix multiplications on GPUs is introduced, which is a new, parallel ABFT scheme that determines rounding error bounds autonomously at runtime with low performance overhead and high error coverage.Mon Mar 19 16:15:07 CET 2018Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'16)251--262{Efficient Algorithm-Based Fault Tolerance for Sparse Matrix Operations}2016ABFT SimTech algebra error-detection fault-tolerance linear localization myown online sparse We propose a fault tolerance approach for sparse matrix operations that detects and implicitly locates errors in the results for efficient local correction. This approach reduces the runtime overhead for fault tolerance and provides high error coverage. Existing algorithm-based fault tolerance approaches for sparse matrix operations detect and correct errors, but they often rely on expensive error localization steps. General checkpointing schemes can induce large recovery cost for high error rates. For sparse matrix-vector multiplications, experimental results show an average reduction in runtime overhead of 43.8%, while the error coverage is on average improved by 52.2% compared to related work. The practical applicability is demonstrated in a case study using the iterative Preconditioned Conjugate Gradient solver. When scaling the error rate by four orders of magnitude, the average runtime overhead increases only by 31.3% compared to low error rates.