PUMA publications for /tag/abfthttps://puma.ub.uni-stuttgart.de/tag/abftPUMA RSS feed for /tag/abft2024-03-29T09:55:38+01:00Algorithm-based fault tolerance for matrix operations on graphics processing units: analysis and extension to autonomous operation.https://puma.ub.uni-stuttgart.de/bibtex/278feed56c1636b8fcbfd657450c145bd/clausbraunclausbraun2018-03-19T16:42:05+01:00ABFT GPGPU GPU SimTech algebra algorithm-based error error-detection fault fault-tolerance linear matrix-operations myown simulation <meta content="thesis" itemprop="educationalUse"/><span data-person-type="author" class="authorEditorList "><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Claus Braun" itemprop="url" href="/person/1f4aa6bff08e99d1685a2218270cadc80/author/0"><span itemprop="name">C. Braun</span></a></span></span>. </span><span class="additional-entrytype-information"><em>University of Stuttgart, </em>(<em><span>2015<meta content="2015" itemprop="datePublished"/></span></em>)</span>Mon Mar 19 16:42:05 CET 2018Algorithm-based fault tolerance for matrix operations on graphics processing units: analysis and extension to autonomous operation.2015ABFT GPGPU GPU SimTech algebra algorithm-based error error-detection fault fault-tolerance linear matrix-operations myown simulation A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Unitshttps://puma.ub.uni-stuttgart.de/bibtex/2a6dcd392b900956871dd3cfde89cd481/clausbraunclausbraun2018-03-19T16:15:07+01:00ABFT GPGPU GPU SimTech adaptivity algebra algorithm-based autonompous error error-correction error-detection fault-tolerance linear matrix matrix-multiplication metric myown rounding rounding-error <span data-person-type="author" class="authorEditorList "><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Claus Braun" itemprop="url" href="/person/1f672f35af5f6b825bda005ca703be294/author/0"><span itemprop="name">C. Braun</span></a></span>, </span><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Sebastian Halder" itemprop="url" href="/person/1f672f35af5f6b825bda005ca703be294/author/1"><span itemprop="name">S. Halder</span></a></span>, </span> and <span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Hans-Joachim Wunderlich" itemprop="url" href="/person/1f672f35af5f6b825bda005ca703be294/author/2"><span itemprop="name">H. Wunderlich</span></a></span></span>. </span><span class="additional-entrytype-information"><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'14)</span>, </em></span><em>page <span itemprop="pagination">443--454</span>. </em>(<em><span>2014<meta content="2014" itemprop="datePublished"/></span></em>)</span>Mon Mar 19 16:15:07 CET 2018Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'14)443--454{A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matrix Multiplications on Graphics Processing Units}2014ABFT GPGPU GPU SimTech adaptivity algebra algorithm-based autonompous error error-correction error-detection fault-tolerance linear matrix matrix-multiplication metric myown rounding rounding-error Graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. To allow scientific computing on GPUs with high performance and reliability requirements, the application of software-based fault tolerance is attractive. Algorithm-Based Fault Tolerance (ABFT) protects important scientific operations like matrix multiplications. However, the application to floating-point operations necessitates the runtime classification of errors into inevitable rounding errors, allowed compute errors in the magnitude of such rounding errors, and into critical errors that are larger than those and not tolerable. Hence, an ABFT scheme needs suitable rounding error bounds to detect errors reliably. The determination of such error bounds is a highly challenging task, especially since it has to be integrated tightly into the algorithm and executed autonomously with low performance overhead.
In this work, A-ABFT for matrix multiplications on GPUs is introduced, which is a new, parallel ABFT scheme that determines rounding error bounds autonomously at runtime with low performance overhead and high error coverage.Low-Overhead Fault-Tolerance for the Preconditioned Conjugate Gradient Solverhttps://puma.ub.uni-stuttgart.de/bibtex/28c90a682adda1e125eb007f0c70bd70a/clausbraunclausbraun2018-03-19T16:15:07+01:00ABFT CG PCG SimTech conjugate error error-correction error-detection fault fault-tolerance gradient linear myown preconditioned solver sparse system <span data-person-type="author" class="authorEditorList "><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Alexander Schöll" itemprop="url" href="/person/1d133c0d9eda7017c266a9d01721a9c91/author/0"><span itemprop="name">A. Schöll</span></a></span>, </span><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Claus Braun" itemprop="url" href="/person/1d133c0d9eda7017c266a9d01721a9c91/author/1"><span itemprop="name">C. Braun</span></a></span>, </span><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Michael A. Kochte" itemprop="url" href="/person/1d133c0d9eda7017c266a9d01721a9c91/author/2"><span itemprop="name">M. Kochte</span></a></span>, </span> and <span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Hans-Joachim Wunderlich" itemprop="url" href="/person/1d133c0d9eda7017c266a9d01721a9c91/author/3"><span itemprop="name">H. Wunderlich</span></a></span></span>. </span><span class="additional-entrytype-information"><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">Proceedings of the International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT'15)</span>, </em></span><em>page <span itemprop="pagination">60-65</span>. </em>(<em><span>2015<meta content="2015" itemprop="datePublished"/></span></em>)</span>Mon Mar 19 16:15:07 CET 2018Proceedings of the International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT'15)60-65{Low-Overhead Fault-Tolerance for the Preconditioned Conjugate Gradient Solver}2015ABFT CG PCG SimTech conjugate error error-correction error-detection fault fault-tolerance gradient linear myown preconditioned solver sparse system Linear system solvers are an integral part for many different compute-intensive applications and they benefit from the compute power of heterogeneous computer architectures. However, the growing spectrum of reliability threats for such nano-scaled CMOS devices makes the integration of fault tolerance mandatory. The preconditioned conjugate gradient (PCG) method is one widely used solver as it finds solutions typically faster compared to direct methods. Although this iterative approach is able to tolerate certain errors, latest research shows that the PCG solver is still vulnerable to transient effects. Even single errors, for instance, caused by marginal hardware, harsh environments, or particle radiation, can considerably affect execution times, or lead to silent data corruption. In this work, a novel fault-tolerant PCG solver with extremely low runtime overhead is proposed. Since the error detection method does not involve expensive operations, it scales very well with increasing problem sizes. In case of errors, the method selects between three different correction methods according to the identified error. Experimental results show a runtime overhead for error detection ranging only from 0.04% to 1.70%. Efficient On-Line Fault-Tolerance for the Preconditioned Conjugate Gradient Methodhttps://puma.ub.uni-stuttgart.de/bibtex/27e5f4629e5616459c867bc30d3893e78/clausbraunclausbraun2018-03-19T16:15:07+01:00ABFT CG PCG SimTech conjugate efficiency fault fault-tolerance gradient linear myown preconditioned solver sparse <span data-person-type="author" class="authorEditorList "><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Alexander Schöll" itemprop="url" href="/person/17911878633d0e9fce50d136187cf87a4/author/0"><span itemprop="name">A. Schöll</span></a></span>, </span><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Claus Braun" itemprop="url" href="/person/17911878633d0e9fce50d136187cf87a4/author/1"><span itemprop="name">C. Braun</span></a></span>, </span><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Michael A. Kochte" itemprop="url" href="/person/17911878633d0e9fce50d136187cf87a4/author/2"><span itemprop="name">M. Kochte</span></a></span>, </span> and <span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Hans-Joachim Wunderlich" itemprop="url" href="/person/17911878633d0e9fce50d136187cf87a4/author/3"><span itemprop="name">H. Wunderlich</span></a></span></span>. </span><span class="additional-entrytype-information"><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">Proceedings of the 21st IEEE International On-Line Testing Symposium (IOLTS'15)</span>, </em></span><em>page <span itemprop="pagination">95--100</span>. </em>(<em><span>2015<meta content="2015" itemprop="datePublished"/></span></em>)</span>Mon Mar 19 16:15:07 CET 2018Proceedings of the 21st IEEE International On-Line Testing Symposium (IOLTS'15)95--100{Efficient On-Line Fault-Tolerance for the Preconditioned Conjugate Gradient Method}2015ABFT CG PCG SimTech conjugate efficiency fault fault-tolerance gradient linear myown preconditioned solver sparse Linear system solvers are key components of many scientific applications and they can benefit significantly from modern heterogeneous computer architectures. However, such nano-scaled CMOS devices face an increasing number of reliability threats, which make the integration of fault tolerance mandatory. The preconditioned conjugate gradient method (PCG) is a very popular solver since it typically finds solutions faster than direct methods, and it is less vulnerable to transient effects. However, as latest research shows, the vulnerability is still considerable. Even single errors caused, for instance, by marginal hardware, harsh operating conditions or particle radiation can increase execution times considerably or corrupt solutions without indication. In this work, a novel and highly efficient fault-tolerant PCG method is presented. The method applies only two inner products to reliably detect errors. In case of errors, the method automatically selects between roll-back and efficient on-line correction. This significantly reduces the error detection overhead and expensive re-computations.Efficacy and Efficiency of Algorithm-Based Fault Tolerance on GPUshttps://puma.ub.uni-stuttgart.de/bibtex/292cad6c6d7a90044e7289f504f6f4cf7/clausbraunclausbraun2018-03-19T16:15:07+01:00ABFT GPGPU SimTech algorithm-based computing errors fault fault-tolerance myown scientific simulation <span data-person-type="author" class="authorEditorList "><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Hans-Joachim Wunderlich" itemprop="url" href="/person/1852ec5b9e00df1c4437700418d91759c/author/0"><span itemprop="name">H. Wunderlich</span></a></span>, </span><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Claus Braun" itemprop="url" href="/person/1852ec5b9e00df1c4437700418d91759c/author/1"><span itemprop="name">C. Braun</span></a></span>, </span> and <span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Sebastian Halder" itemprop="url" href="/person/1852ec5b9e00df1c4437700418d91759c/author/2"><span itemprop="name">S. Halder</span></a></span></span>. </span><span class="additional-entrytype-information"><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">Proceedings of the IEEE International On-Line Testing Symposium (IOLTS'13)</span>, </em></span><em>page <span itemprop="pagination">240--243</span>. </em>(<em><span>2013<meta content="2013" itemprop="datePublished"/></span></em>)</span>Mon Mar 19 16:15:07 CET 2018Proceedings of the IEEE International On-Line Testing Symposium (IOLTS'13)240--243{Efficacy and Efficiency of Algorithm-Based Fault Tolerance on GPUs}2013ABFT GPGPU SimTech algorithm-based computing errors fault fault-tolerance myown scientific simulation Computer simulations drive innovations in science and industry, and they are gaining more and more importance. However, their high computational demand generates extraordinary challenges for computing systems. Typical highperformance computing systems, which provide sufficient performance and high reliability, are extremly expensive.
Modern GPUs offer high performance at very low costs, and they enable simulation applications on the desktop. However, they are increasingly prone to transient effects and other reliability threats. To fulfill the strict reliability requirements in scientific computing and simulation technology, appropriate fault tolerance measures have to be integrated into simulation applications for GPUs. Algorithm-Based Fault Tolerance on GPUs has the potential to meet these requirements.
In this work we investigate the efficiency and the efficacy of ABFT for matrix operations on GPUs. We compare ABFT against fault tolerance schemes that are based on redundant computations and we evaluate its error detection capabilitiesAlgorithmen-basierte Fehlertoleranz für Many-Core-Architekturen;
Algorithm-based Fault-Tolerance on Many-Core Architectureshttps://puma.ub.uni-stuttgart.de/bibtex/22df42dffc60148e2d03c7de55549a3dc/clausbraunclausbraun2018-03-19T16:15:07+01:00ABFT Fehlertoleranz GPU Rechnerarchitekturen SimTech Zuverlässigkeit myown <span data-person-type="author" class="authorEditorList "><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Claus Braun" itemprop="url" href="/person/1ac4312dfc7d2a467ecb4d4c9868fe852/author/0"><span itemprop="name">C. Braun</span></a></span>, </span> and <span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Hans-Joachim Wunderlich" itemprop="url" href="/person/1ac4312dfc7d2a467ecb4d4c9868fe852/author/1"><span itemprop="name">H. Wunderlich</span></a></span></span>. </span><span class="additional-entrytype-information"><span itemtype="http://schema.org/PublicationIssue" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="journal">it - Information Technology</span>, </em> <em><span itemtype="http://schema.org/PublicationVolume" itemscope="itemscope" itemprop="isPartOf"><span itemprop="volumeNumber">52 </span></span>(<span itemprop="issueNumber">4</span>):
<span itemprop="pagination">209--215</span></em> </span>(<em><span>2010<meta content="2010" itemprop="datePublished"/></span></em>)</span>Mon Mar 19 16:15:07 CET 2018it - Information Technology4209--215{Algorithmen-basierte Fehlertoleranz für Many-Core-Architekturen;
Algorithm-based Fault-Tolerance on Many-Core Architectures}522010ABFT Fehlertoleranz GPU Rechnerarchitekturen SimTech Zuverlässigkeit myown Moderne Many-Core-Architekturen bieten ein sehr hohes Potenzial an Rechenleistung. Dies macht sie besonders für Anwendungen aus dem Bereich des wissenschaftlichen Hochleistungsrechnens und der Simulationstechnik attraktiv. Die Architekturen folgen dabei einem Ausführungsparadigma, das sich am besten durch den Begriff ?Many-Threading? beschreiben lässt. Wie alle nanoelektronischen Halbleiterschaltungen leiden auch Many-Core-Prozessoren potentiell unter störenden Einflüssen von transienten Fehlern (soft errors) und diversen Arten von Variationen. Diese Faktoren können die Zuverlässigkeit von Systemen negativ beeinflussen und erfordern Fehlertoleranz auf allen Ebenen, von der Hardware bis zur Software. Auf der Softwareseite stellt die Algorithmen-basierte Fehlertoleranz (ABFT) eine ausgereifte Technik zur Verbesserung der Zuverlässigkeit dar. Der Aufwand für die Anpassung dieser Technik an moderne Many-Threading-Architekturen darf jedoch keinesfalls unterschätzt werden. In diesem Beitrag wird eine effiziente und fehlertolerante Abbildung der Matrixmultiplikation auf eine moderne Many-Core-Architektur präsentiert. Die Fehlertoleranz ist dabei integraler Bestandteil der Abbildung und wird durch ein ABFT-Schema realisiert, das die Leistung nur unwesentlich beeinträchtigt.
Modern many-core architectures provide a high computational potential, which makes them particularly interesting for applications from the fields of scientific high-performance computing and simulation technology. The execution paradigm of these architectures is best described as “Many-Threading”. Like all nano-scaled semiconductor devices, many-core processors are prone to transient errors (soft errors) and different kinds of variations that can have severe impact on the reliability of such systems. Therefore, fault-tolerance has to be incorporated at all levels, from the hardware up to the software. On the software side, Algorithm-based Fault Tolerance (ABFT) is a mature technique to improve the reliability. However, significant effort is required to adapt this technique to modern many-threading architectures. In this article, an efficient and fault-tolerant mapping of the matrix multiplication to a modern many-core architecture is presented. Fault-tolerance is thereby an integral part of the mapping and implemented through an ABFT scheme with marginal impact on the overall performance.Efficient Algorithm-Based Fault Tolerance for Sparse Matrix Operationshttps://puma.ub.uni-stuttgart.de/bibtex/2973313fa8f49996637b903eadac4fc19/clausbraunclausbraun2018-03-19T16:15:07+01:00ABFT SimTech algebra error-detection fault-tolerance linear localization myown online sparse <span data-person-type="author" class="authorEditorList "><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Alexander Schöll" itemprop="url" href="/person/13e8ca3c5bdad028a00edd6b200f98b53/author/0"><span itemprop="name">A. Schöll</span></a></span>, </span><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Claus Braun" itemprop="url" href="/person/13e8ca3c5bdad028a00edd6b200f98b53/author/1"><span itemprop="name">C. Braun</span></a></span>, </span><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Michael A. Kochte" itemprop="url" href="/person/13e8ca3c5bdad028a00edd6b200f98b53/author/2"><span itemprop="name">M. Kochte</span></a></span>, </span> and <span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Hans-Joachim Wunderlich" itemprop="url" href="/person/13e8ca3c5bdad028a00edd6b200f98b53/author/3"><span itemprop="name">H. Wunderlich</span></a></span></span>. </span><span class="additional-entrytype-information"><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'16)</span>, </em></span><em>page <span itemprop="pagination">251--262</span>. </em>(<em><span>2016<meta content="2016" itemprop="datePublished"/></span></em>)</span>Mon Mar 19 16:15:07 CET 2018Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'16)251--262{Efficient Algorithm-Based Fault Tolerance for Sparse Matrix Operations}2016ABFT SimTech algebra error-detection fault-tolerance linear localization myown online sparse We propose a fault tolerance approach for sparse matrix operations that detects and implicitly locates errors in the results for efficient local correction. This approach reduces the runtime overhead for fault tolerance and provides high error coverage. Existing algorithm-based fault tolerance approaches for sparse matrix operations detect and correct errors, but they often rely on expensive error localization steps. General checkpointing schemes can induce large recovery cost for high error rates. For sparse matrix-vector multiplications, experimental results show an average reduction in runtime overhead of 43.8%, while the error coverage is on average improved by 52.2% compared to related work. The practical applicability is demonstrated in a case study using the iterative Preconditioned Conjugate Gradient solver. When scaling the error rate by four orders of magnitude, the average runtime overhead increases only by 31.3% compared to low error rates.Algorithm-Based Fault Tolerance for Many-Core Architectureshttps://puma.ub.uni-stuttgart.de/bibtex/28ae495f897e6e1e00d2159bae7ffc325/clausbraunclausbraun2018-03-19T16:15:07+01:00ABFT GPGPU GPU SimTech fault-tolerance imported myown <span data-person-type="author" class="authorEditorList "><span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Claus Braun" itemprop="url" href="/person/12c870c5be652c307dc565990657ae91c/author/0"><span itemprop="name">C. Braun</span></a></span>, </span> and <span><span itemtype="http://schema.org/Person" itemscope="itemscope" itemprop="author"><a title="Hans-Joachim Wunderlich" itemprop="url" href="/person/12c870c5be652c307dc565990657ae91c/author/1"><span itemprop="name">H. Wunderlich</span></a></span></span>. </span><span class="additional-entrytype-information"><span itemtype="http://schema.org/Book" itemscope="itemscope" itemprop="isPartOf"><em><span itemprop="name">Proceedings of the 15th IEEE European Test Symposium (ETS'10)</span>, </em></span><em>page <span itemprop="pagination">253--253</span>. </em><em><span itemprop="publisher">IEEE Computer Society</span>, </em>(<em><span>2010<meta content="2010" itemprop="datePublished"/></span></em>)</span>Mon Mar 19 16:15:07 CET 2018Proceedings of the 15th IEEE European Test Symposium (ETS'10)253--253{Algorithm-Based Fault Tolerance for Many-Core Architectures}2010ABFT GPGPU GPU SimTech fault-tolerance imported myown Modern many-core architectures with hundreds of cores provide a high computational potential. This makes them particularly interesting for scientific high-performance computing and simulation technology. Like all nano scaled semiconductor devices, many-core processors are prone to reliability harming factors like variations and soft errors. One way to improve the reliability of such systems is software-based hardware fault tolerance. Here, the software is able to detect and correct errors introduced by the hardware. In this work, we propose a software-based approach to improve the reliability of matrix operations on many-core processors. These operations are key components in many scientific applications.