FTI-TMR: A Fault Tolerance and Isolation Algorithm for Interconnected Multicore Systems
By: Yiming Hu
Potential Business Impact:
Keeps computers working even when parts break.
Two-Phase TMR conserves energy by partitioning redundancy operations into two stages and making the execution of the third task copy optional, yet it remains susceptible to permanent faults. Reactive-TMR (R-TMR) counters this by isolating faulty cores, handling both transient and permanent faults. However, the lightweight hardware required by R-TMR not only increases complexity but also becomes a single point of failure itself. To bypass isolated node constraints, this paper proposes a Fault Tolerance and Isolation TMR (FTI-TMR) algorithm for interconnected multicore systems. By constructing a stability metric to identify the most reliable nodes in the system, which then perform periodic diagnostics to isolate permanent faults. Experimental results show that FTI-TMR reduces task workload by approximately 30% compared with baseline TMR while achieving higher permanent fault coverage.
Similar Papers
FTI-TMR: A Fault Tolerance and Isolation Algorithm for Interconnected Multicore Systems
Distributed, Parallel, and Cluster Computing
Keeps computers working even when parts break.
Fault Tolerant Reconfigurable ML Multiprocessor
Networking and Internet Architecture
Makes computers learn faster and fix themselves.
FTHP-MPI: Towards Providing Replication-based Fault Tolerance in a Fault-Intolerant Native MPI Library
Distributed, Parallel, and Cluster Computing
Keeps supercomputers running when parts break.