Score: 0

FTI-TMR: A Fault Tolerance and Isolation Algorithm for Interconnected Multicore Systems

Published: October 19, 2025 | arXiv ID: 2510.16896v2

By: Yiming Hu

Potential Business Impact:

Keeps computers working even when parts break.

Business Areas:
Transaction Processing Financial Services, Payments, Software

Two-Phase TMR conserves energy by partitioning redundancy operations into two stages and making the execution of the third task copy optional, yet it remains susceptible to permanent faults. Reactive-TMR (R-TMR) counters this by isolating faulty cores, handling both transient and permanent faults. However, the lightweight hardware required by R-TMR not only increases complexity but also becomes a single point of failure itself. To bypass isolated node constraints, this paper proposes a Fault Tolerance and Isolation TMR (FTI-TMR) algorithm for interconnected multicore systems. By constructing a stability metric to identify the most reliable nodes in the system, which then perform periodic diagnostics to isolate permanent faults. Experimental results show that FTI-TMR reduces task workload by approximately 30% compared with baseline TMR while achieving higher permanent fault coverage.

Page Count
10 pages

Category
Computer Science:
Distributed, Parallel, and Cluster Computing