Workshop APRETAF : Algorithmes Parallèles, Répartis Et Tolérance Aux Fautes

Les jeudi 22 et vendredi 23 janvier 2009 à Grenoble

APRETAF est un workshop organisé conjointement par des membres du LIG et de VERIMAG. Il sera organisé les 22 et 23 janvier 2009 à Grenoble. Ce workshop porte sur les différentes approches considérées pour la tolérance aux fautes dans les applications parallèles et réparties. L'accent est mis sur l'algorithmique sous-jacente et les garanties de performance pouvant être obtenues.

Mots-clés :

Algorithmes parallèles/distribués
Tolérance aux fautes
Certification
Codes correcteurs
Auto-stabilisation
Détecteur de défaillances
Ordonnancement

Contact : apretaf@imag.fr

Accès : Les présentations auront lieu dans l'amphithéâtre F018 de l'UFR IMA. A partir de la gare de Grenoble, prendre la ligne B du tramway direction Gière, plaine des sports, descendre à l'arrêt Gabriel Fauré, puis suivre le plan pour aller à l'UFR IMA.

Planning
Jeudi 14h00-15h00 Franck Petit (INRIA / LIP Lyon) Stabilisation et synchronisation d'horloges logiques

15h00-16h00 Alain Girault (LIG / INRIA PopArt) Ordonnancement avec tolérance aux fautes

16h00-16h30 Pause

16h30-17h30 Jean-Louis Roch et Thierry Gautier (LIG / INRIA MOAIS) Certification de performances et de résultats sur une infrastructure de calcul ambiante
17h30-19h00

Yoann Dieudonné (MIS / UPJV), Formation de motifs par une cohorte de robots mobiles et autonomes
Xavier Besseron (LIG / INRIA MOAIS), Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow Graph Model
Mohamed-Slim Bouguerra (LIG / INRIA MOAIS), A new flexible checkpoint/Restart model

19h30 Diner

Vendredi 09h00-10h00 Hugues Fauconnier (LIAFA / Paris VII) Détecteurs de défaillances

10h00-10h30 Pause

10h30-12h00

Mathieu Cunche (INRIA Grenoble / PLANETE), Codage pour la tolérance aux pertes
Thomas Roche (LIG / INRIA MOAIS), Certification of linear algebra operations on large scale global computing systems
Cédric Tedeschi (INRIA / Sophia Antipolis), Self-Stabilization in Tree-Structured Peer-to-Peer Service Discovery Systems

12h00-13h30 Déjeuner

13h30-14h30 Sébastien Tixeuil (LIP6 / Paris VI) Sur l'auto-stabilisation des robots mobiles dans les graphes

14h30-15h00 Pause

15h00-16h00

Borzoo Bonakdarpour (VERIMAG / Université de Grenoble I), Symbolic synthesis of masking fault-tolerant distributed programs
Stéphane Devismes (VERIMAG / Université Grenoble I), With finite memory consensus is easier than reliable broadcast

**Planning**
Jeudi	14h00-15h00	Franck Petit (INRIA / LIP Lyon)	Stabilisation et synchronisation d'horloges logiques
15h00-16h00	Alain Girault (LIG / INRIA PopArt)	Ordonnancement avec tolérance aux fautes
16h00-16h30	Pause
16h30-17h30	Jean-Louis Roch et Thierry Gautier (LIG / INRIA MOAIS)	Certification de performances et de résultats sur une infrastructure de calcul ambiante
17h30-19h00	Yoann Dieudonné (MIS / UPJV), Formation de motifs par une cohorte de robots mobiles et autonomes Xavier Besseron (LIG / INRIA MOAIS), Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow Graph Model Mohamed-Slim Bouguerra (LIG / INRIA MOAIS), A new flexible checkpoint/Restart model
19h30	Diner
Vendredi	09h00-10h00	Hugues Fauconnier (LIAFA / Paris VII)	Détecteurs de défaillances
10h00-10h30	Pause
10h30-12h00	Mathieu Cunche (INRIA Grenoble / PLANETE), Codage pour la tolérance aux pertes Thomas Roche (LIG / INRIA MOAIS), Certification of linear algebra operations on large scale global computing systems Cédric Tedeschi (INRIA / Sophia Antipolis), Self-Stabilization in Tree-Structured Peer-to-Peer Service Discovery Systems
12h00-13h30	Déjeuner
13h30-14h30	Sébastien Tixeuil (LIP6 / Paris VI)	Sur l'auto-stabilisation des robots mobiles dans les graphes
14h30-15h00	Pause
15h00-16h00	Borzoo Bonakdarpour (VERIMAG / Université de Grenoble I), Symbolic synthesis of masking fault-tolerant distributed programs Stéphane Devismes (VERIMAG / Université Grenoble I), With finite memory consensus is easier than reliable broadcast

Alain Girault

Automating the Addition of Fault Tolerance with Discrete Controller Synthesis.

Résumé : Discrete controller synthesis (DCS) is a formal approach, based on the same state-space exploration algorithms as model-checking. Its interest lies in the ability to obtain automatically systems satisfying by construction formal properties specified a priori. In this paper, our aim is to demonstrate the feasibility of this approach for fault tolerance. We start with a fault intolerant program, modeled as the synchronous parallel composition of finite labeled transition systems; we specify formally a fault hypothesis; we state some fault tolerance requirements; and we use DCS to obtain automatically a program, having the same behavior as the initial fault intolerant one in the absence of faults, and satisfying the fault tolerance requirements under the fault hypothesis. Our original contribution resides in the demonstration that DCS can be elegantly used to design fault tolerant systems, with guarantees on key properties of the obtained system, such as the fault tolerance level, the satisfaction of quantitative constraints, and so on. We show with numerous examples taken from case studies that our method can address different kinds of failures (crash, value, or Byzantine) affecting different kinds of hardware components (processors, communication links, actuators, or sensors). Besides, we show that our method also offers an optimality criterion very useful to synthesize fault tolerant systems compliant to the constraints of embedded systems, like power consumption.

Yoann Dieudonné

Formation de motifs par une cohorte de robots mobiles et autonomes.

Résumé : Etant donnée une cohorte de robots dont les principales caractéristiques sont d'êtres homogènes, autonomes, anonymes, désorientés, sans memoire et sans medium de communication : quelle tâches peuvent-ils effectuer sous de telle conditions? Nous aborderons le cas du problème de la formation de motifs qui consiste à faire en sorte que les robots s'agencent dans l'espace afin que leurs positions coïncident avec les points d'un motif préalablement connu de tous.

Mots clés : Robot, formation de motif, auto-organisation

Xavier besseron

Optimized Coordinated Checkpoint/Rollback Protocol using a Dataflow Graph Model.

Résumé : Fault-tolerance protocols play an important role in today long runtime scienti\ufb01c parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. We present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. The application is described using a dataflow graph, which is an abstract representation of the execution. Thanks to this representation, the fault recovery in our protocol only requires a partial restart of other processes. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol.

Mots clés : grid, distributed computing, fault-tolerance, data \ufb02ow graph.

Mohamed-Slim Bouguerra

A new flexible checkpoint/Restart model.

Résumé : This work presents a new model of coordinated checkpoint/restart mechanism for several types of computing platforms. Its main feature is that it is independent from the failure law which makes it very flexible. We will show that such a model may be used to determine the optimal periodic checkpoint interval and to reduce the checkpoint overhead through mathematical analysis of reliability. Moreover, unlike most of the existing checkpointing models, the proposed model is able to take into account a variable checkpoint cost.

Mots clés : Parallel processing - Fault tolerance - Checkpointing.

Mathieu Cunche

Codage pour la tolérance aux pertes.

Résumé : Cette présentation s'intéresse à une brique capitale pour de nombreuses applications : les codes correcteurs d'erreurs pour les "canaux à effacement". Ces codes, qui habituellement opèrent au niveau des couches transport ou application , sont complémentaires des codes que l'on trouve au niveau de la couche physique. Grâce à un ajout de redondance ces codes permettent une récupération des données effacées. Les pannes sur un système de stockage distribué de données et les pertes de paquets sur un réseau sont des exemples d'effacements pouvant être corrigés par ces codes. Outre la capacité de correction des codes, la complexité des algorithmes d'encodage et de décodage est particulèrement importante . En effet il n'est pas rare de travailler avec des mots de code longs de plusieurs milliers de symboles. Durant cette présentation nous présenterons les principes du codage pour le canal à effacement et nous ferons un survol des solutions existantes.

Mots clés : codes correcteurs, canal à effacement, tolérance aux pertes, hautes performances.

Hugues Fauconnier

Objets partagés, détecteurs de défaillances et hiérarchie de consensus.

Résumé : On considère généralement deux modèles principaux pour l'algorithmique distribuée asynchrone: celui où les processus partagent des registres atomiques (ou plus généralement des objets comme test&set, compare&swap ...) et celui où les processus communiquent par envoi-réception de messages. En l'absence de pannes ces modèles sont équivalents dans le sens où on peut simuler l'un par l'autre, ce qui n'est plus le cas si un nombre quelconque de processus peuvent tomber en panne crash. Dans cet exposé nous mettrons en évidence des différences importantes entre ces modèles. En particulier nous montrerons que l'implémentation d'objets partagés avec des détecteurs de défaillances dans le modèle d'envoi-réception de messages rend équivalents tous les objets plus forts que des registres. On montrera aussi que déterminer le plus faible détecteur de défaillances peut être beaucoup plus simple dans le modèle avec envoi-réception de messages.

Mots clés: Consensus, Mémoire partagée, mulitcore, tolérance aux défaillances

Thomas Roche

Certification of linear algebra operations on large scale global computing systems.

Résumé : Large scale global computing systems like the Grid and Peer-to-peer computing platforms gather thousands of resources for computing parallel applications. Even if a middleware is used to secure the communications and to manage the resources, the computational nodes operate in an unbounded environment and are subject to a wide range of attacks able to alter the computed results. Of course, global computations are expected to tolerate certain low rates of faults, yet one should consider the possibility of massive attacks resulting in an error rate larger than tolerable by the application. Such massive attacks are especially of concern due to Distributed Denial of Service, virus or Trojan attacks, and more generally orchestrated attacks against widespread vulnerabilities of a specific operating system that may result in the corruption of a large number of resources. Hence the logical fault model of such systems is the byzantine fault one where the least possible hypothesis are made on the attacker power. When the confidence one can have in such system is usually very low, we will see how it can be improved for some target application In this context, we study fault-tolerance for common linear algebra operations such as Matrix-vector multiplication, where the dimensions (n) of the algebra system is high and needs huge computational power (Matrix-vector product complexity is in O(n^2)). Such operations are often found in large scale simulation applications. We then present our work on an Algorithm-Based Fault Tolerant (ABFT) with arbitrary low probability of error (in the resulting vector) and display its complexity.

Mots clés : Byzantine faults, Fault tolerance, Correction codes, Probabilistic certification

Borzoo Bonakdarpour

Symbolic synthesis of masking fault-tolerant distributed programs.

Résumé : We focus on automated addition of masking fault-tolerance to existing fault-intolerant distributed programs. A program is /masking/ fault-tolerant, if it satisfies its safety and liveness specifications in the absence and presence of faults. Masking fault-tolerance is highly desirable in distributed programs, as the structure of such programs are fairly complex and they are often subject to various types of faults. However, the problem of synthesizing masking fault-tolerant distributed programs from their fault-intolerant version is NP-complete in the size of the intolerant program's state space, setting the practicality of the synthesis problem in doubt. We show that synthesizing moderate-sized masking distributed is feasible in practice. In particular, we present and implement a set of BDD-based synthesis heuristics for adding masking fault-tolerance to existing fault-intolerant distributed programs automatically. Our experiments show that synthesizing masking distributed programs with reachable states of size 10^40 and beyond is possible in reasonable amount of time and memory. We also identify several bottlenecks in synthesis of distributed programs depending upon the structure of the intolerant program at hand. We conclude that unlike verification, in program synthesis, the most challenging barrier is not the state explosion problem by itself, but the time complexity of decision procedures.

Sébastien Tixeuil

Sur l'auto-stabilisation des robots mobiles dans les graphes.

Résumé : L'auto-stabilisation est une technique générale permettant le recouvrement d'un comportement correct suite à une défaillance transitoire ayant plongé un système distribué dans un état global arbitraire. Les robots mobiles constituent un nouveau modèle de calcul distribué. Dans cet exposé, je présenterai l'adaptation de l'auto-stabilisation aux modèles de robots mobiles dont les mouvements sont restreints par des graphes. Les résultats d'impossibilité seront discutés au travers d'exemples comme le nommage, l'élection, et les commérages.

Mots clés : auto-stabilisation, robot mobile, impossibilité, nommage, élection, commérage.

Franck Petit

Stabilisation et synchronisation d'horloges logiques

Résumé : Après avoir brièvement rappeler le contexte et les principes fondateurs de l'auto-stabilisation dans les systèmes distribués, nous présentons quelques résultats sur la stabilisation instantanée et la synchronisation de phase logique en environnement asynchrone anonyme. Nous verrons également comment cette dernière permet d'implanter des algorithmes à vagues instantanément stabilisants dans les réseaux anonymes.

Stéphane Devismes

With finite memory consensus is easier than reliable broadcast.

Résumé : We consider asynchronous distributed systems with message losses and process crashes. We study the impact of finite process memory on the solution to consensus, repeated consensus and reliable broadcast. With finite process memory, we show that in some sense consensus is easier to solve than reliable broadcast, and that reliable broadcast is as difficult to solve as repeated consensus: More precisely, with finite memory, consensus can be solved with failure detector S, and P- (a variant of the perfect failure detector which is stronger than S) is necessary and sufficient to solve reliable broadcast and repeated consensus.

Mots clés : Distributed algorithms, failure detectors, reliable broadcast, consensus, repeated consensus.

Cédric Tedeschi

Self-Stabilization in Tree-Structured Peer-to-Peer Service Discovery Systems.

Résumé : The efficiency of service discovery is critical in the development of fully decentralized middleware intended to manage large scale computational grids. This demand influenced the design of many peer-to-peer based approaches. The ability to cope with the expressiveness of the service discovery was behind the design of a new kind of overlay structures that is based on tries, or prefix trees. Although these overlays are well designed, one of their weaknesses is the lack of any concrete fault tolerant mechanism, especially in dynamic platforms; the faults are handled by using preventive and costly mechanisms, e.g., using a high degree of replication. Moreover, those systems cannot handle any arbitrary transient failure. Self-stabilization, which is an efficient approach to design reliable solutions for dynamic systems, was recently suggested to be a good alternative to inject fault-tolerance in peer-to-peer systems. However, most of the previous research on self-stabilization in tree and/or P2P networks was designed in theoretical models, making these approaches hard to implement in practice. We provide a self-stabilizing message passing protocol to maintain prefix trees over practical peer-to-peer networks. A complete correctness proof is provided, as well as simulation results to estimate the practical impact of our protocol.

Mots clés : Peer-to-peer systems, service discovery, fault-tolerance, self-stabilization.

Jean-Louis Roch et Thierry Gautier

Certification de performances et de résultats sur une infrastructure de calcul ambiante.

Résumé : Les infrastructures de calcul ambiantes et les "nébuleuses" informatiques distribuées à grande échelle ("cloud computing", grilles, système pair à pair, ...) permettent de mobiliser des centaines de milliers de ressources pour des calculs applicatifs. Généralement, une exécution met en jeu trois parties : le fournisseur de ressources ; le fournisseur d'application (ou de service) ; l'utilisateur final qui soumet ses entrées pour obtenir les résultats (souvent critiques) grâce à l'exécution de l'application sur l'infrastructure. Même si un intergiciel est utilisé pour sécuriser les communications et gérer les ressources, les noeuds de calcul opèrent dans un environnement non clos et sont sujets à un large éventail d'attaques susceptibles de briser la confidentialité ou d'altérer les ressources ou les résultats. Un défi est alors d'établir un contrat de confiance entre les trois parties, ce qui nécessite de définir et de garantir des mesures qualitatives et quantitatives de la confiance: performance et qualité des résultats obtenus. Centré sur un ordonnancement par vol de travail d'applications à grain potentiellement fin, cet exposé aborde la problématique théorique et pratique de la garantie de performances en fonction des ressources utilisées (protocoles efficaces de tolérance des pannes franches) et de la garantie de la qualité de correction des résultats (protocoles efficaces de correction des résultats).