logo
Proactive Failure Avoidance, Recovery and Maintenance
(PFARM)

Summary of DSN Workshop at Estoril, Lisbon, Portugal, June 29th, 2009

INTRODUCTION

Over the last decade, research on dependable computing has undergone a shift from reactive towards proactive methods: Traditionally, fault tolerance was reacting to errors or component failures in order to prevent them to turn into system failures, and maintenance was following fixed, time-based plans. However, due to an ever increasing system complexity, use of commercial-off-the-shelf components, virtualization, ongoing system patches and updates, etc., such approaches have become difficult to apply. Therefore, a new area in dependability research has emerged focusing on proactive approaches that start acting before a problem arises in order to increase time-to-failure and/or reduce time-to-repair. These techniques frequently build on the anticipation of upcoming problems based on runtime monitoring. Industry and academia has come up with several terms for such techniques, each focusing on different aspects, including self-* computing, autonomic computing, proactive fault management, trustworthy computing, software rejuvenation, or preventive/proactive maintenance.
PFARM covers a variety of topics among which are:

  • Runtime dependability assessment and evaluation (reliability, availability, etc.)
  • Runtime monitoring for online fault detection and diagnosis, including monitoring data processing
  • Prediction methods to anticipate failures, resource exhaustion or other critical situations in complex systems, distributed systems, adaptive or peer-to-peer networks.
  • Predictive diagnosis and fault location as well as root-cause analysis
  • Optimal decision algorithms and policies to manage and schedule the application of actions
  • Downtime minimization or avoidance mechanisms such as preventive failover, state-clean up, proactive reconfiguration, failure-prevention driven load balancing prediction-driven restarts, rejuvenation, adaptive checkpointing, or other prediction-driven enhancements of traditional repair methods
  • Proactive fault management and maintenance techniques such as monitoring-based replacement, configuration and management of computer systems and components
  • Dependability evaluation including models to assess the impact on metrics such as availability, reliability, security, performability and user-oriented metrics such as service availability, etc.
  • Case-studies, applications, experiments, experience reports

GOAL of this WEBSITE

We started out with the PFARM workshop in order to form a community of people from industry as well as academia working on proactive failure avoidance, recovery and maintenance. This website tries to move along this line. Its main goals are

  1. To wrap up the workshop by listing participants, talks, etc.
  2. To be some sort of a landing page for people who want to participate in the "PFARM community". For this purpose, the website
    • links to a forum which is about to come
    • links to a mailing list by which people can communicate


Last updated Sep 25th, 2009