logo
Proactive Failure Avoidance, Recovery and Maintenance
(PFARM)

Summary of DSN Workshop at Estoril, Lisbon, Portugal, June 29th, 2009

THE PFARM CHALLENGES KARAOKE

It was the goal of this workshop to bring together researchers from various communities all over the world. With more than 20 participants the workshop provided a stimulating, and fruitful forum to foster collaboration among researchers. One highlight was the PFARM CHALLENGES KARAOKE. Each workshop participant was asked to provide (on little sheets of paper) up to three short statements what —in his or her perspective— the biggest challenges for PFARM are. Each statement was folded and put into a big bowl. At "showtime", people were selected randomly and asked to pick a challenge from the bowl. Then, they had to argue most convincingly why the picked challenge was most important
In our opinion the PFARM CHALLENGES KARAOKE was a big success. This sometimes offered completely new perspectives on topics that seemed to be all set.

As a first step, we decided to persist the list of topics here. We are in the process to move the topics to a forum where people can discuss about the topics, state their opinion, give references to existing work, and start building a comprehensive compendium on proactive failure avoidance, recovery and maintenance.

  • Can uncertainty in measured data lead to bad?
  • Investigating the relation between faults and misconfigurations
  • Measuring system reliability against potential faults
  • Improvement and exploitation of existing monitoring infrastructures for proactive failure recovery / maintenance
  • How do I proactively compose my systems or services in a highly dynamic environment (such as the Web)?
  • Requirements of monitoring infrastructures for proactive actions (metrics ...)
  • Relationships between prediction and monitoring. Should we focus on a "design for prediction" methodology?
  • How to factor human factors / whole ecosystem analysis
  • Diagnosis of problems that have never occurred before
  • Blackbox systems
  • Dependability assessment of off-the-shelf / open-source component-based complex systems
  • Dependability assessment of dynamic distributed systems
  • How can we model the network accurately?
  • Deal with diversity of equipment, protocols, etc.
  • Increased diversity & complexity of corporate environments
  • Rapid growth brough about by mergers, acquisitions, integration of new areas
  • In proactive recovery, how can we choose the "best" frequency of recoveries, in order to cause only minimal performance penalty, but also to recover from as many faults as possible?
  • If we use modelling to choose the best frequencies of recoveries, how do we model malicious intruders?
  • Automation scares operators! How to build trust of operators on automating recovery / configuration and reactive tools?
  • Benchmark for PFARM problems
  • Monitoring: How, what, when?
  • Proactive reliable systems design; how can proactive policies be integrated in the development cycle?
  • Methods to estimate the costs of introducing proactive policies v.s. reactive policies. How can we assess such costs and compose them with reacive policies' costs?
  • Testing of proactive reliable systems: How can we test the introduced mechanisms for proactive reliability assurance?
  • Diversity injection
  • How do I know that the (service) components I am using from other people are reliable?
  • How to process / select relevant data from all the data available?
  • How to predict and diagnose problems never seen before?
  • How to deal with multiple administrative domains (i.e., not complete visibility) in diagnosis?
  • How to diagnose transient failures?
  • Field data sources: How can we achieve an effective source of failure data?


Last updated Sep 25th, 2009