logo
Proactive Failure Avoidance, Recovery and Maintenance
(PFARM)

Summary of DSN Workshop at Estoril, Lisbon, Portugal, June 29th, 2009

PROGRAM

8:30 - 8:45 Introduction
Monitoring and Evaluation
8:45 - 9:00 Online Reliability Monitoring: a Hybrid Approach,
Roberto Pietrantuono, Stefano Russo (University of Naples Federico II), Kishor Trivedi (Duke University)
9:00 - 9:15 A Runtime Dependability Evaluation Framework for Fault Tolerant Web Services,
Zibin Zheng, Michael Lyu (The Chinese University of Hong Kong)
9:15 - 9:30 An Approach for Assessing Logs by Software Fault Injection,
Roberto Natella, Antonio Pecchia, Domenico Cotroneo, Stefano Russo (University of Naples Federico II)
Reaction Methods
9:30 - 9:45 WEST: Wormhole-Enhanced State Transfer,
Rogério Correia, Paulo Sousa (University of Lisboa)
9:45 - 10:00 Performance Aware Regeneration in Virtualized Multitier Applications,
Kaustubh Joshi, Matti Hiltunen (AT&T Labs Research), Gueyoung Jung (Georgia Institute of Technology)
10:00 - 10:30 Coffee Break
Diagnosis
10:30 - 10:45 Etymon: A Root Cause Analysis System for Large and Complex IT Enterprise Networks,
Tiago Carvalho, Hyong Kim, (Carnegie Mellon University)
10:45 - 11:00 Automatic detection of firewall misconfigurations using firewall and network routing policies
Ricardo Oliveira (Portugal Telecom),  Sihyung Lee, Hyong Kim (Carnegie Mellon University)
11:00 - 12:00 PFARM Challenges Karaoke

ABSTRACTS

Online Reliability Monitoring: a Hybrid Approach

Assuring high reliability levels in complex software systems is difficult. The spread of component-based paradigm brought, along with many advantages, new thorny problems and challenges. Various approaches have been proposed to guarantee high reliability and cope with such problems- among these, proactive policies are particularly effective and inexpensive. The ability to monitor the system at runtime and to give online estimations about the trend of dependability attribute of interest, is the key to implement strategies aiming at forecasting, and thus proactively preventing, the system failure occurrence. In this paper, an online reliability monitoring approach is proposed. It combines benefits of architecturebased reliability model and dynamic analysis, so as to integrate static modeling power with representative operational data. Its usage is illustrated by a prototype implementation, a case-study and preliminary results.

A Runtime Dependability Evaluation Framework for Fault Tolerant Web Services

Service-oriented systems are usually built on top of Web service components, which are distributed across the Internet, making dependability a big challenge. In this paper, we propose a runtime dependability evaluation framework for fault tolerant Web services to attack this crucial problem. We first propose a user-collaborative framework for collecting Web service QoS information from both the service providers and service users. Then, Web service QoS models, fault tolerance strategies, and optimal Web service recommendation approaches are presented. Finally, the benefits of the runtime evaluation framework are demonstrated by real-world experiments. As illustrated by the experimental results, our proposed framework makes fault tolerance for distributed service-oriented systems feasible, reconfigurable, and optimized.

An Approach for Assessing Logs by Software Fault Injection

Nowadays, an increasing number of systems needs to be kept running for long periods without showing failures, but several factors compromise their correct behavior during the operational phase. Logs play a key role to address dependability issues of current systems and to enable proactive actions against failures (e.g., proactive maintenance, failure prediction). Nevertheless, they may lack any information in case of software faults, which escape the testing phase and are activated on the field by complex environmental conditions. In this paper, we evaluate built-in logging capabilities of a software system, namely the Apache Web Server, by means of an extensive software fault injection campaign. We experience that, in most of cases, software faults lead to failures without leaving any information in Apache logs. For this reason, we provide a few guidelines for developers that can be used during the development cycle, in order to improve the effectiveness of logs during the operational phase.

WEST: Wormhole-Enhanced State Transfer

This paper presents WEST, an efficient statetransfer protocol for Byzantine fault-tolerant state machine replication systems enhanced with proactive-reactive recovery. Usually the recovery of a stateful replica consumes a considerable amount of time, mostly due to state transfer. As a result it is essential to reduce the state transfer time, simultaneously guaranteeing that correct replicas never lose their state. Our approach consists on creating periodic state checkpoints stored in a distributed secure component (wormhole), and relying on this component to manage/control state transfer operations. Preliminary evaluation results show the performance and overhead of the proposed protocol.

Performance Aware Regeneration in Virtualized Multitier Applications

Virtual machine technology enables highly agile system deployments in which components can be cheaply moved, cloned, and allocated controlled hardware resources. In this paper, we examine in the context of multitier Enterprise applications, how these facilities can be used to provide enhanced solutions to the classic problem of ensuring high availability without a loss in performance on a fixed amount of resources. By using virtual machine clones to restore the redundancy of a system whenever component failures occur, we achieve improved availability compared to a system with a fixed redundancy level. By smartly controlling component placement and colocation using information about the multitier system’s flows and predictions made by queuing models, we ensure that the resulting performance degradation is minimized. Simulation results show that our proposed approach provides better availability and significantly lower degradation of system response times compared to traditional approaches.

Etymon: A Root Cause Analysis System for Large and Complex IT Enterprise Networks

Localizing the root-cause of failures in a large IT infrastructure is a very challenging task. In this paper, we present Etymon, a system that identifies the most relevant network components and metrics to explain performance problems perceived by end-users. Our proposed modular architecture identifies performance issues, continuously characterizes dependencies using traffic analysis, and creates a network model based on the resulting dependencies. The probability of each component contributing to the failure is evaluated using a deviation analysis and network behavior patterns. We define causal paths with a root-cause component as an origin. The causal path illustrates a probability model that captures both sequential dependencies and correlation of network traffic flows. Etymon also introduces novel concepts such as environment-specific network models, contextconditioned dependency information, temporal correlation of anomalies, and rankings of root-cause components and metrics. Etymon was deployed in an enterprise IT network of a large European Telecom operator, and results from this experiment are discussed in the paper.

Automatic detection of firewall misconfigurations using firewall and network routing policies

Firewalls are the most prevalent and important means of enforcing security policies inside networks and across organizational boundaries. However, effective and fault free firewall management in large and fast growing networks becomes increasingly more challenging. Firewall security policies are complex and their interaction with routing policies and applications further complicates policy configurations. It is often that routing is ignored in firewall management. Configuration problems can occur in a device or multiple devices along several network paths that change over time according to routing. We present an application, Prometheus, which implements mechanisms for automatic detection of firewall configuration problems that are extremely difficult to resolve manually. In addition to firewall configurations, Prometheus incorporates and analyzes dynamic routing information. We believe that the routing information is critical to obtain the complete view of the network and cannot be ignored for firewall configurations. We test Prometheus in a large production network and report its effectiveness. Prometheus is currently being deployed in the production network.


Last updated Sep 25th, 2009