Principles and Practices of Dependable Distributed Computing

NSF Career Award
CCR 9984778

Alex Shvartsman

This research will advance the theoretical foundations and explore practical implementations of dependable distributed system technology. A distributed system is dependable, when it provides guarantees regarding its performance, fault- tolerance, correctness and compositionality. The research objectives will be achieved through synergy between the research in distributed systems with its focus on fault-tolerance and correctness, the research in parallel computing with its focus on speed-up and efficiency, and the practical engineering considerations of specification, development, deployment and performance of systems. This proposal envelops three investigation areas:

(1) Robust Algorithmics: Development of fault-tolerant and efficient distributed algorithms and exploration of limitations on achieving robustness in distributed computing.

(2) Building Blocks: Definition and analysis of dependable distributed building blocks needed by applications requiring precise guarantees; and design of specification frameworks for capturing designs and optimizing distributed system deployment.

(3) Distributed Implementation: Development of exploratory implementations of compositional building blocks and robust algorithms, and evaluation of their performance in realistic and simulated settings; empirical evaluations will complement the analytically established efficiency characterizations.

The educational component includes: developing and delivering new courses in distributed computing in support of undergraduate and graduate programs in computer science; and, building a research group that attracts graduate students and postdoctoral researchers.