Welcome To CSE 5095! 

Troubleshooting Distributed Systems
(Tuesday and Thursday) @ (5:00 pm - 6:15 pm) @ (Bronwell 124)


Mohammad Maifi Hasan Khan
Assistant Professor
Department of Computer Science and Engineering
University of Connecticut
Email: maifi.khan@gmail.com, mohammad.khan@uconn.edu 

This course is focused on reliability aspects of various distributed systems, and will cover concepts of reliability, failure modes, and various approaches to troubleshooting large scale systems. More specifically, this course aims to review sate-of-the-art technologies (e.g., Cassandra, sensor fusion systems, adaptive distributed policies, heterogeneous systems, energy aware systems, storage systems), their reliability challenges, and various troubleshooting approaches. This course will cover recent work on reliability and troubleshooting of software systems along with classical ones. Students will gain hands-on experience, and are expected to participate in interesting research projects addressing challenging research problems. 

Policies on Ethics and Cheating

         Students are expected to discuss with each other. However, paper reviews must be written individually and are not expected to be shared with each other. 

         Reviews must be written in your own words. Nothing can be copy/pasted from paper as it is and will be considered cheating. 

         Presentation slides are expected to be prepared by the presenter.

         All university policies and rules are applied regarding cheating, plagiarism and any other form of misconduct.

Course Policies

Paper Reviews

         Students need to submit review of 1 paper for each day (submit reviews for 2 papers/week) 

         You must hand-in a printout of your review at the beginning of the class. If you miss the class, you need to email the review to the instructor before the class.

Paper Presentation

         Each person has to present three papers in the whole semester 

         Each time the presentation will be 20 minutes long followed by 10 minutes discussion

         Students are expected to create their own slides.

         Slides must be emailed to the instructor at least 48 hours before the presentation (e.g., if the presentation is scheduled on Sept 20, 5 pm, the slides must be emailed by Sept 18, 5pm).

         Presentation slots will be assigned on a first come basis.


Research Projects

         Each person needs to do a research project

         Students are encouraged to work in groups. There can be at most 3 persons in a group. 

         Research project scope will be determined based on the group size. 


Office Hours

         Instructor will meet with each project group individually. Appointments will be made using email. 

         For any other need, students can email to set up an appointment to meet with the instructor. 

Recommended Book: 

Distributed Systems: Concepts and Design (4th Edition). - George Coulouris, Jean Dollimore, Tim Kindberg



Important Dates

         Project proposals are due by Sept 26  

         Midterm progress report will be due by November 11

         Final reports will be due at the end of the semester, Dec 5, 11:59 pm

         Each group will need to present the project at the end of the semester. Presentation slots will be assigned on a first come basis.


Topics to cover


-          Time synchronization and Global States 

-          Reliable Multicast

-          Distributed Mutual Exclusion

-          Leader Election

-          Transactions and Concurrency Control

-          p2p systems and overlay routing

-          Byzantine Fault Tolerance

-          Replica Management

-          Introduction to TinyOS

-          Introduction to NoSQL Database

Lecture Schedule, Slides and Papers



Lecture Topic

Assigned Papers

Aug 26, Tuesday



Aug 28, Thursday


         Time, clocks, and the ordering of events in a distributed system. Magazine, Communications of the ACM, Volume 21 Issue 7, July 1978.

         A sqrt-N algorithm for mutual exclusion in decentralized systems. ACM TOCS. Apr. 1985.

         A practical distributed mutual exclusion protocol in dynamic peer-to-peer systems. IPTPS 2004.

Sept 2, Tuesday

Sept 4, Thursday

Challenge of System Monitoring and Execution Tracing

Sept 9, Tuesday

         Profiling Network Performance for Multi-tier Data Center Applications. NSDI 2011 (Paper to review)

         Improving Software Diagnosability via Log Enhancement. ASPLOS 2011

Sept 11, Thursday

         Dapper, a large-scale distributed systems tracing infrastructure. Google Inc. 2010 (Paper to review) (Prasanna Gautam)

         Carat: Collaborative Energy Diagnosis for Mobile Devices. Sensys 2013

Sept 16, Tuesday

         Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets. NSDI 2013 (Paper to review) (Areej Althubaity)

         Volley: Violation Likelihood Based State Monitoring for Datacenters. ICDCS 2013

Sept 18, Thursday

         Detecting Transient Bottlenecks in n-Tier Applications through Fine-Grained Analysis. ICDCS 2013 (Paper to review) (Aljohara Algwaiz)

         Self-Correlating Predictive Information Tracking for Large-Scale Production Systems. ICAC 2009 (Kewen Wang)

Sept 23, Tuesday

         Fay: Extensible Distributed Tracing from Kernels to Clusters. SOSP 2011 (Prasanna Gautam)

         Detecting failures in distributed systems with the FALCON spy network. SOSP 2011 (Paper to review)

Sept 25, Thursday

         Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. OSDI 2012 (Paper to review) (Ramyaa Muthumani)

         Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis. SOSP 2011

Troubleshooting Large Scale Systems

Sept 30, Tuesday

         FChain: Toward Black-box Online Fault Localization for Cloud Systems. ICDCS 2013(Aljohara Algwaiz)

         Diagnosing Data Center Behavior Flow by Flow. ICDCS 2013 (Paper to review)

Oct 2, Thursday

         DeepDive: transparently identifying and managing performance interference in virtualized environments. ATC 2013 (Paper to review) (Russell Jancewicz)

         Fast Crash Recovery in RAMCloud. SOSP 2011 (Ramyaa Muthumani)

Oct 7,


         Adaptive Bug Isolation. Piramanayagam Arumuga Nainar and Ben Liblit. ICSE 2010 (Paper to review)

         Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. NSDI 2012 (Kewen Wang)

Oct 9, Thursday

         Project Demo

Oct 14, Tuesday

         X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software. OSDI 2012 (Paper to review) (Prasanna Gautam)

         Detecting Large-Scale System Problems by Mining Console Logs. SOSP 2009

Oct 16, Thursday

         EyeQ: Practical Network Performance Isolation at the Edge. NSDI 2013 (Paper to review) (Areej Althubaity)

         Holmes: Effective Statistical Debugging via Efficient Path Profiling. ICSE 2009

Troubleshooting Mobile Systems and Wireless Sensor Network Applications

Oct 21, Tuesday

         Isolating Cause-Effect Chains from Computer Programs. Andreas Zeller. FSE 2010

         AppInsight: Mobile App Performance Monitoring in the Wild. OSDI 2012 (Paper to review) (Ramyaa Muthumani)

Oct 23, Thursday

         Project Demo

Oct 28, Tuesday

         What is keeping my phone awake? Characterizing and detecting no-sleep energy bugs in smartphone apps. Mobisys 2012 (Paper to review) (Russell Jancewicz)

         Minerva: Distributed Tracing and Debugging in Wireless Sensor Networks. Sensys 2013

         Sentomist: Unveiling transient sensor network bugs via symptom mining. ICDCS 2010

Oct 30, Thursday

         Which configuration option should I change?. ICSE 2014 (Paper to review) (Kewen Wang)

         Efficient diagnostic tracing for wireless sensor networks. SenSys, 2010 (Paper to review)

         Surviving Sensor Network Software Faults. SOSP 2009

Nov 4, Tuesday


         Where is the energy spent inside my app? Fine Grained Energy Accounting on Smartphones with Eprof. Eurosys 2012

         Draco: statistical diagnosis of chronic problems in distributed systems. In DSN 2012.

         Automating Configuration Troubleshooting with Dynamic Information Flow Analysis. OSDI 2010

Troubleshooting Misconfiguration

Nov 6, Thursday

         Project Demo

Nov 11, Tuesday

         Remus: High Availability via Asynchronous Virtual Machine Replication. NSDI 2008 (Russell Jancewicz)

         Automated diagnosis of software configuration errors. ICSE 2013 (Paper to review)

Nov 13, Thursday

         Do not blame users for misconfigurations. SOSP '13 (Paper to review)

         NetPrints: Diagnosing Home Network Misconfigurations Using Shared Knowledge. NSDI 2007 (Areej Althubaity)

Error Prevention

Nov 18, Tuesday

         PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems. ICDCS 2012 (Paper to review) (Aljohara Algwaiz)

         MODIST: Transparent Model Checking of Unmodified Distributed Systems. NSDI 2009

         CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. NSDI 2009 (Paper to review)

Nov 20, Thursday

         Project Demo

Nov 25, Tuesday

Thanksgiving Break!

Nov 27, Thursday

Thanksgiving Break!

Dec 2, Tuesday

Final Project Presentation!

Dec 4, Thursday


Final Project Presentation!


Supplemental Random Reading



         X-Trace: A Pervasive Network Tracing Framework. NSDI 07

         D3S: Debugging Deployed Distributed Systems. NSDI 2008

         Enabling Configuration-Independent Automation by Non-Expert Users. OSDI 2010

         Debugging Reinvented: Asking and Answering Why and Why Not Questions about Program Behavior. ICSE 2009

         PDA: Passive distributed assertions for sensor networks. IPSN 2009

         SherLog: Error Diagnosis by Connecting Clues from Run-time Logs. ASPLOS 2010.