CSE 124 Datacenters and tail latency
2017 November 27: Homework 5

Logistics

Due Monday Dec 4, 5pm

Quantifying the effect of tail latency on networked services

In this exercise, you are going to simulate a distributed, replicated network service’s overall performance. In particular, you are going to quantify the effect of latency and latency variation on delivered performance.

Start by reading Dean and Barroso’s The Tail at Scale, Comm. of the ACM, Vol 56, No 2, Pages 74-80.

Overview

You are going to simulate the performance of three different systems:

  • Scenario 1: A single client issues N independent, sequential, non-overlapping requests to a network service
  • Scenario 2: A single client issues P parallel requests to a network service, waiting until all responses have been completed and the results have been returned before considering the operation complete
  • Scenario 3: A single client issues P parallel requests to a network service, waiting until the first 90% of responses have been completed and returned before considering the operation complete. The remaining 10% of responses are simply ignored.

Emulating delay

To emulate this service, you’re going to instantiate a random variable N(mu,sigma), drawn from a Gaussian/Normal distribution with mean mu and standard deviation sigma. This variable represents the observed request/response delay, in milliseconds. Any negative observations seen should be “rounded up” to 0 (i.e., don’t wait).

One way to generate variables from this distribution in Python is to use the random.gauss() function:

import random

def independent(mu,sigma,num):
    vals = []
    for i in range(0,num):
        v = random.gauss(mu,sigma)
        vals.append(v if v >= 0 else 0)
    return sorted(vals)

Thought question: How do you find the average time from a sorted list of numbers? The 90th percentile of a sorted list of length N? The 100th percentile of a list of length N?

Experiments

You are going to implement a Factorial design experiment, in which you measure the effect of variance and scale on the performance of a distributed system.

For each of scenario 1, 2, and 3, described above, compute the latency for:

  • A 100 node cluster, mean request time of 250 milliseconds, std. deviation of 250
  • A 100 node cluster, mean request time of 250 milliseconds, std. deviation of 10
  • A 100 node cluster, mean request time of 200 milliseconds, std. deviation of 200
  • A 100 node cluster, mean request time of 200 milliseconds, std. deviation of 10
  • A 10,000 node cluster, mean request time of 250 milliseconds, std. deviation of 250
  • A 10,000 node cluster, mean request time of 250 milliseconds, std. deviation of 10
  • A 10,000 node cluster, mean request time of 200 milliseconds, std. deviation of 200
  • A 10,000 node cluster, mean request time of 200 milliseconds, std. deviation of 10

Report

For each of the experiments and scenarios, report your results. You should include the mean service latency in the case of Scenario 1, and the operation latency in the case of Scenarios 2 and 3.

In a paragraph, describe the impact that tail latency has on parallelism within a distributed system. What was the effect of “shedding” the slowest 10% of the requests (the so-called stragglers)? Does this shedding effect change depending on the scale of the service?

Submission

To submit this homework as a PDF, go to https://gradescope.com and use entry code MP5RJ6