Uploaded image for project: 'Daffodil'
  1. Daffodil
  2. DFDL-1510

Improve performance report with variance information

XMLWordPrintableJSON

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Normal Normal
    • deferred
    • None
    • Performance, QA
    • None

      A big improvement for these reports would be to make them "self-noise-eliminating", so unlike the report attached, one could eliminate all the red-lights that are about deltas that are "in the noise".

      We want to attract attention (i.e., red-light) deltas that represent a statistically significant drop in performance. This can be a drop relative to prior performance of this branch, or a drop relative to prior performance of a baseline release.

      To do this you need variance-based statistics like Z-score, which is based on standard deviation. Z-score means "how many standard deviations away from the mean is this value." Z-score's between -1 and 1 imply "it's ordinary variation, due to noise most likely". Z-score outside of -1 to 1 implies "it's significant. take a look."

      We need the mean and standard deviation of (previousVal - baselineVal). We can then compute (currentVal - baselineVal), and if its z-score is < -1.0, then we would red-light the value - it means there is a statistically significant degradation in performance (relative to the baseline) due to this commit's code changes. This would only red-light changes due to this code commit. If a test performance is relatively unchanged day to day, but always slow relative to the baseline, this would not red-light that day's delta.

      We probably also want to red-light if there is a general degradation in performance even for tests that are running faster than the baseline, so we would also want mean and standard deviation of previousVal, and similarly red-light if the delta z-score (relative to previousVal) is < -1.0.

      And we want to red-light (or pink-light) tests that are simply slower than the baseline by a statistically significant amount as an ongoing trend. So we would include the currentVal in the mean and stdDev(previousVal), and for mean and stdDev(previousVal - baselineVal). Like everything else here, the assumption is these values are time taken, so lower is better/faster. If the mean of previousVal-baselineVal is negative by more than the stdDev(previousVal - baselineVal), then the trend is that this test is slower than the baseline by a significant amount on an ongoing basis, so we should "pink light" the test results. That particular day's run might or might not have reflected a statistically significant improvement or degradation, but the trend is still below the baseline by a statistically significant amount.

      This takes all the noise variability out of the color highlighting.

      Example:
      baseline is 200, previous is 150, current 139. Mean of prev-baseline is 175, and std-dev of prev-baseline is 12.

      So, current - prev-baseline is -36. Z-score of that is -3.0 which is < -1.0. So red-light goes on.

      Example 2:
      Current is 120. Mean of previous is 142, standard deviation of previous is 12.
      Delta from mean is -22. zscore is -22/12 = -1.83 which is < -1.0, so we red-light this because it represents a statistically significant drop in performance from the average for that test.

      Example 3:
      Current is 120, folding that into mean and std deviation of (previous - baseline) gives mean -20 stdDev of 10. That means the test is generally 20 units slower than the baseline. The z-score of -20 relative to stdDev 10 is -2.0, so we would "pink light" the test, as generally being slower than the baseline on an ongoing basis.

      The inverse of these - statistically significant improvements, could generate green-light, (or light-green).

      To compute this you need at least 12 points of history so that you can have a meaningful mean and standard deviation to compute from.

              Unassigned Unassigned
              mbeckerle.dfdl Mike Beckerle
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: