bolo


Data Types

Bolo handles several different types of monitoring data, including metrics (measurements of some numerical value), events, and states.

Metric Types

A metric is something that can be measured. CPU idle time is a metric, measured in seconds. The number of free inodes on a mountpoint is another metric, measured as a count.

Metrics come in several varieties, based on how the measurement is taken, and the disposition of the raw measurement. These varieties are sample, counter and rate.

Sample Data

Samples are used to measure steady-state information. Each measurement stands alone and fully represents the thing it measures. Most metrics related to computer system health are treated as samples.

For example, the number of running nginx workers is a sample. A single number provides all the information you need to know about the size of the nginx worker pool at a given point in time.

Highly variadic data can be difficult to monitor. The number of connected TCP clients to a given server can change from second to second. Measuring this value once a minute will often lead to skewed results, which in turn leads to misinformation and poorly-informed decision-making. To help combat this, sample metrics can aggregate multiple measurements during the aggregation window, using several statistical methods.

If a system is submitting a measurement every second, and bolo is configured to aggregate minutely, those 60 metrics will be analyzed to ascertain the minimum and maximum values, the sum of all measurements, the mean value measured, and the statistical variance of the sample set.

Counter Data

Sometimes you just want to count things. How many failed logins occurred? How many HTTP 404 responses did the web pool give out? How many times are people clicking the "upgrade" link in the application?

For this type of information, a Counter is ideal. Its operation is mundanely simple — every time we get a value, increment the counter. If clients want to, they can batch up these increments and send a value to add to the counter. When the aggregation window closes, the counter value resets to 0.

Rate Data

Some systems keep track of important quantities through the use of ever-increasing counters. Routers will often increment a 64-bit counter every time they handle a packet. By taking multiple measurements, at different points in time, one can calculate a delta, or change in value, which yields a rate of change measurement.

That's what Rate metrics are.

The bolo aggregator keeps track of the first and last values seen, and when those values were received. During aggregation, this information is used to determine the rate of change over the entire aggregation window, using some extrapolation techniques.

An example should help to make this a little clearer.

Given an aggregation window of 60 seconds, we have the following timeline:

Data Submission Timeline for Rate metric

At t0, the window opens, and we have no previous value for the metric. Then, at t1, a client submits a measurement of v = 13. At this point, we know the starting value and its timestamp.

Then, at t2, a second measurement is submitted, v = 15. We still have our first value, time pair (13, t1), but now we can add to that our last seen value, time pair, (15, t2).

At t4, we get a third measurement, v = 17. Since this supercedes the measurement from t3, and we haven't yet closed the window, we update the last seen value, time pair, leaving us with first = (13, t1) and last = (17, t3)

Finally, at t5, the window closes, we do rate calculation, and broadcast the value. The actual formula for rate calculation is a bit involved:

$$R = {v_L - v_F} / {t_L - t_F} w$$

That is, the value delta, \(v_L - v_F\), divide by the time delta, \(t_L - t_F\), multipled by the window span, \(w\).

The strategy in play here is to reduce the delta down to a per-second rate-of-change, and then multiple by the window (which is always handled in seconds) to get the per-window rate of change.

If we took the naïve approach to calculating rate:

$$R_{naïve} = {v_L - v_F} / w$$

and then plug in the values from our example, \(v_F\) = 13 and \(v_L) = 17, with \(w\) = 60, we get:

$$R_{naïve} = {17 - 13} / 60$$ $$ = 4 / 60$$

which equates to one new thing every 15 seconds.

Let's assign some actual times to this timeline:

Between t1 and t2, we had a net change of \(2 / 20\), or 1 new thing every 10 seconds. Likewise, between t2 and t2, we had another net change of \(2 / 20\).

If we were to extrapolate to a minutely basis, there are 6 ten-second periods in a minute, so we intuitively expect a minutely rate of 6 new things per minute.

The naïve formula mistakenly calculates 4/min, precisely because it mis-handles this extrapolation.

Instead, the real formula:

$$R = {v_L - v_F} / {t_L - t_F} w$$

first calculates a per-second rate, and then extrapolates to per-window. Dropping in our values and timestamps (ignoring the hours and minutes because they are irrelevant here), we get:

$$R = {{17 - 13} / {44 - 4}} 60$$

$$= {4 / 40} 60$$

$$= 6$$

State Data

A key function of any monitoring system is the ability to detect problems and react to them, either by notifying a human operator, or attempting a pre-programmed fix. For this, bolo provides States.

Each State has:

The bolo aggregator tracks state submissions and derives the freshness flag accordingly. It also synthesizes transition notifications whenever the status value changes. Therefore, if a previously OK state becomes CRITICAL, the aggregator will broadcast a TRANSITION indicating the change. This edge-triggering of state data can be useful for subscribers wishing to perform notification on problem detection and associated recovery.

Event Data

State and metric changes rarely happen in a vacuum. Servers reboot, firewalls get reconfigured, processes bounce. These types of events can be tracked in bolo via Events.

An event is little more than a description (what happened?) and a timestamp (when did it happen?). Clients submit these events to bolo, and bolo in turn broadcasts them to interested subscribers. No real aggregation or de-duplication is performed on events.