Bolo handles several different types of monitoring data, including metrics (measurements of some numerical value), events, and states.

A *metric* is something that can be measured. CPU idle time is a
metric, measured in seconds. The number of free inodes on a mountpoint
is another metric, measured as a count.

Metrics come in several varieties, based on how the measurement is
taken, and the *disposition* of the raw measurement. These varieties
are **sample**, **counter** and **rate**.

*Samples* are used to measure steady-state information. Each
measurement stands alone and fully represents the thing it measures.
Most metrics related to computer system health are treated as samples.

For example, the number of running nginx workers is a *sample*. A
single number provides all the information you need to know about the
size of the nginx worker pool at a given point in time.

Highly variadic data can be difficult to monitor. The number of
connected TCP clients to a given server can change from second to
second. Measuring this value once a minute will often lead to skewed
results, which in turn leads to misinformation and poorly-informed
decision-making. To help combat this, *sample* metrics can aggregate
multiple measurements during the aggregation window, using several
statistical methods.

If a system is submitting a measurement every second, and bolo is
configured to aggregate minutely, those 60 metrics will be analyzed to
ascertain the *minimum* and *maximum* values, the *sum* of all
measurements, the *mean* value measured, and the *statistical variance*
of the sample set.

Sometimes you just want to count things. How many failed logins occurred? How many HTTP 404 responses did the web pool give out? How many times are people clicking the "upgrade" link in the application?

For this type of information, a *Counter* is ideal. Its operation is
mundanely simple — every time we get a value, increment the counter.
If clients want to, they can batch up these increments and send a value
to add to the counter. When the aggregation window closes, the counter
value resets to 0.

Some systems keep track of important quantities through the use of
ever-increasing counters. Routers will often increment a 64-bit counter
every time they handle a packet. By taking multiple measurements, at
different points in time, one can calculate a *delta*, or change in
value, which yields a rate of change measurement.

That's what *Rate* metrics are.

The bolo aggregator keeps track of the first and last values seen, and when those values were received. During aggregation, this information is used to determine the rate of change over the entire aggregation window, using some extrapolation techniques.

An example should help to make this a little clearer.

Given an aggregation window of 60 seconds, we have the following timeline:

At `t _{0}`, the window opens, and we have no previous value for the metric.
Then, at

`v = 13`

. At this
point, we know the starting value and its timestamp.Then, at `t _{2}`, a second measurement is submitted,

`v = 15`

. We still
have our first value, time pair (13, At `t _{4}`, we get a third measurement,

`v = 17`

. Since this supercedes the
measurement from Finally, at `t _{5}`, the window closes, we do rate calculation, and
broadcast the value. The actual formula for rate calculation is a bit
involved:

$$R = {v_L - v_F} / {t_L - t_F} w$$

That is, the value delta, \(v_L - v_F\), divide by the time delta, \(t_L - t_F\), multipled by the window span, \(w\).

The strategy in play here is to reduce the delta down to a per-second rate-of-change, and then multiple by the window (which is always handled in seconds) to get the per-window rate of change.

If we took the naïve approach to calculating rate:

$$R_{naïve} = {v_L - v_F} / w$$

and then plug in the values from our example, \(v_F\) = 13 and \(v_L) = 17, with \(w\) = 60, we get:

$$R_{naïve} = {17 - 13} / 60$$ $$ = 4 / 60$$

which equates to one new thing every 15 seconds.

Let's assign some actual times to this timeline:

`t`= 12:13:00_{0}`t`= 12:13:04_{1}`t`= 12:13:24_{2}`t`= 12:13:44_{3}`t`= 12:14:00_{4}

Between `t _{1}` and

If we were to extrapolate to a minutely basis, there are 6 ten-second periods in a minute, so we intuitively expect a minutely rate of 6 new things per minute.

The naïve formula mistakenly calculates 4/min, precisely because it mis-handles this extrapolation.

Instead, the real formula:

$$R = {v_L - v_F} / {t_L - t_F} w$$

first calculates a per-second rate, and then extrapolates to per-window. Dropping in our values and timestamps (ignoring the hours and minutes because they are irrelevant here), we get:

$$R = {{17 - 13} / {44 - 4}} 60$$

$$= {4 / 40} 60$$

$$= 6$$

A key function of any monitoring system is the ability to detect
problems and react to them, either by notifying a human operator, or
attempting a pre-programmed fix. For this, bolo provides *States*.

Each *State* has:

- A
**name**, to differentiate this state from all the others, - A
**status**of either*OK*,*WARNING*,*CRITICAL*or*UNKNOWN*, - A
**freshness**flag that indicates whether or not bolo has recently received confirmation of the current state, and - A
**summary message**which provides a more detailed explanation of the current state / status (i.e.*why*is the state warning?)

The bolo aggregator tracks state submissions and derives the *freshness*
flag accordingly. It also synthesizes *transition notifications*
whenever the status value changes. Therefore, if a previously OK state
becomes CRITICAL, the aggregator will broadcast a *TRANSITION*
indicating the change. This edge-triggering of state data can be useful
for subscribers wishing to perform notification on problem detection and
associated recovery.

State and metric changes rarely happen in a vacuum. Servers reboot,
firewalls get reconfigured, processes bounce. These types of events can
be tracked in bolo via *Events*.

An event is little more than a description (*what happened?*) and a
timestamp (*when did it happen?*). Clients submit these events to bolo,
and bolo in turn broadcasts them to interested subscribers. No real
aggregation or de-duplication is performed on events.