Option to dilute unwritten metrics rather than drop the oldest #16568

sammyhori · 2025-02-28T16:29:01Z

Use Case

Be allowed to prioritise keeping lower interval metrics over losing older metrics during output downtime.
This would be something you would configure in the telegraf.conf, along with metric_buffer_limit and suchlike.

One use case for this would be a system which is monitoring the health of a number of other systems. Here it would seem reasonable that 8 days of system health statuses every 80 seconds may provide more valuable insights than 1 day of statuses every 10 seconds.

Expected behavior

Say there is output downtime (e.g. the InfluxDB database can't be reached). If the metric_buffer_limit is reached, the first of every 2 consecutive datapoints are dropped.

For example:

Let metric_buffer_limit = 100 and interval = "10s".
Given the set of datapoints is { $d_1$, $d_2$, $d_3$, ... $d_{99}$, $d_{100}$ }, the datapoints to be dropped should be $d_{2x-1}$, starting from the earliest metrics.

When $d_{200}$ has been recorded the interval should be doubled (e.g. "10s" $\rightarrow$ "20s"). At this point the dropped values should be $d_{4x-2}$.
This repeats every time the deletion process reaches the last valid datapoint.

For someone analysing graphs based on this data, they would encounter a lower frequency of data points throughout the entire downtime period. However, more recent data collected during the downtime would be available at a frequency twice as high as the rest of the downtime, depending on when the downtime concluded.

Actual behavior

Currently the behaviour is just to drop the oldest metrics (I believe).

Additional info

Potential issues in implementation may occur due to InfluxDB expecting data of a certain interval, but I'm unsure if this is the case.

A feature that could build of of this would be allowing the user to choose to dilute starting at the most recent data, to keep the more frequent data at the point of the downtime starting. This would require the interval doubling to occur as soon as the limit is hit.

The text was updated successfully, but these errors were encountered:

srebhan · 2025-03-04T09:32:05Z

@sammyhori I think this is a very special use-case but it can be done already. You need to downsample your series e.g. using the starlark processor to only contain your "every second sample" series. Then add a special tag and add another output plugin that only outputs the downsampled metrics using e.g. the tagpass option. Drop the additional tag using tagexclude and you are done as the buffering is per output instance.

sammyhori added the feature request Requests for new plugin and for new features to existing plugins label Feb 28, 2025

srebhan added the waiting for response waiting for response from contributor label Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to dilute unwritten metrics rather than drop the oldest #16568

Option to dilute unwritten metrics rather than drop the oldest #16568

sammyhori commented Feb 28, 2025

srebhan commented Mar 4, 2025

Option to dilute unwritten metrics rather than drop the oldest #16568

Option to dilute unwritten metrics rather than drop the oldest #16568

Comments

sammyhori commented Feb 28, 2025

Use Case

Expected behavior

Actual behavior

Additional info

srebhan commented Mar 4, 2025