Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to dilute unwritten metrics rather than drop the oldest #16568

Open
sammyhori opened this issue Feb 28, 2025 · 1 comment
Open

Option to dilute unwritten metrics rather than drop the oldest #16568

sammyhori opened this issue Feb 28, 2025 · 1 comment
Labels
feature request Requests for new plugin and for new features to existing plugins waiting for response waiting for response from contributor

Comments

@sammyhori
Copy link

Use Case

Be allowed to prioritise keeping lower interval metrics over losing older metrics during output downtime.
This would be something you would configure in the telegraf.conf, along with metric_buffer_limit and suchlike.

One use case for this would be a system which is monitoring the health of a number of other systems. Here it would seem reasonable that 8 days of system health statuses every 80 seconds may provide more valuable insights than 1 day of statuses every 10 seconds.

Expected behavior

Say there is output downtime (e.g. the InfluxDB database can't be reached). If the metric_buffer_limit is reached, the first of every 2 consecutive datapoints are dropped.

For example:

Let metric_buffer_limit = 100 and interval = "10s".
Given the set of datapoints is { $d_1$, $d_2$, $d_3$, ... $d_{99}$, $d_{100}$ }, the datapoints to be dropped should be $d_{2x-1}$, starting from the earliest metrics.

When $d_{200}$ has been recorded the interval should be doubled (e.g. "10s" $\rightarrow$ "20s"). At this point the dropped values should be $d_{4x-2}$.
This repeats every time the deletion process reaches the last valid datapoint.

For someone analysing graphs based on this data, they would encounter a lower frequency of data points throughout the entire downtime period. However, more recent data collected during the downtime would be available at a frequency twice as high as the rest of the downtime, depending on when the downtime concluded.

Actual behavior

Currently the behaviour is just to drop the oldest metrics (I believe).

Additional info

Potential issues in implementation may occur due to InfluxDB expecting data of a certain interval, but I'm unsure if this is the case.

A feature that could build of of this would be allowing the user to choose to dilute starting at the most recent data, to keep the more frequent data at the point of the downtime starting. This would require the interval doubling to occur as soon as the limit is hit.

@sammyhori sammyhori added the feature request Requests for new plugin and for new features to existing plugins label Feb 28, 2025
@srebhan
Copy link
Member

srebhan commented Mar 4, 2025

@sammyhori I think this is a very special use-case but it can be done already. You need to downsample your series e.g. using the starlark processor to only contain your "every second sample" series. Then add a special tag and add another output plugin that only outputs the downsampled metrics using e.g. the tagpass option. Drop the additional tag using tagexclude and you are done as the buffering is per output instance.

@srebhan srebhan added the waiting for response waiting for response from contributor label Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins waiting for response waiting for response from contributor
Projects
None yet
Development

No branches or pull requests

2 participants