TBS: set default sampling.tail.storage_limit to 0 but limit disk usage to 90%#15467
TBS: set default sampling.tail.storage_limit to 0 but limit disk usage to 90%#15467carsonip merged 51 commits intoelastic:mainfrom
sampling.tail.storage_limit to 0 but limit disk usage to 90%#15467Conversation
|
This pull request does not have a backport label. Could you fix it @carsonip? 🙏
|
|
|
|
|
|
|
|
|
||
| usage, err := vfs.Default.GetDiskUsage(sm.storageDir) | ||
| if err != nil { | ||
| sm.rateLimitedLogger.With(logp.Error(err)).Warn("failed to get disk usage") |
There was a problem hiding this comment.
what is the fallback in this case?
There was a problem hiding this comment.
In the current implementation it effectively becomes unlimited. It is by design because of how they are 2 separate checkers and RW. But I think it is possible to overwrite dbStorageLimit to a fallback value if we get an error, so that storage_limit checker takes over.
There was a problem hiding this comment.
Added fallback handling. A bit complex, but should cover all cases. All of it is written with the assumption that if GetDiskUsage ever returns an error, it will always return an error.
|
|
| usage, err := sm.getDiskUsage() | ||
| if err != nil { | ||
| sm.logger.With(logp.Error(err)).Warn("failed to get disk usage") | ||
| sm.getDiskUsageFailed.Store(true) |
There was a problem hiding this comment.
nit: While I can't think of what could lead to transient failures, I am not sure about always failing on error bit - something to think about in a future PR.
There was a problem hiding this comment.
This gives me a headache as well. What should existing disk usage threshold checks perform when getDiskUsage has transient failures, should it use a stale number, or should it become unlimited? With this assumption of always failing, it simplifies the implementation.
…age to 90% (#15467) This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk. Change sampling.tail.storage_limit default to 0. While 0 means unlimited local tail-sampling database size, it now enforces a max 90% disk usage on the disk where the data directory is located. Any tail sampling writes after this threshold will be rejected, similar to what happens when tail-sampling database size exceeds a non-0 storage limit. Setting sampling.tail.storage_limit to non-0 maintains the existing behavior which limits the tail-sampling database size to sampling.tail.storage_limit and does not have the new disk usage threshold check. (cherry picked from commit d019277) # Conflicts: # changelogs/9.0.asciidoc
…isk usage to 90% (backport #15467) (#15501) * TBS: set default `sampling.tail.storage_limit` to 0 but limit disk usage to 90% (#15467) This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk. Change sampling.tail.storage_limit default to 0. While 0 means unlimited local tail-sampling database size, it now enforces a max 90% disk usage on the disk where the data directory is located. Any tail sampling writes after this threshold will be rejected, similar to what happens when tail-sampling database size exceeds a non-0 storage limit. Setting sampling.tail.storage_limit to non-0 maintains the existing behavior which limits the tail-sampling database size to sampling.tail.storage_limit and does not have the new disk usage threshold check. (cherry picked from commit d019277) # Conflicts: # changelogs/9.0.asciidoc * Fix changelog --------- Co-authored-by: Carson Ip <carsonip@users.noreply.github.com> Co-authored-by: Carson Ip <carson.ip@elastic.co>
Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion (cherry picked from commit 28068bd) Co-authored-by: Carson Ip <carsonip@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
sampling.tail.storage_limit is 0 by default in 9.0. See elastic/apm-server#15467 . As UI validation requires unit (e.g. GB), set apm integration default storage limit to 0GB which carries the same meaning.
Motivation/summary
This is a breaking change to the default storage_limit to enable a more user-friendly TBS disk usage handling. This new default will automatically scale with a larger disk.
Change
sampling.tail.storage_limitdefault to0.While
0means unlimited local tail-sampling database size,it now enforces a max 90% disk usage on the disk where the data directory is located.
Any tail sampling writes after this threshold will be rejected,
similar to what happens when tail-sampling database size exceeds a non-0 storage limit.
Setting
sampling.tail.storage_limitto non-0 maintains the existing behaviorwhich limits the tail-sampling database size to
sampling.tail.storage_limitand does not have the new disk usage threshold check.
Checklist
For functional changes, consider:
How to test these changes
Test in conjunction with #15524
Test cases
setting disk usage threshold to *** of total disk space of ***gbsetting disk usage threshold to *** of total disk space of ***gbsetting database storage limit to %0.1fgblog but nosetting disk usage threshold to *** of total disk space of ***gbRelated issues
Part of #14933
EA-managed apm-server needs elastic/integrations#12543 to change the default.