-
-
Notifications
You must be signed in to change notification settings - Fork 312
Openmetrics encoding support and Exemplars. #388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openmetrics encoding support and Exemplars. #388
Conversation
Prometheus/Exemplar.cs
Outdated
/// </summary> | ||
public static LabelKey Key(string key) | ||
{ | ||
var si = new StringInfo(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC: Need to find a better way than this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach I took above is counting graphemes (see this), so it is incorrect.
We need to count the "Runes", the runes are unicode scalars. Runes are only available in dotnet core 3 and above.
@sandersaares do you think we could drop support for older frameworks ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what if we only allowed only ASCII characters here until we drop support for older frameworks ? I can't see any reason for an exemplar to contain anything but alphanumeric characters. The implementation would enforce ASCII by simply ensuring that the lengths of the encoded byte[]
and source string are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting case. How important is it to accurately measure the length, do you know? Are there ingestors that hard-reject oversize data or is it more of a "best practice" that might be relaxed?
I would rather not drop support for older frameworks - there is a surprising amount of new .NET Framework code still being written every day.
ASCII only might work, though. Of course, someone might still put non-ASCII data in there occasionally, so I would not want to just throw an exception on non-ASCII input, as that would create a time bomb. However, "sanitizing" non-ASCII data seems like a reasonable compromise to avoid over-engineering for an edge case. I believe if you pass non-ASCII data to Encoding.ASCII.GetBytes() it will just replace those characters with question marks, which should be good enough for now. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting case. How important is it to accurately measure the length, do you know? Are there ingestors that hard-reject oversize data or is it more of a "best practice" that might be relaxed?
I fuzzy searched through the prometheus/prometheus and prometheus/common codebase. The TSDB layer has this function which I assume controls admission.
I believe if you pass non-ASCII data to Encoding.ASCII.GetBytes() it will just replace those characters with question marks, which should be good enough for now. What do you think?
This would be quite defensive. However, is this degree of defensiveness warranted for the intended use-case: encoding distributed trace identifiers. If the industry came up with another field that made sense in the exemplar it would still just be an ASCII safe reference of some form.
If we silently replaced ASCII-unsafe runes with the standard ?
placeholder we are silently corrupting the users intention.
@sandersaares WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sandersaares went with your proposed approach.
Prometheus/Histogram.cs
Outdated
suffix: CountSuffix); | ||
|
||
var cumulativeCount = 0L; | ||
|
||
for (var i = 0; i < _bucketCounts.Length; i++) | ||
{ | ||
ObservedExemplar cp; | ||
lock (_exemplars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC: does this look correct? I haven't reasoned about this. Mental model is that structs in C# work similarly to go and a shallow copy in this case is safe.
This pattern is also used in Counter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks correct, yeah - it performs a shallow copy and that's it.
However, I am not very keen on having locks in the hot path (ObserveInternal()
) - the metric objects try to be very lock-free in general because things like HTTP servers can funnel a lot of requests (handled by many threads) into observations on the same metric instances. So far I have managed to keep all except for Summary lock-free and would ideally like to keep it so.
You can see the potential cost of multithreaded access in MeasurementBenchmarks
- already without locks it is rather slow (though artificially slow as there is no "other work" done in the benchmark). I am afraid that adding locks might create even more potential slowdown.
Example of differences on my PC (True/False column indicating use of exemplars; I will commit my exemplar-enhanced benchmark code shortly):
| MeasurementPerformance | 100000 | 16 | Counter | False | 124,351.9 us | 2,163.73 us | 2,023.96 us | 123,847.7 us | - | 13.0000 | 384 B |
| MeasurementPerformance | 100000 | 16 | Counter | True | 230,592.2 us | 3,085.85 us | 2,886.51 us | 229,846.1 us | - | 6124.0000 | 384 B |
|
| MeasurementPerformance | 100000 | 16 | Histogram | False | 212,602.3 us | 1,505.49 us | 1,408.24 us | 212,393.8 us | - | 14.0000 | 384 B |
| MeasurementPerformance | 100000 | 16 | Histogram | True | 242,188.5 us | 2,762.03 us | 2,583.61 us | 241,092.3 us | - | 1445.0000 | 384 B |
|
The total duration here, of course, also includes ObservedExemplar.Update()
and is not only the lock contention overhead. The 2nd column from the right is the lock contention count.
A lock-free approach would be preferable here to try ensure that capturing data is cheap. How to do that, though? What if we turn it into a reference type and just pool them to avoid the constant allocation overhead?
- ObjectPool can manage a set of
ObservedExemplar
instances. - In
ObserveInternal()
we get a new instance from the pool and use a loop over Interlocked.CompareExchange to replace it in_exemplars
, returning the old one to the pool. - In
CollectAndSerializeImplAsync()
we have a bit of a roadblock that we need to prevent our instance from being returned to the pool while we are still using it. However, we can perhaps do a mirrored operation, take an instance from the pool, zero it, and swap it out with the one in_exemplars
. After serializing the exemplar, we can swap it back (if the one in_exemplars
is still our empty instance) and return the empty instance to the pool. - This leaves a small window where the exemplar is left empty, so in a theoretical case two concurrent metrics exports could end up with one not seeing the exemplar. But this seems like a totally OK price to pay for lock-free code - very theoretical situation, and harmless even if it happens.
What do you think? Just off the top of my head, it could be there is a big plot hole here somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sandersaares updated design to use ObjectPool
, Interlocked.CompareExchange
and Exchange
. Agree that concurrent expositions shouldn't really be a thing in practise.
We do have HA prometheus scrapers in our ingestion topology but I have no looked into how this works. In any case even if there is an increased risk due to an ACTIVE/ACTIVE setup I think this should be fine.
(edit: our ingestion is ACTIVE/ACTIVE but the TSDB backend does deduping on admission. The exemplar shouldn't factor into this).
Prometheus/Exemplar.cs
Outdated
/// </summary> | ||
public static LabelKey Key(string key) | ||
{ | ||
var si = new StringInfo(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting case. How important is it to accurately measure the length, do you know? Are there ingestors that hard-reject oversize data or is it more of a "best practice" that might be relaxed?
I would rather not drop support for older frameworks - there is a surprising amount of new .NET Framework code still being written every day.
ASCII only might work, though. Of course, someone might still put non-ASCII data in there occasionally, so I would not want to just throw an exception on non-ASCII input, as that would create a time bomb. However, "sanitizing" non-ASCII data seems like a reasonable compromise to avoid over-engineering for an edge case. I believe if you pass non-ASCII data to Encoding.ASCII.GetBytes() it will just replace those characters with question marks, which should be good enough for now. What do you think?
Prometheus/Histogram.cs
Outdated
suffix: CountSuffix); | ||
|
||
var cumulativeCount = 0L; | ||
|
||
for (var i = 0; i < _bucketCounts.Length; i++) | ||
{ | ||
ObservedExemplar cp; | ||
lock (_exemplars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks correct, yeah - it performs a shallow copy and that's it.
However, I am not very keen on having locks in the hot path (ObserveInternal()
) - the metric objects try to be very lock-free in general because things like HTTP servers can funnel a lot of requests (handled by many threads) into observations on the same metric instances. So far I have managed to keep all except for Summary lock-free and would ideally like to keep it so.
You can see the potential cost of multithreaded access in MeasurementBenchmarks
- already without locks it is rather slow (though artificially slow as there is no "other work" done in the benchmark). I am afraid that adding locks might create even more potential slowdown.
Example of differences on my PC (True/False column indicating use of exemplars; I will commit my exemplar-enhanced benchmark code shortly):
| MeasurementPerformance | 100000 | 16 | Counter | False | 124,351.9 us | 2,163.73 us | 2,023.96 us | 123,847.7 us | - | 13.0000 | 384 B |
| MeasurementPerformance | 100000 | 16 | Counter | True | 230,592.2 us | 3,085.85 us | 2,886.51 us | 229,846.1 us | - | 6124.0000 | 384 B |
|
| MeasurementPerformance | 100000 | 16 | Histogram | False | 212,602.3 us | 1,505.49 us | 1,408.24 us | 212,393.8 us | - | 14.0000 | 384 B |
| MeasurementPerformance | 100000 | 16 | Histogram | True | 242,188.5 us | 2,762.03 us | 2,583.61 us | 241,092.3 us | - | 1445.0000 | 384 B |
|
The total duration here, of course, also includes ObservedExemplar.Update()
and is not only the lock contention overhead. The 2nd column from the right is the lock contention count.
A lock-free approach would be preferable here to try ensure that capturing data is cheap. How to do that, though? What if we turn it into a reference type and just pool them to avoid the constant allocation overhead?
- ObjectPool can manage a set of
ObservedExemplar
instances. - In
ObserveInternal()
we get a new instance from the pool and use a loop over Interlocked.CompareExchange to replace it in_exemplars
, returning the old one to the pool. - In
CollectAndSerializeImplAsync()
we have a bit of a roadblock that we need to prevent our instance from being returned to the pool while we are still using it. However, we can perhaps do a mirrored operation, take an instance from the pool, zero it, and swap it out with the one in_exemplars
. After serializing the exemplar, we can swap it back (if the one in_exemplars
is still our empty instance) and return the empty instance to the pool. - This leaves a small window where the exemplar is left empty, so in a theoretical case two concurrent metrics exports could end up with one not seeing the exemplar. But this seems like a totally OK price to pay for lock-free code - very theoretical situation, and harmless even if it happens.
What do you think? Just off the top of my head, it could be there is a big plot hole here somewhere.
} | ||
|
||
if (tally > MaxRunes) | ||
throw new ArgumentException($"exemplar labels has {tally} runes, exceeding the limit of {MaxRunes}."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very comfortable throwing "bad data" type of exception from this code, largely because it is often unexpected to the users of the library - they expect that telemetry APIs are not going to complain about data, it is seen as "harmless" code.
What is the impact if we skip the length check and emit over-length data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The go implementation has a panic in the instrumentation codepath. I feel we should treat it similarly here.
What is the impact if we skip the length check and emit over-length data?
I Suspect warnings from the storage backends in the logs and the Exemplars just being dropped.
Prometheus/Histogram.cs
Outdated
@@ -124,6 +141,9 @@ public void Observe(double val, long count) | |||
if (val <= _upperBounds[i]) | |||
{ | |||
_bucketCounts[i].Add(count); | |||
if (exemplar.Length > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest null-checking exemplar, to avoid surprising callers. Here and elsewhere, so we treat an exemplar of null as equivalent to empty array, to be more tolerant of the form of input that is provided.
Perhaps just exemplar ??= Array.Empty<...>()
at the start of the method or such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rider suggested if (exemplar is { Length: > 0 })
for a null + length check. That works (I think) ?
…ormat is used as in golang.
This PR:
State: Feature complete.