Openmetrics encoding support and Exemplars. #388

hsyed-dojo · 2022-11-18T16:06:52Z

This PR:

Introduces all of the machinery needed to attach an Exemplar to Counter and Histgram instruments.
Openmetrics encoding - aim to have feature parity with the Golang implementation (this PR).
Protocol negotiation to select encoding format (manually tested with curl and verified with prometheus).
Wired up end to end in ASP.net (please let me know if I have missed other ways of starting a HTTP runloop).
Bash script to setup prom for scraping and ingesting exemplars for manual testing.

State: Feature complete.

hsyed-dojo · 2022-11-18T16:43:28Z

Prometheus/Exemplar.cs

+    /// </summary>
+    public static LabelKey Key(string key)
+    {
+        var si = new StringInfo(key);


RFC: Need to find a better way than this.

This approach I took above is counting graphemes (see this), so it is incorrect.

We need to count the "Runes", the runes are unicode scalars. Runes are only available in dotnet core 3 and above.

@sandersaares do you think we could drop support for older frameworks ?

Hmm, what if we only allowed only ASCII characters here until we drop support for older frameworks ? I can't see any reason for an exemplar to contain anything but alphanumeric characters. The implementation would enforce ASCII by simply ensuring that the lengths of the encoded byte[] and source string are the same.

Interesting case. How important is it to accurately measure the length, do you know? Are there ingestors that hard-reject oversize data or is it more of a "best practice" that might be relaxed?

I would rather not drop support for older frameworks - there is a surprising amount of new .NET Framework code still being written every day.

ASCII only might work, though. Of course, someone might still put non-ASCII data in there occasionally, so I would not want to just throw an exception on non-ASCII input, as that would create a time bomb. However, "sanitizing" non-ASCII data seems like a reasonable compromise to avoid over-engineering for an edge case. I believe if you pass non-ASCII data to Encoding.ASCII.GetBytes() it will just replace those characters with question marks, which should be good enough for now. What do you think?

Interesting case. How important is it to accurately measure the length, do you know? Are there ingestors that hard-reject oversize data or is it more of a "best practice" that might be relaxed?

I fuzzy searched through the prometheus/prometheus and prometheus/common codebase. The TSDB layer has this function which I assume controls admission.

I believe if you pass non-ASCII data to Encoding.ASCII.GetBytes() it will just replace those characters with question marks, which should be good enough for now. What do you think?

This would be quite defensive. However, is this degree of defensiveness warranted for the intended use-case: encoding distributed trace identifiers. If the industry came up with another field that made sense in the exemplar it would still just be an ASCII safe reference of some form.

If we silently replaced ASCII-unsafe runes with the standard ? placeholder we are silently corrupting the users intention.

@sandersaares WDYT?

@sandersaares went with your proposed approach.

hsyed-dojo · 2022-11-18T16:46:13Z

Prometheus/Histogram.cs

                    suffix: CountSuffix);

                var cumulativeCount = 0L;

                for (var i = 0; i < _bucketCounts.Length; i++)
                {
+                    ObservedExemplar cp;
+                    lock (_exemplars)


RFC: does this look correct? I haven't reasoned about this. Mental model is that structs in C# work similarly to go and a shallow copy in this case is safe.

This pattern is also used in Counter.

It looks correct, yeah - it performs a shallow copy and that's it.

However, I am not very keen on having locks in the hot path (ObserveInternal()) - the metric objects try to be very lock-free in general because things like HTTP servers can funnel a lot of requests (handled by many threads) into observations on the same metric instances. So far I have managed to keep all except for Summary lock-free and would ideally like to keep it so.

You can see the potential cost of multithreaded access in MeasurementBenchmarks - already without locks it is rather slow (though artificially slow as there is no "other work" done in the benchmark). I am afraid that adding locks might create even more potential slowdown.

Example of differences on my PC (True/False column indicating use of exemplars; I will commit my exemplar-enhanced benchmark code shortly):

| MeasurementPerformance | 100000 | 16 | Counter | False | 124,351.9 us | 2,163.73 us | 2,023.96 us | 123,847.7 us | - | 13.0000 | 384 B | | MeasurementPerformance | 100000 | 16 | Counter | True | 230,592.2 us | 3,085.85 us | 2,886.51 us | 229,846.1 us | - | 6124.0000 | 384 B | | | MeasurementPerformance | 100000 | 16 | Histogram | False | 212,602.3 us | 1,505.49 us | 1,408.24 us | 212,393.8 us | - | 14.0000 | 384 B | | MeasurementPerformance | 100000 | 16 | Histogram | True | 242,188.5 us | 2,762.03 us | 2,583.61 us | 241,092.3 us | - | 1445.0000 | 384 B | |

The total duration here, of course, also includes ObservedExemplar.Update() and is not only the lock contention overhead. The 2nd column from the right is the lock contention count.

A lock-free approach would be preferable here to try ensure that capturing data is cheap. How to do that, though? What if we turn it into a reference type and just pool them to avoid the constant allocation overhead?

ObjectPool can manage a set of ObservedExemplar instances.

In ObserveInternal() we get a new instance from the pool and use a loop over Interlocked.CompareExchange to replace it in _exemplars, returning the old one to the pool.

In CollectAndSerializeImplAsync() we have a bit of a roadblock that we need to prevent our instance from being returned to the pool while we are still using it. However, we can perhaps do a mirrored operation, take an instance from the pool, zero it, and swap it out with the one in _exemplars. After serializing the exemplar, we can swap it back (if the one in _exemplars is still our empty instance) and return the empty instance to the pool.

This leaves a small window where the exemplar is left empty, so in a theoretical case two concurrent metrics exports could end up with one not seeing the exemplar. But this seems like a totally OK price to pay for lock-free code - very theoretical situation, and harmless even if it happens.

What do you think? Just off the top of my head, it could be there is a big plot hole here somewhere.

@sandersaares updated design to use ObjectPool, Interlocked.CompareExchange and Exchange. Agree that concurrent expositions shouldn't really be a thing in practise.

We do have HA prometheus scrapers in our ingestion topology but I have no looked into how this works. In any case even if there is an increased risk due to an ACTIVE/ACTIVE setup I think this should be fine.

(edit: our ingestion is ACTIVE/ACTIVE but the TSDB backend does deduping on admission. The exemplar shouldn't factor into this).

Prometheus/Exemplar.cs

Prometheus/Counter.cs

sandersaares · 2022-11-30T08:14:32Z

Prometheus/Exemplar.cs

+    /// </summary>
+    public static LabelKey Key(string key)
+    {
+        var si = new StringInfo(key);


Interesting case. How important is it to accurately measure the length, do you know? Are there ingestors that hard-reject oversize data or is it more of a "best practice" that might be relaxed?

I would rather not drop support for older frameworks - there is a surprising amount of new .NET Framework code still being written every day.

ASCII only might work, though. Of course, someone might still put non-ASCII data in there occasionally, so I would not want to just throw an exception on non-ASCII input, as that would create a time bomb. However, "sanitizing" non-ASCII data seems like a reasonable compromise to avoid over-engineering for an edge case. I believe if you pass non-ASCII data to Encoding.ASCII.GetBytes() it will just replace those characters with question marks, which should be good enough for now. What do you think?

Prometheus/ObservedExemplar.cs

sandersaares · 2022-11-30T09:01:56Z

Prometheus/Histogram.cs

                    suffix: CountSuffix);

                var cumulativeCount = 0L;

                for (var i = 0; i < _bucketCounts.Length; i++)
                {
+                    ObservedExemplar cp;
+                    lock (_exemplars)


It looks correct, yeah - it performs a shallow copy and that's it.

However, I am not very keen on having locks in the hot path (ObserveInternal()) - the metric objects try to be very lock-free in general because things like HTTP servers can funnel a lot of requests (handled by many threads) into observations on the same metric instances. So far I have managed to keep all except for Summary lock-free and would ideally like to keep it so.

You can see the potential cost of multithreaded access in MeasurementBenchmarks - already without locks it is rather slow (though artificially slow as there is no "other work" done in the benchmark). I am afraid that adding locks might create even more potential slowdown.

Example of differences on my PC (True/False column indicating use of exemplars; I will commit my exemplar-enhanced benchmark code shortly):

| MeasurementPerformance | 100000 | 16 | Counter | False | 124,351.9 us | 2,163.73 us | 2,023.96 us | 123,847.7 us | - | 13.0000 | 384 B | | MeasurementPerformance | 100000 | 16 | Counter | True | 230,592.2 us | 3,085.85 us | 2,886.51 us | 229,846.1 us | - | 6124.0000 | 384 B | | | MeasurementPerformance | 100000 | 16 | Histogram | False | 212,602.3 us | 1,505.49 us | 1,408.24 us | 212,393.8 us | - | 14.0000 | 384 B | | MeasurementPerformance | 100000 | 16 | Histogram | True | 242,188.5 us | 2,762.03 us | 2,583.61 us | 241,092.3 us | - | 1445.0000 | 384 B | |

The total duration here, of course, also includes ObservedExemplar.Update() and is not only the lock contention overhead. The 2nd column from the right is the lock contention count.

A lock-free approach would be preferable here to try ensure that capturing data is cheap. How to do that, though? What if we turn it into a reference type and just pool them to avoid the constant allocation overhead?

ObjectPool can manage a set of ObservedExemplar instances.

In ObserveInternal() we get a new instance from the pool and use a loop over Interlocked.CompareExchange to replace it in _exemplars, returning the old one to the pool.

In CollectAndSerializeImplAsync() we have a bit of a roadblock that we need to prevent our instance from being returned to the pool while we are still using it. However, we can perhaps do a mirrored operation, take an instance from the pool, zero it, and swap it out with the one in _exemplars. After serializing the exemplar, we can swap it back (if the one in _exemplars is still our empty instance) and return the empty instance to the pool.

This leaves a small window where the exemplar is left empty, so in a theoretical case two concurrent metrics exports could end up with one not seeing the exemplar. But this seems like a totally OK price to pay for lock-free code - very theoretical situation, and harmless even if it happens.

What do you think? Just off the top of my head, it could be there is a big plot hole here somewhere.

sandersaares · 2022-11-30T09:08:21Z

Prometheus/ObservedExemplar.cs

+        }
+
+        if (tally > MaxRunes)
+            throw new ArgumentException($"exemplar labels has {tally} runes, exceeding the limit of {MaxRunes}.");


I am not very comfortable throwing "bad data" type of exception from this code, largely because it is often unexpected to the users of the library - they expect that telemetry APIs are not going to complain about data, it is seen as "harmless" code.

What is the impact if we skip the length check and emit over-length data?

The go implementation has a panic in the instrumentation codepath. I feel we should treat it similarly here.

What is the impact if we skip the length check and emit over-length data?

I Suspect warnings from the storage backends in the logs and the Exemplars just being dropped.

sandersaares · 2022-11-30T09:58:25Z

Prometheus/Histogram.cs

@@ -124,6 +141,9 @@ public void Observe(double val, long count)
                    if (val <= _upperBounds[i])
                    {
                        _bucketCounts[i].Add(count);
+                        if (exemplar.Length > 0)


I suggest null-checking exemplar, to avoid surprising callers. Here and elsewhere, so we treat an exemplar of null as equivalent to empty array, to be more tolerant of the form of input that is provided.

Perhaps just exemplar ??= Array.Empty<...>() at the start of the method or such?

Rider suggested if (exemplar is { Length: > 0 }) for a null + length check. That works (I think) ?

Prometheus/Exemplar.cs

…ts as ASCII.

… into hsyed/openmetrics-exemplars-e2e

…ormat is used as in golang.

hsyed-dojo added 2 commits November 17, 2022 14:05

Basic openmetrics encoding

414dfb1

Exemplar api surface and OpenMetrics encoding.

e20149b

hsyed-dojo marked this pull request as ready for review November 18, 2022 16:34

hsyed-dojo commented Nov 18, 2022

View reviewed changes

Prometheus/Exemplar.cs Show resolved Hide resolved

hsyed-dojo commented Nov 18, 2022

View reviewed changes

Prometheus/Counter.cs Show resolved Hide resolved

hsyed-dojo and others added 8 commits November 21, 2022 13:54

Rework metric family header serialization for OpenMetrics.

1f3f5d2

Assert exemplar LabelSet keys are never empty.

6f6e780

Add openmetrics benchmark.

33b2ffa

Fix serialization bug as verified by prometheus.

d0a768e

Wire up exemplars end to end

5f1e4f3

Add some dummy exemplars to the web sample for manual testing.

fdd0d4b

Add a helper script to launch prometheus for testing purposes.

469eb25

Add exemplar support to MeasurementBenchmarks

4a9aea9

sandersaares requested changes Nov 30, 2022

View reviewed changes

hsyed-dojo commented Nov 30, 2022

View reviewed changes

Prometheus/Exemplar.cs Outdated Show resolved Hide resolved

Review fixes: lockless exemplar observation and treat exemplar conten…

71512c5

…ts as ASCII.

hsyed-dojo requested a review from sandersaares December 1, 2022 12:21

hsyed-dojo added 4 commits December 1, 2022 12:43

Wire up exemplars end to end

f2f0b62

Add some dummy exemplars to the web sample for manual testing.

83158e9

Add a helper script to launch prometheus for testing purposes.

c27f3c6

Merge remote-tracking branch 'origin/hsyed/openmetrics-exemplars-e2e'…

e39df10

… into hsyed/openmetrics-exemplars-e2e

hsyed-dojo changed the title ~~Openmetrics API surface and encoding~~ Openmetrics encoding support and Exemplars. Dec 1, 2022

Fix bug in normalisation logic and ensure lower case general number f…

018b9d3

…ormat is used as in golang.

sandersaares merged commit 8cdca07 into prometheus-net:master Dec 29, 2022

hsyed-dojo deleted the hsyed/openmetrics-exemplars branch January 17, 2023 15:53

Uh oh!

Openmetrics encoding support and Exemplars. #388

Openmetrics encoding support and Exemplars. #388

Uh oh!

Conversation

hsyed-dojo commented Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsyed-dojo Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsyed-dojo Nov 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsyed-dojo Nov 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsyed-dojo Nov 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsyed-dojo Dec 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hsyed-dojo commented Nov 18, 2022 •

edited

Loading

hsyed-dojo Nov 21, 2022 •

edited

Loading

hsyed-dojo Nov 21, 2022 •

edited

Loading

hsyed-dojo Nov 30, 2022 •

edited

Loading

hsyed-dojo Nov 18, 2022 •

edited

Loading

hsyed-dojo Dec 1, 2022 •

edited

Loading