Skip to content

Openmetrics encoding support and Exemplars. #388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

hsyed-dojo
Copy link
Contributor

@hsyed-dojo hsyed-dojo commented Nov 18, 2022

This PR:

  • Introduces all of the machinery needed to attach an Exemplar to Counter and Histgram instruments.
  • Openmetrics encoding - aim to have feature parity with the Golang implementation (this PR).
  • Protocol negotiation to select encoding format (manually tested with curl and verified with prometheus).
  • Wired up end to end in ASP.net (please let me know if I have missed other ways of starting a HTTP runloop).
  • Bash script to setup prom for scraping and ingesting exemplars for manual testing.

State: Feature complete.

@hsyed-dojo hsyed-dojo marked this pull request as ready for review November 18, 2022 16:34
/// </summary>
public static LabelKey Key(string key)
{
var si = new StringInfo(key);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC: Need to find a better way than this.

Copy link
Contributor Author

@hsyed-dojo hsyed-dojo Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach I took above is counting graphemes (see this), so it is incorrect.

We need to count the "Runes", the runes are unicode scalars. Runes are only available in dotnet core 3 and above.

@sandersaares do you think we could drop support for older frameworks ?

Copy link
Contributor Author

@hsyed-dojo hsyed-dojo Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what if we only allowed only ASCII characters here until we drop support for older frameworks ? I can't see any reason for an exemplar to contain anything but alphanumeric characters. The implementation would enforce ASCII by simply ensuring that the lengths of the encoded byte[] and source string are the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting case. How important is it to accurately measure the length, do you know? Are there ingestors that hard-reject oversize data or is it more of a "best practice" that might be relaxed?

I would rather not drop support for older frameworks - there is a surprising amount of new .NET Framework code still being written every day.

ASCII only might work, though. Of course, someone might still put non-ASCII data in there occasionally, so I would not want to just throw an exception on non-ASCII input, as that would create a time bomb. However, "sanitizing" non-ASCII data seems like a reasonable compromise to avoid over-engineering for an edge case. I believe if you pass non-ASCII data to Encoding.ASCII.GetBytes() it will just replace those characters with question marks, which should be good enough for now. What do you think?

Copy link
Contributor Author

@hsyed-dojo hsyed-dojo Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting case. How important is it to accurately measure the length, do you know? Are there ingestors that hard-reject oversize data or is it more of a "best practice" that might be relaxed?

I fuzzy searched through the prometheus/prometheus and prometheus/common codebase. The TSDB layer has this function which I assume controls admission.

I believe if you pass non-ASCII data to Encoding.ASCII.GetBytes() it will just replace those characters with question marks, which should be good enough for now. What do you think?

This would be quite defensive. However, is this degree of defensiveness warranted for the intended use-case: encoding distributed trace identifiers. If the industry came up with another field that made sense in the exemplar it would still just be an ASCII safe reference of some form.

If we silently replaced ASCII-unsafe runes with the standard ? placeholder we are silently corrupting the users intention.

@sandersaares WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandersaares went with your proposed approach.

suffix: CountSuffix);

var cumulativeCount = 0L;

for (var i = 0; i < _bucketCounts.Length; i++)
{
ObservedExemplar cp;
lock (_exemplars)
Copy link
Contributor Author

@hsyed-dojo hsyed-dojo Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC: does this look correct? I haven't reasoned about this. Mental model is that structs in C# work similarly to go and a shallow copy in this case is safe.

This pattern is also used in Counter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks correct, yeah - it performs a shallow copy and that's it.

However, I am not very keen on having locks in the hot path (ObserveInternal()) - the metric objects try to be very lock-free in general because things like HTTP servers can funnel a lot of requests (handled by many threads) into observations on the same metric instances. So far I have managed to keep all except for Summary lock-free and would ideally like to keep it so.

You can see the potential cost of multithreaded access in MeasurementBenchmarks - already without locks it is rather slow (though artificially slow as there is no "other work" done in the benchmark). I am afraid that adding locks might create even more potential slowdown.

Example of differences on my PC (True/False column indicating use of exemplars; I will commit my exemplar-enhanced benchmark code shortly):

| MeasurementPerformance |           100000 |          16 |          Counter |         False | 124,351.9 us | 2,163.73 us | 2,023.96 us | 123,847.7 us |                    - |          13.0000 |     384 B |
| MeasurementPerformance |           100000 |          16 |          Counter |          True | 230,592.2 us | 3,085.85 us | 2,886.51 us | 229,846.1 us |                    - |        6124.0000 |     384 B |
|
| MeasurementPerformance |           100000 |          16 |        Histogram |         False | 212,602.3 us | 1,505.49 us | 1,408.24 us | 212,393.8 us |                    - |          14.0000 |     384 B |
| MeasurementPerformance |           100000 |          16 |        Histogram |          True | 242,188.5 us | 2,762.03 us | 2,583.61 us | 241,092.3 us |                    - |        1445.0000 |     384 B |
|

The total duration here, of course, also includes ObservedExemplar.Update() and is not only the lock contention overhead. The 2nd column from the right is the lock contention count.

A lock-free approach would be preferable here to try ensure that capturing data is cheap. How to do that, though? What if we turn it into a reference type and just pool them to avoid the constant allocation overhead?

  • ObjectPool can manage a set of ObservedExemplar instances.
  • In ObserveInternal() we get a new instance from the pool and use a loop over Interlocked.CompareExchange to replace it in _exemplars, returning the old one to the pool.
  • In CollectAndSerializeImplAsync() we have a bit of a roadblock that we need to prevent our instance from being returned to the pool while we are still using it. However, we can perhaps do a mirrored operation, take an instance from the pool, zero it, and swap it out with the one in _exemplars. After serializing the exemplar, we can swap it back (if the one in _exemplars is still our empty instance) and return the empty instance to the pool.
  • This leaves a small window where the exemplar is left empty, so in a theoretical case two concurrent metrics exports could end up with one not seeing the exemplar. But this seems like a totally OK price to pay for lock-free code - very theoretical situation, and harmless even if it happens.

What do you think? Just off the top of my head, it could be there is a big plot hole here somewhere.

Copy link
Contributor Author

@hsyed-dojo hsyed-dojo Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sandersaares updated design to use ObjectPool, Interlocked.CompareExchange and Exchange. Agree that concurrent expositions shouldn't really be a thing in practise.

We do have HA prometheus scrapers in our ingestion topology but I have no looked into how this works. In any case even if there is an increased risk due to an ACTIVE/ACTIVE setup I think this should be fine.

(edit: our ingestion is ACTIVE/ACTIVE but the TSDB backend does deduping on admission. The exemplar shouldn't factor into this).

/// </summary>
public static LabelKey Key(string key)
{
var si = new StringInfo(key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting case. How important is it to accurately measure the length, do you know? Are there ingestors that hard-reject oversize data or is it more of a "best practice" that might be relaxed?

I would rather not drop support for older frameworks - there is a surprising amount of new .NET Framework code still being written every day.

ASCII only might work, though. Of course, someone might still put non-ASCII data in there occasionally, so I would not want to just throw an exception on non-ASCII input, as that would create a time bomb. However, "sanitizing" non-ASCII data seems like a reasonable compromise to avoid over-engineering for an edge case. I believe if you pass non-ASCII data to Encoding.ASCII.GetBytes() it will just replace those characters with question marks, which should be good enough for now. What do you think?

suffix: CountSuffix);

var cumulativeCount = 0L;

for (var i = 0; i < _bucketCounts.Length; i++)
{
ObservedExemplar cp;
lock (_exemplars)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks correct, yeah - it performs a shallow copy and that's it.

However, I am not very keen on having locks in the hot path (ObserveInternal()) - the metric objects try to be very lock-free in general because things like HTTP servers can funnel a lot of requests (handled by many threads) into observations on the same metric instances. So far I have managed to keep all except for Summary lock-free and would ideally like to keep it so.

You can see the potential cost of multithreaded access in MeasurementBenchmarks - already without locks it is rather slow (though artificially slow as there is no "other work" done in the benchmark). I am afraid that adding locks might create even more potential slowdown.

Example of differences on my PC (True/False column indicating use of exemplars; I will commit my exemplar-enhanced benchmark code shortly):

| MeasurementPerformance |           100000 |          16 |          Counter |         False | 124,351.9 us | 2,163.73 us | 2,023.96 us | 123,847.7 us |                    - |          13.0000 |     384 B |
| MeasurementPerformance |           100000 |          16 |          Counter |          True | 230,592.2 us | 3,085.85 us | 2,886.51 us | 229,846.1 us |                    - |        6124.0000 |     384 B |
|
| MeasurementPerformance |           100000 |          16 |        Histogram |         False | 212,602.3 us | 1,505.49 us | 1,408.24 us | 212,393.8 us |                    - |          14.0000 |     384 B |
| MeasurementPerformance |           100000 |          16 |        Histogram |          True | 242,188.5 us | 2,762.03 us | 2,583.61 us | 241,092.3 us |                    - |        1445.0000 |     384 B |
|

The total duration here, of course, also includes ObservedExemplar.Update() and is not only the lock contention overhead. The 2nd column from the right is the lock contention count.

A lock-free approach would be preferable here to try ensure that capturing data is cheap. How to do that, though? What if we turn it into a reference type and just pool them to avoid the constant allocation overhead?

  • ObjectPool can manage a set of ObservedExemplar instances.
  • In ObserveInternal() we get a new instance from the pool and use a loop over Interlocked.CompareExchange to replace it in _exemplars, returning the old one to the pool.
  • In CollectAndSerializeImplAsync() we have a bit of a roadblock that we need to prevent our instance from being returned to the pool while we are still using it. However, we can perhaps do a mirrored operation, take an instance from the pool, zero it, and swap it out with the one in _exemplars. After serializing the exemplar, we can swap it back (if the one in _exemplars is still our empty instance) and return the empty instance to the pool.
  • This leaves a small window where the exemplar is left empty, so in a theoretical case two concurrent metrics exports could end up with one not seeing the exemplar. But this seems like a totally OK price to pay for lock-free code - very theoretical situation, and harmless even if it happens.

What do you think? Just off the top of my head, it could be there is a big plot hole here somewhere.

}

if (tally > MaxRunes)
throw new ArgumentException($"exemplar labels has {tally} runes, exceeding the limit of {MaxRunes}.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very comfortable throwing "bad data" type of exception from this code, largely because it is often unexpected to the users of the library - they expect that telemetry APIs are not going to complain about data, it is seen as "harmless" code.

What is the impact if we skip the length check and emit over-length data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The go implementation has a panic in the instrumentation codepath. I feel we should treat it similarly here.

What is the impact if we skip the length check and emit over-length data?

I Suspect warnings from the storage backends in the logs and the Exemplars just being dropped.

@@ -124,6 +141,9 @@ public void Observe(double val, long count)
if (val <= _upperBounds[i])
{
_bucketCounts[i].Add(count);
if (exemplar.Length > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest null-checking exemplar, to avoid surprising callers. Here and elsewhere, so we treat an exemplar of null as equivalent to empty array, to be more tolerant of the form of input that is provided.

Perhaps just exemplar ??= Array.Empty<...>() at the start of the method or such?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rider suggested if (exemplar is { Length: > 0 }) for a null + length check. That works (I think) ?

@hsyed-dojo hsyed-dojo changed the title Openmetrics API surface and encoding Openmetrics encoding support and Exemplars. Dec 1, 2022
@sandersaares sandersaares merged commit 8cdca07 into prometheus-net:master Dec 29, 2022
@hsyed-dojo hsyed-dojo deleted the hsyed/openmetrics-exemplars branch January 17, 2023 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants