Page MenuHomePhabricator

CampaignEvents makes an uncached x1 DB query on pageviews
Closed, ResolvedPublic

Description

During a brief incident on April 27th where x1 circuit breaking was triggered, I noticed that a large proportion of the errors reported in the incident timeframe were from CampaignEvents: https://logstash.wikimedia.org/goto/b9a485e0d685bddb22c62ddbbab70207.

It seems that the extension is making an uncached query to x1 on every pageview, which has multiple implications: any spike in pageviews that are CDN cache misses would cause an x1 load spike, and x1 being down would immediately bring down all registered user pageviews.

In the interim, we could add caching for this lookup to reduce the potential load on x1.

Event Timeline

Change #1139200 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/extensions/CampaignEvents@master] EventStore: Add caching for per-page event lookups

https://gerrit.wikimedia.org/r/1139200

Change #1139439 had a related patch set uploaded (by Ladsgroup; author: Máté Szabó):

[mediawiki/extensions/CampaignEvents@wmf/1.44.0-wmf.25] EventStore: Add caching for per-page event lookups

https://gerrit.wikimedia.org/r/1139439

Change #1139200 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@master] EventStore: Add caching for per-page event lookups

https://gerrit.wikimedia.org/r/1139200

Change #1139439 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@wmf/1.44.0-wmf.25] EventStore: Add caching for per-page event lookups

https://gerrit.wikimedia.org/r/1139439

Mentioned in SAL (#wikimedia-operations) [2025-04-28T11:28:42Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1139439|EventStore: Add caching for per-page event lookups (T392784)]]

Mentioned in SAL (#wikimedia-operations) [2025-04-28T11:33:20Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1139439|EventStore: Add caching for per-page event lookups (T392784)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-04-28T11:41:57Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1139439|EventStore: Add caching for per-page event lookups (T392784)]] (duration: 13m 15s)

I'm seeing number of open connections going down:

grafik.png (941×1 px, 124 KB)

Change #1139200 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/extensions/CampaignEvents@master] EventStore: Add caching for per-page event lookups

https://gerrit.wikimedia.org/r/1139200

On 99.99% of pages this query returns nothing, which may be reason to restructure the data such that you don't need a dedicated per-page query to find that out (with or without cache). I see you're already caching null which is good, but even then it's going to have a pretty high cache-miss rate, since there is a long tail of pages, and given that pageviews are behind Varnish, there may not be frequent enough traffic to have a matching level of popularity to keep the data warm in Memcached.

Generally speaking, features that relate to only a small set of pages should not require a database connection on the majority of articles just to find out that they're not in that small subset. There may be a way to statically exclude a majority based on content model or namespace or page property? E.g. could this trigger on a Wikipedia article?

Similar issue and previous mitigations:

Given how small the set of of potential hits is, perhaps it might make sense to concentate this into a lookup map that you range-query and store as a single cache key. This would certainly help with keeping the data warm and reducing the number of database queries. Long-term it depends on the business logic whether this will scale (i.e. is the set of pages likely to stay constant, or do past-campaigns have to be included for this purpose?).

Alternatively, you could invert the per-page campaign structure. Rather than centrally pointing outward to a page (and thus having to check the central store on every page), you could point from the local page to the central store. Page properties have the benefit of being local to each wiki (i.e. spread out over all db clusters, instead of concentrated on x1). Page properties are also batched and process-cached, which means that potentially you wouldn't even add a query at all in most cases since it would come along with the ParserCache/ParserOutput and/or with an existing PageProps batch lookup. You'd insert the data from a Parser or LinksUpdate hook, and if the source changes you can queue RefreshLinks.

Lastly, it may be worth catching DB exceptions in this hook so that no matter what happens, this hook will never be a SPOF that takes down pageviews again. The graceful default can be assume false and not display the campaign in the worst case, and instead send a warning to Logstash or increment an alert-monitored metric (rather than taking down the page). This would be similar to how SyntaxHighlight falls back to not syntax highlighting rather than an HTTP 500 error.

On 99.99% of pages this query returns nothing, which may be reason to restructure the data such that you don't need a dedicated per-page query to find that out (with or without cache). [...] it's going to have a pretty high cache-miss rate, since there is a long tail of pages, and given that pageviews are behind Varnish, there may not be frequent enough traffic to have a matching level of popularity to keep the data warm in Memcached.

This makes sense to me. If I understand everything correctly, the caching above is mostly useful for mitigating repeated requests to the same page by logged-in users. Which is better than nothing, and actually useful according to T392784#10772367, but surely we can do better.

Generally speaking, features that relate to only a small set of pages should not require a database connection on the majority of articles just to find out that they're not in that small subset. There may be a way to statically exclude a majority based on content model or namespace or page property? E.g. could this trigger on a Wikipedia article?

This used to be behind a namespace check before T387974. TLDR: we now allow individual wikis to configure the namespaces where event registration can be enabled (so it's not necessarily just the Event: namespace). We decided to not make that setting retroactive: it affects newly-created events, but existing events can be in namespaces that are no longer allowed, and those event will continue working as normal. This implies that we can't have a namespace filter a priori. I knew that this would result in an additional query for all page views, but I thought it'd be acceptable because the query in question is a fast unique key lookup. The (admittedly obvious) thing that I did not consider is that said query goes to x1.wikishared, which means that 1) it hits the same database regardless of the origin wiki, and 2) it will likely require opening a connection to that database.

Given how small the set of of potential hits is, perhaps it might make sense to concentate this into a lookup map that you range-query and store as a single cache key. This would certainly help with keeping the data warm and reducing the number of database queries. Long-term it depends on the business logic whether this will scale (i.e. is the set of pages likely to stay constant, or do past-campaigns have to be included for this purpose?).

I'm not sure if it really is small. This code needs to work for all events that exist across every wiki. Even if we make the cache per-wiki, it will still need to include all events. That is, to answer your question, past events remain listed, and so we will reach a point where the data will no longer fit the cache key.

Alternatively, you could invert the per-page campaign structure. Rather than centrally pointing outward to a page (and thus having to check the central store on every page), you could point from the local page to the central store.

Minor observation: I think we would not "invert" the relationship, but just add the missing arrow/direction; i.e., we would have a link in both direction rather than just one. I believe something needs the central -> local part. But obviously this doesn't really affect your proposal.

Page properties have the benefit of being local to each wiki (i.e. spread out over all db clusters, instead of concentrated on x1). Page properties are also batched and process-cached, which means that potentially you wouldn't even add a query at all in most cases since it would come along with the ParserCache/ParserOutput and/or with an existing PageProps batch lookup. You'd insert the data from a Parser or LinksUpdate hook, and if the source changes you can queue RefreshLinks.

So, I did consider page properties for events a few years ago. This was during the early development of the extension. However, phab search really wants to prove me wrong and nothing comes up. Off the top of my head, I can't recall whether this thought ever left my mind, and if not, why not (or otherwise, why we did not pursue it). But yes, I believe this would be my preferred approach.

I'm not sure if it really is small. This code needs to work for all events that exist across every wiki. Even if we make the cache per-wiki, it will still need to include all events. That is, to answer your question, past events remain listed, and so we will reach a point where the data will no longer fit the cache key.

You can probably shard the key? e.g. take reminder of page_id with 100. So you end up with 35 for example, make a key with 35 in it. put all pages in events that end with 35 in that key and check against that.

Another way could be to use a bloom filter (or other probabilistic data structure), you might have a tiny bit of lost information but you are sure be able to store it in a cache key.

I'm not sure if it really is small. This code needs to work for all events that exist across every wiki. Even if we make the cache per-wiki, it will still need to include all events. That is, to answer your question, past events remain listed, and so we will reach a point where the data will no longer fit the cache key.

You can probably shard the key? e.g. take reminder of page_id with 100. So you end up with 35 for example, make a key with 35 in it. put all pages in events that end with 35 in that key and check against that.

Another way could be to use a bloom filter (or other probabilistic data structure), you might have a tiny bit of lost information but you are sure be able to store it in a cache key.

Yes, but neither of those would be trivial to implement. I think we have a "good first step" which is the caching; now I'd like to look into something permanent and effective.

Besides page properties, aonther option worth considering is caching events in namespaces that are no longer allowed (ref T392784#10773716). These are generally 0, unless there's been a config change. And then restore the namespace check removed in T387974, if that cache lookup fails. But even then, we can't in principle guarantee that the length of that list is [close to] 0.

So, maybe a page property would be the best option. Given my previous interest in it, I assume it would also have other benefits for us, but I do not remember what these might be.

Daimona triaged this task as High priority.May 4 2025, 2:38 PM

This seems to be contributing to outages still, see T390510#10790387; basically, as expected. Note, it isn't necessarily the cause of those outages, but it definitely doesn't help. We should probably do something about it...

Alternatively, maybe reconsider our decision from T387974, and restrict lookup to namespaces currently allowed. Acknowledge the limitation and address it "later".

This seems to be contributing to outages still, see T390510#10790387; basically, as expected. Note, it isn't necessarily the cause of those outages, but it definitely doesn't help. We should probably do something about it...

I think the biggest problem is that x1 reads are being triggered on pageviews which means when x1 is getting circuit breaking, pageviews get affected way more than they should.

Alternatively, maybe reconsider our decision from T387974, and restrict lookup to namespaces currently allowed. Acknowledge the limitation and address it "later".

+1

This is really drastically increasing the blast radius of x1 db connection pile up. Please fix or disable that feature for now.

Ladsgroup lowered the priority of this task from Unbreak Now! to High.May 23 2025, 11:39 AM

We fixed the root cause. I'd appreciate if UBN tickets get more attention from teams though.

Apologies, I was on vacation for the last couple weeks and am still going through the backlog of what I missed. The team has been a bit understaffed in the meantime. I definitely want to get this done and will hopefully get to it by the end of the week.

Change #1153375 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/CampaignEvents@master] Restrict event page decoration to currently allowed namespaces

https://gerrit.wikimedia.org/r/1153375

Change #1153375 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@master] Restrict event page decoration to currently allowed namespaces

https://gerrit.wikimedia.org/r/1153375

This should now be fixed. For QA, and in general to verify the fix, we should observe a decrease in the number of connections to x1 https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=now-2d&to=now&timezone=utc&var-site=eqiad&var-group=core&var-shard=x1&var-role=$__all. It's not guaranteed to be noticeable though, as there may be other things connecting to x1.

Change #1154886 had a related patch set uploaded (by Ladsgroup; author: Daimona Eaytoy):

[mediawiki/extensions/CampaignEvents@wmf/1.45.0-wmf.4] Restrict event page decoration to currently allowed namespaces

https://gerrit.wikimedia.org/r/1154886

Change #1154886 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@wmf/1.45.0-wmf.4] Restrict event page decoration to currently allowed namespaces

https://gerrit.wikimedia.org/r/1154886

Mentioned in SAL (#wikimedia-operations) [2025-06-09T21:24:28Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1154886|Restrict event page decoration to currently allowed namespaces (T392784)]]

Mentioned in SAL (#wikimedia-operations) [2025-06-09T21:26:26Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1154886|Restrict event page decoration to currently allowed namespaces (T392784)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-06-09T21:35:36Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1154886|Restrict event page decoration to currently allowed namespaces (T392784)]] (duration: 11m 07s)

I backported it to isolate the change. It clearly reduced the connections to x1:

grafik.png (851×1 px, 89 KB)

grafik.png (851×1 px, 210 KB)

Obligatory remark that the Y axis does not start at 0, but indeed, I was looking at those graphs and the impact is definitely visible. Nice!