User Details
- User Since
- Oct 7 2014, 4:49 PM (562 w, 2 d)
- Availability
- Available
- LDAP User
- EBernhardson
- MediaWiki User
- EBernhardson (WMF) [ Global Accounts ]
Today
Looks to be working as expected, with data starting to arrive on the 15th. Looks like data exists for all rows:
Yesterday
Construct an awkward thing that checks all the servers for errors:
for host in $(cat /etc/dsh/group/wdqs | grep -v ^# | grep -v ^$ | sort); do echo "$host: $(curl -sk "https://$host/sparql?query=SELECT%20%3Fs%20%3Fp%20%3Fo%20%7B%0A%20%20SERVICE%20%3Chttps%3A%2F%2Frkd.triply.cc%2F_api%2Fdatasets%2Frkd%2FRKD-Knowledge-Graph%2Fservices%2FSPARQL%2Fsparql%3E%20%7B%0A%20%20%20%20%3Fs%20%3Fp%20%3Fo%0A%20%20%7D%0A%7D%0ALIMIT%2010" | grep '^Caused by: java.lang.IllegalArgumentException: Service URI https://rkd.triply.cc/_api/datasets/rkd/RKD-Knowledge-Graph/services/SPARQL/sparql is not allowed$')" done
Tue, Jul 15
Looks to query succesfully
Mon, Jul 14
Fri, Jul 11
The problem here is that we are primarily highlighting against the text field which is a wide variety of data stuffed together into a single string. The highlighter doesn't know it should be considering this to be many different strings and picking between them. It highlights it as if it were highlighting paragraphs of content. We are doing some post-processing in the php side to turn that highlighted text field into something more presentable, but it's always going to be hacky trying to solve this problem there.
Thu, Jul 10
There are two things going on here, the first is that quotes in the regexp here have special meaning. They don't match the quote, rather they define a part of the string that has to be a literal match. So the search query insource:/"u.a."/ searches for the literal string u.a., it does not search for the quotation characters and it does not treat the . as a match-all. The second part is that the recent changes to the regex engine didn't take this into account, and is rewriting the . into a semantically equivalent form.
This is most likely related to the plugin deployment for T317599.
Wed, Jul 9
merge request: https://github.com/wikimedia/tools-global-search/pull/118
Most plausibly the global-search side is picking up indices that are not live on the production side. IIRC this runs queries against the * index, it may have to be a little more selective and use *_content,*_general,*_file
Tue, Jul 8
Updated bits were deployed and are now running. I re-ran a few recent hours through and we get the following counts, not sure if these are expected. Essentially it sees a bit under 1k/hr in the couple hours i re-ran.
Looks like a couple hosts in codfw still need a restart:
Started test, sadly it's not working as expected. It's been some time since we ran an AB test involving autocomplete, and since then the javascript layer has changed. End result is that we are not attaching the testing parameter to the autocomplete api requests, meaning no test treatment is being applied.
Mon, Jul 7
In terms of metrics, what we should be looking at:
This looks to have worked as expected. Checked by reviewing max(sum by (pod) (container_memory_usage_bytes{namespace="mw-cron", pod=~"cirrus-build-completion-indices-.*", container="mediawiki-main-app"})) for the last 7 days. This shows that peak memory usage of an individual cirrus-build-completion-indices pod decreased from >2gb to ~550mb on july 3rd.
We use a custom set of plugins for opensearch, including a number that we build ourselves. Those all get packaged up together into a set and this provides the guarantee that development and production have all the same things with the same versions configured and available.
Tue, Jul 1
General investigation:
I've also been pondering how we might detect this kind of issue in the future. Perhaps, during the daily completion suggester rebuild, we could increment a counter every time we notice that we are missing some externally populated fields. We could then monitor that count for a week or two to find the typical range of values and alert whenever we get outside those limits.
Checked the data lake, we have been consistently generating popularity_score data for United States (project=en.wikipedia, page_id=3434750). Checking further down the pipeline I can also find the same page in the bulk update files that we push into elasticsearch. Essentially I'm reasonably certain the updates are still flowing.
We can look at the scoring information to get an idea of what is going on: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=cirruscompsuggestbuilddoc&titles=United%20States%7CUnited%20Kingdom&formatversion=2
Mon, Jun 30
Thu, Jun 26
While the above is a reduced form of our maintenance script and does trigger an OOM, after more investigation I'm not certain that is the cause of our memory usage.
Wed, Jun 25
Relatively minimal reproduction of the OOM we trigger. It fails at around 2.3M cached entries. At a very general level the problem is that this mediawiki code is assuming a webrequest that ends in a few seconds at most, not a maintenance script that visits millions of pages in a single execution.
Tue, Jun 24
Moving back to reported, as we are tracking resolving the memory issue in T395465
There isn't any particular phabricator task for tracking where search traffic currently flows to, i suppose it's tracked in config-master discovery but that's not human readable. The shift in traffic above was part of T370147. There will potentially be a similar shift when we move to opensearch 2, but thats probably 6+ months away. In general it's not particularlly common that we have to disable one of the two search clusters, but it happens from time to time.
Mon, Jun 23
Traffic test complete, moved as expected. Commands have been documented at https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Multi-DC_%2F_Multi-Cluster_Operations
In theory we should be able to depool codfw like so, causing all traffic to move to eqiad:
Search traffic moved between datacenters due to operational concerns, this causes latency effects as noted in the above. The latency difference is ~30ms per round trip to the search servers if the requests have to go cross-datacenter. Typically there are 2 round trips but that can vary for a number of reasons.
Fri, Jun 20
This is mostly done, read traffic is all flowing through the dns-disc endpoints. Traffic can now move with the same tooling as everything else. Not quite done yet, as T397377 was opened for a change we noticed in the dashboards.
There are a few errors remaining
As part of T143553 we enabled retry_on: gateway-error in the envoy configuration for the new dns-discovery endpoint which is used for read traffic. Essentially this transparently retrys specific 5xx errors that are gateway related at the envoy level. We switched over to the new endpoints a little after 2025-06-18 21:00 UTC. Per logstash the "upstream connect error" errors were firing 500-1000 times per 12 hours, post deployment that has dropped to 30-100, a 10x reduction in errors.
Jun 17 2025
This will require a full-cluster reindex operation, which is actually pending for a number of reasons (various language analysis chain updates, regex trigram indexing changes, etc.)
With that patch deployed the test is concluded and the updated configuration utilizing glent m1 is in place.
Jun 16 2025
Per the findings in T396779, i think we can greatly simplify this. The initial premise was that the phrase suggester indices would be too large, but current analysis says we have plenty of headroom. An alternate implementation:
Patch above undeploys the AB test and, per the report recommendations, deploys glent method 1 to all users in english, french, and german language wikipedias.
I think the only patch still open is: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1144486
Jun 13 2025
Two primary sets of tests were conducted: an index size analysis and a memory utilization stress test.
Jun 12 2025
Jun 11 2025
Expecting this to be the final report: https://people.wikimedia.org/~ebernhardson/T262612-dym-ab-analysis.html
Jun 10 2025
Plausibly relevant information from setting up codfw: T117714
Jun 9 2025
:) (ebernhardson@deploy1003)-~$ kube_env mw-cron eqiad :) (ebernhardson@deploy1003)-~$ kubectl get pods | grep OOM cirrus-build-completion-indices-codfw-s3-29155830-52977 0/3 OOMKilled 0 37h cirrus-build-completion-indices-codfw-s3-29157270-wnzhd 0/3 OOMKilled 0 13h cirrus-build-completion-indices-eqiad-s3-29155830-72d5c 0/3 OOMKilled 0 37h cirrus-build-completion-indices-eqiad-s3-29157270-mn6bp 0/3 OOMKilled 0 13h
Possible solutions:
- Some sort of cheapcat keyword that has a reduced depth
- Some sort of parameter / named parameter passing for the keyword to allow users to change the depth
- Some sort of cheapcat that runs the query but without a sort (wont time out, but doesnt get the n-closest nodes)
It looks to still be having issues, in particular the s3 job has been OOMKilled a few times recently and isn't completing a full build.
Jun 6 2025
Jun 5 2025
Jun 3 2025
This is deployed now. Tests show the example query no longer errors out, although it doesn't give any results either. As far as I can tell not getting results is appropriate, the pages returned by the generator don't have a wikibase_item page prop and thus dont get represented here.
T395677 has been fixed, but it essentially means we were not serving suggestions to a subset of queries. The errors stop by 2025-06-03 08:00 UTC, we will need to ensure we do not use any data collected prior to that in data analysis.
Glent update built and shipped overnight, the errors look to have gone away now too.
Jun 2 2025
I think we can call this complete. Nothing in the search/dags folder directly creates a DAG anymore, everything goes through create_easy_dag which i believe suggests everything has been migrated.
This is indeed coming from glent, poking through the index updates we ship can clearly see there are negative numbers in there. I think they come from a place where we do score - edit_distance, we should be able to simply shift the scores with a constant value. I suspect we have to apply that constant globally if we don't want to change the ranking, current query patterns simply return the suggestion score and shifting only m1run would prefer it over others.