Oh hi.
User Details
- User Since
- Dec 11 2018, 9:39 PM (344 w, 2 d)
- Availability
- Available
- IRC Nick
- sukhe
- LDAP User
- Unknown
- MediaWiki User
- SSingh (WMF) [ Global Accounts ]
Yesterday
[Claiming this as the clinic duty person this week]
Arelion want to close the ticket as they see no issue. I asked that they don't. Perhaps for now we just leave eqsin depooled and the circuit in the active traffic path? If the reported RTT remains steady we can re-pool after a reasonable period has elapsed? And if it jumps up we can re-do the iperf tests at the time to try to confirm the packet loss has returned?
Wed, Jul 16
Things were stable for a few hours even after @cmooney made the fix above but starting ~20:00 UTC, we had a page for text-https in eqsin and also observed purged and ATS issues in talking to backends, indicating issues on the link.
Tue, Jul 15
Things look fine so marking as resolved; please re-open if there are any issues.
sukhe@krb1002:~$ sudo manage_principals.py reset-password htriedman --email_address=htriedman-ctr@wikimedia.org Password reset successfully. Successfully sent email to htriedman-ctr@wikimedia.org
sukhe@krb1002:~$ sudo manage_principals.py create vgutierrez --email_address=vgutierrez@wikimedia.org Principal successfully created. Make sure to update data.yaml in Puppet. Successfully sent email to vgutierrez@wikimedia.org
Expiry has been sent to end of FY (June 2026) and contact has been set to Suman to get this request going. We can always update that later.
Mon, Jul 14
@SCherukuwada: Hi, thanks for the approval above. Since the email ends in -ctr, can you let us know the contract period and also the point of contact (which I assume will be you?). Thanks!
sukhe@krb1002:~$ sudo manage_principals.py create addshore --email_address=wikimedia@addshore.com Principal successfully created. Make sure to update data.yaml in Puppet. Successfully sent email to wikimedia@addshore.com
Additionally please note that for logstash-access, you can simply request it via https://idm.wikimedia.org/. See https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access.
Hi. For volunteers requesting access, https://wikitech.wikimedia.org/wiki/SRE/Production_access#Add_a_volunteer_to_an_access_group needs to be followed.
Please re-open if there any issues.
Resolving this as the above has been completed at least from SRE's side. @aranyap: please re-open if there are any issues.
Thu, Jul 10
One additional point: we also ruled out any Kafka-specific issues; thanks to @brouberol.
Summary of where we are now:
Wed, Jul 9
On the DNS hosts as of today, we have an alert in place if we detect a mismatch between the service state as defined by confd/confctl and the advertisements of the VIPs on the DNS hosts themselves. Here is how it works in case we are interested in expanding this to other places:
Thu, Jul 3
Happy to collaborate on this, FWIW.
Wed, Jun 25
I am going to tackle this for the DNS hosts at least and then we can revisit a generic solution.
Tue, Jun 24
Mon, Jun 23
I updated Markmonitor to further add the v6 glue records:
Thu, Jun 19
The ECH experiment has been reverted as of today.
Wed, Jun 18
I checked with Brandon and he confirmed that while not strictly required, we can add the AAAA records here so let me know if you want to do that and I can see how to effect that change in Markmonitor.
Ah, thanks for the context. Looking at the domains in question, all of them are delegated at Markmonitor, which explains why they have the glue records in place already. (As compared to wikimedia.cloud, which we delegate from ns[0-2] for obvious reasons to delegate specific subdomains). wikimediacloud.org has NS set to ns[0-2].wikimedia.org and further has glue records for ns[01].openstack.eqiad1.wikimediacloud.org set explicitly in Markmonitor.
Jun 18 2025
Jun 17 2025
You can add them if required even though my reading of the other tasks and RFC indicates it is optional. That being said, why only v6 glues and not v4? I don't see v4 glue records in the zone file for wikimedia.cloud but I guess I am missing some more context.
Jun 12 2025
Jun 11 2025
Jun 9 2025
Jun 6 2025
Jun 4 2025
Commenting on this with my own understanding and for review of others. After that, letting @BCornwall handle updating the task description.
@Muehlenhoff: Both of these are decommissioned. Let me know if any other action is required from my end, thanks!
Sounds good, thanks. Let me know if I can help with anything.
Jun 2 2025
May 27 2025
May 26 2025
May 22 2025
(Adding @Fabfur who will lead this from Traffic with Brett.)
May 20 2025
Hi @bd808: we discussed this in the Traffic meeting today and will need some more time to prioritize this. I will follow up again next week. (We are interested but with the end of the quarter, we are still triaging this among the other work.)
We discussed this in the Traffic meeting and there are no concerns from our side in moving ahead. Let us know the date/time we want this roll out and we can coordinate!
May 15 2025
May 14 2025
OK, should be fine. We will follow up with a confirmation after discussing this in the Traffic meeting on Tuesday.
@greg: The change has been made and Markmonitor has been updated. NS records have a fairly large TTL of 86400 seconds or one day, so it will be a while (1-2 days) before the various recursors pick up the change (not counting the time it takes for the update to propagate from the registrar to the gives TLD).
Hi @greg: (Using this task is perfectly fine, thanks). The delegation for wiki.gives will be done at Markmonitor's (registrar) side, so we will need to update the NS records there and point them to the above, moving them away from ns[0-2].wikimedia.org where they currently point to. There is no other DNS deployment required on our end in that case; we only need to do that if we were doing subdomain delegation, which we are not.
Thanks for the task and the detailed description! Mostly sounds good to me for the VMs that I "own" but will need to check with Traffic for others. Is there a tentative date for when you want to get to this?
May 13 2025
May 8 2025
May 7 2025
The host has been depooled so you can reboot or shut it down without checking with us. Thanks for the quick response Rob!
So we have trimmed it down even with a simple git maintenance run:
May 5 2025
May 1 2025
I have been facing this, intermittently, over the past week or so. Refreshing helps sometimes but not always. Some additional points that might help with debugging and as I have personally observed: the selectors that fail are random but during the interval within which they fail, it's pretty consistent.