Page MenuHomePhabricator

ssingh (Sukhbir Singh)
SRE/Traffic

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Dec 11 2018, 9:39 PM (344 w, 2 d)
Availability
Available
IRC Nick
sukhe
LDAP User
Unknown
MediaWiki User
SSingh (WMF) [ Global Accounts ]

Oh hi.

Recent Activity

Yesterday

ssingh claimed T399899: Requesting access to analytics-privatedata-users for resquito.
Thu, Jul 17, 9:16 PM · SRE, SRE-Access-Requests
ssingh added a comment to T398650: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum.

[Claiming this as the clinic duty person this week]

Thu, Jul 17, 9:13 PM · SRE, SRE-Access-Requests
ssingh added a comment to T399419: Warning about /etc/acmecerts/unified contents during puppet run on deployment-cache-text08 & deployment-cache-upload08.

I like the idea of using two passes instead of rm -r, so something like:

  • find /var/lib/acme-chief/certs -type f -mtime +365 -delete
  • find /var/lib/acme-chief/certs -type d -empty -delete

thanks for the debugging @bd808

Thu, Jul 17, 5:08 PM · Acme-chief, Beta-Cluster-reproducible, Beta-Cluster-Infrastructure
ssingh added a comment to T399221: eqsin purged consumers lag.

Arelion want to close the ticket as they see no issue. I asked that they don't. Perhaps for now we just leave eqsin depooled and the circuit in the active traffic path? If the reported RTT remains steady we can re-pool after a reasonable period has elapsed? And if it jumps up we can re-do the iperf tests at the time to try to confirm the packet loss has returned?

Thu, Jul 17, 3:45 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic

Wed, Jul 16

ssingh added a comment to T399221: eqsin purged consumers lag.

Things were stable for a few hours even after @cmooney made the fix above but starting ~20:00 UTC, we had a page for text-https in eqsin and also observed purged and ATS issues in talking to backends, indicating issues on the link.

Wed, Jul 16, 8:34 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic

Tue, Jul 15

ssingh closed T399436: Requesting access to Dashboards in Superset for OKryva-WMF as Resolved.

Things look fine so marking as resolved; please re-open if there are any issues.

Tue, Jul 15, 5:55 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests
ssingh closed T375232: Write a cookbook that performs a rolling restart of HAProxy as Resolved.

Done in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1167222.

Tue, Jul 15, 3:55 PM · Traffic
ssingh closed T398501: Requesting a kerberos identity - htriedman as Resolved.
Tue, Jul 15, 3:10 PM · SRE-Access-Requests, Data-Engineering
ssingh updated subscribers of T398501: Requesting a kerberos identity - htriedman.
Tue, Jul 15, 2:45 PM · SRE-Access-Requests, Data-Engineering
ssingh added a comment to T398501: Requesting a kerberos identity - htriedman.
sukhe@krb1002:~$ sudo manage_principals.py reset-password htriedman --email_address=htriedman-ctr@wikimedia.org
Password reset successfully.
Successfully sent email to htriedman-ctr@wikimedia.org
Tue, Jul 15, 2:45 PM · SRE-Access-Requests, Data-Engineering
ssingh closed T399560: Requesting access to analytics_privatedata_users for vgutierrez as Resolved.
sukhe@krb1002:~$ sudo manage_principals.py create vgutierrez --email_address=vgutierrez@wikimedia.org
Principal successfully created. Make sure to update data.yaml in Puppet.
Successfully sent email to vgutierrez@wikimedia.org
Tue, Jul 15, 2:40 PM · Patch-For-Review, SRE, SRE-Access-Requests
ssingh added a comment to T399436: Requesting access to Dashboards in Superset for OKryva-WMF.

Expiry has been sent to end of FY (June 2026) and contact has been set to Suman to get this request going. We can always update that later.

Tue, Jul 15, 2:37 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests
ssingh updated the task description for T399560: Requesting access to analytics_privatedata_users for vgutierrez.
Tue, Jul 15, 2:26 PM · Patch-For-Review, SRE, SRE-Access-Requests

Mon, Jul 14

ssingh added a comment to T399436: Requesting access to Dashboards in Superset for OKryva-WMF.

@SCherukuwada: Hi, thanks for the approval above. Since the email ends in -ctr, can you let us know the contract period and also the point of contact (which I assume will be you?). Thanks!

Mon, Jul 14, 6:39 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests
ssingh closed T399114: Remove OCSP monitoring and related bits as Resolved.
Mon, Jul 14, 3:02 PM · Traffic
ssingh closed T399079: Disable OCSP for GTS issued certificates, a subtask of T399114: Remove OCSP monitoring and related bits, as Resolved.
Mon, Jul 14, 3:02 PM · Traffic
ssingh closed T399079: Disable OCSP for GTS issued certificates as Resolved.
Mon, Jul 14, 3:02 PM · Traffic
ssingh closed T399152: Requesting access to analytics-privatedata-users for addshore as Resolved.
sukhe@krb1002:~$ sudo manage_principals.py create addshore --email_address=wikimedia@addshore.com
Principal successfully created. Make sure to update data.yaml in Puppet.
Successfully sent email to wikimedia@addshore.com
Mon, Jul 14, 2:53 PM · SRE, SRE-Access-Requests
ssingh added a comment to T399421: Logstash Access for gergesshamon.

Additionally please note that for logstash-access, you can simply request it via https://idm.wikimedia.org/. See https://wikitech.wikimedia.org/wiki/SRE/LDAP/Groups/Request_access.

Mon, Jul 14, 2:47 PM · SRE, LDAP-Access-Requests
ssingh added a comment to T399421: Logstash Access for gergesshamon.

Hi. For volunteers requesting access, https://wikitech.wikimedia.org/wiki/SRE/Production_access#Add_a_volunteer_to_an_access_group needs to be followed.

Mon, Jul 14, 2:40 PM · SRE, LDAP-Access-Requests
ssingh updated the task description for T399152: Requesting access to analytics-privatedata-users for addshore.
Mon, Jul 14, 2:29 PM · SRE, SRE-Access-Requests
ssingh closed T398686: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech as Resolved.

Please re-open if there any issues.

Mon, Jul 14, 2:19 PM · SRE-Access-Requests, SRE
ssingh closed T398650: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum as Resolved.

Resolving this as the above has been completed at least from SRE's side. @aranyap: please re-open if there are any issues.

Mon, Jul 14, 2:09 PM · SRE, SRE-Access-Requests
ssingh updated the task description for T399436: Requesting access to Dashboards in Superset for OKryva-WMF.
Mon, Jul 14, 1:48 PM · LDAP-Access-Requests, SRE, SRE-Access-Requests

Thu, Jul 10

ssingh updated subscribers of T399221: eqsin purged consumers lag.

One additional point: we also ruled out any Kafka-specific issues; thanks to @brouberol.

Thu, Jul 10, 7:09 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic
ssingh updated subscribers of T399221: eqsin purged consumers lag.

Summary of where we are now:

Thu, Jul 10, 7:05 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic

Wed, Jul 9

ssingh changed the status of T399114: Remove OCSP monitoring and related bits from Open to In Progress.
Wed, Jul 9, 6:23 PM · Traffic
ssingh added a parent task for T399079: Disable OCSP for GTS issued certificates: T399114: Remove OCSP monitoring and related bits.
Wed, Jul 9, 6:23 PM · Traffic
ssingh added a subtask for T399114: Remove OCSP monitoring and related bits: T399079: Disable OCSP for GTS issued certificates.
Wed, Jul 9, 6:22 PM · Traffic
ssingh created T399114: Remove OCSP monitoring and related bits.
Wed, Jul 9, 6:22 PM · Traffic
ssingh added a comment to T374619: Alert when anycast-healthchecker withdraws BGP route.

On the DNS hosts as of today, we have an alert in place if we detect a mismatch between the service state as defined by confd/confctl and the advertisements of the VIPs on the DNS hosts themselves. Here is how it works in case we are interested in expanding this to other places:

Wed, Jul 9, 5:49 PM · Traffic, Infrastructure-Foundations, netops, SRE

Thu, Jul 3

ssingh added a comment to T336754: Document how to deploy changes to DNS repo without Gerrit working.

Happy to collaborate on this, FWIW.

Thu, Jul 3, 2:37 PM · collaboration-services, SRE, Traffic

Wed, Jun 25

ssingh added a comment to T374619: Alert when anycast-healthchecker withdraws BGP route.

I am going to tackle this for the DNS hosts at least and then we can revisit a generic solution.

Wed, Jun 25, 1:45 PM · Traffic, Infrastructure-Foundations, netops, SRE

Tue, Jun 24

ssingh updated the task description for T397456: Upgrade to ATS 9.2.11.
Tue, Jun 24, 6:02 PM · Traffic
ssingh updated the task description for T397456: Upgrade to ATS 9.2.11.
Tue, Jun 24, 3:29 PM · Traffic
ssingh updated the task description for T397456: Upgrade to ATS 9.2.11.
Tue, Jun 24, 2:53 PM · Traffic
ssingh updated the task description for T397456: Upgrade to ATS 9.2.11.
Tue, Jun 24, 1:23 PM · Traffic

Mon, Jun 23

ssingh added a comment to T397185: Add IPv6 glue records for WMCS Designate-hosted domains.

I updated Markmonitor to further add the v6 glue records:

Mon, Jun 23, 3:48 PM · Traffic, Domains, IPv6, cloud-services-team

Thu, Jun 19

ssingh added a comment to T205378: Support ECH on Wikimedia servers.

The ECH experiment has been reverted as of today.

Thu, Jun 19, 3:59 PM · Traffic, Traffic-Icebox, Upstream, HTTPS, SRE
ssingh updated the task description for T397456: Upgrade to ATS 9.2.11.
Thu, Jun 19, 2:10 PM · Traffic
ssingh triaged T397456: Upgrade to ATS 9.2.11 as Medium priority.
Thu, Jun 19, 1:55 PM · Traffic
ssingh added a parent task for T397456: Upgrade to ATS 9.2.11: Unknown Object (Task).
Thu, Jun 19, 1:55 PM · Traffic
ssingh created T397456: Upgrade to ATS 9.2.11.
Thu, Jun 19, 1:55 PM · Traffic

Wed, Jun 18

ssingh updated the task description for T390912: Upgrade to ATS 9.2.10.
Wed, Jun 18, 4:23 PM · Traffic
ssingh added a comment to T397185: Add IPv6 glue records for WMCS Designate-hosted domains.

I checked with Brandon and he confirmed that while not strictly required, we can add the AAAA records here so let me know if you want to do that and I can see how to effect that change in Markmonitor.

Wed, Jun 18, 3:23 PM · Traffic, Domains, IPv6, cloud-services-team
ssingh added a comment to T397185: Add IPv6 glue records for WMCS Designate-hosted domains.

Ah, thanks for the context. Looking at the domains in question, all of them are delegated at Markmonitor, which explains why they have the glue records in place already. (As compared to wikimedia.cloud, which we delegate from ns[0-2] for obvious reasons to delegate specific subdomains). wikimediacloud.org has NS set to ns[0-2].wikimedia.org and further has glue records for ns[01].openstack.eqiad1.wikimediacloud.org set explicitly in Markmonitor.

Wed, Jun 18, 2:14 PM · Traffic, Domains, IPv6, cloud-services-team

Jun 18 2025

ssingh updated the task description for T396581: varnish 7.1.1-2~bpo11+wmf1 crash.
Jun 18 2025, 12:33 AM · Traffic

Jun 17 2025

ssingh added a comment to T397185: Add IPv6 glue records for WMCS Designate-hosted domains.

You can add them if required even though my reading of the other tasks and RFC indicates it is optional. That being said, why only v6 glues and not v4? I don't see v4 glue records in the zone file for wikimedia.cloud but I guess I am missing some more context.

Jun 17 2025, 5:34 PM · Traffic, Domains, IPv6, cloud-services-team
ssingh updated the task description for T390912: Upgrade to ATS 9.2.10.
Jun 17 2025, 4:18 PM · Traffic
ssingh updated the task description for T390912: Upgrade to ATS 9.2.10.
Jun 17 2025, 1:54 PM · Traffic

Jun 12 2025

ssingh assigned T390912: Upgrade to ATS 9.2.10 to BCornwall.
Jun 12 2025, 1:29 PM · Traffic
ssingh updated the task description for T390912: Upgrade to ATS 9.2.10.
Jun 12 2025, 1:21 PM · Traffic

Jun 11 2025

ssingh updated the task description for T390912: Upgrade to ATS 9.2.10.
Jun 11 2025, 3:22 PM · Traffic
ssingh triaged T396611: Package, prepare the upgrade, and deploy ATS 10 in production as Low priority.
Jun 11 2025, 1:23 PM · Traffic
ssingh created T396611: Package, prepare the upgrade, and deploy ATS 10 in production.
Jun 11 2025, 1:23 PM · Traffic

Jun 9 2025

ssingh triaged T396398: Liberica control plane (liberica-cp) should not automatically start on system boot as Low priority.
Jun 9 2025, 6:20 PM · Liberica, Traffic
ssingh created T396398: Liberica control plane (liberica-cp) should not automatically start on system boot.
Jun 9 2025, 6:20 PM · Liberica, Traffic

Jun 6 2025

ssingh added a comment to T387145: Q3:test NIC for lvs1017.

Check with @cmooney for changes required to hieradata/common/lvs/interfaces.yaml (to add lvs1016 there) and also to hieradata/role/eqiad/lvs/balancer.yaml in case profile::lvs::tagged_subnets needs to be updated (don't think so but check).

To my knowledge this only now applies to low-traffic services in eqiad/codfw (behind K8s and some search ones). I believe everything else is using IPIP so L2 adjacency / extra interfaces is not required on the LVS nodes serving those.

Jun 6 2025, 3:40 PM · SRE, ops-eqiad, Traffic, DC-Ops
ssingh assigned T381608: Upgrade pdns-recursor to 5.x on all prod DNS hosts (all C:dnsrecursor and so possibly WMCS) to CDobbins.
Jun 6 2025, 1:14 PM · Patch-For-Review, SRE, Traffic

Jun 4 2025

ssingh added a comment to T387145: Q3:test NIC for lvs1017.

Commenting on this with my own understanding and for review of others. After that, letting @BCornwall handle updating the task description.

Jun 4 2025, 7:00 PM · SRE, ops-eqiad, Traffic, DC-Ops
ssingh updated subscribers of T396015: Decommission doh7001 and durum7001.

@Muehlenhoff: Both of these are decommissioned. Let me know if any other action is required from my end, thanks!

Jun 4 2025, 2:25 PM · Traffic, Ganeti, SRE
ssingh updated the task description for T396015: Decommission doh7001 and durum7001.
Jun 4 2025, 2:24 PM · Traffic, Ganeti, SRE
ssingh updated the task description for T396015: Decommission doh7001 and durum7001.
Jun 4 2025, 2:06 PM · Traffic, Ganeti, SRE
ssingh added a comment to T396015: Decommission doh7001 and durum7001.

Sounds good, thanks. Let me know if I can help with anything.

Jun 4 2025, 1:13 PM · Traffic, Ganeti, SRE

Jun 2 2025

ssingh added a comment to T395796: Move ncredir7003 into service and decom ncredir7002.

Had a quick chat with Moritz and Sukhbir.
We prefer not to wait for the Bird work to progress on setting up the Routed Ganeti cluster, so we're going to run doh/durum in a non-redundant way (only running on a single ganeti hosts), while creating a 3 nodes routed Ganeti cluster.

This will also permit me to look at setting up a temporary BGP setup on durum7003 (and extend it to doh7003 if it's solid enough).

So next step is for traffic to do all the steps in the task description except the doh7003 steps for now.

Jun 2 2025, 2:57 PM · Traffic, SRE

May 27 2025

ssingh closed T394312: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) as Resolved.
May 27 2025, 6:01 PM · SRE, Traffic
ssingh updated the task description for T394312: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min).
May 27 2025, 6:00 PM · SRE, Traffic

May 26 2025

ssingh added a parent task for T238034: Enable HTTP/3 (QUIC) support on Wikimedia servers: T208242: Investigate using RFC 7838 Alternate Services to better optimize edge connections.
May 26 2025, 3:49 PM · Wikimedia-Performance-recommendation, Traffic-Icebox, SRE, HTTPS
ssingh added a subtask for T208242: Investigate using RFC 7838 Alternate Services to better optimize edge connections: T238034: Enable HTTP/3 (QUIC) support on Wikimedia servers.
May 26 2025, 3:49 PM · Wikimedia-Performance-recommendation, Traffic-Icebox, SRE
ssingh updated the task description for T394312: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min).
May 26 2025, 1:35 PM · SRE, Traffic

May 22 2025

ssingh updated subscribers of T392851: Q4:rack/setup/install cp20[43-58] codfw.

(Adding @Fabfur who will lead this from Traffic with Brett.)

May 22 2025, 5:44 PM · Patch-For-Review, Traffic, ops-codfw, DC-Ops

May 20 2025

ssingh added a comment to T394263: Migrating magru to routed Ganeti.

We discussed this in the Traffic meeting and there are no concerns from our side in moving ahead. Let us know the date/time we want this roll out and we can coordinate!

We'd like to start next Monday. The firsts steps are mostly internal to Ganeti, we'll reach out when VMs need to be shuffled. Ok?

May 20 2025, 3:24 PM · Patch-For-Review, Ganeti, Infrastructure-Foundations, SRE
ssingh added a comment to T358887: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes.

Hi @bd808: we discussed this in the Traffic meeting today and will need some more time to prioritize this. I will follow up again next week. (We are interested but with the end of the quarter, we are still triaging this among the other work.)

May 20 2025, 3:20 PM · Traffic, Release-Engineering-Team, Beta-Cluster-Infrastructure
ssingh added a comment to T394263: Migrating magru to routed Ganeti.

We discussed this in the Traffic meeting and there are no concerns from our side in moving ahead. Let us know the date/time we want this roll out and we can coordinate!

May 20 2025, 3:16 PM · Patch-For-Review, Ganeti, Infrastructure-Foundations, SRE

May 15 2025

ssingh updated the task description for T394312: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min).
May 15 2025, 2:24 PM · SRE, Traffic
ssingh renamed T394312: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) from Lower geodns TTLs for dyna.wm.org from 300s (5 min) to 180s (3 min) to Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min).
May 15 2025, 1:18 PM · SRE, Traffic

May 14 2025

ssingh added a comment to T394263: Migrating magru to routed Ganeti.

OK, should be fine. We will follow up with a confirmation after discussing this in the Traffic meeting on Tuesday.

May 14 2025, 5:26 PM · Patch-For-Review, Ganeti, Infrastructure-Foundations, SRE
ssingh added a comment to T379318: Acoustic SMS: Domain needed for short links.

@greg: The change has been made and Markmonitor has been updated. NS records have a fairly large TTL of 86400 seconds or one day, so it will be a while (1-2 days) before the various recursors pick up the change (not counting the time it takes for the update to propagate from the registrar to the gives TLD).

May 14 2025, 5:04 PM · Wikimedia-Fundraising-CiviCRM, Traffic, Fundraising-Tech-Roadmap, Fundraising-Backlog, fr-acoustic
ssingh added a comment to T379318: Acoustic SMS: Domain needed for short links.

Hi @greg: (Using this task is perfectly fine, thanks). The delegation for wiki.gives will be done at Markmonitor's (registrar) side, so we will need to update the NS records there and point them to the above, moving them away from ns[0-2].wikimedia.org where they currently point to. There is no other DNS deployment required on our end in that case; we only need to do that if we were doing subdomain delegation, which we are not.

May 14 2025, 4:15 PM · Wikimedia-Fundraising-CiviCRM, Traffic, Fundraising-Tech-Roadmap, Fundraising-Backlog, fr-acoustic
ssingh created T394312: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min).
May 14 2025, 2:32 PM · SRE, Traffic
ssingh added a comment to T394263: Migrating magru to routed Ganeti.

Thanks for the task and the detailed description! Mostly sounds good to me for the VMs that I "own" but will need to check with Traffic for others. Is there a tentative date for when you want to get to this?

May 14 2025, 1:05 PM · Patch-For-Review, Ganeti, Infrastructure-Foundations, SRE

May 13 2025

ssingh claimed T358887: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes.
May 13 2025, 6:01 PM · Traffic, Release-Engineering-Team, Beta-Cluster-Infrastructure
ssingh added a comment to T358887: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes.

That being said, if there is a requirement for constantly keeping blocked nets updated in Beta, then we should have the discussion of how to fix it and also to plan for it.

The new T393487: 2025 tracking task for Beta Cluster (deployment-prep) traffic overload protection (blocking unwanted crawlers) parent task and the https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/2RFFHXSI6INQDJ2AQ7U3IQ2HTHT5J4VL/ mailing list post give a bit of context for the current wave of excess traffic combat that we are doing in deployment-prep. In the big picture there are a couple of issues that I would like to resolve:

  • Closer alignment with tools used in the production CDN edge would ideally provide a nicer experience when we need to block traffic manually. It also helps with the "drinking our own champaign" aspects of the Beta Cluster deployment by giving folks a place to practice skills and test features that are eventually bound for production use.
  • More automated protections for Beta Cluster are desirable because there are few to no people dedicated to maintaining the environment vs a number of technical volunteers and Foundation and affiliate staff who use the platform quite regularly for integration, acceptance, and regression testing of various types. This work might lead to learning things that can help other bits of the Wikimedia infrastructure that are not protected by the current CDN for various reasons.

This task and T393481: Add allowlist to make poking holes in abuse_networks:blocked_nets:networks easier are probably more stop gaps than direct movement towards either of the larger goals.

May 13 2025, 5:58 PM · Traffic, Release-Engineering-Team, Beta-Cluster-Infrastructure
ssingh added a comment to T358887: beta cluster: profile::cache::varnish::frontend needs to reload varnish-frontend.service when /etc/varnish/blocked-nets.inc.vcl changes.

If I'm understanding @ssingh's comments on the now abandoned patch, it sounds like setting up the requestctl stack in Beta Cluster might be our better long term solution.

May 13 2025, 5:35 PM · Traffic, Release-Engineering-Team, Beta-Cluster-Infrastructure
ssingh reassigned T393616: lvs3009 NIC HW issue (Broadcom, eno8303) from RobH to BCornwall.
May 13 2025, 1:19 PM · SRE, Traffic, DC-Ops, ops-esams
ssingh added a comment to T393616: lvs3009 NIC HW issue (Broadcom, eno8303).

Thanks @RobH! @cmooney: yeah, I updated the task description to reflect that but we though we should get this checked out anyway, since it's the integrated NIC. Thanks for checking, both!

May 13 2025, 1:06 PM · SRE, Traffic, DC-Ops, ops-esams

May 8 2025

ssingh updated Other Assignee for T387145: Q3:test NIC for lvs1017, added: BCornwall.
May 8 2025, 6:29 PM · SRE, ops-eqiad, Traffic, DC-Ops
ssingh renamed T393616: lvs3009 NIC HW issue (Broadcom, eno8303) from lvs3009 NIC HW issue (Broadcom, eno12399np0) to lvs3009 NIC HW issue (Broadcom, eno8303).
May 8 2025, 1:21 PM · SRE, Traffic, DC-Ops, ops-esams

May 7 2025

ssingh added a comment to T393616: lvs3009 NIC HW issue (Broadcom, eno8303).

The host has been depooled so you can reboot or shut it down without checking with us. Thanks for the quick response Rob!

May 7 2025, 8:49 PM · SRE, Traffic, DC-Ops, ops-esams
ssingh added a comment to T393616: lvs3009 NIC HW issue (Broadcom, eno8303).

I'll open a case with Dell, which will inevitably require the firmware on the NIC, mainboard, and idrac be updated before they'll authorize a replacement.

Was the server under any particular load before the error we can recreate, or did it just randomly fire?

May 7 2025, 8:23 PM · SRE, Traffic, DC-Ops, ops-esams
ssingh renamed T393616: lvs3009 NIC HW issue (Broadcom, eno8303) from lvs3009 NIC possible HW issue to lvs3009 NIC HW issue (Broadcom, eno12399np0).
May 7 2025, 4:17 PM · SRE, Traffic, DC-Ops, ops-esams
ssingh triaged T393616: lvs3009 NIC HW issue (Broadcom, eno8303) as High priority.
May 7 2025, 4:17 PM · SRE, Traffic, DC-Ops, ops-esams
ssingh created T393616: lvs3009 NIC HW issue (Broadcom, eno8303).
May 7 2025, 4:16 PM · SRE, Traffic, DC-Ops, ops-esams
ssingh added a comment to T393602: Improving the time it takes to run authdns-update.

So we have trimmed it down even with a simple git maintenance run:

May 7 2025, 3:11 PM · Traffic
ssingh triaged T393602: Improving the time it takes to run authdns-update as Medium priority.
May 7 2025, 2:23 PM · Traffic
ssingh created T393602: Improving the time it takes to run authdns-update.
May 7 2025, 2:23 PM · Traffic

May 5 2025

ssingh added a comment to T392848: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state.

No, you're right, current_state in the icinga status reports the state independently of the downtime. An alternative option to Luca's implementation could be to add a skip_downtimed flag and when set ignore the status for the services that have scheduled_downtime_depth > 0. No particular strong opinion, just that in general is error prone to specify services names as they can easily change dynamically from Puppet's code and hence not match anymore. For this reason the current methods that work on services names use regexes to try to be a bit more flexible and future-proof.

May 5 2025, 3:42 PM · Infrastructure-Foundations, Traffic, SRE-tools, Spicerack
ssingh added a comment to T392848: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state.

Have you considered just downtiming the affected service during the operation with downtime_services or the related context manager version of it?

May 5 2025, 2:09 PM · Infrastructure-Foundations, Traffic, SRE-tools, Spicerack

May 1 2025

ssingh added a comment to T393097: Frequent filter timeouts in superset UI.

I have been facing this, intermittently, over the past week or so. Refreshing helps sometimes but not always. Some additional points that might help with debugging and as I have personally observed: the selectors that fail are random but during the interval within which they fail, it's pretty consistent.

May 1 2025, 2:15 PM · Data-Platform-SRE (2025.05.24 - 2025.06.13), superset.wikimedia.org

Apr 30 2025

ssingh added a comment to T393034: Investigate out of date refs following gerrit switchover.

The current hypothesis is that during the DNS change both hosts were considered to be primary and unexpected replication took place for approximately an hour and 20 minutes (at most). The result was gerrit2002 deleting changes made on gerrit1003.

Apr 30 2025, 6:57 PM · Wikimedia-Incident, Release-Engineering-Team, collaboration-services, Gerrit