Page MenuHomePhabricator

SREGroup
ActivePublic

Recent Activity

Today

Maintenance_bot added a project to T399916: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}): SRE.
Fri, Jul 18, 2:29 AM · SRE, DC-Ops, ops-codfw

Yesterday

Stashbot added a comment to T399221: eqsin purged consumers lag.

Mentioned in SAL (#wikimedia-operations) [2025-07-17T22:01:29Z] <cmooney@cumin1003> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: repool eqsin to test backhaul cct packet loss, T399221]

Thu, Jul 17, 10:01 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic
Stashbot added a comment to T399221: eqsin purged consumers lag.

Mentioned in SAL (#wikimedia-operations) [2025-07-17T22:01:25Z] <cmooney@cumin1003> START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: repool eqsin to test backhaul cct packet loss, T399221]

Thu, Jul 17, 10:01 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic
aranyap added a comment to T398650: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum.

@cmooney @ssingh I just requested access through the online system. Thank you!

Thu, Jul 17, 9:31 PM · SRE, SRE-Access-Requests
ssingh claimed T399899: Requesting access to analytics-privatedata-users for resquito.
Thu, Jul 17, 9:16 PM · SRE, SRE-Access-Requests
ssingh added a comment to T398650: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum.

[Claiming this as the clinic duty person this week]

Thu, Jul 17, 9:13 PM · SRE, SRE-Access-Requests
cmooney added a comment to T398650: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum.

Hi @aranyap yeah you are not in that group.

cmooney@ldap-maint1001:~$ ldapsearch -x cn=wmf | grep aprum 
cmooney@ldap-maint1001:~$
Thu, Jul 17, 8:58 PM · SRE, SRE-Access-Requests
aranyap reopened T398650: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum as "Open".

Hi @cmooney ! I'm having some trouble trying to access JupyterHub and after some poking around with @dr0ptp4kt and @BTullis we think it's because I don't have wmf LDAP access but we aren't 100% sure.

Thu, Jul 17, 8:53 PM · SRE, SRE-Access-Requests
Jclark-ctr moved T399847: Degraded RAID on backup1007 from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.
Thu, Jul 17, 8:05 PM · DC-Ops, SRE, ops-eqiad
HShaikh added a comment to T399899: Requesting access to analytics-privatedata-users for resquito.

Approved. Thank you

Thu, Jul 17, 7:58 PM · SRE, SRE-Access-Requests
REsquito-WMF added a comment to T399899: Requesting access to analytics-privatedata-users for resquito.

this ticket is a prerequisite for https://phabricator.wikimedia.org/T396672 and that @dr0ptp4kt is also readying a patch for additional access in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1165605 to be taken out of WIP once your initial SSH access is established.

Thu, Jul 17, 7:52 PM · SRE, SRE-Access-Requests
REsquito-WMF created T399899: Requesting access to analytics-privatedata-users for resquito.
Thu, Jul 17, 7:51 PM · SRE, SRE-Access-Requests
Maintenance_bot added a project to T399878: Archive affiliates-l: SRE.
Thu, Jul 17, 7:29 PM · SRE, Wikimedia-Mailing-lists
VRiley-WMF closed T381576: Q2:rack/setup/install ganeti105[34].eqiad.wmnet as Resolved.

These have been imaged

Thu, Jul 17, 7:05 PM · SRE, ops-eqiad, Infrastructure-Foundations, DC-Ops
VRiley-WMF updated the task description for T381576: Q2:rack/setup/install ganeti105[34].eqiad.wmnet.
Thu, Jul 17, 7:04 PM · SRE, ops-eqiad, Infrastructure-Foundations, DC-Ops
ops-monitoring-bot added a comment to T381576: Q2:rack/setup/install ganeti105[34].eqiad.wmnet.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm completed:

  • ganeti1054 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507171848_vriley_1152194_ganeti1054.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Jul 17, 7:03 PM · SRE, ops-eqiad, Infrastructure-Foundations, DC-Ops
Eevans added a comment to T215183: Redundant bootloaders for software RAID.

[ ... ]

I also never spent much time looking at or thinking about RAID10 hosts, as you said. Honestly I don't remember what debian-installer does in the first place for RAID10 and bootloaders.

Thu, Jul 17, 6:45 PM · Infrastructure-Foundations, SRE
Eevans added a comment to T215183: Redundant bootloaders for software RAID.

[ ... ]

For context: We replaced sda in aqs1012 recently (T396970) and were (I believe) bit by this issue. It would seem to have been reimaged since the partman recipe was fixed, and it does not appear in the April 2020 list posted in T215183#6086396, so I'm wondering if a prior replacement didn't get the bootloader installed.

@Eevans, can you check in the BIOS settings of aqs1012 to see if a setting like "Hard drive failover" exists, per T215183#6718961 ?

Thu, Jul 17, 6:30 PM · Infrastructure-Foundations, SRE
ops-monitoring-bot added a comment to T381576: Q2:rack/setup/install ganeti105[34].eqiad.wmnet.

Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ganeti1054.eqiad.wmnet with OS bookworm

Thu, Jul 17, 6:30 PM · SRE, ops-eqiad, Infrastructure-Foundations, DC-Ops
CDanis updated subscribers of T215183: Redundant bootloaders for software RAID.

Has there been any progress toward goal #2? I didn't see where anything had been added to the mentioned runbook.

Thu, Jul 17, 6:02 PM · Infrastructure-Foundations, SRE
Eevans added a comment to T215183: Redundant bootloaders for software RAID.

As a follow-up, I did find a device with a missing bootloader: aqs1014, which went up after it's partman recipe was fixed (it has had SSDs replaced in the years since though)

Thu, Jul 17, 6:00 PM · Infrastructure-Foundations, SRE
VRiley-WMF updated the task description for T381576: Q2:rack/setup/install ganeti105[34].eqiad.wmnet.
Thu, Jul 17, 5:09 PM · SRE, ops-eqiad, Infrastructure-Foundations, DC-Ops
VRiley-WMF added a comment to T381576: Q2:rack/setup/install ganeti105[34].eqiad.wmnet.

ganeti1054 has moved into A4 U38

Thu, Jul 17, 5:09 PM · SRE, ops-eqiad, Infrastructure-Foundations, DC-Ops
cmooney added a comment to T395910: cloudcephosd10[48-51] service implementation.

Regarding the jumbo-frame complication on the plan to move to one link we are arranging to connect a second 25G on each of these new hosts for the storage vlan. See below tasks:

Thu, Jul 17, 4:58 PM · Patch-For-Review, cloud-services-team, SRE, ops-eqiad, DC-Ops
Eevans added a comment to T215183: Redundant bootloaders for software RAID.

Has there been any progress toward goal #2? I didn't see where anything had been added to the mentioned runbook.

Thu, Jul 17, 4:49 PM · Infrastructure-Foundations, SRE
Eevans added a subtask for T215183: Redundant bootloaders for software RAID: T399875: Cassandra clusters: redundant bootloaders for software RAID followup.
Thu, Jul 17, 4:43 PM · Infrastructure-Foundations, SRE
Jhancock.wm added a comment to T396365: Q4:rack/setup/install sretest2009.

checked the physical cables and everything lines up right. couldn't get into the BMC. re-ran the reqular provisioning script and can access the BMC now. But won't let me set the root password in the script. I can login to the BMC with the one printed on the luggage tag. I'll DM it to you. I don't wanna add the root user if you still need to test on that.

Thu, Jul 17, 4:31 PM · SRE, DC-Ops, ops-codfw
jcrespo added a comment to T399847: Degraded RAID on backup1007.

I've stopped it anyway, if you could start it up again after finishing, it would help me a lot, thank you.

Thu, Jul 17, 4:29 PM · DC-Ops, SRE, ops-eqiad
ops-monitoring-bot added a comment to T381576: Q2:rack/setup/install ganeti105[34].eqiad.wmnet.

Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ganeti1053.eqiad.wmnet with OS bookworm completed:

  • ganeti1053 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced UEFI HTTP Boot for next reboot
    • Host rebooted via Redfish
    • Host up (Debian installer)
    • Add puppet_version metadata (7) to Debian installer
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202507162234_vriley_643342_ganeti1053.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Thu, Jul 17, 4:26 PM · SRE, ops-eqiad, Infrastructure-Foundations, DC-Ops
ops-monitoring-bot added a comment to T399847: Degraded RAID on backup1007.

Icinga downtime and Alertmanager silence (ID=ce8f4e27-d454-43c0-b1b5-892d46c710a6) set by jynus@cumin1003 for 1 day, 0:00:00 on 1 host(s) and their services with reason: failed disk

backup1007.eqiad.wmnet
Thu, Jul 17, 4:25 PM · DC-Ops, SRE, ops-eqiad
cmooney added a comment to T399097: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025.

Crickets in the main from Arelion, one update earlier.

2025-07-17 14:08
Thu, Jul 17, 4:21 PM · SRE, DC-Ops, ops-codfw
jcrespo added a comment to T399847: Degraded RAID on backup1007.

This time if fully Failed, so please change it. Do I stop the server first?

Thu, Jul 17, 4:18 PM · DC-Ops, SRE, ops-eqiad
Jclark-ctr updated subscribers of T399847: Degraded RAID on backup1007.

@jcrespo just fyi automated ticket was opened again for this host

Thu, Jul 17, 4:16 PM · DC-Ops, SRE, ops-eqiad
Jclark-ctr claimed T399847: Degraded RAID on backup1007.
Thu, Jul 17, 4:15 PM · DC-Ops, SRE, ops-eqiad
Eevans closed T396970: Degraded RAID on aqs1012 as Resolved.

This is now complete.

Thu, Jul 17, 4:14 PM · DC-Ops, SRE, ops-eqiad
cmooney added a comment to T399221: eqsin purged consumers lag.

I think leaving eqsin depooled given that it is off-peak there and observing this for a few hours is my vote.

Thu, Jul 17, 3:56 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic
ssingh added a comment to T399221: eqsin purged consumers lag.

Arelion want to close the ticket as they see no issue. I asked that they don't. Perhaps for now we just leave eqsin depooled and the circuit in the active traffic path? If the reported RTT remains steady we can re-pool after a reasonable period has elapsed? And if it jumps up we can re-do the iperf tests at the time to try to confirm the packet loss has returned?

Thu, Jul 17, 3:45 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic
cmooney added a comment to T399221: eqsin purged consumers lag.

Not sure how to progress this one. Still see zero packet loss over the link, even running for a longer period (5 mins this time):

cmooney@cp5017:~$ iperf -s -i10 -u -w512k
------------------------------------------------------------
Server listening on UDP port 5001
UDP buffer size: 1000 KByte (WARNING: requested  500 KByte)
------------------------------------------------------------
[  3] local 10.132.0.17 port 5001 connected with 10.192.48.35 port 39207
[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  3] 0.0000-10.0000 sec   125 MBytes   105 Mbits/sec   0.023 ms    0/89169 (0%)
[  3] 10.0000-20.0000 sec   125 MBytes   105 Mbits/sec   0.011 ms    0/89165 (0%)
[  3] 20.0000-30.0000 sec   125 MBytes   105 Mbits/sec   0.009 ms    0/89164 (0%)
[  3] 30.0000-40.0000 sec   125 MBytes   105 Mbits/sec   0.025 ms    0/89165 (0%)
[  3] 40.0000-50.0000 sec   125 MBytes   105 Mbits/sec   0.010 ms    0/89164 (0%)
[  3] 50.0000-60.0000 sec   125 MBytes   105 Mbits/sec   0.014 ms    0/89165 (0%)
[  3] 60.0000-70.0000 sec   125 MBytes   105 Mbits/sec   0.022 ms    0/89165 (0%)
[  3] 70.0000-80.0000 sec   125 MBytes   105 Mbits/sec   0.022 ms    0/89164 (0%)
[  3] 80.0000-90.0000 sec   125 MBytes   105 Mbits/sec   0.015 ms    0/89164 (0%)
[  3] 90.0000-100.0000 sec   125 MBytes   105 Mbits/sec   0.038 ms    0/89161 (0%)
[  3] 100.0000-110.0000 sec   125 MBytes   105 Mbits/sec   0.023 ms    0/89169 (0%)
[  3] 110.0000-120.0000 sec   125 MBytes   105 Mbits/sec   0.031 ms    0/89164 (0%)
[  3] 120.0000-130.0000 sec   125 MBytes   105 Mbits/sec   0.040 ms    0/89166 (0%)
[  3] 130.0000-140.0000 sec   125 MBytes   105 Mbits/sec   0.016 ms    0/89164 (0%)
[  3] 140.0000-150.0000 sec   125 MBytes   105 Mbits/sec   0.018 ms    0/89164 (0%)
[  3] 150.0000-160.0000 sec   125 MBytes   105 Mbits/sec   0.054 ms    0/89166 (0%)
[  3] 160.0000-170.0000 sec   125 MBytes   105 Mbits/sec   0.017 ms    0/89163 (0%)
[  3] 170.0000-180.0000 sec   125 MBytes   105 Mbits/sec   0.032 ms    0/89165 (0%)
[  3] 180.0000-190.0000 sec   125 MBytes   105 Mbits/sec   0.033 ms    0/89165 (0%)
[  3] 190.0000-200.0000 sec   125 MBytes   105 Mbits/sec   0.023 ms    0/89164 (0%)
[  3] 200.0000-210.0000 sec   125 MBytes   105 Mbits/sec   0.016 ms    0/89165 (0%)
[  3] 210.0000-220.0000 sec   125 MBytes   105 Mbits/sec   0.040 ms    0/89165 (0%)
[  3] 220.0000-230.0000 sec   125 MBytes   105 Mbits/sec   0.014 ms    0/89165 (0%)
[  3] 230.0000-240.0000 sec   125 MBytes   105 Mbits/sec   0.021 ms    0/89164 (0%)
[  3] 240.0000-250.0000 sec   125 MBytes   105 Mbits/sec   0.017 ms    0/89165 (0%)
[  3] 250.0000-260.0000 sec   125 MBytes   105 Mbits/sec   0.014 ms    0/89165 (0%)
[  3] 260.0000-270.0000 sec   125 MBytes   105 Mbits/sec   0.020 ms    0/89165 (0%)
[  3] 270.0000-280.0000 sec   125 MBytes   105 Mbits/sec   0.017 ms    0/89163 (0%)
[  3] 280.0000-290.0000 sec   125 MBytes   105 Mbits/sec   0.022 ms    0/89165 (0%)
[  3] 290.0000-300.0000 sec   125 MBytes   105 Mbits/sec   0.021 ms    0/89162 (0%)
[  3] 300.0000-300.0001 sec  2.87 KBytes   367 Mbits/sec   0.035 ms    0/    2 (0%)
[  3] 0.0000-300.0001 sec  3.66 GBytes   105 Mbits/sec   0.035 ms    0/2674942 (0%)
Thu, Jul 17, 3:42 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic
cmooney reopened T394333: Q4:rack/setup/install cloudcephosd10[48-51] as "Open".

@Jclark-ctr as discussed in our call on Tuesday we will be connecting the second SFP port on these hosts to the switches too, as we need to solve the MTU issue before proceeding with T399180: Cloudcephosd: migrate to single network uplink.

Thu, Jul 17, 3:39 PM · Patch-For-Review, SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
Stashbot added a comment to T399221: eqsin purged consumers lag.

Mentioned in SAL (#wikimedia-operations) [2025-07-17T15:28:12Z] <topranks> un-drain Arelion transport circuit from codfw -> eqsin to test performance T399221

Thu, Jul 17, 3:28 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic
Stashbot added a comment to T399221: eqsin purged consumers lag.

Mentioned in SAL (#wikimedia-operations) [2025-07-17T14:38:23Z] <cmooney@cumin1003> END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: depool eqsin to test backhaul cct packet loss, T399221]

Thu, Jul 17, 2:38 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic
Stashbot added a comment to T399221: eqsin purged consumers lag.

Mentioned in SAL (#wikimedia-operations) [2025-07-17T14:38:19Z] <cmooney@cumin1003> START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: depool eqsin to test backhaul cct packet loss, T399221]

Thu, Jul 17, 2:38 PM · SRE, DC-Ops, ops-codfw, Infrastructure-Foundations, netops, Traffic
brouberol moved T399355: Degraded RAID on an-worker1175 from Blocked/Waiting to Done on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.
Thu, Jul 17, 2:17 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), SRE, DC-Ops, ops-eqiad
ops-monitoring-bot created T399847: Degraded RAID on backup1007.
Thu, Jul 17, 1:53 PM · DC-Ops, SRE, ops-eqiad
elukey added a comment to T393948: Q4:rack/setup/install ml-serve101[2345].

Ok so I have a provision script change that seems to work, but it doesn't touch anything on the network PXE / FixedBootOrder config (except ensuring that UEFI Hdd is the first).

Thu, Jul 17, 1:43 PM · SRE, Machine-Learning-Team, ops-eqiad, DC-Ops
gerritbot added a comment to T398613: Investigate dead an-worker host an-worker1176.

Change #1170301 merged by Stevemunene:

[operations/puppet@production] hdfs: Add an-worker 1176|1179|1186 to analytics cluster

https://gerrit.wikimedia.org/r/1170301

Thu, Jul 17, 12:30 PM · Patch-For-Review, Data-Platform-SRE (2025.06.13 - 2025.07.04), SRE, ops-eqiad, DC-Ops
Jclark-ctr closed T399671: Degraded RAID on backup1007 as Resolved.

Updated Firmware on idrac while logged in thanks for assistance @jcrespo

Thu, Jul 17, 12:21 PM · SRE, DC-Ops, ops-eqiad
jcrespo added a comment to T399671: Degraded RAID on backup1007.

Note my prediction is that we will need 3 new disks, not only 1 to be replaced (but this can be resolve for now).

Thu, Jul 17, 12:20 PM · SRE, DC-Ops, ops-eqiad
jcrespo added a comment to T399671: Degraded RAID on backup1007.

I told @Jclark-ctr not to replace the 13th disk yet, as I was more worried about the jbod ones than the RAID:

root@backup1007:~$ megacli -PDList -aall | grep rro
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 6
Other Error Count: 1
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 0
Media Error Count: 0
Other Error Count: 1339
Media Error Count: 0
Other Error Count: 1339
Thu, Jul 17, 12:18 PM · SRE, DC-Ops, ops-eqiad
Jclark-ctr closed T399355: Degraded RAID on an-worker1175 as Resolved.

Replaced Failed Drive Thanks for the assistance with this @BTullis

Thu, Jul 17, 11:50 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), SRE, DC-Ops, ops-eqiad