I'll start the work on this now. It might be nice to disable the systemd timers that run the sql/xml dumps on the snapshot shosts before they start a dump run on July 20th.

Thu, Jul 17, 12:45 PM · Patch-For-Review, Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis added a comment to T397293: Create dse-k8s-etcd cluster in codfw.

@Stevemunene - You'll see that I provisioned these three VMs for you, to save a bit of time. I was able to use the --storage_type plain option when creating them, so they are ready to go.
I think that you can follow these guidelines to make the cluster itself. https://wikitech.wikimedia.org/wiki/Etcd and with reference to https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/etcd/v3/dse_k8s_etcd.yaml

Thu, Jul 17, 12:26 PM · Patch-For-Review, Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis updated the task description for T397160: Set 52 hadoop workers into decommissioning status.

Thu, Jul 17, 10:15 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis added a comment to T397160: Set 52 hadoop workers into decommissioning status.

We are down to 54 million under-replicated files.

Thu, Jul 17, 10:14 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis updated the task description for T379385: Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1.

Thu, Jul 17, 10:07 AM · Epic, Data-Engineering, Data-Platform-SRE

BTullis moved T380866: Build bigtop 3.4 packages for bullseye and bookworm from Backlog - project to Done on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Thu, Jul 17, 10:05 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis closed T380866: Build bigtop 3.4 packages for bullseye and bookworm, a subtask of T379385: Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1, as Resolved.

Thu, Jul 17, 10:05 AM · Epic, Data-Engineering, Data-Platform-SRE

BTullis closed T380866: Build bigtop 3.4 packages for bullseye and bookworm as Resolved.

One thing to bear in mind is that sqoop was removed from BigTop recently: https://issues.apache.org/jira/browse/BIGTOP-3770
So we may need to keep using our bigtop 1.5 version, or find another alternative.

Thu, Jul 17, 10:05 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis added a comment to T380866: Build bigtop 3.4 packages for bullseye and bookworm.

This is now done.
I have created a new GitLab repository here: https://gitlab.wikimedia.org/repos/data-engineering/bigtop-build

Thu, Jul 17, 9:55 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

Yesterday

BTullis added a comment to T399711: Tune the hive metastore parameters as a result of the refine migration to Airflow.

Notice: /Stage[main]/Bigtop::Hive/File[/etc/hive/conf.analytics-test-hadoop/hive-env.sh]/content: 
--- /etc/hive/conf.analytics-test-hadoop/hive-env.sh	2023-08-10 12:02:47.190163979 +0000
+++ /tmp/puppet-file20250716-405022-11hcgyi	2025-07-16 13:41:11.608993039 +0000
@@ -8,7 +8,7 @@
 export HIVE_SKIP_SPARK_ASSEMBLY=true

Wed, Jul 16, 1:44 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis added a comment to T399711: Tune the hive metastore parameters as a result of the refine migration to Airflow.

We can see some interesting performance characteristics here
including this:

Wed, Jul 16, 12:58 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis added a comment to T399711: Tune the hive metastore parameters as a result of the refine migration to Airflow.

I thought that I would look at the performance and caching optimizations in different patches, since caching could change the behaviour.

Wed, Jul 16, 12:44 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis updated the task description for T399711: Tune the hive metastore parameters as a result of the refine migration to Airflow.

Wed, Jul 16, 12:38 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis claimed T399711: Tune the hive metastore parameters as a result of the refine migration to Airflow.

Wed, Jul 16, 12:31 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis created T399711: Tune the hive metastore parameters as a result of the refine migration to Airflow.

Wed, Jul 16, 11:28 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis created T399700: Update herald rule H126 - Add 'Data Platform' tag to list of trigger conditions.

Wed, Jul 16, 10:02 AM · Release-Engineering-Team (Doing 😎), Phabricator

BTullis moved T398773: Degraded RAID on an-worker1189 from Backlog - project to Done on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Wed, Jul 16, 9:33 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), DC-Ops, SRE, ops-eqiad

BTullis closed T398773: Degraded RAID on an-worker1189 as Resolved.

Wed, Jul 16, 9:32 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), DC-Ops, SRE, ops-eqiad

BTullis added a comment to T398773: Degraded RAID on an-worker1189.

I have prepared the two disks as per the instructions at: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop/Administration#Swapping_broken_disk

btullis@an-worker1189:~$ sudo parted /dev/sdn --script mklabel gpt
btullis@an-worker1189:~$ sudo parted /dev/sdn --script mkpart primary ext4 0% 100%
btullis@an-worker1189:~$ sudo mkfs.ext4 -L hadoop-n /dev/sdn1
mke2fs 1.46.2 (28-Feb-2021)
Creating filesystem with 1953365504 4k blocks and 244170752 inodes
Filesystem UUID: 737d7e51-1f10-4257-babf-2126028c5387
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544, 1934917632

Wed, Jul 16, 8:53 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), DC-Ops, SRE, ops-eqiad

BTullis added a comment to T398773: Degraded RAID on an-worker1189.

Thanks @Jclark-ctr and apologies for the delay.
We're OK with deleting the preserved cache on these an-worker data drives, because they are all individual raid0 volumes. Data loss is therefore unavoidable when these drives fail, but recovery is managed by Hadoop itself.

Wed, Jul 16, 8:43 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), DC-Ops, SRE, ops-eqiad

Tue, Jul 15

BTullis created T399594: Proposed improvement: Manage CephX users via exported/collected Puppet resources.

Tue, Jul 15, 3:41 PM · Cloud-VPS, Ceph, cloud-services-team, Data-Platform-SRE

BTullis moved T358866: openjdk and spark Docker images fail to build from sources from Backlog - project to Done on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Tue, Jul 15, 9:15 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis closed T358866: openjdk and spark Docker images fail to build from sources as Resolved.

Tue, Jul 15, 9:15 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis closed T390139: Rebuild Spark images with Bookworm / bullseye-backports deprecation, a subtask of T383557: Deprecate use of bullseye-backports, as Resolved.

Tue, Jul 15, 9:15 AM · Patch-For-Review, Infrastructure-Foundations, SRE

BTullis closed T390139: Rebuild Spark images with Bookworm / bullseye-backports deprecation as Resolved.

This is now done, so all of the spark images are now based on golang1.19 and have been rebuilt and published.

root@build2001:/srv/images/production-images# /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build images/ --select '*spark*'
== Step 0: scanning /srv/images/production-images/images/ ==
Will build the following images:
* docker-registry.discovery.wmnet/spark3.4-build:3.4.1-5
* docker-registry.discovery.wmnet/spark3.1-build:3.1.2-4
* docker-registry.discovery.wmnet/spark3.1:3.1.2-5
* docker-registry.discovery.wmnet/spark3.1-operator:1.3.7-3.1.2-4
* docker-registry.discovery.wmnet/spark3.3-build:3.3.2-4
* docker-registry.discovery.wmnet/spark3.3:3.3.2-5
* docker-registry.discovery.wmnet/spark3.3-operator:1.3.8-3.3.2-4
* docker-registry.discovery.wmnet/spark3.4:3.4.1-5
* docker-registry.discovery.wmnet/spark3.4-operator:1.3.8-3.4.1-5
== Step 1: building images ==
* Built image docker-registry.discovery.wmnet/spark3.4-build:3.4.1-5
* Built image docker-registry.discovery.wmnet/spark3.1-build:3.1.2-4
* Built image docker-registry.discovery.wmnet/spark3.1:3.1.2-5
* Built image docker-registry.discovery.wmnet/spark3.1-operator:1.3.7-3.1.2-4
* Built image docker-registry.discovery.wmnet/spark3.3-build:3.3.2-4
* Built image docker-registry.discovery.wmnet/spark3.3:3.3.2-5
* Built image docker-registry.discovery.wmnet/spark3.3-operator:1.3.8-3.3.2-4
* Built image docker-registry.discovery.wmnet/spark3.4:3.4.1-5
* Built image docker-registry.discovery.wmnet/spark3.4-operator:1.3.8-3.4.1-5
== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/spark3.4-build:3.4.1-5
Successfully published image docker-registry.discovery.wmnet/spark3.1-operator:1.3.7-3.1.2-4
Successfully published image docker-registry.discovery.wmnet/spark3.1:3.1.2-5
Successfully published image docker-registry.discovery.wmnet/spark3.1-build:3.1.2-4
Successfully published image docker-registry.discovery.wmnet/spark3.3-operator:1.3.8-3.3.2-4
Successfully published image docker-registry.discovery.wmnet/spark3.3:3.3.2-5
Successfully published image docker-registry.discovery.wmnet/spark3.4-operator:1.3.8-3.4.1-5
Successfully published image docker-registry.discovery.wmnet/spark3.3-build:3.3.2-4
Successfully published image docker-registry.discovery.wmnet/spark3.4:3.4.1-5
== Build done! ==
You can see the logs at ./docker-pkg-build.log

Tue, Jul 15, 9:15 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Infrastructure-Foundations, Data-Engineering-Radar, Data-Engineering, SRE

BTullis created P79057 an-launcher1002 refine target changed.

Tue, Jul 15, 8:41 AM

BTullis created P79056 an-test-coord1001 refine disabled.

Tue, Jul 15, 8:40 AM

Mon, Jul 14

BTullis moved T399355: Degraded RAID on an-worker1175 from Backlog - project to Blocked/Waiting on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Mon, Jul 14, 1:31 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), SRE, DC-Ops, ops-eqiad

BTullis edited projects for T399355: Degraded RAID on an-worker1175, added: Data-Platform-SRE (2025.07.05 - 2025.07.25); removed Data-Platform-SRE.

Mon, Jul 14, 1:31 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), SRE, DC-Ops, ops-eqiad

BTullis added a comment to T399355: Degraded RAID on an-worker1175.

In T399355#11000023, @Jclark-ctr wrote:

@BTullis We just received another an-worker RAID ticket. I’m opening a ticket with Dell to get a replacement drive. It should arrive by Wednesday. I’d like to replace it on Thursday, as long as there are no delays. Let me know when I’m clear to swap the disk.

Mon, Jul 14, 1:30 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), SRE, DC-Ops, ops-eqiad

BTullis closed T399160: Exclude rbd devices from /usr/local/sbin/smart-data-dump output as Resolved.

Mon, Jul 14, 1:04 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), sre-alert-triage

BTullis moved T389927: Display a link to the Grafana pod usage dashboard in the airflow dump task instance page from Quarterly Goals to Observability on the Data-Platform-SRE board.

Mon, Jul 14, 12:38 PM · Data-Platform-SRE

BTullis edited projects for T389927: Display a link to the Grafana pod usage dashboard in the airflow dump task instance page, added: Data-Platform-SRE; removed Data-Platform-SRE (2025.07.05 - 2025.07.25).

Mon, Jul 14, 12:38 PM · Data-Platform-SRE

BTullis added a comment to T358866: openjdk and spark Docker images fail to build from sources.

I am checking that they build without the workaround by using this command.

root@build2001:/srv/images/production-images# /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build images/ --select '*spark*'

Mon, Jul 14, 11:19 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis claimed T390139: Rebuild Spark images with Bookworm / bullseye-backports deprecation.

Mon, Jul 14, 10:58 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Infrastructure-Foundations, Data-Engineering-Radar, Data-Engineering, SRE

BTullis added a comment to T358866: openjdk and spark Docker images fail to build from sources.

I checked and according to this: https://metadata.ftp-master.debian.org/changelogs//main/o/openjdk-8/openjdk-8_8u452-ga-1_changelog
...the fix mentioned was included in version: 8u402-ga-3

Mon, Jul 14, 10:56 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis added a comment to T398092: Inconsistent Prometheus metrics generating many logs.

In T398092#10977592, @tappof wrote:

The issue was resolved after deleting the Pushgateway data.
However, the task remains valid to verify whether the Trixie version of Pushgateway fixes the issue without requiring data deletion.

Mon, Jul 14, 10:40 AM · Data-Platform-SRE

BTullis moved T399160: Exclude rbd devices from /usr/local/sbin/smart-data-dump output from In Progress to Needs Review on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Mon, Jul 14, 10:26 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), sre-alert-triage

BTullis triaged T399160: Exclude rbd devices from /usr/local/sbin/smart-data-dump output as Medium priority.

Mon, Jul 14, 10:14 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), sre-alert-triage

BTullis renamed T399160: Exclude rbd devices from /usr/local/sbin/smart-data-dump output from Alert in need of triage: SmartNotHealthy (instance dse-k8s-worker1009:9100) to Exclude rbd devices from /usr/local/sbin/smart-data-dump output.

Mon, Jul 14, 10:14 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), sre-alert-triage

BTullis moved T399160: Exclude rbd devices from /usr/local/sbin/smart-data-dump output from Backlog - project to In Progress on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Mon, Jul 14, 10:11 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), sre-alert-triage

BTullis claimed T399160: Exclude rbd devices from /usr/local/sbin/smart-data-dump output.

Mon, Jul 14, 10:11 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), sre-alert-triage

BTullis closed T396617: Improve user experience of spark-history service as Resolved.

These patches are all deployed, so I think that we can resolve this ticket now.
I checked that https://yarn.wikimedia.org/spark-history (without the trailing slash) now works.

Mon, Jul 14, 10:00 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis updated the task description for T357753: Build next iteration of IPoid using OpenSearch/ElasticSearch as backend.

Mon, Jul 14, 9:48 AM · iPoid-Service (IPoid OpenSearch), Data-Platform-SRE

BTullis moved T357753: Build next iteration of IPoid using OpenSearch/ElasticSearch as backend from Watching to Quarterly Goals on the Data-Platform-SRE board.

Mon, Jul 14, 9:42 AM · iPoid-Service (IPoid OpenSearch), Data-Platform-SRE

BTullis moved T398986: Deploy next iteration of iPoid in dse-k8s-codfw from Incoming to Quarterly Goals on the Data-Platform-SRE board.

Mon, Jul 14, 9:40 AM · iPoid-Service (IPoid OpenSearch), Data-Platform-SRE

BTullis closed T399176: We should skip the metacurrentdumprecombine stage for mediawikiwiki, a subtask of T398436: Future improvements to the Dumps on Airflow configuration - tracking task, as Resolved.

Mon, Jul 14, 9:39 AM · Data-Platform-SRE

BTullis closed T399176: We should skip the metacurrentdumprecombine stage for mediawikiwiki as Resolved.

Mon, Jul 14, 9:39 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis assigned T399176: We should skip the metacurrentdumprecombine stage for mediawikiwiki to brouberol.

Mon, Jul 14, 9:39 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis edited projects for T399176: We should skip the metacurrentdumprecombine stage for mediawikiwiki, added: Data-Platform-SRE (2025.07.05 - 2025.07.25); removed Data-Platform-SRE.

Mon, Jul 14, 9:38 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis moved T399319: The wikidata-query-gui and wikidata-query-builder Docker images on Wikikube need to be upgraded from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Mon, Jul 14, 9:35 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis edited projects for T399319: The wikidata-query-gui and wikidata-query-builder Docker images on Wikikube need to be upgraded, added: Data-Platform-SRE (2025.07.05 - 2025.07.25); removed Data-Platform-SRE.

Mon, Jul 14, 9:35 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis moved T399433: Check home/HDFS leftovers of dalezhou from Backlog - project to Backlog - operations on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Mon, Jul 14, 9:34 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis edited projects for T399433: Check home/HDFS leftovers of dalezhou, added: Data-Platform-SRE (2025.07.05 - 2025.07.25); removed Data-Platform-SRE.

Mon, Jul 14, 9:34 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis added a comment to T397160: Set 52 hadoop workers into decommissioning status.

It has dropped from 82 million to 65 million in about 90 hours.

That's not too bad.

Mon, Jul 14, 8:49 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

Fri, Jul 11

BTullis added a comment to T399281: 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures.

There is an issue at present with reimaging these cloudcephosd nodes back to bullseye.
The problem arises because of this remove_os_md() function that is excuted.

Fri, Jul 11, 2:50 PM · cloud-services-team (FY2025/26-Q1), Toolforge (Toolforge iteration 22), Cloud-VPS

BTullis moved T396617: Improve user experience of spark-history service from In Progress to Needs Review on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

Hopefully these patches should address the first three asks.

Increased timeout to the proxy backend (300 seconds)
Double the CPU limit (with 50% more RAM, too)
Fixed the trailing slash issue.

Fri, Jul 11, 1:08 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis claimed T396617: Improve user experience of spark-history service.

Fri, Jul 11, 9:17 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

Thu, Jul 10

BTullis moved T397160: Set 52 hadoop workers into decommissioning status from In Progress to Blocked/Waiting on the Data-Platform-SRE (2025.07.05 - 2025.07.25) board.

That graph levelled off at 82.6 million and has now started to drop slowly.
I silenced the hadoop-yarn-nodemanager services that were failing on the affected hosts.

Thu, Jul 10, 4:57 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis added a comment to T397160: Set 52 hadoop workers into decommissioning status.

We are seeing the under-replicated blocks climbing, as we expect. https://grafana.wikimedia.org/goto/6AhgKjsNg?orgId=1

Thu, Jul 10, 4:40 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis added a comment to T397160: Set 52 hadoop workers into decommissioning status.

I looked at trying to silence all of the alerts before merging the patch, but it seemed too tricky.

Thu, Jul 10, 4:08 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis claimed T397160: Set 52 hadoop workers into decommissioning status.

Thu, Jul 10, 2:30 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

Wed, Jul 9

BTullis added a comment to T380541: deploy RBD-backed persistent storage in the aux clusters.

Hi @CDanis - Just to let you know, the cephosd cluster in codfw is now up and running, so if you want to deploy your ceph-csi-rbd plugin there, you are welcome.
T374923: Bring cephosd200[1-3] into service as a new cluster in codfw is complete.

Wed, Jul 9, 6:19 PM · Kubernetes, Infrastructure-Foundations

BTullis closed T374923: Bring cephosd200[1-3] into service as a new cluster in codfw, a subtask of T362105: EPIC: OpenSearch on K8s (formerly Mutualized opensearch cluster), as Resolved.

Wed, Jul 9, 6:16 PM · Epic, Data-Platform-SRE

BTullis closed T374923: Bring cephosd200[1-3] into service as a new cluster in codfw as Resolved.

Wed, Jul 9, 6:16 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

It's there now, I just uploaded a bigger file with s3cmd and it created the bucket.
Then I set the correct crush rule.

btullis@cephosd2001:~$ sudo ceph osd pool ls
.rgw.root
.mgr
dse-k8s-csi-ssd
aux-k8s-csi-rbd-ssd
cephfs.dpe.meta
cephfs.dpe.data-ssd
cephfs.dpe.data-hdd
codfw.rgw.log
codfw.rgw.control
codfw.rgw.meta
codfw.rgw.buckets.index
codfw.rgw.buckets.data
codfw.rgw.buckets.non-ec
btullis@cephosd2001:~$ sudo ceph osd pool set codfw.rgw.buckets.non-ec crush_rule hdd
set pool 16 crush_rule to hdd

Checking all of the rules.

btullis@cephosd2001:~$ for p in $(sudo ceph osd pool ls); do echo -n "$p :" ; sudo ceph osd pool get $p crush_rule; done
.rgw.root :crush_rule: ssd
.mgr :crush_rule: ssd
dse-k8s-csi-ssd :crush_rule: ssd
aux-k8s-csi-rbd-ssd :crush_rule: ssd
cephfs.dpe.meta :crush_rule: ssd
cephfs.dpe.data-ssd :crush_rule: ssd
cephfs.dpe.data-hdd :crush_rule: hdd
codfw.rgw.log :crush_rule: ssd
codfw.rgw.control :crush_rule: ssd
codfw.rgw.meta :crush_rule: ssd
codfw.rgw.buckets.index :crush_rule: ssd
codfw.rgw.buckets.data :crush_rule: hdd
codfw.rgw.buckets.non-ec :crush_rule: hdd

This looks OK. I think that we can call this ticket done, for now. There will be some more work when we get around to connecting up the dse-k8s-codfw cluster to it, but I think we can say that this is ready for use.

Wed, Jul 9, 6:13 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

I had to rename the zonegroup from dpe_zg to dpe as I had done here: T374447#10144310

Wed, Jul 9, 6:06 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

I have created the zone.

btullis@cephosd2001:~$ sudo radosgw-admin zone create --rgw-zonegroup=dpe_zg --rgw-zone=codfw --master --default --endpoints=https://rgw.codfw.dpe.anycast.wmnet
{
    "id": "19d9bb4a-2a8b-41be-92fa-53eed71f5254",
    "name": "codfw",
    "domain_root": "codfw.rgw.meta:root",
    "control_pool": "codfw.rgw.control",
    "gc_pool": "codfw.rgw.log:gc",
    "lc_pool": "codfw.rgw.log:lc",
    "log_pool": "codfw.rgw.log",
    "intent_log_pool": "codfw.rgw.log:intent",
    "usage_log_pool": "codfw.rgw.log:usage",
    "roles_pool": "codfw.rgw.meta:roles",
    "reshard_pool": "codfw.rgw.log:reshard",
    "user_keys_pool": "codfw.rgw.meta:users.keys",
    "user_email_pool": "codfw.rgw.meta:users.email",
    "user_swift_pool": "codfw.rgw.meta:users.swift",
    "user_uid_pool": "codfw.rgw.meta:users.uid",
    "otp_pool": "codfw.rgw.otp",
    "system_key": {
        "access_key": "",
        "secret_key": ""
    },
    "placement_pools": [
        {
            "key": "default-placement",
            "val": {
                "index_pool": "codfw.rgw.buckets.index",
                "storage_classes": {
                    "STANDARD": {
                        "data_pool": "codfw.rgw.buckets.data"
                    }
                },
                "data_extra_pool": "codfw.rgw.buckets.non-ec",
                "index_type": 0,
                "inline_data": true
            }
        }
    ],
    "realm_id": "72b51936-86d4-4656-8065-c8ed942ddf47",
    "notif_pool": "codfw.rgw.log:notif"
}

Wed, Jul 9, 5:09 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

Now moving on to setting up the radosgw side of things. Referring back to here: T330152#10077357

Wed, Jul 9, 4:31 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

Following what was done here for the cephfs file system: T376405#10206339

Wed, Jul 9, 4:27 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

I have created the two pools for use with RBD and the CSI interfaces.

btullis@cephosd2001:~$ sudo ceph osd pool create dse-k8s-csi-ssd 800 800 replicated ssd --autoscale-mode=on
pool 'dse-k8s-csi-ssd' created
btullis@cephosd2001:~$ sudo ceph osd pool create aux-k8s-csi-rbd-ssd 800 800 replicated ssd --autoscale-mode=on
pool 'aux-k8s-csi-rbd-ssd' created

Enabled these pools for use with the rbd application.

btullis@cephosd2001:~$ sudo ceph osd pool application enable dse-k8s-csi-ssd rbd
enabled application 'rbd' on pool 'dse-k8s-csi-ssd'
btullis@cephosd2001:~$ sudo ceph osd pool application enable aux-k8s-csi-rbd-ssd rbd
enabled application 'rbd' on pool 'aux-k8s-csi-rbd-ssd'

Configured them to use the ssd crush rule.

btullis@cephosd2001:~$ sudo ceph osd pool set dse-k8s-csi-ssd crush_rule ssd
set pool 6 crush_rule to ssd
btullis@cephosd2001:~$ sudo ceph osd pool set aux-k8s-csi-rbd-ssd crush_rule ssd
set pool 7 crush_rule to ssd

Wed, Jul 9, 4:19 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

Creating the required crush rules.

btullis@cephosd2001:~$ sudo ceph osd crush rule create-replicated hdd default host hdd
btullis@cephosd2001:~$ sudo ceph osd crush rule create-replicated ssd default host ssd
btullis@cephosd2001:~$ sudo ceph osd crush rule ls
replicated_rule
hdd
ssd

I can't yet delete the replicated_rule because it is still in use.

Wed, Jul 9, 4:09 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

Now creating the crush maps, so that get row and rack awareness, as well as host awareness. This is similar to the way it was done in T326945#9074454

Wed, Jul 9, 3:48 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

The cluster is up and running now.
Metrics are available in Grafana.

Wed, Jul 9, 3:32 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

In order to bootstrap the cluster, I used the same technique that was used in eqiad. T330149#8705054
Namely, I created a monmap file like this, on each of the three servers:

monmaptool --create --fsid 8e69717a-518b-4c00-9f96-0635d9b913c6 --add cephosd2001 10.192.9.17 --add cephosd2002 10.192.26.19 --add cephosd2003 10.192.37.16 --enable-all-features --set-min-mon-release reef monmap

I created a temporary keyring by concatentating the keyrings for mon. and client.admin into a temp file:

Wed, Jul 9, 11:55 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

BTullis added a comment to T374923: Bring cephosd200[1-3] into service as a new cluster in codfw.

Making good progress on this now.

Wed, Jul 9, 11:48 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Ceph

Tue, Jul 8

BTullis closed T397295: Create the dse-k8s-ctrl servers in codfw, a subtask of T396478: EPIC: Build dse-k8s-codfw Kubernetes cluster, as Resolved.

Tue, Jul 8, 9:10 PM · Epic, Data-Platform-SRE

BTullis closed T397295: Create the dse-k8s-ctrl servers in codfw as Resolved.

Tue, Jul 8, 9:10 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis added a comment to T398968: wikitech-static: resume daily dumps.

I have back-filled to clouddumps100[1-2] with commands like this.

dumpsgen@dumpsdata1003:/data/otherdumps/wikitech$ rsync -av ./ clouddumps1001.wikimedia.org::data/xmldatadumps/public/other/wikitech/
sending incremental file list
./
labswiki-20250701.xml.gz
labswiki-20250702.xml.gz
labswiki-20250703.xml.gz
labswiki-20250704.xml.gz
labswiki-20250705.xml.gz
labswiki-20250706.xml.gz
labswiki-20250707.xml.gz
labswiki-20250708.xml.gz

Tue, Jul 8, 1:57 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), wikitech.wikimedia.org, cloud-services-team

BTullis added a comment to T398968: wikitech-static: resume daily dumps.

Apologies, this was an oversight on my part, as I had assumed that the standard XML/SQL dumps for labswiki would be sufficient.
I didn't know that these additional dumps were being consumed in order to keep https://wikitech-static.wikimedia.org up-to-date.

Tue, Jul 8, 1:53 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), wikitech.wikimedia.org, cloud-services-team

BTullis claimed T398968: wikitech-static: resume daily dumps.

Tue, Jul 8, 1:46 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), wikitech.wikimedia.org, cloud-services-team

BTullis added a comment to T398797: SQL/XML dump for enwiki is much slower than anticipated.

The dump of enwiki on snapshot1012 has completed successfully.

btullis@snapshot1012:~$ tail -f /mnt/dumpsdata/xmldatadumps/private/enwiki/20250701/dumplog.txt 
2025-07-08 03:58:43: enwiki Reading enwiki-20250701-pages-articles-multistream.xml.bz2 checksum for md5 from file /mnt/dumpsdata/xmldatadumps/public/enwiki/20250701/md5sums-enwiki-20250701-pages-articles-multistream.xml.bz2.txt
2025-07-08 03:58:43: enwiki Reading enwiki-20250701-pages-articles-multistream.xml.bz2 checksum for sha1 from file /mnt/dumpsdata/xmldatadumps/public/enwiki/20250701/sha1sums-enwiki-20250701-pages-articles-multistream.xml.bz2.txt
2025-07-08 03:58:43: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/enwiki/latest ...
2025-07-08 03:58:43: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/enwiki/latest ...
2025-07-08 03:58:43: enwiki adding rss feed file /mnt/dumpsdata/xmldatadumps/public/enwiki/latest/enwiki-latest-pages-articles-multistream-index.txt.bz2-rss.xml 
2025-07-08 03:58:43: enwiki Reading enwiki-20250701-pages-articles-multistream-index.txt.bz2 checksum for md5 from file /mnt/dumpsdata/xmldatadumps/public/enwiki/20250701/md5sums-enwiki-20250701-pages-articles-multistream-index.txt.bz2.txt
2025-07-08 03:58:43: enwiki Reading enwiki-20250701-pages-articles-multistream-index.txt.bz2 checksum for sha1 from file /mnt/dumpsdata/xmldatadumps/public/enwiki/20250701/sha1sums-enwiki-20250701-pages-articles-multistream-index.txt.bz2.txt
2025-07-08 03:59:19: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/enwiki/latest ...
2025-07-08 03:59:37: enwiki Checkdir dir /mnt/dumpsdata/xmldatadumps/public/enwiki/latest ...
2025-07-08 04:02:34: enwiki SUCCESS: done.

I'll carry out another manual sync from dumpsdata1006 to clouddumps100[1-2] to get it published.

Tue, Jul 8, 10:33 AM · Performance Issue, Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis closed T385485: Q3: an-worker data volumes HDD upgrade tracking task as Resolved.

Tue, Jul 8, 10:05 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), ops-eqiad, DC-Ops

BTullis updated the task description for T385485: Q3: an-worker data volumes HDD upgrade tracking task.

Tue, Jul 8, 10:05 AM · Data-Platform-SRE (2025.07.05 - 2025.07.25), ops-eqiad, DC-Ops

Mon, Jul 7

BTullis closed T391832: Check home/HDFS leftovers of xiaoxiao as Resolved.

Removed the user's local files.

btullis@cumin1003:~$ sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master::standby' 'rm -rf /home/xiaoxiao'
13 hosts will be targeted:
an-coord[1003-1004].eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-master[1003-1004].eqiad.wmnet,an-test-client1002.eqiad.wmnet,an-test-coord1001.eqiad.wmnet,an-test-master[1001-1002].eqiad.wmnet,stat[1008-1011].eqiad.wmnet
OK to proceed on 13 hosts? Enter the number of affected hosts to confirm or "q" to quit: 13
===== NO OUTPUT =====                                                                                                                                                                                              
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (13/13) [00:11<00:00,  1.12hosts/s]
FAIL |                                                                                                                                                                            |   0% (0/13) [00:11<?, ?hosts/s]
100.0% (13/13) success ratio (>= 100.0% threshold) for command: 'rm -rf /home/xiaoxiao'.
100.0% (13/13) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
btullis@cumin1003:~$

Removed HDFS home directory.

btullis@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r /user/xiaoxiao
25/07/07 15:51:03 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/user/xiaoxiao' to trash at: hdfs://analytics-hadoop/user/hdfs/.Trash/Current/user/xiaoxiao
btullis@an-launcher1002:~$

Mon, Jul 7, 3:51 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25)

BTullis closed T398756: No wikidata dumps this week (20250630) as Resolved.

The manual sync run has now completed.

Mon, Jul 7, 3:47 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Data-Engineering, Dumps-Generation, Wikidata

BTullis closed T398602: Errors from refinery-sqoop-whole-mediawiki.service - July 2025 as Resolved.

This has now finished, so I think that means the whole sqoop process has finished successfully.

analytics@an-launcher1002:/home/btullis$ /usr/local/bin/refinery-sqoop-mediawiki-production-not-history
analytics@an-launcher1002:/home/btullis$ echo $?
0

Resetting the failed systemd service.

btullis@an-launcher1002:~$ systemctl --failed                                                                                                                                                                        UNIT                                   LOAD   ACTIVE SUB    DESCRIPTION
● refinery-sqoop-whole-mediawiki.service loaded failed failed Schedules sqoop to import whole MediaWiki databases into Hadoop monthly.

Mon, Jul 7, 3:43 PM · Data-Platform-SRE (2025.07.05 - 2025.07.25), Data-Engineering

BTullis (Ben)
Staff SRE

Projects (9)
View All

Calendar

Today

Tomorrow

Saturday

User Details

Recent Activity
View All

Today

Yesterday

Tue, Jul 15

Mon, Jul 14

Fri, Jul 11

Thu, Jul 10

Wed, Jul 9

Tue, Jul 8

Mon, Jul 7

BTullis (Ben)Staff SRE

Projects (9)View All

Calendar

Today

Tomorrow

Saturday

User Details

Recent ActivityView All

Today

Yesterday

Tue, Jul 15

Mon, Jul 14

Fri, Jul 11

Thu, Jul 10

Wed, Jul 9

Tue, Jul 8

Mon, Jul 7

BTullis (Ben)
Staff SRE

Projects (9)
View All

Recent Activity
View All