Page MenuHomePhabricator

MatthewVernon (Matthew Vernon)
SRE (Data Persistence)

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Aug 2 2021, 1:52 PM (206 w, 3 d)
Availability
Available
IRC Nick
Emperor
LDAP User
MVernon
MediaWiki User
MVernon (WMF) [ Global Accounts ]

Recent Activity

Mon, Jul 7

MatthewVernon updated the task description for T391354: Q4 object storage hardware tasks.
Mon, Jul 7, 3:38 PM · SRE, Ceph, SRE-swift-storage
MatthewVernon closed T391352: Q4 Thanos hardware refresh, a subtask of T391354: Q4 object storage hardware tasks, as Resolved.
Mon, Jul 7, 3:38 PM · SRE, Ceph, SRE-swift-storage
MatthewVernon closed T391352: Q4 Thanos hardware refresh as Resolved.

All done!

Mon, Jul 7, 3:38 PM · SRE-swift-storage, SRE
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
Mon, Jul 7, 3:38 PM · SRE-swift-storage, SRE
MatthewVernon created T398849: decommission thanos-be200[1-4].codfw.wmnet.
Mon, Jul 7, 3:36 PM · SRE, DC-Ops, SRE-swift-storage, ops-codfw, decommission-hardware

Fri, Jul 4

MatthewVernon closed T398660: Inline image is displayed incorrectly as Resolved.
Fri, Jul 4, 1:16 PM · SRE-swift-storage, Thumbor
MatthewVernon added a comment to T398660: Inline image is displayed incorrectly.

What I assume happened is that some of the thumbnails from the first upload didn't get overwritten when the second upload was made (I purged the image this morning, which should have removed any old thumbnails), and it was those that were getting seen.

Fri, Jul 4, 8:56 AM · SRE-swift-storage, Thumbor
MatthewVernon added a comment to T398660: Inline image is displayed incorrectly.

I think this may have been a caching issue - if I copy your wikitext into the Sandbox, I get what look to me to be correct thumbnails; and indeed the permalink to the cs.wp page has a thumbnail that looks correct to me.

Fri, Jul 4, 8:21 AM · SRE-swift-storage, Thumbor

Wed, Jul 2

MatthewVernon updated the task description for T391354: Q4 object storage hardware tasks.
Wed, Jul 2, 11:04 AM · SRE, Ceph, SRE-swift-storage
MatthewVernon updated the task description for T393104: Q4:rack/setup/install ms-be109[2-5].
Wed, Jul 2, 10:26 AM · SRE, Data-Persistence, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon added a comment to T393104: Q4:rack/setup/install ms-be109[2-5].

@Jclark-ctr the problem with these two nodes was the same as we've had with every one of this batch of Dell servers - they arrive with junk (an EFI partition with something Windows on) on one of the hard disks.

Wed, Jul 2, 10:26 AM · SRE, Data-Persistence, SRE-swift-storage, ops-eqiad, DC-Ops

Mon, Jun 30

MatthewVernon added a comment to T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption.

These corrupt DBs have all been repaired now.

Mon, Jun 30, 1:07 PM · Upstream, SRE-swift-storage, SRE

Fri, Jun 27

MatthewVernon added a comment to T397385: Artifact storage for updated Zuul CI.

You can achieved automatic expiry via S3, though, using a lifecycle policy - IBM's docs have a worked example, as part of their bucket lifecycle docs. The ceph upstream docs on the topic are a bit bare-bones.

Fri, Jun 27, 12:46 PM · Continuous-Integration-Infrastructure (Zuul upgrade)
MatthewVernon added a comment to T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption.

Using test-cookbook and the currently-in-review check-dbs cookbook on all the thumbnail container dbs, we find the following problems:

  • wikipedia-commons-local-thumb.6b -
/srv/swift-storage/accounts1/containers/8721/17d/2211db68672f28e639decfc1f640917d/2211db68672f28e639decfc1f640917d.db on ms-be1083.eqiad.wmnet has errors:
row 1745416 missing from index ix_object_deleted_name
row 1745417 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/sda3/containers/8721/17d/2211db68672f28e639decfc1f640917d/2211db68672f28e639decfc1f640917d.db on ms-be1063.eqiad.wmnet has errors:
row 1745416 missing from index ix_object_deleted_name
row 1745417 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts0/containers/8721/17d/2211db68672f28e639decfc1f640917d/2211db68672f28e639decfc1f640917d.db on ms-be1074.eqiad.wmnet has errors:
 row 1745416 missing from index ix_object_deleted_name
row 1745417 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.6e
/srv/swift-storage/sdb3/containers/26973/1f3/695dccc2f2758a580e42b976670301f3/695dccc2f2758a580e42b976670301f3.db on ms-be2059.codfw.wmnet has errors:
row 1632783 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/sda3/containers/26973/1f3/695dccc2f2758a580e42b976670301f3/695dccc2f2758a580e42b976670301f3.db on ms-be2058.codfw.wmnet has errors:
row 1632783 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts0/containers/26973/1f3/695dccc2f2758a580e42b976670301f3/695dccc2f2758a580e42b976670301f3.db on ms-be2076.codfw.wmnet has errors:
row 1632783 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.79
/srv/swift-storage/sda3/containers/30531/b33/774332fe929445555e757c805d65eb33/774332fe929445555e757c805d65eb33.db on ms-be1067.eqiad.wmnet has errors:
row 4537890 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/sdb3/containers/30531/b33/774332fe929445555e757c805d65eb33/774332fe929445555e757c805d65eb33.db on ms-be1070.eqiad.wmnet has errors:
row 4537890 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts0/containers/30531/b33/774332fe929445555e757c805d65eb33/774332fe929445555e757c805d65eb33.db on ms-be1089.eqiad.wmnet has errors:
row 4537890 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.99
/srv/swift-storage/sdb3/containers/53069/95e/cf4db1352e66ef14017ee2d8a280d95e/cf4db1352e66ef14017ee2d8a280d95e.db on ms-be2064.codfw.wmnet has errors:
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.b7
/srv/swift-storage/sdb3/containers/43276/ec7/a90c2201082e532c429f273788ee6ec7/a90c2201082e532c429f273788ee6ec7.db on ms-be1066.eqiad.wmnet has errors:
row 1612987 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts1/containers/43276/ec7/a90c2201082e532c429f273788ee6ec7/a90c2201082e532c429f273788ee6ec7.db on ms-be1090.eqiad.wmnet has errors:
row 1612987 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts0/containers/43276/ec7/a90c2201082e532c429f273788ee6ec7/a90c2201082e532c429f273788ee6ec7.db on ms-be1087.eqiad.wmnet has errors:
row 1612987 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.bb
/srv/swift-storage/accounts0/containers/25743/3d0/648f2c55d993f0a991ca1cffd66423d0/648f2c55d993f0a991ca1cffd66423d0.db on ms-be2076.codfw.wmnet has errors:
row 5310349 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.d3
/srv/swift-storage/accounts0/containers/35440/a6a/8a7070d66ad3b2c6df2c635eb41e7a6a/8a7070d66ad3b2c6df2c635eb41e7a6a.db on ms-be1079.eqiad.wmnet has errors:
row 4349795 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts1/containers/35440/a6a/8a7070d66ad3b2c6df2c635eb41e7a6a/8a7070d66ad3b2c6df2c635eb41e7a6a.db on ms-be1085.eqiad.wmnet has errors:
row 4234895 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts1/containers/35440/a6a/8a7070d66ad3b2c6df2c635eb41e7a6a/8a7070d66ad3b2c6df2c635eb41e7a6a.db on ms-be1078.eqiad.wmnet has errors:
row 4349795 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
  • wikipedia-commons-local-thumb.ea
/srv/swift-storage/accounts0/containers/18674/f11/48f22e340a0b1d0f141c98e1cc4faf11/48f22e340a0b1d0f141c98e1cc4faf11.db on ms-be1078.eqiad.wmnet has errors:
row 6128861 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
/srv/swift-storage/accounts0/containers/18674/f11/48f22e340a0b1d0f141c98e1cc4faf11/48f22e340a0b1d0f141c98e1cc4faf11.db on ms-be1080.eqiad.wmnet has errors:
row 6128861 missing from index ix_object_deleted_name
wrong # of entries in index ix_object_deleted_name
Fri, Jun 27, 10:22 AM · Upstream, SRE-swift-storage, SRE
MatthewVernon added a comment to T397385: Artifact storage for updated Zuul CI.

FWIW, I would advise against using Swift (the protocol) if you want to move storage to Ceph, and rather use only S3 against Ceph. I don't think the Swift protocol support in radosgw gets a lot of love these days, unlike the S3 protocol which is getting actively developed.

Fri, Jun 27, 7:10 AM · Continuous-Integration-Infrastructure (Zuul upgrade)

Mon, Jun 23

MatthewVernon added a project to T376237: Turn down unused swift-r[ow] discovery services: SRE-swift-storage.

FWIW, after today's incident we ended up with both swift-rw resources depooled:

mvernon@cumin2002:~$ confctl --object-type discovery select 'dnsdisc=swift.*' get
{"eqiad": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=swift-rw"}
{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=swift-rw"}
{"codfw": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=swift"}
{"eqiad": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=swift"}
{"codfw": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=swift-ro"}
{"eqiad": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=swift-ro"}
Mon, Jun 23, 10:13 PM · SRE-swift-storage, Datacenter-Switchover, serviceops

Thu, Jun 19

MatthewVernon added a comment to T397385: Artifact storage for updated Zuul CI.

Yes, I'd expect the AccessDenied when trying to list the mcv21test bucket - only the 1-month-old_kittens_32.jpg object within it is globally-readable, not the bucket itself.

Thu, Jun 19, 10:15 AM · Continuous-Integration-Infrastructure (Zuul upgrade)
MatthewVernon added a comment to T396584: Create a new bucket for Tegola's tile cache and duplicate its data.

@elukey You could delete each container in parallel (in a separate tmux/screen window or whatever)? That would get us some more parallelism, but less aggressively than bumping the concurrency in swift delete. I'm relaxed about it taking a few weeks, but can see that having it all under way rather than waiting weeks and then starting the next container going would be annoying...

Thu, Jun 19, 8:58 AM · Patch-For-Review, SRE-swift-storage, Data-Persistence, SRE-Unowned, Maps, SRE
MatthewVernon added a comment to T393104: Q4:rack/setup/install ms-be109[2-5].

@Jclark-ctr I've just put in T397414 to decommission (amongst others) thanos-be1003 in C4 and thanos-be1004 in D7; could the last two ms backends from this ticket go into those spaces, do you think, please?

Thu, Jun 19, 8:37 AM · SRE, Data-Persistence, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
Thu, Jun 19, 8:34 AM · SRE-swift-storage, SRE
MatthewVernon created T397414: decommission thanos-be100[1-4].eqiad.wmnet.
Thu, Jun 19, 8:33 AM · SRE-swift-storage, SRE, ops-eqiad, decommission-hardware, DC-Ops
MatthewVernon added a comment to T396584: Create a new bucket for Tegola's tile cache and duplicate its data.

@elukey yes, that should be fine - ping me when the deletion is done, please?

Thu, Jun 19, 8:06 AM · Patch-For-Review, SRE-swift-storage, Data-Persistence, SRE-Unowned, Maps, SRE
MatthewVernon added a comment to T397385: Artifact storage for updated Zuul CI.

I think the answer is "you need a proxy if you want to do it externally", but it's a little "It Depends". If you set a suitable ACL, you can download objects from anywhere internal, for example:

Thu, Jun 19, 7:51 AM · Continuous-Integration-Infrastructure (Zuul upgrade)

Wed, Jun 18

MatthewVernon updated the task description for T397343: Disk (sde) failed in ms-be1071.
Wed, Jun 18, 3:18 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon triaged T397343: Disk (sde) failed in ms-be1071 as High priority.

[this is blocking ongoing load/drain operations for the eqiad ms cluster]

Wed, Jun 18, 3:17 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon created T397343: Disk (sde) failed in ms-be1071.
Wed, Jun 18, 3:17 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
Wed, Jun 18, 1:40 PM · SRE-swift-storage, SRE
MatthewVernon added a comment to T396584: Create a new bucket for Tegola's tile cache and duplicate its data.

Silly question while I'm here - do you need 2 buckets, each of which ends up replicated cross-DC? [The answer might be yes, but it seemed worth checking]

Wed, Jun 18, 10:18 AM · Patch-For-Review, SRE-swift-storage, Data-Persistence, SRE-Unowned, Maps, SRE
MatthewVernon added a comment to T396584: Create a new bucket for Tegola's tile cache and duplicate its data.

FWIW, I use sudo bash ; . /etc/swift/accountfile.env, but yes. Those commands will take some time to run I expect, so run them in a screen/tmux. The thanos-swift cluster is one cluster stretched between both DCs, so those deletes will remove content from both DCs.

Wed, Jun 18, 8:35 AM · Patch-For-Review, SRE-swift-storage, Data-Persistence, SRE-Unowned, Maps, SRE

Jun 17 2025

MatthewVernon added a comment to T392908: Q4:rack/setup/install thanos-be200[6-9].

@Jhancock.wm thanos-be2006 Just Worked; thanos-be2007 still had a disk with an old EFI setup on it - I knew which one from reading the run-puppet-agent output (which complained about /dev/disk/by-path/pci-0000:50:00.0-scsi-0:2:11:0-part1; blkid told me it was a vfat EFI partition and I checked it wasn't part of the installation. I think used ls -l to tell me it was /dev/sdl1 and then wiped both that and /dev/sdl with wipefs -a /dev/sdl1 ; wipefs -a /dev/sdl and then the re-image worked.

Jun 17 2025, 12:45 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Jun 16 2025

MatthewVernon updated the task description for T354872: Re-IP Swift hosts to per-rack subnets in codfw rows A-D.
Jun 16 2025, 1:24 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
MatthewVernon added a comment to T396584: Create a new bucket for Tegola's tile cache and duplicate its data.

...so ideally, delete all the old data and then you can just go ahead (and maybe let's make a rough plan for "when to delete v002"?) :)

Jun 16 2025, 1:21 PM · Patch-For-Review, SRE-swift-storage, Data-Persistence, SRE-Unowned, Maps, SRE
MatthewVernon added a comment to T396584: Create a new bucket for Tegola's tile cache and duplicate its data.

To put a little more context on that:

root@thanos-fe1004:/home/mvernon# for b in $(swift list); do swift stat --lh  "$b" | grep -E 'Container|Bytes' ; done
                    Container: tegola-swift-codfw-v002
                        Bytes: 82G
                    Container: tegola-swift-codfw-v003
                        Bytes: 0
                    Container: tegola-swift-container
                        Bytes: 59G
                    Container: tegola-swift-eqiad-v002
                        Bytes: 87G
                    Container: tegola-swift-eqiad-v003
                        Bytes: 0
                    Container: tegola-swift-fallback
                        Bytes: 38G
                    Container: tegola-swift-new
                        Bytes: 46G
                    Container: tegola-swift-staging-container
                        Bytes: 32G
                    Container: tegola-swift-v001
                        Bytes: 71G
Jun 16 2025, 1:15 PM · Patch-For-Review, SRE-swift-storage, Data-Persistence, SRE-Unowned, Maps, SRE
MatthewVernon added a project to T396584: Create a new bucket for Tegola's tile cache and duplicate its data: SRE-swift-storage.

Quick question: I'm concerned about the rather vague timeline for deleting tegola-swift-eqiad-v002 and tegola-swift-codfw-v002; it'd be easier to be relaxed about your proposal if there was a timeline for going back to current-ish usage rather than x2 usage...?

Jun 16 2025, 1:11 PM · Patch-For-Review, SRE-swift-storage, Data-Persistence, SRE-Unowned, Maps, SRE
MatthewVernon added a comment to T396573: ms-fe2015 is suffering intermittent errors on port 80.

Thanks @Ladsgroup :)

Jun 16 2025, 7:26 AM · Data-Persistence, SRE-swift-storage
MatthewVernon closed T395990: Disk (sdr) failed on ms-be2066 as Resolved.

I've had a look, and this system looks good to me know (right number of filesystems of the right size, puppet happy, swift-recon -r and swift-dispersion-report both good). Thanks to both @Jhancock.wm and @Eevans for fixing :)

Jun 16 2025, 7:22 AM · SRE-swift-storage, SRE, ops-codfw, DC-Ops

Jun 6 2025

MatthewVernon created T396203: Consider increasing swift workers on proxy nodes to 32.
Jun 6 2025, 10:50 AM · SRE-swift-storage, SRE
MatthewVernon added a comment to T395990: Disk (sdr) failed on ms-be2066.

@Jhancock.wm I've had a look at the web-iDRAC, and the serial number I found above (9120A025F1QF) does correspond to the device in slot 14, which the web iDRAC describes as " Physical Disk 0:2:14 ", which is what I'd expect to be lit up by megacli -PDLocate -PhysDrv [32:14] -a0.

Jun 6 2025, 10:18 AM · SRE-swift-storage, SRE, ops-codfw, DC-Ops
MatthewVernon added a comment to T396186: Commons: fault in storage backend "local-swift-codfw".

My colleagues in SRE applied some filtering to problematic traffic.

Jun 6 2025, 9:04 AM · SRE-swift-storage, Commons
MatthewVernon added a comment to T396186: Commons: fault in storage backend "local-swift-codfw".

This should be resolved now.

Jun 6 2025, 7:29 AM · SRE-swift-storage, Commons

Jun 4 2025

MatthewVernon added a comment to T395990: Disk (sdr) failed on ms-be2066.

(i.e. the kernel thinks sdf got removed, not the bad sdr)

Jun 4 2025, 10:20 PM · SRE-swift-storage, SRE, ops-codfw, DC-Ops
MatthewVernon added a comment to T395990: Disk (sdr) failed on ms-be2066.

@Jhancock.wm I fear the wrong disk has gone:

Jun  4 16:22:52 ms-be2066 kernel: [31462174.362949] sd 0:2:5:0: [sdf] tag#606 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=4s
Jun  4 16:22:52 ms-be2066 kernel: [31462174.362977] sd 0:2:5:0: [sdf] tag#606 CDB: Read(16) 88 00 00 00 00 00 13 0c 24 08 00 00 00 08 00 00
Jun  4 16:22:52 ms-be2066 kernel: [31462174.362985] blk_update_request: I/O error, dev sdf, sector 319562760 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:52 ms-be2066 kernel: [31462174.364433] sd 0:2:5:0: [sdf] tag#380 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=3s
Jun  4 16:22:52 ms-be2066 kernel: [31462174.364462] sd 0:2:5:0: [sdf] tag#584 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=4s
Jun  4 16:22:52 ms-be2066 kernel: [31462174.364486] sd 0:2:5:0: [sdf] tag#584 CDB: Read(16) 88 00 00 00 00 02 b3 30 5a 90 00 00 00 08 00 00
Jun  4 16:22:52 ms-be2066 kernel: [31462174.364495] blk_update_request: I/O error, dev sdf, sector 11596225168 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:52 ms-be2066 kernel: [31462174.364669] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x2b3305290 len 8 error 5
Jun  4 16:22:52 ms-be2066 kernel: [31462174.374145] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x130c1c08 len 8 error 5
Jun  4 16:22:52 ms-be2066 kernel: [31462174.385240] sd 0:2:5:0: [sdf] tag#380 CDB: Read(16) 88 00 00 00 00 00 40 f4 99 a0 00 00 00 20 00 00
Jun  4 16:22:52 ms-be2066 kernel: [31462174.385242] blk_update_request: I/O error, dev sdf, sector 1089771936 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
Jun  4 16:22:52 ms-be2066 kernel: [31462174.385411] XFS (sdf1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x40f491a0 len 32 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.777084] sd 0:2:5:0: [sdf] tag#580 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=5s
Jun  4 16:22:55 ms-be2066 kernel: [31462176.777111] sd 0:2:5:0: [sdf] tag#580 CDB: Read(16) 88 00 00 00 00 00 8c d5 33 50 00 00 00 08 00 00
Jun  4 16:22:55 ms-be2066 kernel: [31462176.777120] blk_update_request: I/O error, dev sdf, sector 2362782544 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.778794] megaraid_sas 0000:18:00.0: scanning for scsi0...
Jun  4 16:22:55 ms-be2066 kernel: [31462176.779347] sd 0:2:5:0: [sdf] tag#584 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=2s
Jun  4 16:22:55 ms-be2066 kernel: [31462176.779353] sd 0:2:5:0: [sdf] tag#584 CDB: Read(16) 88 00 00 00 00 02 b3 30 5a 90 00 00 00 08 00 00
Jun  4 16:22:55 ms-be2066 kernel: [31462176.779359] blk_update_request: I/O error, dev sdf, sector 11596225168 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.779496] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x2b3305290 len 8 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782267] sd 0:2:5:0: [sdf] tag#680 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782273] sd 0:2:5:0: [sdf] tag#680 CDB: Read(16) 88 00 00 00 00 02 ce fe 3b 68 00 00 00 08 00 00
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782278] blk_update_request: I/O error, dev sdf, sector 12062702440 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782454] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x2cefe3368 len 8 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782541] sd 0:2:5:0: [sdf] tag#674 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782545] sd 0:2:5:0: [sdf] tag#674 CDB: Read(16) 88 00 00 00 00 02 ce fe 3b 68 00 00 00 08 00 00
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782550] blk_update_request: I/O error, dev sdf, sector 12062702440 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782641] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x2cefe3368 len 8 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782771] sd 0:2:5:0: [sdf] tag#681 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782775] sd 0:2:5:0: [sdf] tag#681 CDB: Read(16) 88 00 00 00 00 02 ce fe 3b 68 00 00 00 08 00 00
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782779] blk_update_request: I/O error, dev sdf, sector 12062702440 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782863] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x2cefe3368 len 8 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.785294] sd 0:2:5:0: [sdf] tag#296 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun  4 16:22:55 ms-be2066 kernel: [31462176.785298] sd 0:2:5:0: [sdf] tag#296 CDB: Read(16) 88 00 00 00 00 01 b6 3e 8e e8 00 00 00 20 00 00
Jun  4 16:22:55 ms-be2066 kernel: [31462176.785300] blk_update_request: I/O error, dev sdf, sector 7352520424 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.785351] XFS (sdf1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x1b63e86e8 len 32 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.786029] sd 0:2:5:0: [sdf] tag#692 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun  4 16:22:55 ms-be2066 kernel: [31462176.786035] sd 0:2:5:0: [sdf] tag#692 CDB: Read(16) 88 00 00 00 00 03 8e a0 7c 98 00 00 00 08 00 00
Jun  4 16:22:55 ms-be2066 kernel: [31462176.786041] blk_update_request: I/O error, dev sdf, sector 15277784216 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.786137] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x38ea07498 len 8 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.786289] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x38ea07498 len 8 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782550] blk_update_request: I/O error, dev sdf, sector 12062702440 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782641] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x2cefe3368 len 8 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782771] sd 0:2:5:0: [sdf] tag#681 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782775] sd 0:2:5:0: [sdf] tag#681 CDB: Read(16) 88 00 00 00 00 02 ce fe 3b 68 00 00 00 08 00 00
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782779] blk_update_request: I/O error, dev sdf, sector 12062702440 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.782863] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x2cefe3368 len 8 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.785294] sd 0:2:5:0: [sdf] tag#296 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun  4 16:22:55 ms-be2066 kernel: [31462176.785298] sd 0:2:5:0: [sdf] tag#296 CDB: Read(16) 88 00 00 00 00 01 b6 3e 8e e8 00 00 00 20 00 00
Jun  4 16:22:55 ms-be2066 kernel: [31462176.785300] blk_update_request: I/O error, dev sdf, sector 7352520424 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.785351] XFS (sdf1): metadata I/O error in "xfs_imap_to_bp+0x61/0xb0 [xfs]" at daddr 0x1b63e86e8 len 32 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.786029] sd 0:2:5:0: [sdf] tag#692 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
Jun  4 16:22:55 ms-be2066 kernel: [31462176.786035] sd 0:2:5:0: [sdf] tag#692 CDB: Read(16) 88 00 00 00 00 03 8e a0 7c 98 00 00 00 08 00 00
Jun  4 16:22:55 ms-be2066 kernel: [31462176.786041] blk_update_request: I/O error, dev sdf, sector 15277784216 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
Jun  4 16:22:55 ms-be2066 kernel: [31462176.786137] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x38ea07498 len 8 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.786289] XFS (sdf1): metadata I/O error in "xfs_da_read_buf+0xd9/0x130 [xfs]" at daddr 0x38ea07498 len 8 error 5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.788845] XFS (sdf1): log I/O error -5
Jun  4 16:22:55 ms-be2066 kernel: [31462176.799732] sd 0:2:5:0: SCSI device is removed
Jun  4 16:22:55 ms-be2066 kernel: [31462176.810273] XFS (sdf1): xfs_do_force_shutdown(0x2) called from line 1211 of file fs/xfs/xfs_log.c. Return address = 00000000582ccb94
Jun  4 16:22:55 ms-be2066 kernel: [31462176.934503] XFS (sdf1): Log I/O Error Detected. Shutting down filesystem
Jun  4 16:22:55 ms-be2066 kernel: [31462176.941482] XFS (sdf1): Please unmount the filesystem and rectify the problem(s)
Jun  4 16:23:01 ms-be2066 kernel: [31462182.621563] XFS (sdf1): Unmounting Filesystem
Jun  4 16:23:01 ms-be2066 kernel: [31462182.631276] megaraid_sas 0000:18:00.0: 4339 (802368937s/0x0021/FATAL) - Controller cache pinned for missing or offline VD 05/3
Jun  4 16:23:01 ms-be2066 kernel: [31462182.631339] megaraid_sas 0000:18:00.0: 4340 (802368937s/0x0001/FATAL) - VD 05/3 is now OFFLINE
Jun 4 2025, 10:20 PM · SRE-swift-storage, SRE, ops-codfw, DC-Ops
MatthewVernon updated the task description for T391354: Q4 object storage hardware tasks.
Jun 4 2025, 3:13 PM · SRE, Ceph, SRE-swift-storage
MatthewVernon added a comment to T393104: Q4:rack/setup/install ms-be109[2-5].

Just to record what we talked about on IRC - the plan is for now to put two nodes into A (one each in A2 and A7), and at the same time drain ms-be1061 (in B2) and ms-be1062 (in C2) with a view to putting the remaining two nodes there if no other space in A-D becomes available beforehand.

Jun 4 2025, 11:50 AM · SRE, Data-Persistence, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon triaged T395990: Disk (sdr) failed on ms-be2066 as High priority.

This is blocking other work on ms-codfw at the moment.

Jun 4 2025, 8:16 AM · SRE-swift-storage, SRE, ops-codfw, DC-Ops
MatthewVernon created T395990: Disk (sdr) failed on ms-be2066.
Jun 4 2025, 8:15 AM · SRE-swift-storage, SRE, ops-codfw, DC-Ops

Jun 3 2025

MatthewVernon added a comment to T395103: Disk (sde) failed in moss-be1002.

Thanks :)

Jun 3 2025, 1:59 PM · SRE-swift-storage, SRE, Ceph, ops-eqiad, DC-Ops
MatthewVernon added a comment to T379942: Gradually drop all thumbnails as a one-off clean up.

Looks like you've just finished the codfw 3x ones, so I looked:
wikipedia-commons-local-thumb.30 838,472 objects 96,830,126,986 bytes
wikipedia-commons-local-thumb.3f 836,193 objects 95,360,984,940 bytes

Jun 3 2025, 11:36 AM · SRE-swift-storage, Thumbor
MatthewVernon added a comment to T395659: New apus account for Tegola.

I'm not inclined to move this right now, I think - thanos-swift already gives you S3 protocol support and cross-DC replication, and a very large number of small objects is not a super-great fit to apus right now (we don't have solid-state storage for bucket indexes and suchlike; and our experience with gitlab is that cross-DC replication is done per-object, so lots of small objects is slow compared to fewer larger ones).

Jun 3 2025, 9:22 AM · SRE-Unowned, Maps, SRE
MatthewVernon updated subscribers of T392908: Q4:rack/setup/install thanos-be200[6-9].

@Jhancock.wm @Jclark-ctr I've now re-imaged thanos-be2008 and thanos-be2009 OK. The problem in both cases was that there was one of the hdds with an existing partition on it - it had a single vfat partition with EFI on it (and subdirectories Boot Dell Microsoft PEBoot); the puppet that sets up swift (and also thanos) disks won't over-write existing data, so gets stuck. I wiped the partition and drive and then all was good.

Jun 3 2025, 9:09 AM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
MatthewVernon added a comment to T392908: Q4:rack/setup/install thanos-be200[6-9].

@Jhancock.wm thanks for this; I tried running puppet on thanos-be2009 and the problem was that /dev/sdi1 had an EFI partition on it (presumably either because it was shipped like that, or maybe the result of a previous failed isntall); I wiped that and puppet now runs to completion, so I'll try another reimage.

Jun 3 2025, 7:21 AM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Jun 2 2025

MatthewVernon added a project to T395773: Some uploaded files on Commons show "Create" instead of "Edit"/"View History" tabs on File page: UploadWizard.

I've checked these objects in swift, and they are both present and correct in both clusters (I checked with swift stat wikipedia-commons-local-public.65 6/65/Басков_переулок_19_СПб_01.jpg and swift stat wikipedia-commons-local-public.89 8/89/Басков_переулок_19_СПб_02.jpg respectively), with timestamps of 07:07 UTC this morning.

Jun 2 2025, 9:20 AM · UploadWizard, MediaWiki-Uploading, Commons
MatthewVernon added a comment to T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw.

I think the "check the file is in a consistent (presumed-to-be-absent) state" operation is intentional, and probably replicated across other file changing call paths; not least because we get tickets sometimes when these checks fail...

Jun 2 2025, 9:04 AM · API Platform, MediaWiki-File-management, MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), Unstewarded-production-error, MediaWiki-Uploading, Wikimedia-production-error, SRE-swift-storage, Commons

May 28 2025

MatthewVernon added a comment to T379942: Gradually drop all thumbnails as a one-off clean up.

Just apropos the sizes, I took one at random (wikipedia-commons-local-thumb.f9), and whilst eqiad is bigger, it's not a lot bigger:
eqiad: 9,285,690 objects 1,307,136,946,396 bytes
codfw: 7,434,712 objects 1,027,498,001,890 bytes

May 28 2025, 1:35 PM · SRE-swift-storage, Thumbor
MatthewVernon added a comment to T394476: Onboard the Docker Registry to apus.

@elukey OK, I've set that up for you. Quota is 3T, but can be adjusted as needed - it's really there so we can keep track of how much apus storage we've promised to people.

May 28 2025, 9:01 AM · Ceph, SRE-swift-storage, Data-Persistence, serviceops

May 27 2025

MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
May 27 2025, 12:04 PM · SRE-swift-storage, SRE
MatthewVernon updated the task description for T391354: Q4 object storage hardware tasks.
May 27 2025, 12:03 PM · SRE, Ceph, SRE-swift-storage
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
May 27 2025, 12:00 PM · SRE-swift-storage, SRE
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
May 27 2025, 10:27 AM · SRE-swift-storage, SRE
MatthewVernon added a comment to T394476: Onboard the Docker Registry to apus.

If you want to do some testing, I could set you up with a test account on apus.

That would be swell!

May 27 2025, 9:24 AM · Ceph, SRE-swift-storage, Data-Persistence, serviceops
MatthewVernon added a comment to T378922: Migrate gitlab storage to apus (also: backups from S3?).

Yeah, the apus cluster isn't ideal for buckets with a very large number of objects in (if we wanted to start aiming to support such a use case, we'd want to use some SSD/NVME specifically for bucket indexes, which I think would involve some hardware hassle), so if it's straightforward to keep the artifacts bucket to a smaller number that'd be nice.

May 27 2025, 8:22 AM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
MatthewVernon added a comment to T378922: Migrate gitlab storage to apus (also: backups from S3?).

Ah, the bucket is gone from eqiad, but codfw is still catching up:

root@moss-be2001:/# radosgw-admin bucket sync status --bucket=gitlab-artifacts
          realm 5d3dbc7a-7bbf-4412-a33b-124dfd79c774 (apus)
      zonegroup 19f169b0-5e44-4086-9e1b-0df871fbea50 (apus_zg)
           zone acc58620-9fac-476b-aee3-0f640100a3bb (codfw)
         bucket :gitlab-artifacts[64f0dd71-48bf-45aa-9741-69a51c083556.75705.1])
   current time 2025-05-22T09:33:40Z

    source zone 64f0dd71-48bf-45aa-9741-69a51c083556 (eqiad)
  source bucket :gitlab-artifacts[64f0dd71-48bf-45aa-9741-69a51c083556.75705.1])
                incremental sync on 11 shards
                bucket is behind on 11 shards
                behind shards: [0,1,2,3,4,5,6,7,8,9,10]

This is annoying (and perhaps my fault for trying to be too clever); I presume that while the removal on the master zone was quick, it still ends up being replicated to the secondary zone as a long series of individual object deletions. I'll monitor how that is progressing (I'm hoping this is a case of "wait, it'll get there"), and also see if I can figure out a better approach for next time we want to delete an object with O(100k) objects in.

Great thanks! Running the zone sync over the weekend is also fine from my side. I have some concerns it might take much longer than the weekend, but let's see. I also opened a subtaks to double-check GitLabs artifacts retention policy (T395014). The policy should delete artifacts older than 7 days but 400k sounds a bit too.

May 27 2025, 8:13 AM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE

May 23 2025

MatthewVernon added a comment to T395049: UploadChunkFileException: Error storing file in '{chunkPath}': backend-fail-internal; local-swift-codfw.

I had a look in the swift logs for the associated item (per logstash, https://ms-fe.svc.codfw.wmnet/v1/AUTH_mw/wikipedia-commons-local-temp.85/8/85/1bszz9c7w23k.h5b3mo.6438344.webm.5), and get exactly one hit, in eqiad:

May 21 16:51:30 ms-fe1013 proxy-server: 10.67.188.58 10.64.48.149 21/May/2025/16/51/30 PUT /v1/AUTH_mw/wikipedia-commons-local-temp.85/8/85/1bszz9c7w23k.h5b3mo.6438344.webm.5 HTTP/1.0 201 - wikimedia/multi-http-client%20v1.1 AUTH_tk3261723f1... 16777216 - d5807b8ad0ae1b0af9f9cf957a42740c txd4becbe641444c8ba1f63-00682e0492 - 0.1807 - - 1747846290.054478645 1747846290.235161781 0

...so that request took 0.1807 seconds in eqiad, and was of 16,777,216 bytes.

May 23 2025, 4:39 PM · MediaWiki-Uploading, SRE-swift-storage, User-brennen, Wikimedia-production-error
MatthewVernon added a comment to T379942: Gradually drop all thumbnails as a one-off clean up.

@Ladsgroup can you let me know when one of the current batch has finished, please? Now we've done the thumbnail defrag stuff, I'd like to re-asses (for the purposes of wondering about a cache system) how big a freshly-deleted thumb container is (and how it grows).

May 23 2025, 3:45 PM · SRE-swift-storage, Thumbor
MatthewVernon added projects to T394476: Onboard the Docker Registry to apus: SRE-swift-storage, Ceph.

A few TB of quota shouldn't be a problem; how many objects per bucket are you looking at? We get better performance out of fewer larger objects.

May 23 2025, 3:42 PM · Ceph, SRE-swift-storage, Data-Persistence, serviceops
MatthewVernon added a comment to T393104: Q4:rack/setup/install ms-be109[2-5].

@Jclark-ctr @wiki_willy any idea when that might be or if there's anywhere else these servers could go? Once they're installed I'll be able to drain the nodes they're replacing (but that process typically takes ~3 weeks) and they can be decommissioned, but I'd rather avoid having to drain them first (partly because of the delay, and also because of the capacity reduction that would involve)...

May 23 2025, 1:54 PM · SRE, Data-Persistence, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
May 23 2025, 1:04 PM · SRE-swift-storage, SRE
MatthewVernon added a comment to T392909: Q4:rack/setup/install thanos-be100[6-9].

@MatthewVernon Thank you for your assistance. In most cases, I am able to image a server and have it successfully pass Puppet validation with only the os drives configured. This is the typical approach I use when troubleshooting a server, as the issue is sometimes within the Puppet configuration.
Unfortunately, dcops group does not have the necessary permissions to run the perccli64 command. As a result, we are required to configure each drive manually through the iDRAC gui, which is a highly time-consuming process.

May 23 2025, 12:50 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
MatthewVernon added a comment to T392909: Q4:rack/setup/install thanos-be100[6-9].

@Jclark-ctr the problem (at least on thanos-be1006 where I started) is that the disks aren't visible to the operating system, because they're "Ready" not non-RAID, and indeed the OS looks to have been installed on virtual disks not non-RAID disks. These systems are all meant to be set up as JBOD...

May 23 2025, 8:14 AM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
MatthewVernon triaged T395103: Disk (sde) failed in moss-be1002 as High priority.
May 23 2025, 7:35 AM · SRE-swift-storage, SRE, Ceph, ops-eqiad, DC-Ops
MatthewVernon created T395103: Disk (sde) failed in moss-be1002.
May 23 2025, 7:35 AM · SRE-swift-storage, SRE, Ceph, ops-eqiad, DC-Ops

May 22 2025

MatthewVernon added a comment to T378922: Migrate gitlab storage to apus (also: backups from S3?).

Ah, the bucket is gone from eqiad, but codfw is still catching up:

root@moss-be2001:/# radosgw-admin bucket sync status --bucket=gitlab-artifacts
          realm 5d3dbc7a-7bbf-4412-a33b-124dfd79c774 (apus)
      zonegroup 19f169b0-5e44-4086-9e1b-0df871fbea50 (apus_zg)
           zone acc58620-9fac-476b-aee3-0f640100a3bb (codfw)
         bucket :gitlab-artifacts[64f0dd71-48bf-45aa-9741-69a51c083556.75705.1])
   current time 2025-05-22T09:33:40Z
May 22 2025, 9:36 AM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
MatthewVernon placed T328872: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw up for grabs.
May 22 2025, 7:20 AM · API Platform, MediaWiki-File-management, MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), Unstewarded-production-error, MediaWiki-Uploading, Wikimedia-production-error, SRE-swift-storage, Commons

May 21 2025

MatthewVernon added a comment to T378922: Migrate gitlab storage to apus (also: backups from S3?).

@Jelto both buckets deleted.

May 21 2025, 10:22 AM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
MatthewVernon added a comment to T378922: Migrate gitlab storage to apus (also: backups from S3?).

s3://gitlab-packages is empty after several hours (s3cmd del --force --recursive s3://gitlab-packages/). Using the same approach for s3://gitlab-artifacts is not really feasible due to the high latency and the huge number of objects. I was thinking about deleting the bucket directly, but I'm not sure what exactly happens to the objects in that case, and whether there are any runtime benefits.

I'll just leave the artifacts in the bucket for now and start a new sync on the production host once object storage is enabled again.

May 21 2025, 9:11 AM · Data-Persistence-Backup, GitLab (Infrastructure), collaboration-services, SRE-swift-storage, Ceph, SRE
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
May 21 2025, 8:57 AM · SRE-swift-storage, SRE
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
May 21 2025, 8:56 AM · SRE-swift-storage, SRE
MatthewVernon created T394894: decommission thanos-fe100[1-3].eqiad.wmnet.
May 21 2025, 8:56 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops, decommission-hardware

May 20 2025

MatthewVernon updated the task description for T391354: Q4 object storage hardware tasks.
May 20 2025, 1:59 PM · SRE, Ceph, SRE-swift-storage
MatthewVernon updated the task description for T391354: Q4 object storage hardware tasks.
May 20 2025, 9:05 AM · SRE, Ceph, SRE-swift-storage
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
May 20 2025, 9:02 AM · SRE-swift-storage, SRE

May 17 2025

MatthewVernon created T394584: Broken / outdated links from Api Gateway & application servers docs.
May 17 2025, 3:04 PM · Sustainability (Incident Followup), serviceops

May 16 2025

MatthewVernon added a comment to T392844: Q4:rack/setup/install apus-be1004.

@Jclark-ctr the only BIOS/iDRAC changes I made were to set all the hdds to be non-RAID. Everything else was fixing puppet/preseed configs.

May 16 2025, 2:35 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon added a comment to T392845: Q4:rack/setup/install apus-be2004.

@Jhancock.wm we had some fun with the eqiad equivalent system, but did get it properly installed (and the preseed done such that this system should also preseed OK).

May 16 2025, 2:29 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops
MatthewVernon updated the task description for T391354: Q4 object storage hardware tasks.
May 16 2025, 2:24 PM · SRE, Ceph, SRE-swift-storage
MatthewVernon added a comment to T392844: Q4:rack/setup/install apus-be1004.

@Jclark-ctr system imaged OK now. I don't know if you have more you want to do before closing this ticket out?

May 16 2025, 2:24 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon created P76271 (An Untitled Masterwork).
May 16 2025, 2:03 PM
MatthewVernon added a comment to T392844: Q4:rack/setup/install apus-be1004.

Thanks for this! I was able to get in via install_console, and have a look. None of the hdds were available - I had to boot into BIOS and convert them all to non-RAID from there (for some reason attempting this via the iDRAC doesn't work).

May 16 2025, 9:45 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
May 16 2025, 9:08 AM · SRE-swift-storage, SRE

May 15 2025

MatthewVernon added a comment to T392844: Q4:rack/setup/install apus-be1004.

Oh, right, yes, we want the OS on that (which I thought was going to be presented to the OS as a single device, doing RAID-1 in hardware), sorry.

May 15 2025, 12:32 PM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon added a comment to T392844: Q4:rack/setup/install apus-be1004.

@Jclark-ctr EFI booting is fine (I thought I'd said as much on a previous ticket, but may have missed it); I don't want the OS on the NVME drive, the OS should go on the SSDs on the boss card.

May 15 2025, 11:56 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
May 15 2025, 9:31 AM · SRE-swift-storage, SRE

May 14 2025

MatthewVernon added a comment to T392844: Q4:rack/setup/install apus-be1004.

@Jclark-ctr the boss card should be left as RAID 1, thank you (but all the other drives should be JBOD).

May 14 2025, 8:54 AM · SRE, SRE-swift-storage, ops-eqiad, DC-Ops
MatthewVernon updated the task description for T391354: Q4 object storage hardware tasks.
May 14 2025, 8:52 AM · SRE, Ceph, SRE-swift-storage

May 13 2025

MatthewVernon added a comment to T377497: Functional replacement for importImages.php on Kubernetes.

[A brief aside: data-persistence sometimes need to use this script for restoring images we had to fish out of backups (or one of the ms clusters if an image was only uploaded to one); we could "just" upload the image to ms directly with the swift CLI tool, but have avoided doing so in the past because it's not clear what (if any) metadata (either in swift or in MW) would also need to be updated for this to work.]

May 13 2025, 11:03 AM · serviceops, MW-on-K8s
MatthewVernon added a comment to T389635: Q4:rack/setup/install thanos-fe100[5-7].

@VRiley-WMF apologies, but any chance you can get thanos-fe1007 to at least PXE-boot OK, please? I don't think I can usefully make progress here.

May 13 2025, 7:24 AM · Patch-For-Review, SRE-swift-storage, SRE, Data-Persistence, ops-eqiad, DC-Ops
MatthewVernon updated the task description for T391354: Q4 object storage hardware tasks.
May 13 2025, 7:17 AM · SRE, Ceph, SRE-swift-storage

May 12 2025

MatthewVernon added a comment to T389635: Q4:rack/setup/install thanos-fe100[5-7].

thanos-fe1007 looks like it's not even trying to PXE at the moment, so maybe the 10g card needs setting up to PXE on this system too?

May 12 2025, 9:37 AM · Patch-For-Review, SRE-swift-storage, SRE, Data-Persistence, ops-eqiad, DC-Ops
MatthewVernon updated the task description for T391352: Q4 Thanos hardware refresh.
May 12 2025, 8:55 AM · SRE-swift-storage, SRE
MatthewVernon created T393870: decommission thanos-fe200[1-3].codfw.wmnet.
May 12 2025, 8:54 AM · DC-Ops, SRE, SRE-swift-storage, ops-codfw, decommission-hardware