Page MenuHomePhabricator

Automate the pre/post switchover tasks related to databases
Closed, ResolvedPublic

Description

As well documented by Manuel we should automate all the steps required to prepare the databases for the switchover few days before and to cleanup the special settings for the switchover after it's completed.

Opening this task for tracking purposes with the datacenter switchover tag.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

There is some minor usability issue (but could be confusing under pressure), I get this text:

==> Run on section test-s4 was manually aborted. Continue with the remaining sections or abort completely?

However, if it is the last or the only section, it doesn't make much sense, as it would do the same, basically. Maybe just changing the wording if there are no more sections left even if you want to keep the pause?

What do you think?

I cannot at the moment reproduce the bad scenario, because this is the binlog from, eg. x1:

# at 1199094
#241017  9:53:45 server id 180360966  end_log_pos 1199136 CRC32 0x45c248cc      GTID 180360966-180360966-115215619 trans
/*!100001 SET @@session.gtid_seq_no=115215619*//*!*/;
START TRANSACTION
/*!*/;
# at 1199136
# at 1199400
#241017  9:53:45 server id 180360966  end_log_pos 1199400 CRC32 0xd37d7cf7      Annotate_rows:
#Q> REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2024-10-17T09:53:45.001510', '180360966', 'db2196-bin.003776', '1199094', NULL, NULL, 'x1', 'codfw')
#241017  9:53:45 server id 180360966  end_log_pos 1199474 CRC32 0xe050a53c      Table_map: `heartbeat`.`heartbeat` mapped to number 23
# at 1199474
#241017  9:53:45 server id 180360966  end_log_pos 1199642 CRC32 0x14ae218b      Update_rows: table id 23 flags: STMT_END_F
### UPDATE `heartbeat`.`heartbeat`
### WHERE
###   @1='2024-10-17T09:53:44.001370' /* VARSTRING(26) meta=26 nullable=0 is_null=0 */
###   @2=180360966 /* INT meta=0 nullable=0 is_null=0 */
###   @3='db2196-bin.003775' /* VARSTRING(255) meta=255 nullable=1 is_null=0 */
###   @4=1048427332 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @5=NULL /* VARSTRING(255) meta=255 nullable=1 is_null=1 */
###   @6=NULL /* LONGINT meta=0 nullable=1 is_null=1 */
###   @7='x1' /* VARSTRING(10) meta=10 nullable=1 is_null=0 */
###   @8='codfw' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### SET
###   @1='2024-10-17T09:53:45.001510' /* VARSTRING(26) meta=26 nullable=0 is_null=0 */
###   @2=180360966 /* INT meta=0 nullable=0 is_null=0 */
###   @3='db2196-bin.003776' /* VARSTRING(255) meta=255 nullable=1 is_null=0 */
###   @4=1199094 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @5=NULL /* VARSTRING(255) meta=255 nullable=1 is_null=1 */
###   @6=NULL /* LONGINT meta=0 nullable=1 is_null=1 */
###   @7='x1' /* VARSTRING(10) meta=10 nullable=1 is_null=0 */
###   @8='codfw' /* STRING(5) meta=65029 nullable=1 is_null=0 */
# Number of rows: 1
# at 1199642
#241017  9:53:45 server id 180360966  end_log_pos 1199673 CRC32 0xc6f0d777      Xid = 3891994318
COMMIT/*!*/;
# at 1199673

And this is the binlog from test-s4, even after changing the format, doing flush tables and restarting pt-heartbeat-wikimedia:

# at 417291
#241017 10:35:31 server id 171978825  end_log_pos 417333 CRC32 0x3f3b8917       GTID 171978825-171978825-13743836 trans
/*M!100001 SET @@session.gtid_seq_no=13743836*//*!*/;
START TRANSACTION
/*!*/;
# at 417333
# at 417412
#241017 10:35:31 server id 171978825  end_log_pos 417412 CRC32 0x025cddf4       Annotate_rows:
#Q> INSERT INTO test (s) VALUES ('ca977ebccb5f6ea211903ebf')
#241017 10:35:31 server id 171978825  end_log_pos 417464 CRC32 0x32152a67       Table_map: `test`.`test` mapped to number 428
# at 417464
#241017 10:35:31 server id 171978825  end_log_pos 417532 CRC32 0x92c81f62       Write_rows: table id 428 flags: STMT_END_F
### INSERT INTO `test`.`test`
### SET
###   @1=877001 /* INT meta=0 nullable=0 is_null=0 */
###   @2='ca977ebccb5f6ea211903ebf' /* VARSTRING(1000) meta=1000 nullable=1 is_null=0 */
###   @3=1729161331 /* TIMESTAMP(0) meta=0 nullable=0 is_null=0 */
# Number of rows: 1
# at 417532
#241017 10:35:31 server id 171978825  end_log_pos 417563 CRC32 0xa5142826       Xid = 22234452
COMMIT/*!*/;
# at 417563
#241017 10:35:32 server id 171978825  end_log_pos 417605 CRC32 0x2cb0630f       GTID 171978825-171978825-13743837 trans
/*M!100001 SET @@session.gtid_seq_no=13743837*//*!*/;
START TRANSACTION
/*!*/;
# at 417605
#241017 10:35:32 server id 171978825  end_log_pos 417922 CRC32 0x93269457       Query   thread_id=1737128       exec_time=0     error_code=0    xid=0
SET TIMESTAMP=1729161332/*!*/;
REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2024-10-17T10:35:32.000810', '171978825', 'db1125-bin.000026', '417563', NULL, NULL, 'test-s4', 'eqiad')
/*!*/;
# at 417922
#241017 10:35:32 server id 171978825  end_log_pos 417953 CRC32 0xfaf53238       Xid = 22234456
COMMIT/*!*/;
# at 417953

You can see that we are in ROW format (because of the test traffic I created), but heartbeat is using STATEMENT, not ROW. Which is good, but not what I need for testing the issue. I will do a restart of the server, but this may be the source of the issue of T375144 (?)

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

There is some minor usability issue (but could be confusing under pressure), I get this text:

==> Run on section test-s4 was manually aborted. Continue with the remaining sections or abort completely?

However, if it is the last or the only section, it doesn't make much sense, as it would do the same, basically. Maybe just changing the wording if there are no more sections left even if you want to keep the pause?

What do you think?

I've updated the last CR of the chain with this, I've also added a "progress" indicator in a log line and the ask confirmation.

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

Nice catch! I don't mind discarding the patch if we solved the problem :)

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

Nice catch! I don't mind discarding the patch if we solved the problem :)

We should keep the check part. What I am not sure is if to keep the logic but checking the configuration instead or it is too much. I accept suggestions if it is just best to fix the root issue and simplify the script, as we cannot preview future issues.

Indeed it is. let me use blame to see when this happened. This is good news because we finally know WHY this happened only recently, the sad part is that we may discard this patch partially.

Nice catch! I don't mind discarding the patch if we solved the problem :)

ACtually, I responded myself while testing the script. My original proposed solution doesn't work because it solves it for the from server, but moves it to the secondary- through replication . The only fix when this issue happens is to do it without sql_log_bin on all hosts of one side- which I did manually during the incident but it is not a good solution unattended, and may not be able to be done reliably. We need to drop the insertion and monitor heartbeat is using STATEMENT always.

Change #1074127 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases.prepare: add check

https://gerrit.wikimedia.org/r/1074127

Change #1074128 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.databases: update Phabricator more

https://gerrit.wikimedia.org/r/1074128

As soon as the other patches are merged, this is done for me IMHO. Core section ordering can be discussed afterwards.

All pending patches have been tested and merged. Resolving.

@Volans can we document this steps in the DC switchover page on wikitech? Thanks!

@Marostegui I've added some notes in those two pages and removed one paragraph that I think was obsolete due to active/active mediawiki. I think though that some of the steps listed there might be outdatad.

Let me know if you want to go in more detail there or the help message of the cookbook is enough (and also self-updated in the future).

Diffs:
https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&diff=2242496&oldid=2233628
https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2FSwitch_Datacenter&diff=2242499&oldid=1932299

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad was aborted for section test-s4:
test-s4 (FAIL)

  • Expected all replicas of MASTER_TO db1125.eqiad.wmnet to be in the same datacenter, got db2230.codfw.wmnet instead
  • Execution for this section was manually aborted

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (FAIL)

  • MASTER_FROM db2230.codfw.wmnet should be read only
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet RESET SLAVE ALL.
  • MASTER_TO db1125.eqiad.wmnet has no replication set.
  • MASTER_FROM db2230.codfw.wmnet STOP SLAVE.
  • MASTER_FROM db2230.codfw.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db2230.codfw.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db2230.codfw.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_TO db2230.codfw.wmnet and MASTER_FROM db1125.eqiad.wmnet
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet RESET SLAVE ALL.
  • MASTER_TO db2230.codfw.wmnet has no replication set.
  • MASTER_FROM db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_FROM db1125.eqiad.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db1125.eqiad.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db1125.eqiad.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad was aborted for section test-s4:
test-s4 (FAIL)

  • Validated replication topology for section test-s4 between MASTER_FROM db2230.codfw.wmnet and MASTER_TO db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER_USE_GTID=no.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • MASTER_TO db1125.eqiad.wmnet wrong SLAVE STATUS Last_IO_Errno=0, expected 0 instead
  • Failed to verify disabled GTID on db1125.eqiad.wmnet
  • Execution for this section was manually aborted

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (FAIL)

  • Validated replication topology for section test-s4 between MASTER_FROM db2230.codfw.wmnet and MASTER_TO db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER_USE_GTID=no.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • MASTER_TO db1125.eqiad.wmnet wrong SLAVE STATUS Last_IO_Errno=0, expected 0 instead
  • Failed to verify disabled GTID on db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet stopped pt-heartbeat.
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER STATUS is stable over time: {'File': 'db1125-bin.000030', 'Position': 899318914, 'Binlog_Do_DB': '', 'Binlog_Ignore_DB': ''}
  • MASTER_FROM db2230.codfw.wmnet CHANGE MASTER to ReplicationInfo(primary='db1125.eqiad.wmnet', binlog='db1125-bin.000030', position=899318914, port=3306) and user repl2024
  • MASTER_FROM db2230.codfw.wmnet START SLAVE
  • MASTER_FROM db2230.codfw.wmnet wrong SLAVE STATUS Master_Port=3306, expected 3306 instead
  • MASTER_TO db1125.eqiad.wmnet started pt-heartbeat.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • MASTER_TO db1125.eqiad.wmnet wrong SLAVE STATUS Master_Port=3306, expected 3306 instead
  • MASTER_FROM db2230.codfw.wmnet wrong SLAVE STATUS Master_Port=3306, expected 3306 instead

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (FAIL)

  • MASTER_FROM db2230.codfw.wmnet should be read only
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet RESET SLAVE ALL.
  • MASTER_TO db1125.eqiad.wmnet has no replication set.
  • MASTER_FROM db2230.codfw.wmnet STOP SLAVE.
  • MASTER_FROM db2230.codfw.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db2230.codfw.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db2230.codfw.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw was aborted for section test-s4:
test-s4 (FAIL)

  • MASTER_FROM db1125.eqiad.wmnet should be read write
  • Execution for this section was manually aborted

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw run successfully on section test-s4:
test-s4 (FAIL)

  • Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet MASTER_USE_GTID=no.
  • MASTER_TO db2230.codfw.wmnet START SLAVE.
  • MASTER_TO db2230.codfw.wmnet wrong SLAVE STATUS Last_IO_Errno=0, expected 0 instead
  • Failed to verify disabled GTID on db2230.codfw.wmnet
  • MASTER_TO db2230.codfw.wmnet stopped pt-heartbeat.
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet MASTER STATUS is stable over time: {'File': 'db2230-bin.000010', 'Position': 472059358, 'Binlog_Do_DB': '', 'Binlog_Ignore_DB': ''}
  • MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000010', position=472059358, port=3306) and user repl2024
  • MASTER_FROM db1125.eqiad.wmnet START SLAVE
  • MASTER_FROM db1125.eqiad.wmnet wrong SLAVE STATUS Master_Port=3306, expected 3306 instead
  • MASTER_TO db2230.codfw.wmnet started pt-heartbeat.
  • MASTER_TO db2230.codfw.wmnet START SLAVE.
  • MASTER_TO db2230.codfw.wmnet wrong SLAVE STATUS Master_Port=3306, expected 3306 instead
  • MASTER_FROM db1125.eqiad.wmnet wrong SLAVE STATUS Master_Port=3306, expected 3306 instead

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw was aborted for section test-s4:
test-s4 (FAIL)

  • MASTER_FROM db1125.eqiad.wmnet should be read only
  • Execution for this section was manually aborted

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw was aborted for section test-s4:
test-s4 (FAIL)

  • MASTER_TO db2230.codfw.wmnet should be read write
  • Execution for this section was manually aborted

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_TO db2230.codfw.wmnet and MASTER_FROM db1125.eqiad.wmnet
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet RESET SLAVE ALL.
  • MASTER_TO db2230.codfw.wmnet has no replication set.
  • MASTER_FROM db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_FROM db1125.eqiad.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db1125.eqiad.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db1125.eqiad.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_FROM db2230.codfw.wmnet and MASTER_TO db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER_USE_GTID=no.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • Disabled GTID on MASTER_TO db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet stopped pt-heartbeat.
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER STATUS is stable over time: {'File': 'db1125-bin.000030', 'Position': 906844617, 'Binlog_Do_DB': '', 'Binlog_Ignore_DB': ''}
  • MASTER_FROM db2230.codfw.wmnet CHANGE MASTER to ReplicationInfo(primary='db1125.eqiad.wmnet', binlog='db1125-bin.000030', position=906844617, port=3306) and user repl2024
  • MASTER_FROM db2230.codfw.wmnet START SLAVE
  • MASTER_FROM db2230.codfw.wmnet replication from MASTER_TO db1125.eqiad.wmnet verified
  • MASTER_TO db1125.eqiad.wmnet started pt-heartbeat.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • MASTER_TO db1125.eqiad.wmnet replication from MASTER_FROM db2230.codfw.wmnet verified
  • MASTER_FROM db2230.codfw.wmnet replication from MASTER_TO db1125.eqiad.wmnet verified after pt-heartbeat

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_TO db1125.eqiad.wmnet and MASTER_FROM db2230.codfw.wmnet
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet RESET SLAVE ALL.
  • MASTER_TO db1125.eqiad.wmnet has no replication set.
  • MASTER_TO db2230.codfw.wmnet heartbeat server IDs to delete are: [180360463]
  • MASTER_TO db2230.codfw.wmnet DELETED heartbeat rows for server IDs [180360463]
  • MASTER_FROM db2230.codfw.wmnet STOP SLAVE.
  • MASTER_FROM db2230.codfw.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db2230.codfw.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db2230.codfw.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw executed by arnaudb@cumin1002 with errors:

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet MASTER_USE_GTID=no.
  • MASTER_TO db2230.codfw.wmnet START SLAVE.
  • Disabled GTID on MASTER_TO db2230.codfw.wmnet
  • MASTER_TO db2230.codfw.wmnet stopped pt-heartbeat.
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet MASTER STATUS is stable over time: {'File': 'db2230-bin.000010', 'Position': 476646927, 'Binlog_Do_DB': '', 'Binlog_Ignore_DB': ''}
  • MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000010', position=476646927, port=3306) and user repl2024
  • MASTER_FROM db1125.eqiad.wmnet START SLAVE
  • MASTER_FROM db1125.eqiad.wmnet replication from MASTER_TO db2230.codfw.wmnet verified
  • MASTER_TO db2230.codfw.wmnet started pt-heartbeat.
  • MASTER_TO db2230.codfw.wmnet START SLAVE.
  • MASTER_TO db2230.codfw.wmnet replication from MASTER_FROM db1125.eqiad.wmnet verified
  • MASTER_FROM db1125.eqiad.wmnet replication from MASTER_TO db2230.codfw.wmnet verified after pt-heartbeat

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_TO db2230.codfw.wmnet and MASTER_FROM db1125.eqiad.wmnet
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet RESET SLAVE ALL.
  • MASTER_TO db2230.codfw.wmnet has no replication set.
  • MASTER_TO db1125.eqiad.wmnet heartbeat server IDs to delete are: [171978825]
  • MASTER_TO db1125.eqiad.wmnet DELETED heartbeat rows for server IDs [171978825]
  • MASTER_FROM db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_FROM db1125.eqiad.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db1125.eqiad.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db1125.eqiad.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad was aborted for section test-s4:
test-s4 (FAIL)

  • MASTER_FROM db2230.codfw.wmnet should be read only
  • Execution for this section was manually aborted

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_FROM db2230.codfw.wmnet and MASTER_TO db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER_USE_GTID=no.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • Disabled GTID on MASTER_TO db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet stopped pt-heartbeat.
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER STATUS is stable over time: {'File': 'db1125-bin.000030', 'Position': 912391806, 'Binlog_Do_DB': '', 'Binlog_Ignore_DB': ''}
  • MASTER_FROM db2230.codfw.wmnet CHANGE MASTER to ReplicationInfo(primary='db1125.eqiad.wmnet', binlog='db1125-bin.000030', position=912391806, port=3306) and user repl2024
  • MASTER_FROM db2230.codfw.wmnet START SLAVE
  • MASTER_FROM db2230.codfw.wmnet replication from MASTER_TO db1125.eqiad.wmnet verified
  • MASTER_TO db1125.eqiad.wmnet started pt-heartbeat.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • MASTER_TO db1125.eqiad.wmnet replication from MASTER_FROM db2230.codfw.wmnet verified
  • MASTER_FROM db2230.codfw.wmnet replication from MASTER_TO db1125.eqiad.wmnet verified after pt-heartbeat

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_TO db1125.eqiad.wmnet and MASTER_FROM db2230.codfw.wmnet
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet RESET SLAVE ALL.
  • MASTER_TO db1125.eqiad.wmnet has no replication set.
  • MASTER_TO db2230.codfw.wmnet heartbeat server IDs to delete are: [180360463]
  • MASTER_FROM db2230.codfw.wmnet STOP SLAVE.
  • MASTER_FROM db2230.codfw.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db2230.codfw.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db2230.codfw.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet MASTER_USE_GTID=no.
  • MASTER_TO db2230.codfw.wmnet START SLAVE.
  • Disabled GTID on MASTER_TO db2230.codfw.wmnet
  • MASTER_TO db2230.codfw.wmnet stopped pt-heartbeat.
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet MASTER STATUS is stable over time: {'File': 'db2230-bin.000010', 'Position': 479385816, 'Binlog_Do_DB': '', 'Binlog_Ignore_DB': ''}
  • MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000010', position=479385816, port=3306) and user repl2024
  • MASTER_FROM db1125.eqiad.wmnet START SLAVE
  • MASTER_FROM db1125.eqiad.wmnet replication from MASTER_TO db2230.codfw.wmnet verified
  • MASTER_TO db2230.codfw.wmnet started pt-heartbeat.
  • MASTER_TO db2230.codfw.wmnet START SLAVE.
  • MASTER_TO db2230.codfw.wmnet replication from MASTER_FROM db1125.eqiad.wmnet verified
  • MASTER_FROM db1125.eqiad.wmnet replication from MASTER_TO db2230.codfw.wmnet verified after pt-heartbeat

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_TO db2230.codfw.wmnet and MASTER_FROM db1125.eqiad.wmnet
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet RESET SLAVE ALL.
  • MASTER_TO db2230.codfw.wmnet has no replication set.
  • MASTER_TO db1125.eqiad.wmnet heartbeat server IDs to delete are: [171978825]
  • MASTER_FROM db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_FROM db1125.eqiad.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db1125.eqiad.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db1125.eqiad.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_FROM db2230.codfw.wmnet and MASTER_TO db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER_USE_GTID=no.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • Disabled GTID on MASTER_TO db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet stopped pt-heartbeat.
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER STATUS is stable over time: {'File': 'db1125-bin.000030', 'Position': 912801410, 'Binlog_Do_DB': '', 'Binlog_Ignore_DB': ''}
  • MASTER_FROM db2230.codfw.wmnet CHANGE MASTER to ReplicationInfo(primary='db1125.eqiad.wmnet', binlog='db1125-bin.000030', position=912801410, port=3306) and user repl2024
  • MASTER_FROM db2230.codfw.wmnet START SLAVE
  • MASTER_FROM db2230.codfw.wmnet replication from MASTER_TO db1125.eqiad.wmnet verified
  • MASTER_TO db1125.eqiad.wmnet started pt-heartbeat.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • MASTER_TO db1125.eqiad.wmnet replication from MASTER_FROM db2230.codfw.wmnet verified
  • MASTER_FROM db2230.codfw.wmnet replication from MASTER_TO db1125.eqiad.wmnet verified after pt-heartbeat

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (FAIL)

  • MASTER_FROM db2230.codfw.wmnet should be read only
  • Validated replication topology for section test-s4 between MASTER_TO db1125.eqiad.wmnet and MASTER_FROM db2230.codfw.wmnet
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet RESET SLAVE ALL.
  • MASTER_TO db1125.eqiad.wmnet has no replication set.
  • MASTER_TO db2230.codfw.wmnet heartbeat server IDs to delete are: [180360463]
  • MASTER_TO db2230.codfw.wmnet DELETED 1 heartbeat rows for server IDs [180360463]
  • MASTER_FROM db2230.codfw.wmnet STOP SLAVE.
  • MASTER_FROM db2230.codfw.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db2230.codfw.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db2230.codfw.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_FROM db1125.eqiad.wmnet and MASTER_TO db2230.codfw.wmnet
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet MASTER_USE_GTID=no.
  • MASTER_TO db2230.codfw.wmnet START SLAVE.
  • Disabled GTID on MASTER_TO db2230.codfw.wmnet
  • MASTER_TO db2230.codfw.wmnet stopped pt-heartbeat.
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet MASTER STATUS is stable over time: {'File': 'db2230-bin.000010', 'Position': 480264183, 'Binlog_Do_DB': '', 'Binlog_Ignore_DB': ''}
  • MASTER_FROM db1125.eqiad.wmnet CHANGE MASTER to ReplicationInfo(primary='db2230.codfw.wmnet', binlog='db2230-bin.000010', position=480264183, port=3306) and user repl2024
  • MASTER_FROM db1125.eqiad.wmnet START SLAVE
  • MASTER_FROM db1125.eqiad.wmnet replication from MASTER_TO db2230.codfw.wmnet verified
  • MASTER_TO db2230.codfw.wmnet started pt-heartbeat.
  • MASTER_TO db2230.codfw.wmnet START SLAVE.
  • MASTER_TO db2230.codfw.wmnet replication from MASTER_FROM db1125.eqiad.wmnet verified
  • MASTER_FROM db1125.eqiad.wmnet replication from MASTER_TO db2230.codfw.wmnet verified after pt-heartbeat

cookbooks.sre.switchdc.databases.prepare for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_TO db2230.codfw.wmnet and MASTER_FROM db1125.eqiad.wmnet
  • MASTER_TO db2230.codfw.wmnet STOP SLAVE.
  • MASTER_TO db2230.codfw.wmnet RESET SLAVE ALL.
  • MASTER_TO db2230.codfw.wmnet has no replication set.
  • MASTER_TO db1125.eqiad.wmnet heartbeat server IDs to delete are: [171978825]
  • MASTER_TO db1125.eqiad.wmnet DELETED 1 heartbeat rows for server IDs [171978825]
  • MASTER_FROM db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_FROM db1125.eqiad.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db1125.eqiad.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db1125.eqiad.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from eqiad to codfw executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_FROM db2230.codfw.wmnet and MASTER_TO db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER_USE_GTID=no.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • Disabled GTID on MASTER_TO db1125.eqiad.wmnet
  • MASTER_TO db1125.eqiad.wmnet stopped pt-heartbeat.
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet MASTER STATUS is stable over time: {'File': 'db1125-bin.000030', 'Position': 914013989, 'Binlog_Do_DB': '', 'Binlog_Ignore_DB': ''}
  • MASTER_FROM db2230.codfw.wmnet CHANGE MASTER to ReplicationInfo(primary='db1125.eqiad.wmnet', binlog='db1125-bin.000030', position=914013989, port=3306) and user repl2024
  • MASTER_FROM db2230.codfw.wmnet START SLAVE
  • MASTER_FROM db2230.codfw.wmnet replication from MASTER_TO db1125.eqiad.wmnet verified
  • MASTER_TO db1125.eqiad.wmnet started pt-heartbeat.
  • MASTER_TO db1125.eqiad.wmnet START SLAVE.
  • MASTER_TO db1125.eqiad.wmnet replication from MASTER_FROM db2230.codfw.wmnet verified
  • MASTER_FROM db2230.codfw.wmnet replication from MASTER_TO db1125.eqiad.wmnet verified after pt-heartbeat

cookbooks.sre.switchdc.databases.prepare for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad started by arnaudb@cumin1002

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad run successfully on section test-s4:
test-s4 (PASS)

  • Validated replication topology for section test-s4 between MASTER_TO db1125.eqiad.wmnet and MASTER_FROM db2230.codfw.wmnet
  • MASTER_TO db1125.eqiad.wmnet STOP SLAVE.
  • MASTER_TO db1125.eqiad.wmnet RESET SLAVE ALL.
  • MASTER_TO db1125.eqiad.wmnet has no replication set.
  • MASTER_TO db2230.codfw.wmnet heartbeat server IDs to delete are: [180360463]
  • MASTER_TO db2230.codfw.wmnet DELETED 1 heartbeat rows for server IDs [180360463]
  • MASTER_FROM db2230.codfw.wmnet STOP SLAVE.
  • MASTER_FROM db2230.codfw.wmnet MASTER_USE_GTID=slave_pos.
  • MASTER_FROM db2230.codfw.wmnet START SLAVE.
  • Enabled GTID on MASTER_FROM db2230.codfw.wmnet

cookbooks.sre.switchdc.databases.finalize for the switch from codfw to eqiad executed by arnaudb@cumin1002 completed.