魅力天空 发表于 2024-1-22 00:36:25

MySQL MHA切换过程分析

启动 
MHA的启动脚本为masterha_manager(安装后,默认路径--/usr/local/bin/masterha_manager)。启动的过程中会主动检查各节点的SSH连接和主从复制的状态是否正常。运行期间,manager会调用masterha_master_monitor脚本(masterha_master_monitor进一步调用XXX/mha4mysql-manager-0.5?/lib/MHA/MasterMonitor.pm 和 HealthCheck.pm 等脚本),探测各节点的运行情况。探测间隔由manager配置文件中的ping_interval参数决定,探测三次主节点无反应,就判定为宕机。
 故障选主
---读取配置文件中是否有候选主库的参数--candidate_master=1;如果有该参数,并且check_repl_delay=0,则将该节点提升为新的主库。
--如果没有指定候选主节点,则自动判断所有从库的日志量,将最接近主数据库的从库提升为新的主库。
---按照配置文件中,节点的先后顺序选主。
数据补偿
---判断主库SSH的连通性,如果能连通,则通过“save_binary_logs”脚本将缺失的binlog发送给从库,并恢复;
---如果主库无法连通,则通过“apply_diff_relay_logs”脚本计算从库的relay log的差异,并恢复到其他从库;
角色切换
新选出的主库,解除从库身份,剩余从库与新的主库建立主从关系。
VIP偏移
虚拟IP的绑定。
 
思考
如果在FailOver的过程中,主库恢复了怎么办?
要分情况了,可能会FailOver继续也可能要FailOver终止。下面是FailOver终止的Log。
Sat Jan 20 09:27:28 2024 - Got timeout on MySQL Ping(SELECT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 431.
Sat Jan 20 09:27:28 2024 - Executing SSH check script: exit 0
Sat Jan 20 09:27:32 2018 - Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.171.172.171' (4))
Sat Jan 20 09:27:32 2018 - Connection failed 2 time(s)..
Sat Jan 20 09:27:34 2024 - HealthCheck: Got timeout on checking SSH connection to 172.171.172.171! at /usr/local/share/perl5/MHA/HealthCheck.pm line 342.
Sat Jan 20 09:27:35 2024 - Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.171.172.171' (4))
Sat Jan 20 09:27:35 2024 - Connection failed 3 time(s)..
Sat Jan 20 09:27:38 2024 - Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.171.172.171' (4))
Sat Jan 20 09:27:38 2024 - Connection failed 4 time(s)..
Sat Jan 20 09:27:38 2024 - Master is not reachable from health checker!
Sat Jan 20 09:27:38 2024 - Master 172.171.172.171(172.171.172.171:3307) is not reachable!
Sat Jan 20 09:27:38 2024 - SSH is NOT reachable.
Sat Jan 20 09:27:38 2024 - Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /data/mhacnf/qqweixinod.cnf again, and trying to connect to all servers to check server status..
Sat Jan 20 09:27:38 2024 - Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sat Jan 20 09:27:38 2024 - Reading application default configuration from /data/mhacnf/qqweixinod.cnf..
Sat Jan 20 09:27:38 2024 - Reading server configuration from /data/mhacnf/qqweixinod.cnf..
Sat Jan 20 09:27:39 2024 - GTID failover mode = 1
Sat Jan 20 09:27:39 2024 - Dead Servers:
Sat Jan 20 09:27:39 2024 -    172.171.172.171(172.171.172.171:3307)
Sat Jan 20 09:27:39 2024 - Alive Servers:
Sat Jan 20 09:27:39 2024 -    172.171.172.172(172.171.172.172:3307)
Sat Jan 20 09:27:39 2024 -    172.171.172.173(172.171.172.173:3307)
Sat Jan 20 09:27:39 2024 - Alive Slaves:
Sat Jan 20 09:27:39 2024 -    172.171.172.172(172.171.172.172:3307)Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Sat Jan 20 09:27:39 2024 -    GTID ON
Sat Jan 20 09:27:39 2024 -    Replicating from 172.171.172.171(172.171.172.171:3307)
Sat Jan 20 09:27:39 2024 -    Primary candidate for the new Master (candidate_master is set)
Sat Jan 20 09:27:39 2024 -    172.171.172.173(172.171.172.173:3307)Version=5.7.21-log (oldest major version between slaves) log-bin:enabled
Sat Jan 20 09:27:39 2024 -    GTID ON
Sat Jan 20 09:27:39 2024 -    Replicating from 172.171.172.171(172.171.172.171:3307)
Sat Jan 20 09:27:39 2024 - Checking slave configurations..
Sat Jan 20 09:27:39 2024 - Checking replication filtering settings..
Sat Jan 20 09:27:39 2024 - Replication filtering check ok.
Sat Jan 20 09:27:39 2024 - Master is down!
Sat Jan 20 09:27:39 2024 - Terminating monitoring script.
Sat Jan 20 09:27:39 2024 - Got exit code 20 (Master dead).
Sat Jan 20 09:27:39 2024 - MHA::MasterFailover version 0.56.
Sat Jan 20 09:27:39 2024 - Starting master failover.
Sat Jan 20 09:27:39 2024 -
Sat Jan 20 09:27:39 2024 - * Phase 1: Configuration Check Phase..
Sat Jan 20 09:27:39 2024 -
Sat Jan 20 09:27:40 2024 - GTID failover mode = 1
Sat Jan 20 09:27:40 2024 - Dead Servers:
Sat Jan 20 09:27:40 2024 -    172.171.172.171(172.171.172.171:3307)
<strong>Sat Jan 20 09:27:40 2018 - Checking master reachability via MySQL(double check)...</strong><br><strong>Sat Jan 20 09:27:40 2018 - The master 172.171.172.171(172.171.172.171:3307) is reachable via MySQL (error=1:Connection Succeeded) ! Stop failover.
Sat Jan 20 09:27:40 2018 - Got ERROR:at /usr/local/bin/masterha_manager line 65.</strong>注:Log中的3307是数据库的DB端口,别奇怪. 
如果是在 Checking master reachability via MySQL(double check) 的过程中(或者check前),发现恢复了,则退出切换过程。并且MHA的进程也会被退出(KIll),masterha_manager 需要重新手动启动。
Checking master reachability via MySQL(double check) ---MasterFailover.pm
源码如下:
# quick check that the dead server is really dead
# not double check when ping_type is insert,
# because check_connection_fast_util can rerurn true if insert-check detects I/O failure.
if ( $servers_config->{ping_type} ne $MHA::ManagerConst::PING_TYPE_INSERT )
{
    $log->info("Checking master reachability via MySQL(double check)...");
    if (
      my $rc = MHA::DBHelper::check_connection_fast_util(
      $dead_master->{hostname}, $dead_master->{port},
      $dead_master->{user},   $dead_master->{password}
      )
      )
    {
      $log->error(
      sprintf(
          "The master %s is reachable via MySQL (error=%s) ! Stop failover.",
          $dead_master->get_hostinfo(), $rc
      )
      );
      croak;
    }
    $log->info(" ok.");


来源:https://www.cnblogs.com/xuliuzai/p/17978546
免责声明:由于采集信息均来自互联网,如果侵犯了您的权益,请联系我们【E-Mail:cb@itdo.tech】 我们会及时删除侵权内容,谢谢合作!
页: [1]
查看完整版本: MySQL MHA切换过程分析