MySQL 高可用

MySQL 高可用解决方案

MySQL官方和社区里推出了很多高可用的解决方案，大体如下，仅供参考（数据引用自Percona）

Method (实现方法)	Level of Availability (可用性级别)
Simple replication (简单复制)	98-99.9%
Master-Master/MMM (双主/MMM)	99%
SAN (存储区域网络)	99.5-99.9%
DRBD, MHA, Tungsten Replicator	99.9%
NDBCluster, Galera Cluster	99.999%

MMM: Multi-Master Replication Manager for MySQL，Mysql主主复制管理器是一套灵活的脚本程序，基于perl实现，用来对mysql replication进行监控和故障迁移，并能管理mysql Master-Master复制的配置(同一时间只有一个节点是可写的)
- 官网： http://www.mysql-mmm.org
  https://code.google.com/archive/p/mysql-master-master/downloads
MHA：Master High Availability，对主节点进行监控，可实现自动故障转移至其它从节点；通过提升某一从节点为新的主节点，基于主从复制实现，还需要客户端配合实现，目前MHA主要支持一主多从的架构，要搭建MHA,要求一个复制集群中必须最少有三台数据库服务器，一主二从，即一台充当master，一台充当备用master，另外一台充当从库，出于机器成本的考虑，淘宝进行了
改造，目前淘宝TMHA已经支持一主一从
- 官方网站：https://code.google.com/archive/p/mysql-master-ha/
  https://github.com/yoshinorim/mha4mysql-manager/wiki/Downloads
  https://github.com/yoshinorim/mha4mysql-manager/releases
  https://github.com/yoshinorim/mha4mysql-node/releases/tag/v0.58

以下技术可以达到金融级的高可用性要求

Galera Cluster：wsrep(MySQL extended with the Write Set Replication)
通过wsrep协议在全局实现复制；任何一节点都可读写，不需要主从复制，实现多主读写
GR（Group Replication）：MySQL官方提供的组复制技术(MySQL 5.7.17引入的技术)，基于原生复制技术Paxos算法，实现了多主更新，复制组由多个server成员构成，组中的每个server可独立地执行事务，但所有读写事务只在冲突检测成功后才会提交

3个节点互相通信，每当有事件发生，都会向其他节点传播该事件，然后协商，如果大多数节点都同意这次的事件，那么该事件将通过，否则该事件将失败或回滚。这些节点可以是单主模型的(single-primary)，也可以是多主模型的(multi-primary)。单主模型只有一个主节点可以接受写操作，主节点故障时可以自动选举主节点。多主模型下，所有节点都可以接受写操作，所以没有master-slave的概念。

MHA Master High Availability

MHA 工作原理和架构

官方文档

https://github.com/yoshinorim/mha4mysql-manager/wiki

MHA集群架构

mariadb 数据库目录更换 mariadb保存数据库目录_mysql_03

MHA工作原理

mariadb 数据库目录更换 mariadb保存数据库目录_二进制日志_04

MHA利用 SELECT 1 As Value 指令判断master服务器的健康性,一旦master 宕机,MHA 从宕机崩溃的master保存二进制日志事件（binlog events）
识别含有最新更新的slave
应用差异的中继日志（relay log）到其他的slave
应用从master保存的二进制日志事件（binlog events）到所有slave节点
提升一个slave为新的master
使其他的slave连接新的master进行复制
故障服务器自动被剔除集群(masterha_conf_host),将配置信息去掉
旧的Master的 VIP 漂移到新的master上，用户应用就可以访问新的Master
MHA是一次性的高可用性解决方案,Manager会自动退出

选举新的Master

如果设定权重(candidate_master=1),按照权重强制指定新主,但是默认情况下如果一个slave落后master 二进制日志超过100M的relay logs，即使有权重,也会失效.如果设置check_repl_delay=0,即使落后很多日志,也强制选择其为新主
如果从库数据之间有差异,最接近于Master的slave成为新主
如果所有从库数据都一致,按照配置文件顺序最前面的当新主

数据恢复

当主服务器的SSH还能连接,从库对比主库position 或者GTID号,将二进制日志保存至各个从节点并且应用(执行save_binary_logs 实现)
当主服务器的SSH不能连接, 对比从库之间的relaylog的差异(执行apply_diff_relay_logs[实现])

注意：

为了尽可能的减少主库硬件损坏宕机造成的数据丢失，因此在配置MHA的同时建议配置成MySQL的半同步复制

MHA软件

MHA软件由两部分组成，Manager工具包和Node工具包

Manager工具包主要包括以下几个工具：

masterha_check_ssh 		检查MHA的SSH配置状况
masterha_check_repl 	检查MySQL复制状况
masterha_manger 		启动MHA
masterha_check_status 	检测当前MHA运行状态
masterha_master_monitor 检测master是否宕机
masterha_master_switch	 故障转移（自动或手动）
masterha_conf_host 		添加或删除配置的server信息
masterha_stop --conf=app1.cnf 	停止MHA
masterha_secondary_check 	两个或多个网络线路检查MySQL主服务器的可用

Node工具包：这些工具通常由MHA Manager的脚本触发，无需人为操作）主要包括以下几个工具

save_binary_logs 		#保存和复制master的二进制日志
apply_diff_relay_logs 	#识别差异的中继日志事件并将其差异的事件应用于其他的slave
filter_mysqlbinlog 		#去除不必要的ROLLBACK事件（MHA已不再使用此工具）
purge_relay_logs 		#清除中继日志（不会阻塞SQL线程）

MHA自定义扩展：

secondary_check_script 		#通过多条网络路由检测master的可用性
master_ip_ailover_script 	#更新Application使用的masterip
shutdown_script 			#强制关闭master节点
report_script 				#发送报告
init_conf_load_script		 #加载初始配置参数
master_ip_online_change_script #更新master节点ip地址

MHA配置文件：

global配置，为各application提供默认配置，默认文件路径 /etc/masterha_default.cnf
application配置：为每个主从复制集群

实现 MHA 实战案例

环境:四台主机
10.0.0.7 CentOS7 MHA管理端
10.0.0.8 CentOS8 MySQL8.0 Master
10.0.0.18 CentOS8 MySQL8.0 Slave1
10.0.0.28 CentOS8 MySQL8.0 Slave2

->master

MHA管理端 ->slave1

->slave2

在管理节点上安装两个包mha4mysql-manager和mha4mysql-node

说明:

mha4mysql-manager-0.58-0.el7.centos.noarch.rpm 只支持CentOS7上安装,不支持在CentOS8安装,支持MySQL5.7和MySQL8.0 ,但和CentOS8版本上的Mariadb-10.3.17不兼容
mha4mysql-manager-0.56-0.el6.noarch.rpm 不支持CentOS 8，只支持CentOS7及以下版本

两个安装包

mha4mysql-manager
mha4mysql-node
#下载
https://github.com/yoshinorim/mha4mysql-manager/wiki/Downloads
https://github.com/yoshinorim/mha4mysql-node/releases/tag/v0.58
https://github.com/yoshinorim/mha4mysql-node/releases/tag/v0.58

范例:

[root@mha-manager ~]#yum -y install mha4mysql-manager-0.58-0.el7.centos.noarch.rpm
[root@mha-manager ~]#yum -y install mha4mysql-node-0.58-0.el7.centos.noarch.rpm

在所有MySQL服务器上安装mha4mysql-node包

此包支持CentOS 8，7，6

mha4mysql-node

范例:

[root@master ~]#yum -y install mha4mysql-node-0.58-0.el7.centos.noarch.rpm

在所有节点实现相互之间 ssh key 验证

[root@mha-manager ~]#ssh-keygen
[root@mha-manager ~]#ssh-copy-id 127.0.0.1
[root@mha-manager ~]#rsync -av .ssh 10.0.0.8:/root/
[root@mha-manager ~]#rsync -av .ssh 10.0.0.18:/root/
[root@mha-manager ~]#rsync -av .ssh 10.0.0.28:/root/

在管理节点建立配置文件

注意: 此文件的行尾不要加空格等符号

[root@mha-manager ~]#mkdir /etc/mastermha/
[root@mha-manager ~]#vim /etc/mastermha/app1.cnf

[server default]
user=mhauser #用于远程连接MySQL所有节点的用户,需要有管理员的权限
password=ayaka
manager_workdir=/data/mastermha/app1/ #目录会自动生成,无需手动创建
manager_log=/data/mastermha/app1/manager.log
remote_workdir=/data/mastermha/app1/
ssh_user=root #用于实现远程ssh基于KEY的连接,访问二进制日志
repl_user=repluser #主从复制的用户信息
repl_password=ayaka
ping_interval=1 #健康性检查的时间间隔
master_ip_failover_script=/usr/local/bin/master_ip_failover #切换VIP的perl脚本,不支持跨网络,也可用Keepalived实现
report_script=/usr/local/bin/sendmail.sh #当执行报警脚本
check_repl_delay=0 #默认值为1,表示如果slave中从库落后主库relay log超过100M，主库不会选择这个从库为新的master，因为这个从库进行恢复需要很长的时间.通过设置参数check_repl_delay=0，mha触发主从切换时会忽略复制的延时，对于设置candidate_master=1的从库非常有用，这样确保这个从库一定能成为最新的master
master_binlog_dir=/data/mysql/ #指定二进制日志存放的目录,mha4mysql-manager-0.58必须指定,之前版本不需要指定

[server1]
hostname=10.0.0.8
port=3306
candidate_master=1
[server2]
hostname=10.0.0.18
port=3306
[server3]
hostname=10.0.0.28
port=3306
candidate_master=1 #设置为优先候选master，即使不是集群中事件最新的slave,也会优先当master

#最终文件内容
[root@mha-manager ~]#cat /etc/mastermha/app1.cnf
[server default]
user=mhauser
password=ayaka
manager_workdir=/data/mastermha/app1/
manager_log=/data/mastermha/app1/manager.log
remote_workdir=/data/mastermha/app1/
ssh_user=root
repl_user=repluser
repl_password=ayaka
ping_interval=1
master_ip_failover_script=/usr/local/bin/master_ip_failover
report_script=/usr/local/bin/sendmail.sh
check_repl_delay=0
master_binlog_dir=/data/mysql/

[server1]
hostname=10.0.0.8
candidate_master=1
[server2]
hostname=10.0.0.18
candidate_master=1
[server3]
hostname=10.0.0.28

说明: 主库宕机谁来接管新的master

1. 所有从节点日志都是一致的，默认会以配置文件的顺序去选择一个新主
2. 从节点日志不一致，自动选择最接近于主库的从库充当新主
3. 如果对于某节点设定了权重（candidate_master=1），权重节点会优先选择。但是此节点日志量落后主库超过100M日志的话，也不会被选择。可以配合check_repl_delay=0，关闭日志量的检查，强制选择候选节点

实现Master

[root@master ~]#dnf -y install mysql-server
[root@master ~]#mkdir /data/mysql/
[root@master ~]#chown mysql.mysql /data/mysql/

[root@master ~]#vim /etc/my.cnf
[mysqld]
server_id=1
log-bin=/data/mysql/mysql-bin
skip_name_resolve=1
general_log #观察结果,非必须项,生产无需启用

[root@master ~]#systemctl enable --now mysqld

[root@master ~]#mysql
mysql>show master logs;

#如果是MySQL8.0执行下面操操作
mysql> create user repluser@'10.0.0.%' identified by 'ayaka';
mysql> grant replication slave on *.* to repluser@'10.0.0.%';
mysql> create user mhauser@'10.0.0.%' identified by 'ayaka';
mysql> grant all on *.* to mhauser@'10.0.0.%';

#如果是MySQL8.0以前版本执行下面操操作
mysql>grant replication slave on *.* to repluser@'10.0.0.%' identified by'ayaka';
mysql>grant all on *.* to mhauser@'10.0.0.%' identified by 'ayaka';

#配置VIP
[root@master ~]#ifconfig eth0:1 10.0.0.100/24

实现slave

[root@slave ~]#dnf -y install mysql-server

[root@slave ~]#mkdir /data/mysql
[root@slave ~]#chown mysql.mysql /data/mysql/

[root@slave ~]#vim /etc/my.cnf
[mysqld]
server_id=2 #不同节点此值各不相同
log-bin=/data/mysql/mysql-bin
read_only
relay_log_purge=0 #禁止自动删除处理过的relay_log
skip_name_resolve=1 #禁止反向解析
general_log #方便观察的设置,生产无需启用

[root@slave ~]#systemctl enable --now mysqld

[root@slave ~]#mysql
mysql>CHANGE MASTER TO MASTER_HOST='10.0.0.8', MASTER_USER='repluser',
MASTER_PASSWORD='ayaka', MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=156;
mysql>START SLAVE;

检查MHA的环境

#检查环境
[root@mha-manager ~]#masterha_check_ssh --conf=/etc/mastermha/app1.cnf
[root@mha-manager ~]#masterha_check_repl --conf=/etc/mastermha/app1.cnf

#查看状态
[root@mha-manager ~]#masterha_check_status --conf=/etc/mastermha/app1.cnf

范例:

[root@mha-manager ~]#masterha_check_ssh --conf=/etc/mastermha/app1.cnf
Wed Jun 17 09:59:41 2020 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Wed Jun 17 09:59:41 2020 - [info] Reading application default configuration from
/etc/mastermha/app1.cnf..
Wed Jun 17 09:59:41 2020 - [info] Reading server configuration from
/etc/mastermha/app1.cnf..
Wed Jun 17 09:59:41 2020 - [info] Starting SSH connection tests..
Wed Jun 17 09:59:42 2020 - [debug]
Wed Jun 17 09:59:41 2020 - [debug] Connecting via SSH from
root@10.0.0.8(10.0.0.8:22) to root@10.0.0.18(10.0.0.18:22)..
Wed Jun 17 09:59:42 2020 - [debug] ok.
Wed Jun 17 09:59:42 2020 - [debug] Connecting via SSH from
root@10.0.0.8(10.0.0.8:22) to root@10.0.0.28(10.0.0.28:22)..
Wed Jun 17 09:59:42 2020 - [debug] ok.
Wed Jun 17 09:59:43 2020 - [debug]
Wed Jun 17 09:59:42 2020 - [debug] Connecting via SSH from
root@10.0.0.18(10.0.0.18:22) to root@10.0.0.8(10.0.0.8:22)..
Wed Jun 17 09:59:42 2020 - [debug] ok.
Wed Jun 17 09:59:42 2020 - [debug] Connecting via SSH from
root@10.0.0.18(10.0.0.18:22) to root@10.0.0.28(10.0.0.28:22)..
Wed Jun 17 09:59:43 2020 - [debug] ok.
Wed Jun 17 09:59:44 2020 - [debug]
Wed Jun 17 09:59:42 2020 - [debug] Connecting via SSH from
root@10.0.0.28(10.0.0.28:22) to root@10.0.0.8(10.0.0.8:22)..
Wed Jun 17 09:59:43 2020 - [debug] ok.
Wed Jun 17 09:59:43 2020 - [debug] Connecting via SSH from
root@10.0.0.28(10.0.0.28:22) to root@10.0.0.18(10.0.0.18:22)..
Wed Jun 17 09:59:43 2020 - [debug] ok.
Wed Jun 17 09:59:44 2020 - [info] All SSH connection tests passed successfully.


[root@mha-manager ~]#masterha_check_repl --conf=/etc/mastermha/app1.cnf
Wed Jun 17 10:00:56 2020 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Wed Jun 17 10:00:56 2020 - [info] Reading application default configuration from
/etc/mastermha/app1.cnf..
Wed Jun 17 10:00:56 2020 - [info] Reading server configuration from
/etc/mastermha/app1.cnf..
Wed Jun 17 10:00:56 2020 - [info] MHA::MasterMonitor version 0.56.
Creating directory /data/mastermha/app1/.. done.
Wed Jun 17 10:00:58 2020 - [info] GTID failover mode = 0
Wed Jun 17 10:00:58 2020 - [info] Dead Servers:
Wed Jun 17 10:00:58 2020 - [info] Alive Servers:
Wed Jun 17 10:00:58 2020 - [info] 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:00:58 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Wed Jun 17 10:00:58 2020 - [info] 10.0.0.28(10.0.0.28:3306)
Wed Jun 17 10:00:58 2020 - [info] Alive Slaves:
Wed Jun 17 10:00:58 2020 - [info] 10.0.0.18(10.0.0.18:3306) Version=10.3.17-
MariaDB-log (oldest major version between slaves) log-bin:enabled
Wed Jun 17 10:00:58 2020 - [info] Replicating from 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:00:58 2020 - [info] Primary candidate for the new Master
(candidate_master is set)
Wed Jun 17 10:00:58 2020 - [info] 10.0.0.28(10.0.0.28:3306) Version=10.3.17-
MariaDB-log (oldest major version between slaves) log-bin:enabled
Wed Jun 17 10:00:58 2020 - [info] Replicating from 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:00:58 2020 - [info] Current Alive Master: 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:00:58 2020 - [info] Checking slave configurations..
Wed Jun 17 10:00:58 2020 - [info] Checking replication filtering settings..
Wed Jun 17 10:00:58 2020 - [info] binlog_do_db= , binlog_ignore_db=
Wed Jun 17 10:00:58 2020 - [info] Replication filtering check ok.
Wed Jun 17 10:00:58 2020 - [info] GTID (with auto-pos) is not supported
Wed Jun 17 10:00:58 2020 - [info] Starting SSH connection tests..
Wed Jun 17 10:01:00 2020 - [info] All SSH connection tests passed successfully.
Wed Jun 17 10:01:00 2020 - [info] Checking MHA Node version..
Wed Jun 17 10:01:01 2020 - [info] Version check ok.
Wed Jun 17 10:01:01 2020 - [info] Checking SSH publickey authentication settings
on the current master..
Wed Jun 17 10:01:01 2020 - [info] HealthCheck: SSH to 10.0.0.8 is reachable.
Wed Jun 17 10:01:01 2020 - [info] Master MHA Node version is 0.56.
Wed Jun 17 10:01:01 2020 - [info] Checking recovery script configurations on
10.0.0.8(10.0.0.8:3306)..
Wed Jun 17 10:01:01 2020 - [info] Executing command: save_binary_logs --
command=test --start_pos=4 --binlog_dir=/var/lib/mysql,/var/log/mysql --
output_file=/data/mastermha/app1//save_binary_logs_test --manager_version=0.56 -
-start_file=mariadb-bin.000002
Wed Jun 17 10:01:01 2020 - [info] Connecting to root@10.0.0.8(10.0.0.8:22)..
Creating /data/mastermha/app1 if not exists.. Creating directory
/data/mastermha/app1.. done.
ok.
Checking output directory is accessible or not..
ok.
Binlog found at /var/lib/mysql, up to mariadb-bin.000002
Wed Jun 17 10:01:02 2020 - [info] Binlog setting check done.
Wed Jun 17 10:01:02 2020 - [info] Checking SSH publickey authentication and
checking recovery script configurations on all alive slave servers..
Wed Jun 17 10:01:02 2020 - [info] Executing command : apply_diff_relay_logs --
command=test --slave_user='mhauser' --slave_host=10.0.0.18 --slave_ip=10.0.0.18
--slave_port=3306 --workdir=/data/mastermha/app1/ --target_version=10.3.17-
MariaDB-log --manager_version=0.56 --relay_log_info=/var/lib/mysql/relay-
log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
Wed Jun 17 10:01:02 2020 - [info] Connecting to root@10.0.0.18(10.0.0.18:22)..
Creating directory /data/mastermha/app1/.. done.
Checking slave recovery environment settings..
Opening /var/lib/mysql/relay-log.info ... ok.
Relay log found at /var/lib/mysql, up to mariadb-relay-bin.000002
Temporary relay log file is /var/lib/mysql/mariadb-relay-bin.000002
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Wed Jun 17 10:01:02 2020 - [info] Executing command : apply_diff_relay_logs --
command=test --slave_user='mhauser' --slave_host=10.0.0.28 --slave_ip=10.0.0.28
--slave_port=3306 --workdir=/data/mastermha/app1/ --target_version=10.3.17-
MariaDB-log --manager_version=0.56 --relay_log_info=/var/lib/mysql/relay-
log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
Wed Jun 17 10:01:02 2020 - [info] Connecting to root@10.0.0.28(10.0.0.28:22)..
Creating directory /data/mastermha/app1/.. done.
Checking slave recovery environment settings..
Opening /var/lib/mysql/relay-log.info ... ok.
Relay log found at /var/lib/mysql, up to mariadb-relay-bin.000002
Temporary relay log file is /var/lib/mysql/mariadb-relay-bin.000002
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Wed Jun 17 10:01:03 2020 - [info] Slaves settings check done.
Wed Jun 17 10:01:03 2020 - [info]
10.0.0.8(10.0.0.8:3306) (current master)
+--10.0.0.18(10.0.0.18:3306)
+--10.0.0.28(10.0.0.28:3306)
Wed Jun 17 10:01:03 2020 - [info] Checking replication health on 10.0.0.18..
Wed Jun 17 10:01:03 2020 - [info] ok.
Wed Jun 17 10:01:03 2020 - [info] Checking replication health on 10.0.0.28..
Wed Jun 17 10:01:03 2020 - [info] ok.
Wed Jun 17 10:01:03 2020 - [warning] master_ip_failover_script is not defined.
Wed Jun 17 10:01:03 2020 - [warning] shutdown_script is not defined.
Wed Jun 17 10:01:03 2020 - [info] Got exit code 0 (Not master dead).
MySQL Replication Health is OK.

[root@mha-manager ~]#masterha_check_status --conf=/etc/mastermha/app1.cnf
app1 is stopped(2:NOT_RUNNING).

启动MHA

#开启MHA,默认是前台运行,生产环境一般为后台执行
nohup masterha_manager --conf=/etc/mastermha/app1.cnf --remove_dead_master_conf --ignore_last_failover &> /dev/null

#测试环境：
#masterha_manager --conf=/etc/mastermha/app1.cnf --remove_dead_master_conf --ignore_last_failover

#如果想停止后台执行的MHA,可以执行下面命令
[root@mha-master ~]#masterha_stop --conf=/etc/mastermha/app1.cnf
Stopped app1 successfully.

#查看状态
masterha_check_status --conf=/etc/mastermha/app1.cnf

范例:

[root@mha-manager ~]#masterha_manager --conf=/etc/mastermha/app1.cnf --remove_dead_master_conf --ignore_last_failover
Wed Jun 17 10:02:58 2020 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Wed Jun 17 10:02:58 2020 - [info] Reading application default configuration from
/etc/mastermha/app1.cnf..
Wed Jun 17 10:02:58 2020 - [info] Reading server configuration from
/etc/mastermha/app1.cnf..

#查看到健康性检查
[root@master ~]#tail -f /var/lib/mysql/centos8.log
200617 20:14:16 28 Query SELECT 1 As Value
200617 20:14:17 28 Query SELECT 1 As Value
200617 20:14:18 28 Query SELECT 1 As Value
200617 20:14:19 28 Query SELECT 1 As Value
200617 20:14:20 28 Query SELECT 1 As Value
200617 20:14:21 28 Query SELECT 1 As Value

[root@mha-manager ~]#masterha_check_status --conf=/etc/mastermha/app1.cnf
app1 (pid:25994) is running(0:PING_OK), master:10.0.0.8

排错日志

tail /data/mastermha/app1/manager.log

范例:

[root@mha-manager ~]#cat /data/mastermha/app1/manager.log
Wed Jun 17 10:02:58 2020 - [info] MHA::MasterMonitor version 0.56.
Wed Jun 17 10:03:00 2020 - [info] GTID failover mode = 0
Wed Jun 17 10:03:00 2020 - [info] Dead Servers:
Wed Jun 17 10:03:00 2020 - [info] Alive Servers:
Wed Jun 17 10:03:00 2020 - [info] 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:03:00 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Wed Jun 17 10:03:00 2020 - [info] 10.0.0.28(10.0.0.28:3306)
Wed Jun 17 10:03:00 2020 - [info] Alive Slaves:
Wed Jun 17 10:03:00 2020 - [info] 10.0.0.18(10.0.0.18:3306) Version=10.3.17-
MariaDB-log (oldest major version between slaves) log-bin:enabled
Wed Jun 17 10:03:00 2020 - [info] Replicating from 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:03:00 2020 - [info] Primary candidate for the new Master
(candidate_master is set)
Wed Jun 17 10:03:00 2020 - [info] 10.0.0.28(10.0.0.28:3306) Version=10.3.17-
MariaDB-log (oldest major version between slaves) log-bin:enabled
Wed Jun 17 10:03:00 2020 - [info] Replicating from 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:03:00 2020 - [info] Current Alive Master: 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:03:00 2020 - [info] Checking slave configurations..
Wed Jun 17 10:03:00 2020 - [info] Checking replication filtering settings..
Wed Jun 17 10:03:00 2020 - [info] binlog_do_db= , binlog_ignore_db=
Wed Jun 17 10:03:00 2020 - [info] Replication filtering check ok.
Wed Jun 17 10:03:00 2020 - [info] GTID (with auto-pos) is not supported
Wed Jun 17 10:03:00 2020 - [info] Starting SSH connection tests..
Wed Jun 17 10:03:02 2020 - [info] All SSH connection tests passed successfully.
Wed Jun 17 10:03:02 2020 - [info] Checking MHA Node version..
Wed Jun 17 10:03:03 2020 - [info] Version check ok.
Wed Jun 17 10:03:03 2020 - [info] Checking SSH publickey authentication settings
on the current master..
Wed Jun 17 10:03:03 2020 - [info] HealthCheck: SSH to 10.0.0.8 is reachable.
Wed Jun 17 10:03:03 2020 - [info] Master MHA Node version is 0.56.
Wed Jun 17 10:03:03 2020 - [info] Checking recovery script configurations on
10.0.0.8(10.0.0.8:3306)..
Wed Jun 17 10:03:03 2020 - [info] Executing command: save_binary_logs --
command=test --start_pos=4 --binlog_dir=/var/lib/mysql,/var/log/mysql --
output_file=/data/mastermha/app1//save_binary_logs_test --manager_version=0.56 -
-start_file=mariadb-bin.000002
Wed Jun 17 10:03:03 2020 - [info] Connecting to root@10.0.0.8(10.0.0.8:22)..
Creating /data/mastermha/app1 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /var/lib/mysql, up to mariadb-bin.000002
Wed Jun 17 10:03:04 2020 - [info] Binlog setting check done.
Wed Jun 17 10:03:04 2020 - [info] Checking SSH publickey authentication and
checking recovery script configurations on all alive slave servers..
Wed Jun 17 10:03:04 2020 - [info] Executing command : apply_diff_relay_logs --
command=test --slave_user='mhauser' --slave_host=10.0.0.18 --slave_ip=10.0.0.18
--slave_port=3306 --workdir=/data/mastermha/app1/ --target_version=10.3.17-
MariaDB-log --manager_version=0.56 --relay_log_info=/var/lib/mysql/relay-
log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
Wed Jun 17 10:03:04 2020 - [info] Connecting to root@10.0.0.18(10.0.0.18:22)..
Checking slave recovery environment settings..
Opening /var/lib/mysql/relay-log.info ... ok.
Relay log found at /var/lib/mysql, up to mariadb-relay-bin.000002
Temporary relay log file is /var/lib/mysql/mariadb-relay-bin.000002
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Wed Jun 17 10:03:05 2020 - [info] Executing command : apply_diff_relay_logs --
command=test --slave_user='mhauser' --slave_host=10.0.0.28 --slave_ip=10.0.0.28
--slave_port=3306 --workdir=/data/mastermha/app1/ --target_version=10.3.17-
MariaDB-log --manager_version=0.56 --relay_log_info=/var/lib/mysql/relay-
log.info --relay_dir=/var/lib/mysql/ --slave_pass=xxx
Wed Jun 17 10:03:05 2020 - [info] Connecting to root@10.0.0.28(10.0.0.28:22)..
Checking slave recovery environment settings..
Opening /var/lib/mysql/relay-log.info ... ok.
Relay log found at /var/lib/mysql, up to mariadb-relay-bin.000002
Temporary relay log file is /var/lib/mysql/mariadb-relay-bin.000002
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Wed Jun 17 10:03:05 2020 - [info] Slaves settings check done.
Wed Jun 17 10:03:05 2020 - [info]
10.0.0.8(10.0.0.8:3306) (current master)
+--10.0.0.18(10.0.0.18:3306)
+--10.0.0.28(10.0.0.28:3306)
Wed Jun 17 10:03:05 2020 - [warning] master_ip_failover_script is not defined.
Wed Jun 17 10:03:05 2020 - [warning] shutdown_script is not defined.
Wed Jun 17 10:03:05 2020 - [info] Set master ping interval 1 seconds.
Wed Jun 17 10:03:05 2020 - [warning] secondary_check_script is not defined. It
is highly recommended setting it to check master reachability from two or more
routes.
Wed Jun 17 10:03:05 2020 - [info] Starting ping health check on
10.0.0.8(10.0.0.8:3306)..
Wed Jun 17 10:03:05 2020 - [info] Ping(SELECT) succeeded, waiting until MySQL
doesn't respond..

模拟故障

#模拟故障
[root@master ~]#systemctl stop mysqld
#当 master down机后,mha管理程序自动退出

[root@mha-manager ~]#masterha_manager --conf=/etc/mastermha/app1.cnf
Wed Jun 17 10:02:58 2020 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Wed Jun 17 10:02:58 2020 - [info] Reading application default configuration
from /etc/mastermha/app1.cnf..
Wed Jun 17 10:02:58 2020 - [info] Reading server configuration from
/etc/mastermha/app1.cnf..
Wed Jun 17 10:06:37 2020 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Wed Jun 17 10:06:37 2020 - [info] Reading application default configuration
from /etc/mastermha/app1.cnf..
Wed Jun 17 10:06:37 2020 - [info] Reading server configuration from
/etc/mastermha/app1.cnf..

[root@mha-manager ~]#cat /data/mastermha/app1/manager.log
Wed Jun 17 10:02:58 2020 - [info] MHA::MasterMonitor version 0.56.
Wed Jun 17 10:03:00 2020 - [info] GTID failover mode = 0
Wed Jun 17 10:03:00 2020 - [info] Dead Servers:
Wed Jun 17 10:03:00 2020 - [info] Alive Servers:
Wed Jun 17 10:03:00 2020 - [info] 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:03:00 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Wed Jun 17 10:03:00 2020 - [info] 10.0.0.28(10.0.0.28:3306)
Wed Jun 17 10:03:00 2020 - [info] Alive Slaves:
Wed Jun 17 10:03:00 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:03:00 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:03:00 2020 - [info] Primary candidate for the new Master
(candidate_master is set)
Wed Jun 17 10:03:00 2020 - [info] 10.0.0.28(10.0.0.28:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:03:00 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:03:00 2020 - [info] Current Alive Master:
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:03:00 2020 - [info] Checking slave configurations..
Wed Jun 17 10:03:00 2020 - [info] Checking replication filtering settings..
Wed Jun 17 10:03:00 2020 - [info] binlog_do_db= , binlog_ignore_db=
Wed Jun 17 10:03:00 2020 - [info] Replication filtering check ok.
Wed Jun 17 10:03:00 2020 - [info] GTID (with auto-pos) is not supported
Wed Jun 17 10:03:00 2020 - [info] Starting SSH connection tests..
Wed Jun 17 10:03:02 2020 - [info] All SSH connection tests passed
successfully.
Wed Jun 17 10:03:02 2020 - [info] Checking MHA Node version..
Wed Jun 17 10:03:03 2020 - [info] Version check ok.
Wed Jun 17 10:03:03 2020 - [info] Checking SSH publickey authentication
settings on the current master..
Wed Jun 17 10:03:03 2020 - [info] HealthCheck: SSH to 10.0.0.8 is reachable.
Wed Jun 17 10:03:03 2020 - [info] Master MHA Node version is 0.56.
Wed Jun 17 10:03:03 2020 - [info] Checking recovery script configurations on
10.0.0.8(10.0.0.8:3306)..
Wed Jun 17 10:03:03 2020 - [info] Executing command: save_binary_logs --
command=test --start_pos=4 --binlog_dir=/var/lib/mysql,/var/log/mysql --
output_file=/data/mastermha/app1//save_binary_logs_test --manager_version=0.56 -
-start_file=mariadb-bin.000002
Wed Jun 17 10:03:03 2020 - [info] Connecting to
root@10.0.0.8(10.0.0.8:22)..
Creating /data/mastermha/app1 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /var/lib/mysql, up to mariadb-bin.000002
Wed Jun 17 10:03:04 2020 - [info] Binlog setting check done.
Wed Jun 17 10:03:04 2020 - [info] Checking SSH publickey authentication and
checking recovery script configurations on all alive slave servers..
Wed Jun 17 10:03:04 2020 - [info] Executing command :
apply_diff_relay_logs --command=test --slave_user='mhauser' --
slave_host=10.0.0.18 --slave_ip=10.0.0.18 --slave_port=3306 --
workdir=/data/mastermha/app1/ --target_version=10.3.17-MariaDB-log --
manager_version=0.56 --relay_log_info=/var/lib/mysql/relay-log.info --
relay_dir=/var/lib/mysql/ --slave_pass=xxx
Wed Jun 17 10:03:04 2020 - [info] Connecting to
root@10.0.0.18(10.0.0.18:22)..
Checking slave recovery environment settings..
Opening /var/lib/mysql/relay-log.info ... ok.
Relay log found at /var/lib/mysql, up to mariadb-relay-bin.000002
Temporary relay log file is /var/lib/mysql/mariadb-relay-bin.000002
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Wed Jun 17 10:03:05 2020 - [info] Executing command :
apply_diff_relay_logs --command=test --slave_user='mhauser' --
slave_host=10.0.0.28 --slave_ip=10.0.0.28 --slave_port=3306 --
workdir=/data/mastermha/app1/ --target_version=10.3.17-MariaDB-log --
manager_version=0.56 --relay_log_info=/var/lib/mysql/relay-log.info --
relay_dir=/var/lib/mysql/ --slave_pass=xxx
Wed Jun 17 10:03:05 2020 - [info] Connecting to
root@10.0.0.28(10.0.0.28:22)..
Checking slave recovery environment settings..
Opening /var/lib/mysql/relay-log.info ... ok.
Relay log found at /var/lib/mysql, up to mariadb-relay-bin.000002
Temporary relay log file is /var/lib/mysql/mariadb-relay-bin.000002
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Wed Jun 17 10:03:05 2020 - [info] Slaves settings check done.
Wed Jun 17 10:03:05 2020 - [info]
10.0.0.8(10.0.0.8:3306) (current master)
+--10.0.0.18(10.0.0.18:3306)
+--10.0.0.28(10.0.0.28:3306)
Wed Jun 17 10:03:05 2020 - [warning] master_ip_failover_script is not
defined.
Wed Jun 17 10:03:05 2020 - [warning] shutdown_script is not defined.
Wed Jun 17 10:03:05 2020 - [info] Set master ping interval 1 seconds.
Wed Jun 17 10:03:05 2020 - [warning] secondary_check_script is not defined.
It is highly recommended setting it to check master reachability from two or
more routes.
ed Jun 17 10:03:05 2020 - [info] Starting ping health check on
10.0.0.8(10.0.0.8:3306)..
Wed Jun 17 10:03:05 2020 - [info] Ping(SELECT) succeeded, waiting until
MySQL doesn't respond..
Wed Jun 17 10:06:31 2020 - [warning] Got timeout on MySQL Ping(SELECT) child
process and killed it! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line
431.
Wed Jun 17 10:06:31 2020 - [info] Executing SSH check script:
save_binary_logs --command=test --start_pos=4 --
binlog_dir=/var/lib/mysql,/var/log/mysql --
output_file=/data/mastermha/app1//save_binary_logs_test --manager_version=0.56 -
-binlog_prefix=mariadb-bin
Wed Jun 17 10:06:32 2020 - [warning] Got error on MySQL connect: 2003 (Can't
connect to MySQL server on '10.0.0.8' (4))
Wed Jun 17 10:06:32 2020 - [warning] Connection failed 2 time(s)..
Wed Jun 17 10:06:33 2020 - [warning] Got error on MySQL connect: 2003 (Can't
connect to MySQL server on '10.0.0.8' (4))
Wed Jun 17 10:06:33 2020 - [warning] Connection failed 3 time(s)..
Wed Jun 17 10:06:34 2020 - [warning] Got error on MySQL connect: 2003 (Can't
connect to MySQL server on '10.0.0.8' (4))
Wed Jun 17 10:06:34 2020 - [warning] Connection failed 4 time(s)..
Wed Jun 17 10:06:36 2020 - [warning] HealthCheck: Got timeout on checking
SSH connection to 10.0.0.8! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm
line 342.
Wed Jun 17 10:06:36 2020 - [warning] Master is not reachable from health
checker!
Wed Jun 17 10:06:36 2020 - [warning] Master 10.0.0.8(10.0.0.8:3306) is not
reachable!
Wed Jun 17 10:06:36 2020 - [warning] SSH is NOT reachable.
Wed Jun 17 10:06:36 2020 - [info] Connecting to a master server failed.
Reading configuration file /etc/masterha_default.cnf and /etc/mastermha/app1.cnf
again, and trying to connect to all servers to check server status..
Wed Jun 17 10:06:36 2020 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Wed Jun 17 10:06:36 2020 - [info] Reading application default configuration
from /etc/mastermha/app1.cnf..
Wed Jun 17 10:06:36 2020 - [info] Reading server configuration from
/etc/mastermha/app1.cnf..
Wed Jun 17 10:06:37 2020 - [info] GTID failover mode = 0
Wed Jun 17 10:06:37 2020 - [info] Dead Servers:
Wed Jun 17 10:06:37 2020 - [info] 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:37 2020 - [info] Alive Servers:
Wed Jun 17 10:06:37 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Wed Jun 17 10:06:37 2020 - [info] 10.0.0.28(10.0.0.28:3306)
Wed Jun 17 10:06:37 2020 - [info] Alive Slaves:
Wed Jun 17 10:06:37 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:06:37 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:37 2020 - [info] Primary candidate for the new Master
(candidate_master is set)
Wed Jun 17 10:06:37 2020 - [info] 10.0.0.28(10.0.0.28:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:06:37 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:37 2020 - [info] Checking slave configurations..
Wed Jun 17 10:06:37 2020 - [info] Checking replication filtering settings..
Wed Jun 17 10:06:37 2020 - [info] Replication filtering check ok.
Wed Jun 17 10:06:37 2020 - [info] Master is down!
Wed Jun 17 10:06:37 2020 - [info] Terminating monitoring script.
Wed Jun 17 10:06:37 2020 - [info] Got exit code 20 (Master dead).
Wed Jun 17 10:06:37 2020 - [info] MHA::MasterFailover version 0.56.
Wed Jun 17 10:06:37 2020 - [info] Starting master failover.
Wed Jun 17 10:06:37 2020 - [info]
Wed Jun 17 10:06:37 2020 - [info] * Phase 1: Configuration Check Phase..
Wed Jun 17 10:06:37 2020 - [info]
Wed Jun 17 10:06:38 2020 - [info] GTID failover mode = 0
Wed Jun 17 10:06:38 2020 - [info] Dead Servers:
Wed Jun 17 10:06:38 2020 - [info] 10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:38 2020 - [info] Checking master reachability via
MySQL(double check)...
Wed Jun 17 10:06:39 2020 - [info] ok.
Wed Jun 17 10:06:39 2020 - [info] Alive Servers:
Wed Jun 17 10:06:39 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Wed Jun 17 10:06:39 2020 - [info] 10.0.0.28(10.0.0.28:3306)
Wed Jun 17 10:06:39 2020 - [info] Alive Slaves:
Wed Jun 17 10:06:39 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:06:39 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:39 2020 - [info] Primary candidate for the new Master
(candidate_master is set)
Wed Jun 17 10:06:39 2020 - [info] 10.0.0.28(10.0.0.28:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:06:39 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:39 2020 - [info] Starting Non-GTID based failover.
Wed Jun 17 10:06:39 2020 - [info]
Wed Jun 17 10:06:39 2020 - [info] ** Phase 1: Configuration Check Phase
completed.
Wed Jun 17 10:06:39 2020 - [info]
Wed Jun 17 10:06:39 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Wed Jun 17 10:06:39 2020 - [info]
Wed Jun 17 10:06:39 2020 - [info] Forcing shutdown so that applications
never connect to the current master..
Wed Jun 17 10:06:39 2020 - [warning] master_ip_failover_script is not set.
Skipping invalidating dead master IP address.
Wed Jun 17 10:06:39 2020 - [warning] shutdown_script is not set. Skipping
explicit shutting down of the dead master.
Wed Jun 17 10:06:40 2020 - [info] * Phase 2: Dead Master Shutdown Phase
completed.
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] * Phase 3: Master Recovery Phase..
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] The latest binary log file/position on all
slaves is mariadb-bin.000002:3062073
Wed Jun 17 10:06:40 2020 - [info] Latest slaves (Slaves that received relay
log files to the latest):
Wed Jun 17 10:06:40 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:06:40 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:40 2020 - [info] Primary candidate for the new Master
(candidate_master is set)
Wed Jun 17 10:06:40 2020 - [info] 10.0.0.28(10.0.0.28:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:06:40 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:40 2020 - [info] The oldest binary log file/position on all
slaves is mariadb-bin.000002:3062073
Wed Jun 17 10:06:40 2020 - [info] Oldest slaves:
Wed Jun 17 10:06:40 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:06:40 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:40 2020 - [info] Primary candidate for the new Master
(candidate_master is set)
Wed Jun 17 10:06:40 2020 - [info] 10.0.0.28(10.0.0.28:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:06:40 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] * Phase 3.2: Saving Dead Master's Binlog
Phase..
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [warning] Dead Master is not SSH reachable. Could
not save it's binlogs. Transactions that were not sent to the latest slave
(Read_Master_Log_Pos to the tail of the dead master's binlog) were lost.
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] * Phase 3.3: Determining New Master
Phase..
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] Finding the latest slave that has all
relay logs for recovering other slaves..
Wed Jun 17 10:06:40 2020 - [info] All slaves received relay logs to the same
position. No need to resync each other.
Wed Jun 17 10:06:40 2020 - [info] Searching new master from slaves..
Wed Jun 17 10:06:40 2020 - [info] Candidate masters from the configuration
file:
Wed Jun 17 10:06:40 2020 - [info] 10.0.0.18(10.0.0.18:3306)
Version=10.3.17-MariaDB-log (oldest major version between slaves) log-
bin:enabled
Wed Jun 17 10:06:40 2020 - [info] Replicating from
10.0.0.8(10.0.0.8:3306)
Wed Jun 17 10:06:40 2020 - [info] Primary candidate for the new Master
(candidate_master is set)
Wed Jun 17 10:06:40 2020 - [info] Non-candidate masters:
Wed Jun 17 10:06:40 2020 - [info] Searching from candidate_master slaves
which have received the latest relay log events..
Wed Jun 17 10:06:40 2020 - [info] New master is 10.0.0.18(10.0.0.18:3306)
Wed Jun 17 10:06:40 2020 - [info] Starting master failover..
Wed Jun 17 10:06:40 2020 - [info]
From:
10.0.0.8(10.0.0.8:3306) (current master)
+--10.0.0.18(10.0.0.18:3306)
+--10.0.0.28(10.0.0.28:3306)
To:
10.0.0.18(10.0.0.18:3306) (new master)
+--10.0.0.28(10.0.0.28:3306)
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] * Phase 3.3: New Master Diff Log
Generation Phase..
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] This server has all relay logs. No need
to generate diff files from the latest slave.
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] * Phase 3.4: Master Log Apply Phase..
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] *NOTICE: If any error happens from this
phase, manual recovery is needed.
Wed Jun 17 10:06:40 2020 - [info] Starting recovery on
10.0.0.18(10.0.0.18:3306)..
Wed Jun 17 10:06:40 2020 - [info] This server has all relay logs. Waiting
all logs to be applied..
Wed Jun 17 10:06:40 2020 - [info] done.
Wed Jun 17 10:06:40 2020 - [info] All relay logs were successfully applied.
Wed Jun 17 10:06:40 2020 - [info] Getting new master's binlog name and
position..
Wed Jun 17 10:06:40 2020 - [info] mariadb-bin.000002:344
Wed Jun 17 10:06:40 2020 - [info] All other slaves should start replication
from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.0.0.18',
MASTER_PORT=3306, MASTER_LOG_FILE='mariadb-bin.000002', MASTER_LOG_POS=344,
MASTER_USER='repluser', MASTER_PASSWORD='xxx';
Wed Jun 17 10:06:40 2020 - [warning] master_ip_failover_script is not set.
Skipping taking over new master IP address.
Wed Jun 17 10:06:40 2020 - [info] Setting read_only=0 on
10.0.0.18(10.0.0.18:3306)..
Wed Jun 17 10:06:40 2020 - [info] ok.
Wed Jun 17 10:06:40 2020 - [info] ** Finished master recovery successfully.
Wed Jun 17 10:06:40 2020 - [info] * Phase 3: Master Recovery Phase
completed.
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] * Phase 4: Slaves Recovery Phase..
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] * Phase 4.1: Starting Parallel Slave Diff
Log Generation Phase..
Wed Jun 17 10:06:40 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] -- Slave diff file generation on host
10.0.0.28(10.0.0.28:3306) started, pid: 24706. Check tmp log
/data/mastermha/app1//10.0.0.28_3306_20200617100637.log if it takes time..
Wed Jun 17 10:06:41 2020 - [info]
Wed Jun 17 10:06:41 2020 - [info] Log messages from 10.0.0.28 ...
Wed Jun 17 10:06:41 2020 - [info]
Wed Jun 17 10:06:40 2020 - [info] This server has all relay logs. No need
to generate diff files from the latest slave.
Wed Jun 17 10:06:41 2020 - [info] End of log messages from 10.0.0.28.
Wed Jun 17 10:06:41 2020 - [info] -- 10.0.0.28(10.0.0.28:3306) has the
latest relay log events.
Wed Jun 17 10:06:41 2020 - [info] Generating relay diff files from the
latest slave succeeded.
Wed Jun 17 10:06:41 2020 - [info]
Wed Jun 17 10:06:41 2020 - [info] * Phase 4.2: Starting Parallel Slave Log
Apply Phase..
Wed Jun 17 10:06:41 2020 - [info]
Wed Jun 17 10:06:41 2020 - [info] -- Slave recovery on host
10.0.0.28(10.0.0.28:3306) started, pid: 24708. Check tmp log
/data/mastermha/app1//10.0.0.28_3306_20200617100637.log if it takes time..
Wed Jun 17 10:06:42 2020 - [info]
Wed Jun 17 10:06:42 2020 - [info] Log messages from 10.0.0.28 ...
Wed Jun 17 10:06:42 2020 - [info]
Wed Jun 17 10:06:41 2020 - [info] Starting recovery on
10.0.0.28(10.0.0.28:3306)..
Wed Jun 17 10:06:41 2020 - [info] This server has all relay logs. Waiting
all logs to be applied..
Wed Jun 17 10:06:41 2020 - [info] done.
Wed Jun 17 10:06:41 2020 - [info] All relay logs were successfully applied.
Wed Jun 17 10:06:41 2020 - [info] Resetting slave 10.0.0.28(10.0.0.28:3306)
and starting replication from the new master 10.0.0.18(10.0.0.18:3306)..
Wed Jun 17 10:06:41 2020 - [info] Executed CHANGE MASTER.
Wed Jun 17 10:06:42 2020 - [info] Slave started.
Wed Jun 17 10:06:42 2020 - [info] End of log messages from 10.0.0.28.
Wed Jun 17 10:06:42 2020 - [info] -- Slave recovery on host
10.0.0.28(10.0.0.28:3306) succeeded.
Wed Jun 17 10:06:42 2020 - [info] All new slave servers recovered
successfully.
Wed Jun 17 10:06:42 2020 - [info]
Wed Jun 17 10:06:42 2020 - [info] * Phase 5: New master cleanup phase..
Wed Jun 17 10:06:42 2020 - [info]
Wed Jun 17 10:06:42 2020 - [info] Resetting slave info on the new master..
Wed Jun 17 10:06:42 2020 - [info] 10.0.0.18: Resetting slave info
succeeded.
Wed Jun 17 10:06:42 2020 - [info] Master failover to
10.0.0.18(10.0.0.18:3306) completed successfully.
Wed Jun 17 10:06:42 2020 - [info]
----- Failover Report -----
app1: MySQL Master failover 10.0.0.8(10.0.0.8:3306) to
10.0.0.18(10.0.0.18:3306) succeeded
Master 10.0.0.8(10.0.0.8:3306) is down!
Check MHA Manager logs at mha-manager:/data/mastermha/app1/manager.log for
details.
Started automated(non-interactive) failover.
The latest slave 10.0.0.18(10.0.0.18:3306) has all relay logs for recovery.
Selected 10.0.0.18(10.0.0.18:3306) as a new master.
10.0.0.18(10.0.0.18:3306): OK: Applying all logs succeeded.
10.0.0.28(10.0.0.28:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
10.0.0.28(10.0.0.28:3306): OK: Applying all logs succeeded. Slave started,
replicating from 10.0.0.18(10.0.0.18:3306)
10.0.0.18(10.0.0.18:3306): Resetting slave info succeeded.
Master failover to 10.0.0.18(10.0.0.18:3306) completed successfully.

[root@mha-manager ~]#masterha_check_status --conf=/etc/mastermha/app1.cnf
app1 is stopped(2:NOT_RUNNING).

#验证VIP漂移至新的Master上
[root@slave1 ~]#ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP
group default qlen 1000
link/ether 00:0c:29:e1:0e:53 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.18/24 brd 10.0.0.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet 10.0.0.100/8 brd 10.255.255.255 scope global eth0:1
valid_lft forever preferred_lft forever
inet6 fe80::20c:29ff:fee1:e53/64 scope link
valid_lft forever preferred_lft forever

#自动修改manager节点上的配置文件,将master剔除
[root@mha-manager ~]#cat /etc/mastermha/app1.cnf
[server2]
hostname=10.0.0.18
port=3306
candidate_master=1
[server3]
hostname=10.0.0.28
port=3306

收到报警邮件

注意: 如果出错,需要删除下面文件再执行MHA

root@mha-manager ~]#rm -f /data/mastermha/app1/app1.failover.error

修复主从

修复故障的主库,保证数据同步
修复主从,手工新故障库加入新的主,设为为从库
修复manager的配置文件
清理相关目录
检查ssh互信和replication的复制是否成功
检查VIP,如果有问题,重新配置VIP
重新运行MHA,查询MHA状态,确保运行正常

如果再次运行MHA,需要先删除下面文件

MHA只能漂移一次，如果多次使用必须删除以下文件，要不MHA不可重用

[root@mha-manager ~]#rm -rf /data/mastermha/app1/ #mha_master自己的工作路径
[root@mha-manager ~]#rm -rf /data/mastermha/app1/manager.log #mha_master自己的日志文件
[root@master ~]#rm -rf /data/mastermha/app1/ #每个远程主机即三个节点的的工作目录