mysql galera cluster集群的監控
一、集群復制狀態檢查
1、SHOW GLOBAL STATUS LIKE 'wsrep_%';
+------------------------------+-------------------------------------------------------------+
| Variable_name | Value |
+------------------------------+-------------------------------------------------------------+
| wsrep_local_state_uuid | 9f6a992a-7dd9-11e5-9f85-f760745ffb39 |
| wsrep_protocol_version | 7 |
| wsrep_last_committed | 53 |
| wsrep_replicated | 6 |
| wsrep_replicated_bytes | 1368 |
| wsrep_repl_keys | 9 |
| wsrep_repl_keys_bytes | 210 |
| wsrep_repl_data_bytes | 774 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 37 |
| wsrep_received_bytes | 23347 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_max | 2 |
| wsrep_local_send_queue_min | 0 |
| wsrep_local_send_queue_avg | 0.125000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_max | 2 |
| wsrep_local_recv_queue_min | 0 |
| wsrep_local_recv_queue_avg | 0.027027 |
| wsrep_local_cached_downto | 14 |
| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 1.000000 |
| wsrep_apply_oooe | 0.100000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 1.250000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 1.250000 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 10 |
| wsrep_cert_bucket_count | 22 |
| wsrep_gcache_pool_size | 27144 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.325000 |
| wsrep_incoming_addresses | |
| wsrep_evs_delayed | |
| wsrep_evs_evict_list | |
| wsrep_evs_repl_latency | 0/0/0/0/0 |
| wsrep_evs_state | OPERATIONAL |
| wsrep_gcomm_uuid | 5e28860a-829e-11e5-9c06-665d7fe4003d |
| wsrep_cluster_conf_id | 3 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | 9f6a992a-7dd9-11e5-9f85-f760745ffb39 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
| wsrep_provider_version | 3.12(rXXXX) |
| wsrep_ready | ON |
+------------------------------+-------------------------------------------------------------+
wsrep_notify_cmd.sh——監控狀態的變化。使用方法參見http://galeracluster.com/documentation-webpages/notificationcmd.html
2、wsrep_cluster_state_uuid顯示了cluster的state UUID,由此可看出節點是否還是集群的一員
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_state_uuid'
集群內每個節點的value都應該是一樣的,否則說明該節點不在集群中了
+--------------------------+--------------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------------+
| wsrep_cluster_state_uuid | 9f6a992a-7dd9-11e5-9f85-f760745ffb39 |
+--------------------------+--------------------------------------+
3、wsrep_cluster_conf_id顯示了整個集群的變化次數。所有節點都應相同,否則說明某個節點與集群斷開了
4、wsrep_cluster_size顯示了集群中節點的個數
5、wsrep_cluster_status顯示集群里節點的主狀態。標準返回primary。如返回non-Primary或其他值說明是多個節點改變導致的節點丟失或者腦裂。如果所有節點都返回不是Primary,則要重設quorum。具體參見http://galeracluster.com/documentation-webpages/quorumreset.html如果返回都正常,說明復制機制在每個節點都能正常工作,下一步該檢查每個節點的狀態確保他們都能收到write-set
show global status like 'wsrep_cluster_status';
+----------------------+---------+
| Variable_name | Value |
+----------------------+---------+
| wsrep_cluster_status | Primary |
+----------------------+---------+
二、檢查節點狀態
節點狀態顯示了集群中的節點接受和更新write-set狀態,以及可能阻止復制的一些問題
1、wsrep_ready顯示了節點是否可以接受queries。ON表示正常,如果是OFF幾乎所有的query都會報錯,報錯信息提示“ERROR 1047 (08501) Unknown Command”
SHOW GLOBAL STATUS LIKE 'wsrep_ready';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_ready | ON |
+---------------+-------+
2、SHOW GLOBAL STATUS LIKE 'wsrep_connected’顯示該節點是否與其他節點有網絡連接。(實驗得知,當把某節點的網卡down掉之后,該值仍為on。說明網絡還在)丟失連接的問題可能在于配置wsrep_cluster_address或wsrep_cluster_name的錯誤
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| wsrep_connected | ON |
+-----------------+-------+
3、wsrep_local_state_comment 以人能讀懂的方式顯示節點的狀態,正常的返回值是Joining, Waiting on SST, Joined, Synced or Donor,返回Initialized說明已不在正常工作狀態
+---------------------------+--------+
| Variable_name | Value |
+---------------------------+--------+
| wsrep_local_state_comment | Synced |
+---------------------------+--------+
三、查看復制的健康狀態
通過Flow Control的反饋機制來管理復制進程。當本地收到的write-set超過某一閥值時,該節點會啟動flow control來暫停復制直到它趕上進度。監控本地收到的請求和flow control,有如下幾個參數:
1、wsrep_local_recv_queue_avg——平均請求隊列長度。當返回值大于0時,說明apply write-sets比收write-set慢,有等待。堆積太多可能導致啟動flow control
+----------------------------+----------+
| Variable_name | Value |
+----------------------------+----------+
| wsrep_local_recv_queue_avg | 0.027027 |
+----------------------------+----------+
wsrep_local_recv_queue_max 和 wsrep_local_recv_queue_min可以看隊列設置的最大最小值
2、wsrep_flow_control_paused 顯示了自從上次查詢之后,節點由于flow control而暫停的時間占整個查詢間隔時間比。總體反映節點落后集群的狀況。如果返回值為1,說明自上次查詢之后,節點一直在暫停狀態。如果發現某節點頻繁落后集群,則應該調整wsrep_slave_threads或者把節點剔除
+---------------------------+----------+
| Variable_name | Value |
+---------------------------+----------+
| wsrep_flow_control_paused | 0.000000 |
+---------------------------+----------+
3、wsrep_cert_deps_distance顯示了平行apply的最低和最高排序編號或者sql編號之間的平均距離值。這代表了節點潛在的并行程度,和線程相關
+--------------------------+----------+
| Variable_name | Value |
+--------------------------+----------+
| wsrep_cert_deps_distance | 1.000000 |
+--------------------------+----------+
四、檢測網絡慢的問題
通過檢查發送隊列來看傳出的連接狀況
1、wsrep_local_send_queue_avg顯示自上次查詢之后的平均發送隊列長度。比如網絡瓶頸和flow control都可能是原因
+----------------------------+----------+
| Variable_name | Value |
+----------------------------+----------+
| wsrep_local_send_queue_avg | 0.033333 |
+----------------------------+----------+
wsrep_local_send_queue_max 和 wsrep_local_send_queue_min可以看隊列設置的最大值和最小值
五、日志監控
在my.cnf中做如下配置
# wsrep Log Options
wsrep_log_conflicts=ON #會將沖突信息寫入錯誤日志中,例如兩個節點同時寫同一行數據
wsrep_provider_options="cert.log_conflicts=ON" #復制過程中的錯誤信息寫在日志中
wsrep_debug=ON #顯示debug 信息在日志中,其中也包括鑒權信息,例如賬號密碼。因此在生產環境中不開啟
六、附加的日志
當某節點在從節點上應用一個事件失敗時,數據庫服務器會創建一個特殊的binary log文件。文件名默認是GRA_*.log