pacemaker(heartbeat)の切り替えがうまく動作しない件について (Linux-ha-jp) - Linux-HA Japan

根本です。

岩崎さん早速のご返事ありがとうございます。
そして、返信おくれてすいません。

>  ・Primary側が落とせないプロセスがDRBDのマウントポイントを掴んでいる時（8.3.4だともしかするとシェルでログインしてるだけでも駄目）
>  ・OutDatedステータスになっている。（スプリットブレイン発生を防ぐ安全装置）
>  ・メタデータ領域が壊れている。（ほぼ無いですが外部メタデータにしててうっかりDDで消したりするとアクセスできなったり）
>  多くの場合が、相手側ノードのPrimaryをSecondary状態に降格できない時にSecondaryからPrimaryに昇格できない症状が出ます。
>  今回の場合、リブートしてしまった時に起きているという事なので、OutDated状態になっていてフェイルオーバーできずにいて、
>  セカンダリノードが上がったときにDRBDも立ち上がり、再接続することでステータスが正常化してしまっている可能性があります。
>  これが試験環境であれば、sysrq-triggerを使って再起動を行い正常に再起動されるかの確認をすると状況を再現できるかと思います。
>  あと、chkconfigでheartbeatとDRBDはoffにしておいたほうが良いです。
>  この２つは中途半端な状態で上がられると困るものなので、落ちたなら安全が確認できるまで上がってくれないほうが良いものになります。
▼
heartbeatとDRBDは、chkconfigでoffにしております。
なので、2号機が落ちた時点でDRBDの接続が切れているので、
マウントポイントを掴んでいるプロセスは無いかと思います。
さすがに本番環境なので、sysrq-triggerを使った再起動実験は難しいと思います。
一応、KVM上のテスト環境では、ちゃんと切り替わっているのですが・・・
やはり、どこかでOutDatedステータスになったままだったのでしょうか。


また、ログが見あたらなかったとのことですが、ログファイルを添付したつもりでしたが
本文に貼り付けた方がよろしかったでしょうか。取り急ぎ、本文末に貼り付けて起きます。


>  ＃ OpenVZのリソースエージェントに興味津々（実は私もコンテナ単位でF/OできるRAを作ってます）
▼
前任者がheartbeat1.xx時代に作成したものを、数ヶ月前まで「シェルスクリプトってなに？」て言っていた
自分が改良したものなので、かなり頼りない参考にならないスクリプトですが・・・
よろしければ、MLに投げましょうか？(冗談抜きでやっつけスクリプトですよ)


以上、よろしくお願い致します。


以下、ha-debug.log
----start----
Apr 30 05:02:47 02a.ss.jp heartbeat: [8613]: WARN: node 02b.ss.jp: is dead
Apr 30 05:02:47 02a.ss.jp heartbeat: [8613]: info: Link 02b.ss.jp:eth1 dead.
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: notice: crmd_ha_status_callback: Status update: Node 02b.ss.jp now has status [dead] (DC=true)
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: crm_update_peer_proc: 02b.ss.jp.ais is now offline
Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: quorum plugin: majority
Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: cluster:linux-ha, member_count=1, member_quorum_votes=100
Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: total_node_count=2, total_quorum_votes=200
Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: quorum plugin: twonodes
Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: cluster:linux-ha, member_count=1, member_quorum_votes=100
Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: total_node_count=2, total_quorum_votes=200
Apr 30 05:02:47 02a.ss.jp ccm: [8621]: info: Break tie for 2 nodes cluster
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: mem_handle_event: no mbr_track info
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: mem_handle_event: instance=5, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=5)
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: ccm_event_detail: NEW MEMBERSHIP: trans=5, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, old_idx=3
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: ccm_event_detail:     CURRENT: 02a.ss.jp [nodeid=0, born=5]
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: ccm_event_detail:     LOST:    02b.ss.jp [nodeid=1, born=4]
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: ais_status_callback: status: 02b.ss.jp is now lost (was member)
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: crm_update_peer: Node 02b.ss.jp: id=1 state=lost (new) addr=(null) votes=-1 born=4 seen=4 proc=00000000000000000000000000000200
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: erase_node_from_join: Removed node 02b.ss.jp from join calculations: welcomed=0 itegrated=0 finalized=0 confirmed=1
Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: populate_cib_nodes_ha: Requesting the list of configured nodes
Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: mem_handle_event: no mbr_track info
Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: mem_handle_event: instance=5, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=5)
Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: crm_update_peer: Node 02b.ss.jp: id=1 state=lost (new) addr=(null) votes=-1 born=4 seen=4 proc=00000000000000000000000000000302
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: WARN: match_down_event: No match for shutdown action on 588eee4d-dbdf-471f-92b6-2b9789c5531a
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: te_update_diff: Stonith/shutdown of 588eee4d-dbdf-471f-92b6-2b9789c5531a not matched
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: abort_transition_graph: te_update_diff:191 - Triggered transition abort (complete=1, tag=node_state, id=588eee4d-dbdf-471f-92b6-2b9789c5531a, magic=NA, cib=0.33.45) : Node failure
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_pe_invoke: Query 2999: Requesting the current CIB: S_POLICY_ENGINE
Apr 30 05:02:48 02a.ss.jp cib: [8622]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/2997, version=0.33.45): ok (rc=0)
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_pe_invoke_callback: Invoking the PE: query=2999, ref=pe_calc-dc-1304107368-2964, seq=5, quorate=1
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_config: On loss of CCM Quorum: Ignore
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: determine_online_status: Node 02a.ss.jp is online
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op: Operation res_vps1_monitor_0 found resource res_vps1 active on 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op: Operation res_vps2_monitor_0 found resource res_vps2 active on 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op: Operation res_drbd1:1_monitor_0 found resource res_drbd1:1 active in master mode on 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op: Operation res_drbd2:1_monitor_0 found resource res_drbd2:1 active on 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: group_print:  Resource Group: group1
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_Filesystem1  (ocf::heartbeat:Filesystem):    Started 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_ip1  (ocf::heartbeat:IPaddr2):   Started 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_vps1 (ocf::heartbeat:vps):   Started 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_MailTo1  (ocf::heartbeat:MailTo):    Started 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: group_print:  Resource Group: group2
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_Filesystem2  (ocf::heartbeat:Filesystem):    Stopped 
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_ip2  (ocf::heartbeat:IPaddr2):   Stopped 
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_vps2 (ocf::heartbeat:vps):   Stopped 
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_MailTo2  (ocf::heartbeat:MailTo):    Stopped 
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: clone_print:  Master/Slave Set: ms_drbd1
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:      Masters: [ 02a.ss.jp ]
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:      Stopped: [ res_drbd1:0 ]
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: clone_print:  Master/Slave Set: ms_drbd2
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:      Slaves: [ 02a.ss.jp ]
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:      Stopped: [ res_drbd2:0 ]
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_drbd1:0 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: Promoting res_drbd1:1 (Master 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: ms_drbd1: Promoted 1 instances of a possible 1 to master
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_drbd2:0 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: ms_drbd2: Promoted 0 instances of a possible 1 to master
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: rsc_merge_weights: res_Filesystem2: Rolling back scores from res_ip2
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_Filesystem2 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: rsc_merge_weights: res_ip2: Rolling back scores from res_vps2
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_ip2 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: rsc_merge_weights: res_vps2: Rolling back scores from res_MailTo2
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_vps2 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_MailTo2 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: Promoting res_drbd1:1 (Master 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: ms_drbd1: Promoted 1 instances of a possible 1 to master
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: ms_drbd2: Promoted 0 instances of a possible 1 to master
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_Filesystem1  (Started 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_ip1  (Started 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_vps1 (Started 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_MailTo1  (Started 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_Filesystem2  (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_ip2  (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_vps2 (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_MailTo2  (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_drbd1:0  (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_drbd1:1  (Master 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_drbd2:0  (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_drbd2:1  (Slave 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: unpack_graph: Unpacked transition 2908: 0 actions in 0 synapses
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_te_invoke: Processing graph 2908 (ref=pe_calc-dc-1304107368-2964) derived from /var/lib/pengine/pe-input-5430.bz2
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: run_graph: ====================================================
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: notice: run_graph: Transition 2908 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-5430.bz2): Complete
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: te_graph_trigger: Transition 2908 is now complete
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: notify_crmd: Transition 2908 status: done - <null>
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition: Starting PEngine Recheck Timer
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: process_pe_message: Transition 2908: PEngine Input stored in: /var/lib/pengine/pe-input-5430.bz2
Apr 30 05:02:48 02a.ss.jp attrd: [8625]: info: attrd_trigger_update: Sending flush op to all hosts for: 02b.ss.jp-eth1 (<null>)
Apr 30 05:02:48 02a.ss.jp attrd: [8625]: info: attrd_perform_update: Sent delete 103: node=f6bd0d01-7e10-4dc6-af77-2fd6d772ed0c, attr=02b.ss.jp-eth1, id=<n/a>, set=(null), section=status
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: abort_transition_graph: te_update_diff:157 - Triggered transition abort (complete=1, tag=transient_attributes, id=f6bd0d01-7e10-4dc6-af77-2fd6d772ed0c, magic=NA, cib=0.33.47) : Transient attribute: removal
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_pe_invoke: Query 3000: Requesting the current CIB: S_POLICY_ENGINE
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_pe_invoke_callback: Invoking the PE: query=3000, ref=pe_calc-dc-1304107368-2965, seq=5, quorate=1
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_config: On loss of CCM Quorum: Ignore
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: determine_online_status: Node 02a.ss.jp is online
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op: Operation res_vps1_monitor_0 found resource res_vps1 active on 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op: Operation res_vps2_monitor_0 found resource res_vps2 active on 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op: Operation res_drbd1:1_monitor_0 found resource res_drbd1:1 active in master mode on 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op: Operation res_drbd2:1_monitor_0 found resource res_drbd2:1 active on 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: group_print:  Resource Group: group1
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_Filesystem1  (ocf::heartbeat:Filesystem):    Started 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_ip1  (ocf::heartbeat:IPaddr2):   Started 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_vps1 (ocf::heartbeat:vps):   Started 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_MailTo1  (ocf::heartbeat:MailTo):    Started 02a.ss.jp
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: group_print:  Resource Group: group2
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_Filesystem2  (ocf::heartbeat:Filesystem):    Stopped 
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_ip2  (ocf::heartbeat:IPaddr2):   Stopped 
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_vps2 (ocf::heartbeat:vps):   Stopped 
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:      res_MailTo2  (ocf::heartbeat:MailTo):    Stopped 
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: clone_print:  Master/Slave Set: ms_drbd1
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:      Masters: [ 02a.ss.jp ]
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:      Stopped: [ res_drbd1:0 ]
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: clone_print:  Master/Slave Set: ms_drbd2
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:      Slaves: [ 02a.ss.jp ]
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:      Stopped: [ res_drbd2:0 ]
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_drbd1:0 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: Promoting res_drbd1:1 (Master 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: ms_drbd1: Promoted 1 instances of a possible 1 to master
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_drbd2:0 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: ms_drbd2: Promoted 0 instances of a possible 1 to master
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: rsc_merge_weights: res_Filesystem2: Rolling back scores from res_ip2
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_Filesystem2 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: rsc_merge_weights: res_ip2: Rolling back scores from res_vps2
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_ip2 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: rsc_merge_weights: res_vps2: Rolling back scores from res_MailTo2
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_vps2 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color: Resource res_MailTo2 cannot run anywhere
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: Promoting res_drbd1:1 (Master 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: ms_drbd1: Promoted 1 instances of a possible 1 to master
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color: ms_drbd2: Promoted 0 instances of a possible 1 to master
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_Filesystem1  (Started 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_ip1  (Started 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_vps1 (Started 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_MailTo1  (Started 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_Filesystem2  (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_ip2  (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_vps2 (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_MailTo2  (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_drbd1:0  (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_drbd1:1  (Master 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_drbd2:0  (Stopped)
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: Leave resource res_drbd2:1  (Slave 02a.ss.jp)
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: unpack_graph: Unpacked transition 2909: 0 actions in 0 synapses
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_te_invoke: Processing graph 2909 (ref=pe_calc-dc-1304107368-2965) derived from /var/lib/pengine/pe-input-5431.bz2
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: run_graph: ====================================================
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: notice: run_graph: Transition 2909 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-5431.bz2): Complete
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: te_graph_trigger: Transition 2909 is now complete
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: notify_crmd: Transition 2909 status: done - <null>
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition: Starting PEngine Recheck Timer
Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: process_pe_message: Transition 2909: PEngine Input stored in: /var/lib/pengine/pe-input-5431.bz2
Apr 30 05:02:49 02a.ss.jp attrd: [8625]: info: attrd_ha_callback: flush message from 02a.ss.jp
---- end ----


岩崎＠サードウェア <iwasa****@3ware*****>さん：
>  岩崎＠サードウェアです
> 
>  OpenVZいいですねー。
>  DRBDがSecondaryからPrimaryに出来ない理由について、ログが見あたらなかったものでよくわからないのですがいくつかパターンがあるので確認してみてはいかがでしょうか。
> 
>  ・Primary側が落とせないプロセスがDRBDのマウントポイントを掴んでいる時（8.3.4だともしかするとシェルでログインしてるだけでも駄目）
>  ・OutDatedステータスになっている。（スプリットブレイン発生を防ぐ安全装置）
>  ・メタデータ領域が壊れている。（ほぼ無いですが外部メタデータにしててうっかりDDで消したりするとアクセスできなったり）
> 
>  多くの場合が、相手側ノードのPrimaryをSecondary状態に降格できない時にSecondaryからPrimaryに昇格できない症状が出ます。
>  今回の場合、リブートしてしまった時に起きているという事なので、OutDated状態になっていてフェイルオーバーできずにいて、
>  セカンダリノードが上がったときにDRBDも立ち上がり、再接続することでステータスが正常化してしまっている可能性があります。
> 
>  これが試験環境であれば、sysrq-triggerを使って再起動を行い正常に再起動されるかの確認をすると状況を再現できるかと思います。
>  あと、chkconfigでheartbeatとDRBDはoffにしておいたほうが良いです。
>  この２つは中途半端な状態で上がられると困るものなので、落ちたなら安全が確認できるまで上がってくれないほうが良いものになります。
> 
>  ＃ OpenVZのリソースエージェントに興味津々（実は私もコンテナ単位でF/OできるRAを作ってます）
> 
> 
>  On Mon, 16 May 2011 10:52:52 +0900, nemo****@zuku***** wrote:
> > 根本と申します。
> >
> > 4/30にDRBD+pacemakerで構成していたサーバの片方が
> > 謎のリブートをした時に、サービスがうまく切り替わ
> > らない問題が発生しました。現在原因を調査しており
> > ますが、原因が掴めておりません。
> >
> > [環境]
> > OS : CentOS 5.6
> > ※ 正確には、Openvz Kernelです。
> > # uname -a
> > Linux 02a.ss.jp 2.6.18-194.26.1.el5.028stab079.2 #1 SMP Fri Jan 28
> > 20:28:13 JST 2011 x86_64 x86_64 x86_64 GNU/Linux)
> >
> > 構成：姉妹構成(2台構成)
> > pacemaker-1.0.10-1.4.el5(Linux-HA Japan提供)
> > DRBD領域を2つ用意して、お互いがPrimary且つSecondaryにしています。
> >
> > 02aから見た/proc/drbd
> > [root ＠ 02a ~]# cat /proc/drbd
> > version: 8.3.4 (api:88/proto:86-91)
> > GIT-hash: 70a645ae080411c87b4482a135847d69dc90a6a2 build by
> > xemul****@ovzco*****, 2009-10-12 19:29:01
> >
> >  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----
> >     ns:1138230316 nr:0 dw:912902096 dr:411045257 al:2333158 bm:118286
> > lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
> >  2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----
> >     ns:687291028 nr:0 dw:339801384 dr:409325753 al:17077 bm:29038
> > lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
> >
> > /dev/drbd1 が /vz1 として
> > /dev/drbd2 が /vz2 としてマウントしてます。
> >
> > ログを見る限り、DRBDをSecondaryからPrimaryするところでこけている感じがするのですが
> > それ以上の原因がよくわかりません。そのときのログを添付しますので、ご意見を頂ければと思います。
> > なお、ログ内の res_vps1 と res_vps2は、Openvzの起動、停止用に自作したリソース・エージェントです。
> >
> > 第二回勉強会の準備などで忙しい方々もいらっしゃると思いますが、よろしくお願い致します。。
> > また、勉強会は私も参加させて頂ければと思います。
> >
> >
> > 不明点などがあれば、ご指摘頂下さい。
> > 以上、よろしくお願い致します。
> >
> >
> >
> >
> > 根本 稔也
> > ----
> > nemo****@zuku*****
> > ----
> 
> -- 
>  ----------------------------------------------------------------------
>  岩崎　　登               (株)サードウェア
> 
>  Noboru Iwasaki           274-0815 千葉県船橋市西習志野3-39-8
>  iwasa****@3ware*****      URL: http://www.3ware.co.jp/
>  Phone: 047-496-3341      Fax: 047-496-3370
> 
> _______________________________________________
> Linux-ha-japan mailing list
> Linux****@lists*****
> http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools

[Linux-ha-jp] pacemaker(heartbeat)の切り替えがうまく動作しない件について

Linux-HA Japan Forkpm_logconv-cspm_diskdpm_logconv-hbpm_extrasdocpm_crmgenvm-ctlpm_kvm_tools

[Linux-ha-jp] pacemaker(heartbeat)の切り替えがうまく動作しない件について

Linux-HA Japan
Fork
pm_logconv-cs
pm_diskd
pm_logconv-hb
pm_extras
doc
pm_crmgen
vm-ctl
pm_kvm_tools