[Linux-ha-jp] pacemaker(heartbeat)の切り替えがうまく動作しない件について

Back to archive index

nemo****@zuku***** nemo****@zuku*****
2011年 5月 18日 (水) 18:55:56 JST


根本です。

DRBDのNICに対しては、下記コマンドにより
起動時に特に制限をしてません。
# iptables -A INPUT -i eth2 -j ACCEPT
# iptables -A OUTPUT -o eth2 -j ACCEPT
# /etc/rc.d/init.d/iptables save

となると、メタデータ壊れたか物理デバイスでの
IOエラーかですがね。
となると、現状これ以上の調査は難しいですね。
とりあえず、リブートしたサーバのメモリが
怪しいので、全取っ替えしてDRBDのRAを詳細な
記録を取るように改良するなどして対応したいと
思います。

おのれD○LLマシンめ!!ヽ(`Д´)ノ 

どうもありがとうございました。



岩崎@サードウェア <iwasa****@3ware*****>さん:
>  岩崎@サードウェアです
> 
>  似たような状態になったことがありますね。
>  原因はいくつか考えられますが
> 
>  ・再起動の時iptablesが起動してポート閉じてるとき
>  ・メタデータ壊れてるとき
>  ・物理デバイスでioエラー出てるとき
> 
>  まず、iptablesの設定確認してみてください。
>  あとは…最終的にはStandAloneでPrimaryになっている感じだと思うのですが
>  いかがでしょう。
> 
> 
> 
> 
> 
> 
>  On Wed, 18 May 2011 15:56:42 +0900, nemo****@zuku***** wrote:
> > 根本です。
> >
> > 岩崎さんすいません。かなり早とちりでした。
> > 貼り付けたログは、ha-logでしたね。以下、drbdのログです。
> > なお、05:20:21 以降のログは手動でPrimaryにした結果なので
> > 無視して下さい。
> >
> > ----start----
> > Apr 30 05:02:28 02a kernel: block drbd2: PingAck did not arrive in 
> > time.
> > Apr 30 05:02:28 02a kernel: block drbd2: peer( Primary -> Unknown )
> > conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> > Apr 30 05:02:28 02a kernel: block drbd2: asender terminated
> > Apr 30 05:02:28 02a kernel: block drbd2: Terminating asender thread
> > Apr 30 05:02:28 02a kernel: block drbd2: short read expecting header
> > on sock: r=-512
> > Apr 30 05:02:28 02a kernel: block drbd2: Connection closed
> > Apr 30 05:02:28 02a kernel: block drbd2: conn( NetworkFailure ->
> > Unconnected )
> > Apr 30 05:02:28 02a kernel: block drbd2: receiver terminated
> > Apr 30 05:02:28 02a kernel: block drbd2: Restarting receiver thread
> > Apr 30 05:02:28 02a kernel: block drbd2: receiver (re)started
> > Apr 30 05:02:28 02a kernel: block drbd2: conn( Unconnected -> 
> > WFConnection )
> > Apr 30 05:20:21 02a kernel: block drbd2: role( Secondary -> Primary )
> > Apr 30 05:20:21 02a kernel: block drbd2: Creating new current UUID
> > Apr 30 05:20:28 02a kernel: block drbd2: role( Primary -> Secondary )
> > Apr 30 05:20:28 02a kernel: block drbd2: conn( WFConnection ->
> > Disconnecting )
> > Apr 30 05:20:28 02a kernel: block drbd2: Discarding network 
> > configuration.
> > Apr 30 05:20:28 02a kernel: block drbd2: Connection closed
> > Apr 30 05:20:28 02a kernel: block drbd2: conn( Disconnecting -> 
> > StandAlone )
> > Apr 30 05:20:28 02a kernel: block drbd2: receiver terminated
> > Apr 30 05:20:28 02a kernel: block drbd2: Terminating receiver thread
> > Apr 30 05:20:28 02a kernel: block drbd2: disk( UpToDate -> Diskless )
> > Apr 30 05:20:28 02a kernel: block drbd2: drbd_bm_resize called with
> > capacity == 0
> > Apr 30 05:20:28 02a kernel: block drbd2: worker terminated
> > Apr 30 05:20:28 02a kernel: block drbd2: Terminating worker thread
> > Apr 30 05:20:28 02a kernel: block drbd2: Starting worker thread (from
> > cqueue/9 [517])
> > Apr 30 05:20:28 02a kernel: block drbd2: disk( Diskless -> Attaching 
> > )
> > Apr 30 05:20:28 02a kernel: block drbd2: Found 6 transactions (128
> > active extents) in activity log.
> > Apr 30 05:20:28 02a kernel: block drbd2: Method to ensure write
> > ordering: barrier
> > Apr 30 05:20:28 02a kernel: block drbd2: max_segment_size ( = BIO
> > size ) = 32768
> > Apr 30 05:20:28 02a kernel: block drbd2: drbd_bm_resize called with
> > capacity == 786407928
> > Apr 30 05:20:28 02a kernel: block drbd2: resync bitmap: bits=98300991
> > words=1535953
> > Apr 30 05:20:28 02a kernel: block drbd2: size = 375 GB (393203964 KB)
> > Apr 30 05:20:28 02a kernel: block drbd2: recounting of set bits took
> > additional 13 jiffies
> > Apr 30 05:20:28 02a kernel: block drbd2: 0 KB (0 bits) marked
> > out-of-sync by on disk bit-map.
> > Apr 30 05:20:28 02a kernel: block drbd2: disk( Attaching -> UpToDate 
> > )
> > Apr 30 05:20:28 02a kernel: block drbd2: Barriers not supported on
> > meta data device - disabling
> > Apr 30 05:20:28 02a kernel: block drbd2: conn( StandAlone -> 
> > Unconnected )
> > Apr 30 05:20:28 02a kernel: block drbd2: Starting receiver thread
> > (from drbd2_worker [16783])
> > Apr 30 05:20:28 02a kernel: block drbd2: receiver (re)started
> > Apr 30 05:20:28 02a kernel: block drbd2: conn( Unconnected -> 
> > WFConnection )
> > Apr 30 05:20:29 02a kernel: block drbd2: role( Secondary -> Primary )
> > Apr 30 05:20:32 02a kernel: EXT3 FS on drbd2, internal journal
> > Apr 30 05:20:34 02a kernel: EXT3-fs: drbd2: 226 orphan inodes deleted
> > ---- end ----
> >
> >
> >
> >
> >
> > 岩崎@サードウェア <iwasa****@3ware*****>さん:
> >>  岩崎@サードウェアです
> >>
> >>  根本さんのログを拝見しましたけど、DRBD周りのログが出ていないようですね。
> >>  DRBDの操作が行われるとmessageログにDRBDのログも出るかと思うんですが、いかがでしょうか。
> >>  はっつけてもらったのはha-logのほうでしょうかね。
> >>
> >>
> >>
> >>  On Wed, 18 May 2011 15:07:18 +0900, nemo****@zuku***** wrote:
> >> > 根本です。
> >> >
> >> > 岩崎さん早速のご返事ありがとうございます。
> >> > そして、返信おくれてすいません。
> >> >
> >> >>
> >> >> 
> >> ・Primary側が落とせないプロセスがDRBDのマウントポイントを掴んでいる時(8.3.4だともしかするとシェルでログインしてるだけでも駄目)
> >> >>  ・OutDatedステータスになっている。(スプリットブレイン発生を防ぐ安全装置)
> >> >>  ・メタデータ領域が壊れている。(ほぼ無いですが外部メタデータにしててうっかりDDで消したりするとアクセスできなったり)
> >> >>
> >> >> 
> >> 多くの場合が、相手側ノードのPrimaryをSecondary状態に降格できない時にSecondaryからPrimaryに昇格できない症状が出ます。
> >> >>  今回の場合、リブートしてしまった時に起きているという事なので、OutDated状態になっていてフェイルオーバーできずにいて、
> >> >>  セカンダリノードが上がったときにDRBDも立ち上がり、再接続することでステータスが正常化してしまっている可能性があります。
> >> >>  
> >> これが試験環境であれば、sysrq-triggerを使って再起動を行い正常に再起動されるかの確認をすると状況を再現できるかと思います。
> >> >>  あと、chkconfigでheartbeatとDRBDはoffにしておいたほうが良いです。
> >> >>  この2つは中途半端な状態で上がられると困るものなので、落ちたなら安全が確認できるまで上がってくれないほうが良いものになります。
> >> > ▼
> >> > heartbeatとDRBDは、chkconfigでoffにしております。
> >> > なので、2号機が落ちた時点でDRBDの接続が切れているので、
> >> > マウントポイントを掴んでいるプロセスは無いかと思います。
> >> > さすがに本番環境なので、sysrq-triggerを使った再起動実験は難しいと思います。
> >> > 一応、KVM上のテスト環境では、ちゃんと切り替わっているのですが・・・
> >> > やはり、どこかでOutDatedステータスになったままだったのでしょうか。
> >> >
> >> >
> >> > また、ログが見あたらなかったとのことですが、ログファイルを添付したつもりでしたが
> >> > 本文に貼り付けた方がよろしかったでしょうか。取り急ぎ、本文末に貼り付けて起きます。
> >> >
> >> >
> >> >>  # OpenVZのリソースエージェントに興味津々(実は私もコンテナ単位でF/OできるRAを作ってます)
> >> > ▼
> >> > 前任者がheartbeat1.xx時代に作成したものを、数ヶ月前まで「シェルスクリプトってなに?」て言っていた
> >> > 自分が改良したものなので、かなり頼りない参考にならないスクリプトですが・・・
> >> > よろしければ、MLに投げましょうか?(冗談抜きでやっつけスクリプトですよ)
> >> >
> >> >
> >> >
> >> > 以上、よろしくお願い致します。
> >> >
> >> >
> >> >
> >> > 以下、ha-debug.log
> >> > ----start----
> >> > Apr 30 05:02:47 02a.ss.jp heartbeat: [8613]: WARN: node 02b.ss.jp: 
> >> is
> >> > dead
> >> > Apr 30 05:02:47 02a.ss.jp heartbeat: [8613]: info: Link
> >> > 02b.ss.jp:eth1 dead.
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: notice:
> >> > crmd_ha_status_callback: Status update: Node 02b.ss.jp now has 
> >> status
> >> > [dead] (DC=true)
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: 
> >> crm_update_peer_proc:
> >> > 02b.ss.jp.ais is now offline
> >> > Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: quorum plugin: 
> >> majority
> >> > Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: cluster:linux-ha,
> >> > member_count=1, member_quorum_votes=100
> >> > Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: total_node_count=2,
> >> > total_quorum_votes=200
> >> > Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: quorum plugin: 
> >> twonodes
> >> > Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: cluster:linux-ha,
> >> > member_count=1, member_quorum_votes=100
> >> > Apr 30 05:02:47 02a.ss.jp ccm: [8621]: debug: total_node_count=2,
> >> > total_quorum_votes=200
> >> > Apr 30 05:02:47 02a.ss.jp ccm: [8621]: info: Break tie for 2 nodes
> >> > cluster
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: mem_handle_event: 
> >> Got
> >> > an event OC_EV_MS_INVALID from ccm
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: mem_handle_event: no
> >> > mbr_track info
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: mem_handle_event: 
> >> Got
> >> > an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: mem_handle_event:
> >> > instance=5, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: 
> >> crmd_ccm_msg_callback:
> >> > Quorum (re)attained after event=NEW MEMBERSHIP (id=5)
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: ccm_event_detail: 
> >> NEW
> >> > MEMBERSHIP: trans=5, nodes=1, new=0, lost=1 n_idx=0, new_idx=1,
> >> > old_idx=3
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: ccm_event_detail:
> >> > CURRENT: 02a.ss.jp [nodeid=0, born=5]
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: ccm_event_detail:
> >> > LOST:    02b.ss.jp [nodeid=1, born=4]
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: ais_status_callback:
> >> > status: 02b.ss.jp is now lost (was member)
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: crm_update_peer: 
> >> Node
> >> > 02b.ss.jp: id=1 state=lost (new) addr=(null) votes=-1 born=4 
> >> seen=4
> >> > proc=00000000000000000000000000000200
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: 
> >> erase_node_from_join:
> >> > Removed node 02b.ss.jp from join calculations: welcomed=0 
> >> itegrated=0
> >> > finalized=0 confirmed=1
> >> > Apr 30 05:02:47 02a.ss.jp crmd: [8626]: info: 
> >> populate_cib_nodes_ha:
> >> > Requesting the list of configured nodes
> >> > Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: mem_handle_event: Got 
> >> an
> >> > event OC_EV_MS_INVALID from ccm
> >> > Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: mem_handle_event: no
> >> > mbr_track info
> >> > Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: mem_handle_event: Got 
> >> an
> >> > event OC_EV_MS_NEW_MEMBERSHIP from ccm
> >> > Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: mem_handle_event:
> >> > instance=5, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
> >> > Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: cib_ccm_msg_callback:
> >> > Processing CCM event=NEW MEMBERSHIP (id=5)
> >> > Apr 30 05:02:47 02a.ss.jp cib: [8622]: info: crm_update_peer: Node
> >> > 02b.ss.jp: id=1 state=lost (new) addr=(null) votes=-1 born=4 
> >> seen=4
> >> > proc=00000000000000000000000000000302
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: WARN: match_down_event: No
> >> > match for shutdown action on 588eee4d-dbdf-471f-92b6-2b9789c5531a
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: te_update_diff:
> >> > Stonith/shutdown of 588eee4d-dbdf-471f-92b6-2b9789c5531a not 
> >> matched
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: 
> >> abort_transition_graph:
> >> > te_update_diff:191 - Triggered transition abort (complete=1,
> >> > tag=node_state, id=588eee4d-dbdf-471f-92b6-2b9789c5531a, magic=NA,
> >> > cib=0.33.45) : Node failure
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition:
> >> > State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> >> > cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition:
> >> > All 1 cluster nodes are eligible to run resources.
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_pe_invoke: Query
> >> > 2999: Requesting the current CIB: S_POLICY_ENGINE
> >> > Apr 30 05:02:48 02a.ss.jp cib: [8622]: info: cib_process_request:
> >> > Operation complete: op cib_modify for section nodes
> >> > (origin=local/crmd/2997, version=0.33.45): ok (rc=0)
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: 
> >> do_pe_invoke_callback:
> >> > Invoking the PE: query=2999, ref=pe_calc-dc-1304107368-2964, 
> >> seq=5,
> >> > quorate=1
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_config: 
> >> On
> >> > loss of CCM Quorum: Ignore
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: unpack_config: 
> >> Node
> >> > scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info:
> >> > determine_online_status: Node 02a.ss.jp is online
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op:
> >> > Operation res_vps1_monitor_0 found resource res_vps1 active on
> >> > 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op:
> >> > Operation res_vps2_monitor_0 found resource res_vps2 active on
> >> > 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op:
> >> > Operation res_drbd1:1_monitor_0 found resource res_drbd1:1 active 
> >> in
> >> > master mode on 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op:
> >> > Operation res_drbd2:1_monitor_0 found resource res_drbd2:1 active 
> >> on
> >> > 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: group_print:
> >> > Resource Group: group1
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_Filesystem1  (ocf::heartbeat:Filesystem):    Started 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_ip1  (ocf::heartbeat:IPaddr2):   Started 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_vps1 (ocf::heartbeat:vps):   Started 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_MailTo1  (ocf::heartbeat:MailTo):    Started 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: group_print:
> >> > Resource Group: group2
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_Filesystem2  (ocf::heartbeat:Filesystem):    Stopped
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_ip2  (ocf::heartbeat:IPaddr2):   Stopped
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_vps2 (ocf::heartbeat:vps):   Stopped
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_MailTo2  (ocf::heartbeat:MailTo):    Stopped
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: clone_print:
> >> > Master/Slave Set: ms_drbd1
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:
> >> > Masters: [ 02a.ss.jp ]
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:
> >> > Stopped: [ res_drbd1:0 ]
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: clone_print:
> >> > Master/Slave Set: ms_drbd2
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:
> >> > Slaves: [ 02a.ss.jp ]
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:
> >> > Stopped: [ res_drbd2:0 ]
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_drbd1:0 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > Promoting res_drbd1:1 (Master 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > ms_drbd1: Promoted 1 instances of a possible 1 to master
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_drbd2:0 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > ms_drbd2: Promoted 0 instances of a possible 1 to master
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: 
> >> rsc_merge_weights:
> >> > res_Filesystem2: Rolling back scores from res_ip2
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_Filesystem2 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: 
> >> rsc_merge_weights:
> >> > res_ip2: Rolling back scores from res_vps2
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_ip2 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: 
> >> rsc_merge_weights:
> >> > res_vps2: Rolling back scores from res_MailTo2
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_vps2 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_MailTo2 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > Promoting res_drbd1:1 (Master 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > ms_drbd1: Promoted 1 instances of a possible 1 to master
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > ms_drbd2: Promoted 0 instances of a possible 1 to master
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_Filesystem1  (Started 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_ip1  (Started 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_vps1 (Started 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_MailTo1  (Started 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_Filesystem2  (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_ip2  (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_vps2 (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_MailTo2  (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_drbd1:0  (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_drbd1:1  (Master 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_drbd2:0  (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_drbd2:1  (Slave 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition:
> >> > State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> >> > input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: unpack_graph: 
> >> Unpacked
> >> > transition 2908: 0 actions in 0 synapses
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_te_invoke:
> >> > Processing graph 2908 (ref=pe_calc-dc-1304107368-2964) derived 
> >> from
> >> > /var/lib/pengine/pe-input-5430.bz2
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: run_graph:
> >> > ====================================================
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: notice: run_graph: 
> >> Transition
> >> > 2908 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> >> > Source=/var/lib/pengine/pe-input-5430.bz2): Complete
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: te_graph_trigger:
> >> > Transition 2908 is now complete
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: notify_crmd: 
> >> Transition
> >> > 2908 status: done - <null>
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition:
> >> > State transition S_TRANSITION_ENGINE -> S_IDLE [ 
> >> input=I_TE_SUCCESS
> >> > cause=C_FSA_INTERNAL origin=notify_crmd ]
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition:
> >> > Starting PEngine Recheck Timer
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: 
> >> process_pe_message:
> >> > Transition 2908: PEngine Input stored in:
> >> > /var/lib/pengine/pe-input-5430.bz2
> >> > Apr 30 05:02:48 02a.ss.jp attrd: [8625]: info: 
> >> attrd_trigger_update:
> >> > Sending flush op to all hosts for: 02b.ss.jp-eth1 (<null>)
> >> > Apr 30 05:02:48 02a.ss.jp attrd: [8625]: info: 
> >> attrd_perform_update:
> >> > Sent delete 103: node=f6bd0d01-7e10-4dc6-af77-2fd6d772ed0c,
> >> > attr=02b.ss.jp-eth1, id=<n/a>, set=(null), section=status
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: 
> >> abort_transition_graph:
> >> > te_update_diff:157 - Triggered transition abort (complete=1,
> >> > tag=transient_attributes, id=f6bd0d01-7e10-4dc6-af77-2fd6d772ed0c,
> >> > magic=NA, cib=0.33.47) : Transient attribute: removal
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition:
> >> > State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> >> > cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition:
> >> > All 1 cluster nodes are eligible to run resources.
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_pe_invoke: Query
> >> > 3000: Requesting the current CIB: S_POLICY_ENGINE
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: 
> >> do_pe_invoke_callback:
> >> > Invoking the PE: query=3000, ref=pe_calc-dc-1304107368-2965, 
> >> seq=5,
> >> > quorate=1
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_config: 
> >> On
> >> > loss of CCM Quorum: Ignore
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: unpack_config: 
> >> Node
> >> > scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info:
> >> > determine_online_status: Node 02a.ss.jp is online
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op:
> >> > Operation res_vps1_monitor_0 found resource res_vps1 active on
> >> > 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op:
> >> > Operation res_vps2_monitor_0 found resource res_vps2 active on
> >> > 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op:
> >> > Operation res_drbd1:1_monitor_0 found resource res_drbd1:1 active 
> >> in
> >> > master mode on 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: unpack_rsc_op:
> >> > Operation res_drbd2:1_monitor_0 found resource res_drbd2:1 active 
> >> on
> >> > 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: group_print:
> >> > Resource Group: group1
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_Filesystem1  (ocf::heartbeat:Filesystem):    Started 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_ip1  (ocf::heartbeat:IPaddr2):   Started 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_vps1 (ocf::heartbeat:vps):   Started 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_MailTo1  (ocf::heartbeat:MailTo):    Started 02a.ss.jp
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: group_print:
> >> > Resource Group: group2
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_Filesystem2  (ocf::heartbeat:Filesystem):    Stopped
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_ip2  (ocf::heartbeat:IPaddr2):   Stopped
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_vps2 (ocf::heartbeat:vps):   Stopped
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: native_print:
> >> > res_MailTo2  (ocf::heartbeat:MailTo):    Stopped
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: clone_print:
> >> > Master/Slave Set: ms_drbd1
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:
> >> > Masters: [ 02a.ss.jp ]
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:
> >> > Stopped: [ res_drbd1:0 ]
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: clone_print:
> >> > Master/Slave Set: ms_drbd2
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:
> >> > Slaves: [ 02a.ss.jp ]
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: short_print:
> >> > Stopped: [ res_drbd2:0 ]
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_drbd1:0 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > Promoting res_drbd1:1 (Master 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > ms_drbd1: Promoted 1 instances of a possible 1 to master
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_drbd2:0 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > ms_drbd2: Promoted 0 instances of a possible 1 to master
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: 
> >> rsc_merge_weights:
> >> > res_Filesystem2: Rolling back scores from res_ip2
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_Filesystem2 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: 
> >> rsc_merge_weights:
> >> > res_ip2: Rolling back scores from res_vps2
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_ip2 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: 
> >> rsc_merge_weights:
> >> > res_vps2: Rolling back scores from res_MailTo2
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_vps2 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: native_color:
> >> > Resource res_MailTo2 cannot run anywhere
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > Promoting res_drbd1:1 (Master 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > ms_drbd1: Promoted 1 instances of a possible 1 to master
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: master_color:
> >> > ms_drbd2: Promoted 0 instances of a possible 1 to master
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_Filesystem1  (Started 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_ip1  (Started 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_vps1 (Started 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_MailTo1  (Started 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_Filesystem2  (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_ip2  (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_vps2 (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_MailTo2  (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_drbd1:0  (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_drbd1:1  (Master 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_drbd2:0  (Stopped)
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: notice: LogActions: 
> >> Leave
> >> > resource res_drbd2:1  (Slave 02a.ss.jp)
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition:
> >> > State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [
> >> > input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: unpack_graph: 
> >> Unpacked
> >> > transition 2909: 0 actions in 0 synapses
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_te_invoke:
> >> > Processing graph 2909 (ref=pe_calc-dc-1304107368-2965) derived 
> >> from
> >> > /var/lib/pengine/pe-input-5431.bz2
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: run_graph:
> >> > ====================================================
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: notice: run_graph: 
> >> Transition
> >> > 2909 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> >> > Source=/var/lib/pengine/pe-input-5431.bz2): Complete
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: te_graph_trigger:
> >> > Transition 2909 is now complete
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: notify_crmd: 
> >> Transition
> >> > 2909 status: done - <null>
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition:
> >> > State transition S_TRANSITION_ENGINE -> S_IDLE [ 
> >> input=I_TE_SUCCESS
> >> > cause=C_FSA_INTERNAL origin=notify_crmd ]
> >> > Apr 30 05:02:48 02a.ss.jp crmd: [8626]: info: do_state_transition:
> >> > Starting PEngine Recheck Timer
> >> > Apr 30 05:02:48 02a.ss.jp pengine: [23132]: info: 
> >> process_pe_message:
> >> > Transition 2909: PEngine Input stored in:
> >> > /var/lib/pengine/pe-input-5431.bz2
> >> > Apr 30 05:02:49 02a.ss.jp attrd: [8625]: info: attrd_ha_callback:
> >> > flush message from 02a.ss.jp
> >> > ---- end ----
> >> >
> >> >
> >> >
> >> > 岩崎@サードウェア <iwasa****@3ware*****>さん:
> >> >>  岩崎@サードウェアです
> >> >>
> >> >>  OpenVZいいですねー。
> >> >>
> >> >> 
> >> DRBDがSecondaryからPrimaryに出来ない理由について、ログが見あたらなかったものでよくわからないのですがいくつかパターンがあるので確認してみてはいかがでしょうか。
> >> >>
> >> >>
> >> >> 
> >> ・Primary側が落とせないプロセスがDRBDのマウントポイントを掴んでいる時(8.3.4だともしかするとシェルでログインしてるだけでも駄目)
> >> >>  ・OutDatedステータスになっている。(スプリットブレイン発生を防ぐ安全装置)
> >> >>  ・メタデータ領域が壊れている。(ほぼ無いですが外部メタデータにしててうっかりDDで消したりするとアクセスできなったり)
> >> >>
> >> >>
> >> >> 
> >> 多くの場合が、相手側ノードのPrimaryをSecondary状態に降格できない時にSecondaryからPrimaryに昇格できない症状が出ます。
> >> >>  今回の場合、リブートしてしまった時に起きているという事なので、OutDated状態になっていてフェイルオーバーできずにいて、
> >> >>  セカンダリノードが上がったときにDRBDも立ち上がり、再接続することでステータスが正常化してしまっている可能性があります。
> >> >>
> >> >>  
> >> これが試験環境であれば、sysrq-triggerを使って再起動を行い正常に再起動されるかの確認をすると状況を再現できるかと思います。
> >> >>  あと、chkconfigでheartbeatとDRBDはoffにしておいたほうが良いです。
> >> >>  この2つは中途半端な状態で上がられると困るものなので、落ちたなら安全が確認できるまで上がってくれないほうが良いものになります。
> >> >>
> >> >>  # OpenVZのリソースエージェントに興味津々(実は私もコンテナ単位でF/OできるRAを作ってます)
> >> >>
> >> >>
> >> >>  On Mon, 16 May 2011 10:52:52 +0900, nemo****@zuku***** wrote:
> >> >> > 根本と申します。
> >> >> >
> >> >> > 4/30にDRBD+pacemakerで構成していたサーバの片方が
> >> >> > 謎のリブートをした時に、サービスがうまく切り替わ
> >> >> > らない問題が発生しました。現在原因を調査しており
> >> >> > ますが、原因が掴めておりません。
> >> >> >
> >> >> > [環境]
> >> >> > OS : CentOS 5.6
> >> >> > ※ 正確には、Openvz Kernelです。
> >> >> > # uname -a
> >> >> > Linux 02a.ss.jp 2.6.18-194.26.1.el5.028stab079.2 #1 SMP Fri Jan 
> >> 28
> >> >> > 20:28:13 JST 2011 x86_64 x86_64 x86_64 GNU/Linux)
> >> >> >
> >> >> > 構成:姉妹構成(2台構成)
> >> >> > pacemaker-1.0.10-1.4.el5(Linux-HA Japan提供)
> >> >> > DRBD領域を2つ用意して、お互いがPrimary且つSecondaryにしています。
> >> >> >
> >> >> > 02aから見た/proc/drbd
> >> >> > [root @ 02a ~]# cat /proc/drbd
> >> >> > version: 8.3.4 (api:88/proto:86-91)
> >> >> > GIT-hash: 70a645ae080411c87b4482a135847d69dc90a6a2 build by
> >> >> > xemul****@ovzco*****, 2009-10-12 19:29:01
> >> >> >
> >> >> >  1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C 
> >> r----
> >> >> >     ns:1138230316 nr:0 dw:912902096 dr:411045257 al:2333158
> >> >> bm:118286
> >> >> > lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
> >> >> >  2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C 
> >> r----
> >> >> >     ns:687291028 nr:0 dw:339801384 dr:409325753 al:17077 
> >> bm:29038
> >> >> > lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
> >> >> >
> >> >> > /dev/drbd1 が /vz1 として
> >> >> > /dev/drbd2 が /vz2 としてマウントしてます。
> >> >> >
> >> >> > ログを見る限り、DRBDをSecondaryからPrimaryするところでこけている感じがするのですが
> >> >> > それ以上の原因がよくわかりません。そのときのログを添付しますので、ご意見を頂ければと思います。
> >> >> > なお、ログ内の res_vps1 と res_vps2は、Openvzの起動、停止用に自作したリソース・エージェントです。
> >> >> >
> >> >> > 第二回勉強会の準備などで忙しい方々もいらっしゃると思いますが、よろしくお願い致します。。
> >> >> > また、勉強会は私も参加させて頂ければと思います。
> >> >> >
> >> >> >
> >> >> > 不明点などがあれば、ご指摘頂下さい。
> >> >> > 以上、よろしくお願い致します。
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > 根本 稔也
> >> >> > ----
> >> >> > nemo****@zuku*****
> >> >> > ----
> >> >>
> >> >> --
> >> >>
> >> >> 
> >> ----------------------------------------------------------------------
> >> >>  岩崎  登               (株)サードウェア
> >> >>
> >> >>  Noboru Iwasaki           274-0815 千葉県船橋市西習志野3-39-8
> >> >>  iwasa****@3ware*****      URL: http://www.3ware.co.jp/
> >> >>  Phone: 047-496-3341      Fax: 047-496-3370
> >> >>
> >> >> _______________________________________________
> >> >> Linux-ha-japan mailing list
> >> >> Linux****@lists*****
> >> >> http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
> >> >
> >> > _______________________________________________
> >> > Linux-ha-japan mailing list
> >> > Linux****@lists*****
> >> > http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
> >>
> >> --
> >>  
> >> ----------------------------------------------------------------------
> >>  岩崎  登               (株)サードウェア
> >>
> >>  Noboru Iwasaki           274-0815 千葉県船橋市西習志野3-39-8
> >>  iwasa****@3ware*****      URL: http://www.3ware.co.jp/
> >>  Phone: 047-496-3341      Fax: 047-496-3370
> >>
> >> _______________________________________________
> >> Linux-ha-japan mailing list
> >> Linux****@lists*****
> >> http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
> >
> > _______________________________________________
> > Linux-ha-japan mailing list
> > Linux****@lists*****
> > http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan
> 
> -- 
>  ----------------------------------------------------------------------
>  岩崎  登               (株)サードウェア
> 
>  Noboru Iwasaki           274-0815 千葉県船橋市西習志野3-39-8
>  iwasa****@3ware*****      URL: http://www.3ware.co.jp/
>  Phone: 047-496-3341      Fax: 047-496-3370
> 
> _______________________________________________
> Linux-ha-japan mailing list
> Linux****@lists*****
> http://lists.sourceforge.jp/mailman/listinfo/linux-ha-japan





Linux-ha-japan メーリングリストの案内
Back to archive index