The database instance Crash because the CPU High ?

本站文章除注明转载外，均为本站原创： 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

本文链接地址: The database instance Crash because the CPU High ?

某系统的其中一个RAC节点的db实例被干掉并自动重启了，如下是该实例的alert log信息：

Wed Oct 08 12:59:51 CST 2014
Thread 1 advanced to log sequence 543 (LGWR switch)
Current log# 1 seq# 543 mem# 0: +DGSYS/hisdb/onlinelog/group_1.261.856010381
Wed Oct 08 13:13:45 CST 2014
WARNING: inbound connection timed out (ORA-3136)
Wed Oct 08 13:16:56 CST 2014
LMON (ospid: 17329) is not heartbeating for 204 seconds.
LMON is not healthy and has no heartbeat.
Please check LMD0/LMS0 and DIAG trace files for detail.
Wed Oct 08 13:17:02 CST 2014
LMS0 (ospid: 17333) is terminating the instance.
LMS0: terminating instance due to error 484
Wed Oct 08 13:17:05 CST 2014
System state dump is made for local instance
System State dumped to trace file /opt/oracle/admin/hisdb/bdump/hisdb1_diag_17325.trc
Wed Oct 08 13:17:06 CST 2014
Shutting down instance (abort)
License high water mark = 1068
Wed Oct 08 13:17:07 CST 2014
Instance terminated by LMS0, pid = 17333
Wed Oct 08 13:17:11 CST 2014
Instance terminated by USER, pid = 15788
Wed Oct 08 13:17:17 CST 2014
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 eth1 10.0.0.0 configured from OCR for use as a cluster interconnect
Interface type 1 eth0 10.3.36.192 configured from OCR for use as  a public interface
Picked latch-free SCN scheme 3
Wed Oct 08 13:17:27 CST 2014
Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/oracle/product/10.2.0/db/dbs/arch
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.5.0.
System parameters with non-default values:
。。。。。。。
Wed Oct 08 13:17:39 CST 2014
ALTER DATABASE OPEN
Picked broadcast on commit scheme to generate SCNs
Wed Oct 08 13:17:39 CST 2014
SUCCESS: diskgroup MSDATA was mounted
SUCCESS: diskgroup DGDATA was mounted
Wed Oct 08 13:17:39 CST 2014
Thread 1 opened at log sequence 544
Current log# 2 seq# 544 mem# 0: +DGSYS/hisdb/onlinelog/group_2.262.856010381
Successful open of redo thread 1
Wed Oct 08 13:17:39 CST 2014
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Wed Oct 08 13:17:39 CST 2014
SMON: enabling cache recovery
Wed Oct 08 13:17:40 CST 2014
Successfully onlined Undo Tablespace 1.
Wed Oct 08 13:17:40 CST 2014
SMON: enabling tx recovery
Wed Oct 08 13:17:40 CST 2014
Database Characterset is ZHS16GBK
Opening with internal Resource Manager plan
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=28, OS id=16432
Wed Oct 08 13:17:42 CST 2014
Completed: ALTER DATABASE OPEN

Wed Oct 08 12:59:51 CST 2014

Thread 1 advanced to log sequence 543 (LGWR switch)

Current log# 1 seq# 543 mem# 0: +DGSYS/hisdb/onlinelog/group_1.261.856010381

Wed Oct 08 13:13:45 CST 2014

WARNING: inbound connection timed out (ORA-3136)

Wed Oct 08 13:16:56 CST 2014

LMON (ospid: 17329) is not heartbeating for 204 seconds.

LMON is not healthy and has no heartbeat.

Please check LMD0/LMS0 and DIAG trace files for detail.

Wed Oct 08 13:17:02 CST 2014

LMS0 (ospid: 17333) is terminating the instance.

LMS0: terminating instance due to error 484

Wed Oct 08 13:17:05 CST 2014

System state dump is made for local instance

System State dumped to trace file /opt/oracle/admin/hisdb/bdump/hisdb1_diag_17325.trc

Wed Oct 08 13:17:06 CST 2014

Shutting down instance (abort)

License high water mark = 1068

Wed Oct 08 13:17:07 CST 2014

Instance terminated by LMS0, pid = 17333

Wed Oct 08 13:17:11 CST 2014

Instance terminated by USER, pid = 15788

Wed Oct 08 13:17:17 CST 2014

Starting ORACLE instance (normal)

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Interface type 1 eth1 10.0.0.0 configured from OCR for use as a cluster interconnect

Interface type 1 eth0 10.3.36.192 configured from OCR for use as a public interface

Picked latch-free SCN scheme 3

Wed Oct 08 13:17:27 CST 2014

Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/oracle/product/10.2.0/db/dbs/arch

LICENSE_MAX_USERS = 0

SYS auditing is disabled

ksdpec: called for event 13740 prior to event group initialization

Starting up ORACLE RDBMS Version: 10.2.0.5.0.

System parameters with non-default values:

。。。。。。。

Wed Oct 08 13:17:39 CST 2014

ALTER DATABASE OPEN

Picked broadcast on commit scheme to generate SCNs

Wed Oct 08 13:17:39 CST 2014

SUCCESS: diskgroup MSDATA was mounted

SUCCESS: diskgroup DGDATA was mounted

Wed Oct 08 13:17:39 CST 2014

Thread 1 opened at log sequence 544

Current log# 2 seq# 544 mem# 0: +DGSYS/hisdb/onlinelog/group_2.262.856010381

Successful open of redo thread 1

Wed Oct 08 13:17:39 CST 2014

MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set

Wed Oct 08 13:17:39 CST 2014

SMON: enabling cache recovery

Wed Oct 08 13:17:40 CST 2014

Successfully onlined Undo Tablespace 1.

Wed Oct 08 13:17:40 CST 2014

SMON: enabling tx recovery

Wed Oct 08 13:17:40 CST 2014

Database Characterset is ZHS16GBK

Opening with internal Resource Manager plan

replication_dependency_tracking turned off (no async multimaster replication found)

Starting background process QMNC

QMNC started with pid=28, OS id=16432

Wed Oct 08 13:17:42 CST 2014

Completed: ALTER DATABASE OPEN

我们可以看到，该实例在13:17:07 秒被LMS0进程强行终止掉了，接着该实例在13:17:17 被正常启动。
在该实例被强行终止之前，有一点关键信息是值得我们注意的，如下：

Wed Oct 08 13:13:45 CST 2014
WARNING: inbound connection timed out (ORA-3136)
Wed Oct 08 13:16:56 CST 2014
LMON (ospid: 17329) is not heartbeating for 204 seconds.
LMON is not healthy and has no heartbeat.
Please check LMD0/LMS0 and DIAG trace files for detail.
Wed Oct 08 13:17:02 CST 2014
LMS0 (ospid: 17333) is terminating the instance.
LMS0: terminating instance due to error 484

Wed Oct 08 13:13:45 CST 2014

WARNING: inbound connection timed out (ORA-3136)

Wed Oct 08 13:16:56 CST 2014

LMON (ospid: 17329) is not heartbeating for 204 seconds.

LMON is not healthy and has no heartbeat.

Please check LMD0/LMS0 and DIAG trace files for detail.

Wed Oct 08 13:17:02 CST 2014

LMS0 (ospid: 17333) is terminating the instance.

LMS0: terminating instance due to error 484

从这部分信息，我们可以大致判断，在13:16:56时，Oracle已经发现LMON进程长时间没有检测到心跳了，这个时间长达204秒。
如果根据时间向前推进，在13:13:32时间点，实际上Lmon进程就开始出现异常了。我们也可以看到在13:13:45时间点，出现了
一个ora-3136错误。一般来讲，这个waring跟系统的负载可能有极大的关系，例如资源使用极高，可能出现超时的情况。

从alert log信息来看，Oracle 让我们去查看LMD0/LMS0 以及diag的信息来进行进一步的分析。那么我们首先就来看一下LMD0进程的信息：

*** 2014-10-08 12:47:22.077
Setting 3-way CR grants to 1 global-lru off? 0
*** 2014-10-08 13:16:58.621
KJM_HISTORY: RCVR STALL OP(12) context 5 elapsed 202974048 us
KJM HIST LMD0:
12:5:202974048 7:6 6:3 10:4 2:6 1:2 11:2 15:2 12:80846 7:4
6:4 10:9 11:2 15:4 12:78700 7:8 6:1 10:8 11:2 15:2
12:79509 7:3 6:1 10:3 11:2 15:2 12:78710 7:2 6:2 10:4
11:3 15:1 12:79073 7:3 6:2 10:5 11:2 15:2 12:79846 7:6
6:1 10:14 11:2 15:5 12:78560 7:6 6:3 10:9 11:2 15:5
12:79708 7:5 6:1 10:6 11:0 15:1 12:79261 7:3 6:1 10:10
11:1 15:4 12:79432 7:4
----------------------------------------
SO: 0x478e7ff10, type: 4, owner: 0x4798e0088, flag: INIT/-/-/0x00
(session) sid: 3301 trans: (nil), creator: 0x4798e0088, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
last wait for 'ges remote message' wait_time=3 min 22 sec, seconds since wait started=258
waittime=40, loop=0, p3=0
blocking sess=0x(nil) seq=8230
Dumping Session Wait History
for 'ges remote message' count=1 wait_time=3 min 22 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.080830 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078682 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.079500 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078702 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.079063 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.079834 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.078517 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.079689 sec
waittime=40, loop=0, p3=0
for 'ges remote message' count=1 wait_time=0.079250 sec
waittime=40, loop=0, p3=0

*** 2014-10-08 12:47:22.077

Setting 3-way CR grants to 1 global-lru off? 0

*** 2014-10-08 13:16:58.621

KJM_HISTORY: RCVR STALL OP(12) context 5 elapsed 202974048 us

KJM HIST LMD0:

12:5:202974048 7:6 6:3 10:4 2:6 1:2 11:2 15:2 12:80846 7:4

6:4 10:9 11:2 15:4 12:78700 7:8 6:1 10:8 11:2 15:2

12:79509 7:3 6:1 10:3 11:2 15:2 12:78710 7:2 6:2 10:4

11:3 15:1 12:79073 7:3 6:2 10:5 11:2 15:2 12:79846 7:6

6:1 10:14 11:2 15:5 12:78560 7:6 6:3 10:9 11:2 15:5

12:79708 7:5 6:1 10:6 11:0 15:1 12:79261 7:3 6:1 10:10

11:1 15:4 12:79432 7:4

----------------------------------------

SO: 0x478e7ff10, type: 4, owner: 0x4798e0088, flag: INIT/-/-/0x00

(session) sid: 3301 trans: (nil), creator: 0x4798e0088, flag: (51) USR/- BSY/-/-/-/-/-

DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000

txn branch: (nil)

oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS

last wait for 'ges remote message' wait_time=3 min 22 sec, seconds since wait started=258

waittime=40, loop=0, p3=0

blocking sess=0x(nil) seq=8230

Dumping Session Wait History

for 'ges remote message' count=1 wait_time=3 min 22 sec

waittime=40, loop=0, p3=0

for 'ges remote message' count=1 wait_time=0.080830 sec

waittime=40, loop=0, p3=0

for 'ges remote message' count=1 wait_time=0.078682 sec

waittime=40, loop=0, p3=0

for 'ges remote message' count=1 wait_time=0.079500 sec

waittime=40, loop=0, p3=0

for 'ges remote message' count=1 wait_time=0.078702 sec

waittime=40, loop=0, p3=0

for 'ges remote message' count=1 wait_time=0.079063 sec

waittime=40, loop=0, p3=0

for 'ges remote message' count=1 wait_time=0.079834 sec

waittime=40, loop=0, p3=0

for 'ges remote message' count=1 wait_time=0.078517 sec

waittime=40, loop=0, p3=0

for 'ges remote message' count=1 wait_time=0.079689 sec

waittime=40, loop=0, p3=0

for 'ges remote message' count=1 wait_time=0.079250 sec

waittime=40, loop=0, p3=0

从LMD0进程的信息来看，可以发现该进程（lmd0）已经等待了258秒，等待事件为ges remote message，除此之外，没有其他的信息了。

既然这样，那我们继续来看下LMS0进程的信息：

*** 2014-09-30 10:01:46.825
DRM(3) win(1) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(2) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(3) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(4) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(5) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(6) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(7) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
DRM(3) win(8) lms 0 finished replaying gcs resources
lms 0 finished fixing gcs write protocol
*** 2014-10-08 13:16:56.712
kjfdchklmon: LMON is not heartbeating for 204 sec (1412745209.1412745413.0).
kjfdchklmon: Dumping callstack of lmon
Submitting asynchronized dump request [20]
kjfdchklmon: LMON is not healthy. kill instance.
ksuitm: waiting up to [5] seconds before killing DIAG(17325)

*** 2014-09-30 10:01:46.825

DRM(3) win(1) lms 0 finished replaying gcs resources