The database instance Crash because the CPU High ?
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
某系统的其中一个RAC节点的db实例被干掉并自动重启了,如下是该实例的alert log信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
Wed Oct 08 12:59:51 CST 2014 Thread 1 advanced to log sequence 543 (LGWR switch) Current log# 1 seq# 543 mem# 0: +DGSYS/hisdb/onlinelog/group_1.261.856010381 Wed Oct 08 13:13:45 CST 2014 WARNING: inbound connection timed out (ORA-3136) Wed Oct 08 13:16:56 CST 2014 LMON (ospid: 17329) is not heartbeating for 204 seconds. LMON is not healthy and has no heartbeat. Please check LMD0/LMS0 and DIAG trace files for detail. Wed Oct 08 13:17:02 CST 2014 LMS0 (ospid: 17333) is terminating the instance. LMS0: terminating instance due to error 484 Wed Oct 08 13:17:05 CST 2014 System state dump is made for local instance System State dumped to trace file /opt/oracle/admin/hisdb/bdump/hisdb1_diag_17325.trc Wed Oct 08 13:17:06 CST 2014 Shutting down instance (abort) License high water mark = 1068 Wed Oct 08 13:17:07 CST 2014 Instance terminated by LMS0, pid = 17333 Wed Oct 08 13:17:11 CST 2014 Instance terminated by USER, pid = 15788 Wed Oct 08 13:17:17 CST 2014 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface type 1 eth1 10.0.0.0 configured from OCR for use as a cluster interconnect Interface type 1 eth0 10.3.36.192 configured from OCR for use as a public interface Picked latch-free SCN scheme 3 Wed Oct 08 13:17:27 CST 2014 Using LOG_ARCHIVE_DEST_1 parameter default value as /opt/oracle/product/10.2.0/db/dbs/arch LICENSE_MAX_USERS = 0 SYS auditing is disabled ksdpec: called for event 13740 prior to event group initialization Starting up ORACLE RDBMS Version: 10.2.0.5.0. System parameters with non-default values: 。。。。。。。 Wed Oct 08 13:17:39 CST 2014 ALTER DATABASE OPEN Picked broadcast on commit scheme to generate SCNs Wed Oct 08 13:17:39 CST 2014 SUCCESS: diskgroup MSDATA was mounted SUCCESS: diskgroup DGDATA was mounted Wed Oct 08 13:17:39 CST 2014 Thread 1 opened at log sequence 544 Current log# 2 seq# 544 mem# 0: +DGSYS/hisdb/onlinelog/group_2.262.856010381 Successful open of redo thread 1 Wed Oct 08 13:17:39 CST 2014 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set Wed Oct 08 13:17:39 CST 2014 SMON: enabling cache recovery Wed Oct 08 13:17:40 CST 2014 Successfully onlined Undo Tablespace 1. Wed Oct 08 13:17:40 CST 2014 SMON: enabling tx recovery Wed Oct 08 13:17:40 CST 2014 Database Characterset is ZHS16GBK Opening with internal Resource Manager plan replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC QMNC started with pid=28, OS id=16432 Wed Oct 08 13:17:42 CST 2014 Completed: ALTER DATABASE OPEN |
我们可以看到,该实例在13:17:07 秒被LMS0进程强行终止掉了,接着该实例在13:17:17 被正常启动。
在该实例被强行终止之前,有一点关键信息是值得我们注意的,如下:
1 2 3 4 5 6 7 8 9 |
Wed Oct 08 13:13:45 CST 2014 WARNING: inbound connection timed out (ORA-3136) Wed Oct 08 13:16:56 CST 2014 LMON (ospid: 17329) is not heartbeating for 204 seconds. LMON is not healthy and has no heartbeat. Please check LMD0/LMS0 and DIAG trace files for detail. Wed Oct 08 13:17:02 CST 2014 LMS0 (ospid: 17333) is terminating the instance. LMS0: terminating instance due to error 484 |
从这部分信息,我们可以大致判断,在13:16:56时,Oracle已经发现LMON进程长时间没有检测到心跳了,这个时间长达204秒。
如果根据时间向前推进,在13:13:32时间点,实际上Lmon进程就开始出现异常了。 我们也可以看到在13:13:45时间点,出现了
一个ora-3136错误。一般来讲,这个waring跟系统的负载可能有极大的关系,例如资源使用极高,可能出现超时的情况。
从alert log信息来看,Oracle 让我们去查看LMD0/LMS0 以及diag的信息来进行进一步的分析。那么我们首先就来看一下LMD0进程的信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
*** 2014-10-08 12:47:22.077 Setting 3-way CR grants to 1 global-lru off? 0 *** 2014-10-08 13:16:58.621 KJM_HISTORY: RCVR STALL OP(12) context 5 elapsed 202974048 us KJM HIST LMD0: 12:5:202974048 7:6 6:3 10:4 2:6 1:2 11:2 15:2 12:80846 7:4 6:4 10:9 11:2 15:4 12:78700 7:8 6:1 10:8 11:2 15:2 12:79509 7:3 6:1 10:3 11:2 15:2 12:78710 7:2 6:2 10:4 11:3 15:1 12:79073 7:3 6:2 10:5 11:2 15:2 12:79846 7:6 6:1 10:14 11:2 15:5 12:78560 7:6 6:3 10:9 11:2 15:5 12:79708 7:5 6:1 10:6 11:0 15:1 12:79261 7:3 6:1 10:10 11:1 15:4 12:79432 7:4 ---------------------------------------- SO: 0x478e7ff10, type: 4, owner: 0x4798e0088, flag: INIT/-/-/0x00 (session) sid: 3301 trans: (nil), creator: 0x4798e0088, flag: (51) USR/- BSY/-/-/-/-/- DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000 txn branch: (nil) oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS last wait for 'ges remote message' wait_time=3 min 22 sec, seconds since wait started=258 waittime=40, loop=0, p3=0 blocking sess=0x(nil) seq=8230 Dumping Session Wait History for 'ges remote message' count=1 wait_time=3 min 22 sec waittime=40, loop=0, p3=0 for 'ges remote message' count=1 wait_time=0.080830 sec waittime=40, loop=0, p3=0 for 'ges remote message' count=1 wait_time=0.078682 sec waittime=40, loop=0, p3=0 for 'ges remote message' count=1 wait_time=0.079500 sec waittime=40, loop=0, p3=0 for 'ges remote message' count=1 wait_time=0.078702 sec waittime=40, loop=0, p3=0 for 'ges remote message' count=1 wait_time=0.079063 sec waittime=40, loop=0, p3=0 for 'ges remote message' count=1 wait_time=0.079834 sec waittime=40, loop=0, p3=0 for 'ges remote message' count=1 wait_time=0.078517 sec waittime=40, loop=0, p3=0 for 'ges remote message' count=1 wait_time=0.079689 sec waittime=40, loop=0, p3=0 for 'ges remote message' count=1 wait_time=0.079250 sec waittime=40, loop=0, p3=0 |
从LMD0进程的信息来看,可以发现该进程(lmd0)已经等待了258秒,等待事件为ges remote message,除此之外,没有其他的信息了。
既然这样,那我们继续来看下LMS0进程的信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
*** 2014-09-30 10:01:46.825 DRM(3) win(1) lms 0 finished replaying gcs resources lms 0 finished fixing gcs write protocol DRM(3) win(2) lms 0 finished replaying gcs resources lms 0 finished fixing gcs write protocol DRM(3) win(3) lms 0 finished replaying gcs resources lms 0 finished fixing gcs write protocol DRM(3) win(4) lms 0 finished replaying gcs resources lms 0 finished fixing gcs write protocol DRM(3) win(5) lms 0 finished replaying gcs resources lms 0 finished fixing gcs write protocol DRM(3) win(6) lms 0 finished replaying gcs resources lms 0 finished fixing gcs write protocol DRM(3) win(7) lms 0 finished replaying gcs resources lms 0 finished fixing gcs write protocol DRM(3) win(8) lms 0 finished replaying gcs resources lms 0 finished fixing gcs write protocol *** 2014-10-08 13:16:56.712 kjfdchklmon: LMON is not heartbeating for 204 sec (1412745209.1412745413.0). kjfdchklmon: Dumping callstack of lmon Submitting asynchronized dump request [20] kjfdchklmon: LMON is not healthy. kill instance. ksuitm: waiting up to [5] seconds before killing DIAG(17325) |
LMS0进程的信息似乎也没有太多的价值。最后我们只能看看diag的信息了,Oracle在将所有进程kill之前进行了一次dump。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
*** 2014-10-08 13:16:56.894 Process diagnostic dump for oracle@hisdbs1 (LMON), OS id=17329, pid: 5, proc_ser: 1, sid: 3302, sess_ser: 1 ------------------------------------------------------------------------------- loadavg : 62.17 35.29 18.12 Memory (Avail / Total) = 5976.90M / 64455.59M Swap (Avail / Total) = 21215.88M / 24003.36M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD Short stack dump: ORA-00072: ?? "Unix process pid: 17329, image: oracle@hisdbs1 (LMON)" ??? ------------------------------------------------------------------------------- Process diagnostic dump actual duration=8.010000 sec (max dump time=30.000000 sec) *** 2014-10-08 13:17:04.908 ---------------------------------------- SO: 0x47e8f2a38, type: 2, owner: (nil), flag: INIT/-/-/0x00 (process) Oracle pid=5, calls cur/top: 0x4762511b8/0x4762511b8, flag: (6) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 91 last post received-location: kjfzack: post pmon for dead process cleanup last process to post me: 4798e0088 1 6 last post sent: 0 0 108 last post sent-location: kjmpost: post lmd last process posted by me: 4798e0088 1 6 (latch info) wait_event=0 bits=0 Location from where call was made: ksqgel: create enqueue: parent obj Context saved from call: 19162010040 waiting for 600098c8 enqueues level=5 Location from where latch is held: ksqgel: create enqueue: parent obj Context saved from call: 16790760304 state=busy, wlstate=free waiters [orapid (seconds since: put on list, posted, alive check)]: 5 (216, 1412745425, 6) ---LMON 774 (216, 1412745425, 216) 15 (216, 1412745425, 153) ---ckpt 813 (216, 1412745425, 216) 18 (216, 1412745425, 141) ---CJQ0 25 (216, 1412745425, 216) ---ASMB 886 (27, 1412745425, 27) 280 (15, 1412745425, 15) 783 (9, 1412745425, 9) 800 (9, 1412745425, 9) waiter count=10 gotten 100670517 times wait, failed first 9774 sleeps 746 gotten 0 times nowait, failed: 0 possible holder pid = 792 ospid=15095 on wait list for 600098c8 Process Group: DEFAULT, pseudo proc: 0x478a24a58 O/S info: user: oracle, term: UNKNOWN, ospid: 17329 (DEAD) OSD pid info: Unix process pid: 17329, image: oracle@hisdbs1 (LMON) PSO child state object changes : Dump of memory from 0x000000047E8930D8 to 0x000000047E8932E0 47E8930D0 00000008 00000000 [........] 。。。。。。 Repeat 23 times ---------------------------------------- SO: 0x477e7c9c0, type: 4, owner: 0x47e8f2a38, flag: INIT/-/-/0x00 (session) sid: 3302 trans: (nil), creator: 0x47e8f2a38, flag: (51) USR/- BSY/-/-/-/-/- DID: 0001-0005-00000006, short-term DID: 0000-0000-00000000 txn branch: (nil) oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS waiting for 'latch free' wait_time=0, seconds since wait started=210 address=600098c8, number=13, tries=2b blocking sess=0x(nil) seq=28098 Dumping Session Wait History for 'latch free' count=1 wait_time=4.885477 sec address=600098c8, number=13, tries=2a for 'latch free' count=1 wait_time=4.902795 sec address=600098c8, number=13, tries=29 for 'latch free' count=1 wait_time=4.884726 sec address=600098c8, number=13, tries=28 for 'latch free' count=1 wait_time=4.883317 sec address=600098c8, number=13, tries=27 for 'latch free' count=1 wait_time=4.887379 sec address=600098c8, number=13, tries=26 for 'latch free' count=1 wait_time=4.885649 sec address=600098c8, number=13, tries=25 for 'latch free' count=1 wait_time=4.886556 sec address=600098c8, number=13, tries=24 for 'latch free' count=1 wait_time=4.888166 sec address=600098c8, number=13, tries=23 for 'latch free' count=1 wait_time=4.885367 sec address=600098c8, number=13, tries=22 for 'latch free' count=1 wait_time=4.885849 sec address=600098c8, number=13, tries=21 。。。。。。 ------------process 0x0x474316de0-------------------- proc version : 0 Local node : 0 pid : 17329 lkp_node : 0 svr_mode : 0 proc state : KJP_NORMAL Last drm hb acked : 0 Total accesses : 2 Imm. accesses : 1 Locks on ASTQ : 0 Locks Pending AST : 0 Granted locks : 0 AST_Q: PENDING_Q: GRANTED_Q: Performing diagnostic data dump for this instance |
从前面的load来看,load信息如下:
loadavg : 62.17 35.29 18.12
这里简单解释一下,这3个loadavg值表示该主机分别在1分钟,5分钟,15分钟内的平均load值。 显然,这里我们不看后面2个指标了。
根据时间点,我们这里看第一个loadavg值是62.17. 这个值已经超出CPU的数倍了,由此我们不难判断在13:16:56向前推进1分钟的时间内
系统的load是极高的。
对于LMON进程,主要是监控RAC的GES信息,当然其作用不仅仅局限于此,其还负责检查集群中每个节点的健康情况,当有节点出现故障是,
负责进行reconfig以及GRD(global resource Directory)的恢复等等。我们知道AC的脑裂机制,如果IO fencing是Oracle本身来完成,也
就是说由CLusterware来完成。那么Lmon进程检查到实例级别出现脑裂时,会通知clusterware来进行脑裂操作,然而其并不会等待clusterware
的处理结果。当等待超过一定时间,那么LMON进程会自动触发IMR(instance membership recovery),这实际上也就是我们所说的
Instance membership reconfig。
对于报错中提到的LMON进程无法获得心跳了,那么LMON进程的心跳是什么呢? lmon进程主要通过2种心跳机制来判断集群节点的状态:
1) 网络心跳 (主要是通过ping进行检测)
2) 控制文件磁盘心跳,类似ckpt进程每3s更新一次controlfile的机制。
那么什么情况下,会导致LMON无法获得心跳呢?
一般来讲有可能是该进程无法获得latch或者系统的资源不足,导致LMON进程无法正常工作。而这里我们也确实发现系统的load极高。
从上面的进程详细信息来看,我们也确实发现该进程一直在等待一个latch,同时有其他的10个进程正在等待LMON进程,这其中的等待
就包括LMON,ASMB,ckpt进程等等。从diag的dump来看,提示lmon进程已经等待了216秒。
从process dump我们还可以看到该进程一直无法获得的latch是可能是被如下进程所持有:
possible holder pid = 792 ospid=15095
不过可惜的是,我搜索diag日志,居然找不到pid 792这个会话。 我们知道diag的trace中的process state dump信息是根据PID排序的。
我发现dump内容到process 69就结束了。从PID来看,该实例在出问题的时候,起码都800个process以上,dump的内容不可能只有这么一点。
从2节点的核心进程的日志也看不到有价值的信息。后面监控了一段时间发现该节点cpu几乎都是90%以上,很多时候甚至100%。
检查发现是CBC latch,而且都是一些报表的SQL。
对于这个case,我仍然感觉可能是资源异常导致LMON进程异常,最后Oracle 检测到该进程长时间没有心跳,最终终止了该实例。
基于这个假设,我打算在自己的VM Rac环境(Linux +10.2.0.5)测试一下,如果某个节点CPU使用率极高的情况下,是否会导致LMON进程异常。
不过遗憾的是,我并没有模拟出这种情况,如下是我的模拟过程:
利用shell 来模拟cpu的消耗:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
[root@rac1 ~]# cat kill_cpu.sh #! /bin/sh # filename killcpu.sh for i in `seq $1` do echo -ne " i=0; while true do i=i+1; done" | /bin/sh & pid_array[$i]=$! ; done for i in "${pid_array[@]}"; do echo 'kill ' $i ';'; done |
调用该脚本即可:
[root@rac1 ~]# ./kill_cpu.sh 1
kill 17449 ;
这样测试了10分钟,发现Rac毫无反应,于是我又模拟了latch: cache buffer chains. 以此来尽可能的将该节点的CPU消耗光。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
SQL> create table killdb(no int,object_name varchar2(50)); Table created. SQL> declare i int; 2 begin 3 for i in 1..5 loop 4 insert into killdb select rownum as no,object_name from dba_objects; 5 end loop; 6 end; 7 / PL/SQL procedure successfully completed. SQL> select count(1) from killdb; COUNT(1) ---------- 250645 SQL> commit; Commit complete. SQL> create or replace procedure full_scan is 2 i int; 3 icount int; 4 begin 5 for i in 1..1000 loop 6 select count(*) into icount from killdb; 7 end loop; 8 end; 9 / Procedure created. SQL> var job_no number; SQL> begin 2 for idx in 1..30 loop 3 dbms_job.submit(:job_no,'full_scan;'); 4 commit; 5 end loop ; 6 end; 7 / PL/SQL procedure successfully completed. SQL> select event,count(1) from v$session where wait_class#<>6 group by event; EVENT COUNT(1) ---------------------------------------------------------------- ---------- os thread startup 1 null event 1 latch: cache buffers chains 26 gc current request 1 SQL*Net message to client 1 |
如下是该节点的CPU使用情况:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
08:42:47 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 08:42:48 PM all 99.01 0.00 0.99 0.00 0.00 0.00 0.00 0.00 1005.94 08:42:49 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1044.44 ......省略部分内容 08:43:08 PM all 99.01 0.00 0.99 0.00 0.00 0.00 0.00 0.00 1003.96 08:43:09 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1038.38 08:43:10 PM all 99.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1019.00 08:43:11 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1032.00 08:43:12 PM all 99.01 0.00 0.99 0.00 0.00 0.00 0.00 0.00 1000.99 08:43:13 PM all 95.96 0.00 4.04 0.00 0.00 0.00 0.00 0.00 1202.02 08:43:14 PM all 93.07 0.00 5.94 0.00 0.00 0.99 0.00 0.00 1057.43 08:43:15 PM all 86.32 0.00 12.63 0.00 0.00 1.05 0.00 0.00 1080.00 08:43:16 PM all 96.97 0.00 3.03 0.00 0.00 0.00 0.00 0.00 1058.59 08:44:48 PM all 99.25 0.00 0.61 0.00 0.08 0.07 0.00 0.00 1026.50 08:46:16 PM all 99.24 0.00 0.61 0.00 0.05 0.10 0.00 0.00 1024.49 08:46:44 PM all 99.06 0.00 0.73 0.00 0.07 0.15 0.00 0.00 1042.57 08:46:45 PM all 91.43 0.00 7.62 0.00 0.00 0.95 0.00 0.00 1140.95 08:46:46 PM all 82.61 0.00 17.39 0.00 0.00 0.00 0.00 0.00 1042.39 08:46:47 PM all 57.14 0.00 42.86 0.00 0.00 0.00 0.00 0.00 1068.25 08:46:48 PM all 52.00 0.00 48.00 0.00 0.00 0.00 0.00 0.00 1092.00 08:46:49 PM all 84.34 0.00 14.46 0.00 0.00 1.20 0.00 0.00 1090.36 08:46:50 PM all 90.82 0.00 7.14 0.00 1.02 1.02 0.00 0.00 1137.76 08:46:51 PM all 89.69 0.00 10.31 0.00 0.00 0.00 0.00 0.00 1040.21 08:46:52 PM all 95.65 0.00 4.35 0.00 0.00 0.00 0.00 0.00 1052.17 08:46:53 PM all 99.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 1012.00 08:46:54 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1028.00 08:46:55 PM all 99.01 0.00 0.99 0.00 0.00 0.00 0.00 0.00 1007.92 08:46:56 PM all 98.92 0.00 1.08 0.00 0.00 0.00 0.00 0.00 1052.69 |
我们可以看到在08:43:16 到 08:46:44 之间,rac 节点1其实hang了很长时间,可惜这个数据库节点都没有挂。
我只能说Oracle 太强悍了,我檫。
One Response to “The database instance Crash because the CPU High ?”
根据时间点,我们这里看第一个loadavg值是62.17. 这个值已经超出CPU的数倍了,由此我们不难判断在13:16:56向前推进1分钟的时间内系统的load是极高的。
Hi,Roger。怎么由loadavg值是62.17, 这个值已经超出CPU的数倍了得出系统的load是极高的这个结论?不明白,请指教。
Leave a Reply
You must be logged in to post a comment.