love wife & love life —Roger的Oracle&MySQL技术博客

Phone:18180207355 提供专业Oracle&MySQL数据恢复、性能优化、迁移升级、紧急救援等服务

某客户的5TB RAC 恢复小记

本站文章除注明转载外,均为本站原创: 转载自love wife & love life —Roger的Oracle&MySQL技术博客

本文链接地址: 某客户的5TB RAC 恢复小记

某客户的核心数据库存储问题导致数据库重启后无法正常启动,根据客户反馈最开始在启动数据库时
报错控制文件IO错误,如下:

Sun Mar 15 11:59:37 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_arc1_630876.trc:
ORA-00204: error in reading (block 1, # blocks 1) of control file
ORA-00202: control file: '/xxx/xxxx/control01.ctl'
ORA-17500: ODM err:ODM ERROR V-41-4-2-43-6 No such device or address
Sun Mar 15 11:59:37 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_arc1_630876.trc:
ORA-00204: error in reading (block 1, # blocks 1) of control file
ORA-00202: control file: '/xxx/xxxx/control01.ctl'
ORA-17500: ODM err:ODM ERROR V-41-4-2-43-6 No such device or address
Sun Mar 15 11:59:37 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_arc1_630876.trc:
ORA-00204: error in reading (block 1, # blocks 1) of control file
ORA-00202: control file: '/xxx/xxxx/control01.ctl'
ORA-17500: ODM err:ODM ERROR V-41-4-2-43-6 No such device or address
Sun Mar 15 11:59:37 2015
Master background archival failure: 204
Sun Mar 15 11:59:49 2015
Termination issued to instance processes. Waiting for the processes to exit
Sun Mar 15 15:40:09 2015
Starting ORACLE instance (normal)

上述的问题本质上都跟控制文件有关系,替换掉损坏的控制文件就行。当替换掉控制文件之后,在open数据库时发现报如下错误:

Sun Mar 15 16:10:48 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_626734.trc:
ORA-00600: internal error code, arguments: [kcrfr_update_nab_2], [0x70000038F8C94E0], [2], [], [], [], [], []
Abort recovery for domain 0
Sun Mar 15 16:10:49 2015
Aborting crash recovery due to error 600
Sun Mar 15 16:10:49 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_626734.trc:
ORA-00600: internal error code, arguments: [kcrfr_update_nab_2], [0x70000038F8C94E0], [2], [], [], [], [], []
ORA-600 signalled during: ALTER DATABASE OPEN...
Sun Mar 15 16:10:49 2015
Trace dumping is performing id=[cdmp_20150315161049]
Sun Mar 15 16:12:35 2015
Shutting down instance: further logons disabled
Sun Mar 15 16:12:35 2015

该错误本质上是因为redo的问题,即有redo log损坏。通过在RMAN进行recover,发现报如下类似错误:

Sun Mar 15 16:47:59 2015
Beginning crash recovery of 2 threads
 parallel recovery setup failed: using serial mode
Sun Mar 15 16:47:59 2015
Started redo scan
Sun Mar 15 16:47:59 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
ORA-00313: open failed for members of log group 5 of thread 2
ORA-00312: online log 5 thread 2: '/xxx/xxxx/redo05a.log'
ORA-17503: ksfdopn:4 Failed to open file /xxx/xxxx/redo05a.log
ORA-17500: ODM err:File does not exist
Sun Mar 15 16:47:59 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
ORA-00313: open failed for members of log group 4 of thread 1
ORA-00312: online log 4 thread 1: '/xxx/xxxx/redo04a.log'
ORA-17503: ksfdopn:4 Failed to open file /xxx/xxxx/redo04a.log
ORA-17500: ODM err:File does not exist
Sun Mar 15 17:03:03 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2009344 change 14160745159583 time 03/15/2015 11:56:29
ORA-00334: archived log: '/xxx/xxxx/redo04b.log'
Sun Mar 15 17:03:03 2015
Abort recovery for domain 0
Sun Mar 15 17:03:03 2015
Aborting crash recovery due to error 354
Sun Mar 15 17:03:03 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_299470.trc:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 2009344 change 14160745159583 time 03/15/2015 11:56:29
ORA-00312: online log 4 thread 1: '/xxx/xxxx/redo04b.log'
ORA-354 signalled during: ALTER DATABASE OPEN...
Sun Mar 15 17:08:02 2015

上述过程大致是客户之前的处理过程。我在18点左右介入之后,进行了相关的操作。我最开始尝试在利用RMAN 进行恢复,发现报错:

RMAN> recover database;

Starting recover at 15-MAR-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=3268 instance=xxxx2 devtype=DISK

starting media recovery

media recovery failed
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 03/15/2015 18:39:26
ORA-00283: recovery session canceled due to errors
RMAN-11003: failure during parse/execution of SQL statement: alter database recover if needed
 start
ORA-00283: recovery session canceled due to errors
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 1788672 change 14160744248478 time 03/15/2015 11:54:46
ORA-00312: online log 4 thread 1: '/xxx/xxxx/redo04a.log'

从上面的错误来看,初步可以判断redo04a.log文件已经损坏,而且是block 1788672的问题。为了验证该block是否损坏,我通过类似如下的dump 命令进行dump,发现报错:

alter system dump logfile 'xxx' scn min xxxx scn max xxxx;

由此判断,该block损坏无疑。 由于客户的需求是尽可能快的将数据库拉起来,因此对应redo损坏的情况之下。
通常只能进程不完全恢复并强制打开,这里我使用了如下的参数:

*._allow_resetlogs_corruption=TRUE
*._allow_error_simulation=TRUE

在open resetlogs之前,我已经将redo备份,resetlogs打开时,发现数据库报错如下:

Sun Mar 15 19:43:36 2015
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Sun Mar 15 19:43:36 2015
SMON: enabling cache recovery
Sun Mar 15 19:43:37 2015
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sun Mar 15 19:43:37 2015
ORA-01555 caused by SQL statement below (SQL ID: 5wc2915k44m38, Query Duration=0 sec, SCN: 0x0ce1.0e2d8971):
Sun Mar 15 19:43:37 2015
select user#,type# from user$ where name=:1
Sun Mar 15 19:43:37 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_667814.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 25 with name "_SYSSMU25$" too small
Error 704 happened during db open, shutting down database
USER: terminating instance due to error 704
Instance terminated by USER, pid = 667814
ORA-1092 signalled during: alter database open resetlogs...

从日志来看,大致判断可能是_SYSSMU25$ 回滚段的问题,因此尝试先通过如下隐含参数屏蔽回滚段:

_corrupted_rollback_segments=_SYSSMU25$
_offline_rollback_segments=_SYSSMU25$

屏蔽回滚段之后,尝试打开数据库,发现错误依旧,通过10046 trace跟踪,发现递归SQL在如下的block上执行失败:

PARSING IN CURSOR #3 len=43 dep=1 uid=0 oct=3 lid=0 tim=37951056727245 hv=1682066536 ad='8cb74a90'
select user#,type# from user$ where name=:1
END OF STMT
PARSE #3:c=0,e=372,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=37951056727243
BINDS #3:
kkscoacd
 Bind#0
 oacdty=01 mxl=32(03) mxlc=00 mal=00 scl=00 pre=00
 oacflg=18 fl2=0001 frm=01 csi=852 siz=32 off=0
 kxsbbbfp=1126d4b70  bln=32  avl=03  flg=05
 value="XDB"
EXEC #3:c=0,e=465,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=37951056727780
WAIT #3: nam='db file sequential read' ela= 1582 file#=1 block#=282 blocks=1 obj#=44 tim=37951056729421
WAIT #3: nam='db file sequential read' ela= 6642 file#=1 block#=91 blocks=1 obj#=22 tim=37951056736126

通过dump file 1 block 91,发现该block上第2个ITL确认存在一个活跃事务。原本计划直接bbed提交该事务,但是当我编译好bbed之后,查看发现该block为一个cluster block.

对于cluster block的事务修改,相对复杂一些,我的博客有文章描述,大家可以参考,这里不多说。考虑到生产库使用bbed有一定的风险,我并没有使用bbed。

接着使用undo_management参数启动数据库,然后强制open数据库,发现错误变成如下:

Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_778430.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238525189], [3297], [238091117], [], [], []
Sun Mar 15 20:50:52 2015
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Sun Mar 15 20:50:53 2015
Redo thread 1 internally disabled at seq 1 (CKPT)
Sun Mar 15 20:50:53 2015
ARC1: Archiving disabled thread 1 sequence 1
Sun Mar 15 20:50:54 2015
Trace dumping is performing id=[cdmp_20150315205054]
Sun Mar 15 20:50:54 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_778430.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238525189], [3297], [238091117], [], [], []
Sun Mar 15 20:50:54 2015
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 778430
ORA-1092 signalled during: alter database open  resetlogs...

从错误来看,我们就可以知道,这应该是SCN的问题。如果要手工推进SCN,那么level应该待遇3297*4才行,由于这里的238091117/1024/1024/1024小于1,因此推进scn时,level=3297*4+2 就差不多了。 这里我再次进行了10046 trace,发现了如下信息:

WAIT #5: nam='db file sequential read' ela= 1021 file#=1 block#=400 blocks=1 obj#=0 tim=37953716489772
EXEC #5:c=0,e=2969,p=1,cr=2,cu=3,mis=1,r=1,dep=1,og=4,tim=37953716490098
STAT #5 id=1 cnt=1 pid=0 pos=1 obj=0 op='UPDATE  UNDO$ (cr=2 pr=1 pw=0 time=1542 us)'
STAT #5 id=2 cnt=1 pid=1 pos=1 obj=34 op='INDEX UNIQUE SCAN I_UNDO1 (cr=2 pr=0 pw=0 time=11 us)'
WAIT #1: nam='row cache lock' ela= 71 cache id=3 mode=0 request=5 obj#=0 tim=37953716490369
WAIT #1: nam='db file sequential read' ela= 5783 file#=2 block#=25 blocks=1 obj#=0 tim=37953716496201
........
........
GLOBAL CACHE ELEMENT DUMP (address: 700000011f91498):
 id1: 0x19 id2: 0x20000 obj: US#2 block: (2/25)
 lock: SL rls: 0x0000 acq: 0x0000 latch: 3
 flags: 0x41 fair: 0 recovery: 0 fpin: 'ktuwh02: ktugus'
 bscn: 0x0.0 bctx: 0 write: 0 scan: 0x0 xflg: 0 xid: 0x0.0.0
 lcp: 0 lnk: [NULL] lch: [70000023bf4bc20,70000023bf4bc20]
 seq: 3 hist: 143:0 208 352
 LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT:
 flg: 0x00080000 state: READING mode: EXCL
 pin: 'ktuwh02: ktugus'
 addr: 70000023bf4bb10 obj: INVALID cls: UNDO HEAD bscn: 0xce1.e379b05  ---这里的bscn即scn值
 GCS SHADOW 700000011f91508,1 sq[70000035fb339f8,70000035fb339f8] resp[70000035fb339d0,0x19.20000] pkey 4294950914
 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
 master 1 owner 1 sid 0 remote[0,0] hist 0x106
 history 0x6.0x4.0x0.0x0.0x0.0x0. cflag 0x0 sender 0 flags 0x0 replay# 0
 disk: 0x0000.00000000 write request: 0x0000.00000000
 pi scn: 0x0000.00000000
msgseq 0x0 updseq 0x0 reqids[1,0,0] infop 0x0
 GCS RESOURCE 70000035fb339d0 hashq [70000038cbc6658,70000038cbc6658] name[0x19.20000] pkey 4294950914
 grant 700000011f91508 cvt 0 send 0,0 write 0,0@65535
 flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL
 disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0
 xid 0x0000.000.00000000 sid 0 pkwait 59s
 pkey 4294950914
 hv 0 [stat 0x0, 1->1, wm 32767, RMno 0, remxxx 0, dom 0]
 kjga st 0x4, step 0.0.0, cxxx 2, rmno 0, flags 0x0
 lb 0, hb 0, myb 6147, drmb 6147, apifrz 0
 GCS SHADOW 700000011f91508,1 sq[70000035fb339f8,70000035fb339f8] resp[70000035fb339d0,0x19.20000] pkey 4294950914
 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL
 master 1 owner 1 sid 0 remote[0,0] hist 0x106
 history 0x6.0x4.0x0.0x0.0x0.0x0. cflag 0x0 sender 0 flags 0x0 replay# 0
 disk: 0x0000.00000000 write request: 0x0000.00000000
 pi scn: 0x0000.00000000
msgseq 0x0 updseq 0x0 reqids[1,0,0] infop 0x0
kjbmbassert [0x19.20000]
*** 2015-03-15 20:54:54.385
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238525189], [3297], [238091117], [], [], []
Current SQL statement for this session:
alter database open resetlogs

bscn: 0xce1.e379b05 将该scn进行转换,我们可以发现:0xce1 为3297,e379b05为238525189. 与上述报错信息一致。同时我发现这里使用了第2号回滚段,如下:

id1: 0x19 id2: 0x20000 obj: US#2 block: (2/25)

因此,尝试继续使用隐含参数屏蔽这第2号回滚段,并尝试打开数据库,但是错误依旧。看来还是需要调整SCN才行,如下:

Sun Mar 15 21:23:20 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_774222.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238958669], [3297], [238091118], [], [], []
Sun Mar 15 21:23:20 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_774222.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238958669], [3297], [238091118], [], [], []
Sun Mar 15 21:23:21 2015
Errors in file /oracle/app1/oracle/admin/xxxx/udump/xxxx2_ora_774222.trc:
ORA-00600: internal error code, arguments: [kclchkblk_4], [3297], [238958669], [3297], [238091118], [], [], []
Sun Mar 15 21:23:21 2015
Error 600 happened during db open, shutting down database
USER: terminating instance due to error 600
Instance terminated by USER, pid = 774222

首先我尝试了在会话级别设置:

alter session set events '10015 trace name adjust_scn level 13190';

发现alter database open失败,尝试使用*._minimum_giga_scn参数,但是在启动的时候,提示说该参数不支持。从此判断,该环境可能是安装了比较新的PSU,Oracle将该参数废弃掉了,这么说前面的10015 event根本就没起作用。 无奈只能通过oradebug手工修改SCN来启动数据库了,如下:

SQL> startup mount pfile='/tmp/gb.ora';
ORACLE instance started.

Total System Global Area 1.5032E+10 bytes
Fixed Size                  2110096 bytes
Variable Size            5704256880 bytes
Database Buffers         9311354880 bytes
Redo Buffers               14663680 bytes
Database mounted.
SQL> oradebug setmypid
Statement processed.
SQL> oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [7000000100122A8, 7000000100122D8) = 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 07000000 ...
SQL>
SQL> oradebug poke 0x7000000100122A8 4 3300
BEFORE: [7000000100122A8, 7000000100122AC) = 00000000
AFTER:  [7000000100122A8, 7000000100122AC) = 00000CE4
SQL> oradebug DUMPvar SGA kcsgscn_
kcslf kcsgscn_ [7000000100122A8, 7000000100122D8) = 00000CE4 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 07000000 ...
SQL>

修改SCN之后,顺利打开了数据库,但是数据库很快就crash掉,如下是日志信息:

Sun Mar 15 21:47:31 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
......
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:33 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
QMNC started with pid=32, OS id=520520
Sun Mar 15 21:47:35 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
Sun Mar 15 21:47:35 2015
ORACLE Instance xxxx2 (pid = 22) - Error 600 encountered while recovering transaction (44, 26) on object 47098.
Sun Mar 15 21:47:35 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [6006], [1], [], [], [], [], [], []
Sun Mar 15 21:47:36 2015
LOGSTDBY: Validating controlfile with logical metadata
Sun Mar 15 21:47:36 2015
LOGSTDBY: Validation complete
Sun Mar 15 21:47:36 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:36 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:36 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sun Mar 15 21:47:37 2015
ORACLE Instance xxxx2 (pid = 22) - Error 600 encountered while recovering transaction (48, 25).
Sun Mar 15 21:47:37 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sun Mar 15 21:47:39 2015
Completed: alter database open
Sun Mar 15 21:47:39 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:39 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:39 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sun Mar 15 21:47:40 2015
ORACLE Instance xxxx2 (pid = 22) - Error 600 encountered while recovering transaction (65, 7).
Sun Mar 15 21:47:40 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sun Mar 15 21:47:40 2015
Trace dumping is performing id=[cdmp_20150315214740]
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
ORA-12012: error on auto execute of job 524
ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
ORA-06512: at "SYS.xxx_LOGINHISTORY", line 3
ORA-06512: at line 1
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
ORA-12012: error on auto execute of job 524
ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
ORA-06512: at "SYS.xxx_LOGINHISTORY", line 3
ORA-06512: at line 1
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
Sun Mar 15 21:47:41 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_smon_774386.trc:
ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], []
Sun Mar 15 21:47:42 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j008_586068.trc:
ORA-12012: error on auto execute of job 526
ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
ORA-06512: at "SYS.xxx_SEG_xxx", line 3
ORA-06512: at line 1
Sun Mar 15 21:47:42 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-12012: error on auto execute of job 524
ORA-01552: cannot use system rollback segment for non-system tablespace 'xxx_ADMIN'
ORA-06512: at "SYS.xxx_LOGINHISTORY", line 3
ORA-06512: at line 1
Sun Mar 15 21:47:42 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Sun Mar 15 21:47:43 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-12012: error on auto execute of job 524
ORA-01552: cannot use system rollback segment for non-system tablespace 'XXXX_ADMIN'
ORA-06512: at "SYS.XXXX_LOGINHISTORY", line 3
ORA-06512: at line 1
Sun Mar 15 21:47:43 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j009_790778.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-08102: index key not found, obj# 239, file 1, block 1674 (2)
Sun Mar 15 21:47:43 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/xxx/xxxx/redo02a.log'
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Sun Mar 15 21:47:43 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_j002_475534.trc:
ORA-00600: internal error code, arguments: [], [], [], [], [], [], [], []
ORA-06512: at "xxxx.PKG_XXXXX", line 126
ORA-06512: at line 3
Sun Mar 15 21:47:43 2015
Errors in file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_pmon_565700.trc:
ORA-00474: SMON process terminated with error
Sun Mar 15 21:47:43 2015
PMON: terminating instance due to error 474
Sun Mar 15 21:47:47 2015
Dump system state for local instance only
System State dumped to trace file /oracle/app1/oracle/admin/xxxx/bdump/xxxx2_diag_377122.trc
Sun Mar 15 21:47:48 2015
Instance terminated by PMON, pid = 565700

从上述日志信息来看,主要出现了如下几个错误:

ORA-00600 [6006],ORA-00600 [4137],ORA-00600 [kdsgrp1]

对于前面2个错误,很明显是Oracle SMON进程在进行利用回滚段进行事务rollback时失败导致,如下:
ORACLE Instance xxxx2 (pid = 22) – Error 600 encountered while recovering transaction (44, 26) on object 47098.
ORACLE Instance xxxx2 (pid = 22) – Error 600 encountered while recovering transaction (48, 25).

因此,不难看出,数据库中还有部分的回滚段存在活跃事务。

对于ORA-00600 [kdsgrp1]错误,通常是出现在Index上,即Index数据和表的数据不一致,一般来说可以通过重建解决。

其次,对于后面的ORA-08102: index key not found, obj# 239, file 1, block 1674 (2) 错误,主要是job调用出现,因此
我们可以暂时屏蔽job的调度。

对于ORA-08102错误,我的博客几年前也写过相关的文章,本质上也是Index block中的相关键值不存在导致。

与其如此,最后我感觉将数据库的所有回滚段都屏蔽掉,并重建数据库undo 表空间,如下是获取回滚段的命令:

strings system01.dbf | grep _SYSSMU | cut -d $ -f 1 | sort -u

经过整理,发现该库存在大约2600个回滚段,我了个去,先不管这么多,重启实例后,重建undo表空间:

SQL> conn /as sysdba
Connected to an idle instance.
SQL> startup upgrade pfile='/tmp/gb2.ora';
ORACLE instance started.

Total System Global Area 1.5032E+10 bytes
Fixed Size                  2110096 bytes
Variable Size            5704256880 bytes
Database Buffers         9311354880 bytes
Redo Buffers               14663680 bytes
Database mounted.
Database opened.
SQL> create undo tablespace undotbs11 datafile '/xxx/xxxx/undotbs11_01.dbf' size 100m;

Tablespace created.

SQL> create undo tablespace undotbs22 datafile '/xxx/xxxx/undotbs22_01.dbf' size 100m;

Tablespace created.

SQL> drop tablespace undotbs1 xxxluding contents and datafiles;

Tablespace dropped.

SQL> drop tablespace undotbs2 xxxluding contents and datafiles;

Tablespace dropped.

最后重启数据库实例,让客户将关键核心的配置表导出,先进行业务恢复,如果需要数据,直接从库中抽取。
这里要补充一点,该库约为5TB多一点,虽然有备份,但是恢复时间太长,如果有个dataguard是多么的重要啊!

2 Responses to “某客户的5TB RAC 恢复小记”

  1. kelvin Says:

    报道,已经看了roger的文章

  2. ivanwu Says:

    Roger大师,我有一个问题,在做这些不可逆的操作,怎么备份,谢谢

Leave a Reply

You must be logged in to post a comment.