清明节加班恢复的一个11gR2 rac恢复案例

本站文章除注明转载外，均为本站原创： 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

这是昨天节假如接到的某客户的紧急救援数据恢复案例。大致的情况是由于掉电导致数据库无法open。经过初步排查,确认数据库版本为Oracle 11.2.0.3(linux RAC),数据量比较小，

大约200G左右。整个恢复过程开始看上去很顺利，仅30分钟就顺利打开了数据库,后续发现其中确实有少坑，这里跟大家简单分享一下这个清明节加班的恢复case。

首先我们来看下数据库无法open所报的错误是什么？

Sun Apr 03 20:55:36 2016
SMON: enabling cache recovery
ORA-01555 caused by SQL statement below (SQL ID: 4krwuz0ctqxdt, SCN: 0x0000.5edc85a7):
select ctime, mtime, stime from obj$ where obj# = :1
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 19990): terminating the instance due to error 704
Instance terminated by USER, pid = 19990
ORA-1092 signalled during: alter database open...
opiodr aborting process unknown ospid (19990) as a result of ORA-1092

Sun Apr 03 20:55:36 2016

SMON: enabling cache recovery

ORA-01555 caused by SQL statement below (SQL ID: 4krwuz0ctqxdt, SCN: 0x0000.5edc85a7):

select ctime, mtime, stime from obj$ where obj# = :1

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc:

ORA-00704: bootstrap process failure

ORA-00604: error occurred at recursive SQL level 1

ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc:

ORA-00704: bootstrap process failure

ORA-00604: error occurred at recursive SQL level 1

ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small

Error 704 happened during db open, shutting down database

USER (ospid: 19990): terminating the instance due to error 704

Instance terminated by USER, pid = 19990

ORA-1092 signalled during: alter database open...

opiodr aborting process unknown ospid (19990) as a result of ORA-1092

这个错误其实很常见，已经遇到很多次了,处理方式也不难;大致上有两种.
1、通过10046 trace定位到有问题的数据块，然后手工去屏蔽事务；

2、推进数据库SCN
这里我选择使用推进scn的方式来进行处理。

直接通过oradebug poke修改scn；第一次修改可能是增加的scn不够大；第一次报错一样；第二次报错改变了；变成我们更加熟悉的错误：

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc  (incident=2108431):
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_2108431/orcl1_ora_23188_i2108431.trc
Sun Apr 03 21:09:46 2016
Dumping diagnostic data in directory=[cdmp_20160403210946], requested by (instance=1, osid=23188), summary=[incident=2108431].
Sun Apr 03 21:09:46 2016
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc:
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc:
ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 23188): terminating the instance due to error 600

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc (incident=2108431):

ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []

Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_2108431/orcl1_ora_23188_i2108431.trc

Sun Apr 03 21:09:46 2016

Dumping diagnostic data in directory=[cdmp_20160403210946], requested by (instance=1, osid=23188), summary=[incident=2108431].

Sun Apr 03 21:09:46 2016

Use ADRCI or Support Workbench to package the incident.

See Note 411.1 at My Oracle Support for error and packaging details.

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc:

ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []

Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc:

ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], []

Error 600 happened during db open, shutting down database

USER (ospid: 23188): terminating the instance due to error 600

上述的这个错误处理方式其实也有2种，大致如下：
1、由于scn差距很小，因此直接适当推进scn即可。

2、bbed修改dba地址20971648 中的事务来绕过该错误。

很明显，这里我选择第1种方法更简单；这里我再次修改scn，稍微增加大一点即可；很顺利的打开了数据库。

SQL&gt; startup mount pfile='/tmp/pfile.ora';
ORACLE instance started.
Total System Global Area 2.0243E+10 bytes
Fixed Size		    2237088 bytes
Variable Size		 7449087328 bytes
Database Buffers	 1.2751E+10 bytes
Redo Buffers		   41189376 bytes
Database mounted.
SQL&gt; oradebug setmypid
Statement processed.
SQL&gt; alter system set job_queue_processes=0;
System altered.
SQL&gt; oradebug poke 0x060019598  4 0x832B8852
BEFORE: [060019598, 06001959C) = 00000000
AFTER:	[060019598, 06001959C) = 832B8852
SQL&gt; alter database open;
Database altered.
SQL&gt;

SQL> startup mount pfile='/tmp/pfile.ora';

ORACLE instance started.

Total System Global Area 2.0243E+10 bytes

Fixed Size 2237088 bytes

Variable Size 7449087328 bytes

Database Buffers 1.2751E+10 bytes

Redo Buffers 41189376 bytes

Database mounted.

SQL> oradebug setmypid

Statement processed.

SQL> alter system set job_queue_processes=0;

System altered.

SQL> oradebug poke 0x060019598 4 0x832B8852

BEFORE: [060019598, 06001959C) = 00000000

AFTER: [060019598, 06001959C) = 832B8852

SQL> alter database open;

Database altered.

SQL>

看上去整个恢复过程很简单，也就不到半小时就打开了数据库。可是当我检查数据库文件状态时，整个数据库一共有23个数据文件，其中有11个数据文件状态为missing，

这也就是说都无法识别到数据库文件。实际上此时数据库alert log中也在报如下的错误，告诉我们这部分数据文件无法识别：

Sun Apr 03 21:26:09 2016
minact-scn: Inst 1 is now the master inc#:2 mmon proc-id:24523 status:0x7
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000
[24583] Successfully onlined Undo Tablespace 2.
Undo initialization finished serial:0 start:123081664 end:123083044 diff:1380 (13 seconds)
Dictionary check beginning
Tablespace 'NORMING_DATA' #10 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMING_TEMP' #11 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGTEST_TEMP' #12 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGTEST_DATA' #13 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGLJ_TEMP' #14 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'NORMINGLJ_DATA' #15 found in data dictionary,
but not in the controlfile. Adding to controlfile.
Tablespace 'TABLESPACE_XYZH' #16 found in data dictionary,
but not in the controlfile. Adding to controlfile.
File #13 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00013' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #14 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00014' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #15 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00015' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #16 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00016' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #17 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00017' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #18 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00018' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #19 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00019' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #20 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00020' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #21 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00021' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #22 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00022' in the controlfile.
This file can no longer be recovered so it must be dropped.
File #23 found in data dictionary but not in controlfile.
Creating OFFLINE file 'MISSING00023' in the controlfile.
This file can no longer be recovered so it must be dropped.
Dictionary check complete

Sun Apr 03 21:26:09 2016

minact-scn: Inst 1 is now the master inc#:2 mmon proc-id:24523 status:0x7

minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000

[24583] Successfully onlined Undo Tablespace 2.

Undo initialization finished serial:0 start:123081664 end:123083044 diff:1380 (13 seconds)

Dictionary check beginning

Tablespace 'NORMING_DATA' #10 found in data dictionary,

but not in the controlfile. Adding to controlfile.

Tablespace 'NORMING_TEMP' #11 found in data dictionary,

but not in the controlfile. Adding to controlfile.

Tablespace 'NORMINGTEST_TEMP' #12 found in data dictionary,

but not in the controlfile. Adding to controlfile.

Tablespace 'NORMINGTEST_DATA' #13 found in data dictionary,

but not in the controlfile. Adding to controlfile.

Tablespace 'NORMINGLJ_TEMP' #14 found in data dictionary,

but not in the controlfile. Adding to controlfile.

Tablespace 'NORMINGLJ_DATA' #15 found in data dictionary,

but not in the controlfile. Adding to controlfile.

Tablespace 'TABLESPACE_XYZH' #16 found in data dictionary,

but not in the controlfile. Adding to controlfile.

File #13 found in data dictionary but not in controlfile.

Creating OFFLINE file 'MISSING00013' in the controlfile.