清明节加班恢复的一个11gR2 rac恢复案例
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 清明节加班恢复的一个11gR2 rac恢复案例
这是昨天节假如接到的某客户的紧急救援数据恢复案例。大致的情况是由于掉电导致数据库无法open。经过初步排查,确认数据库版本为Oracle 11.2.0.3(linux RAC),数据量比较小,
大约200G左右。整个恢复过程开始看上去很顺利,仅30分钟就顺利打开了数据库,后续发现其中确实有少坑,这里跟大家简单分享一下这个清明节加班的恢复case。
首先我们来看下数据库无法open所报的错误是什么?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Sun Apr 03 20:55:36 2016 SMON: enabling cache recovery ORA-01555 caused by SQL statement below (SQL ID: 4krwuz0ctqxdt, SCN: 0x0000.5edc85a7): select ctime, mtime, stime from obj$ where obj# = :1 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc: ORA-00704: bootstrap process failure ORA-00704: bootstrap process failure ORA-00604: error occurred at recursive SQL level 1 ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_19990.trc: ORA-00704: bootstrap process failure ORA-00704: bootstrap process failure ORA-00604: error occurred at recursive SQL level 1 ORA-01555: snapshot too old: rollback segment number 29 with name "_SYSSMU29_3872709797$" too small Error 704 happened during db open, shutting down database USER (ospid: 19990): terminating the instance due to error 704 Instance terminated by USER, pid = 19990 ORA-1092 signalled during: alter database open... opiodr aborting process unknown ospid (19990) as a result of ORA-1092 |
这个错误其实很常见,已经遇到很多次了,处理方式也不难;大致上有两种.
1、通过10046 trace定位到有问题的数据块,然后手工去屏蔽事务;
2、推进数据库SCN
这里我选择使用推进scn的方式来进行处理。
直接通过oradebug poke修改scn;第一次修改可能是增加的scn不够大;第一次报错一样;第二次报错改变了;变成我们更加熟悉的错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc (incident=2108431): ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_2108431/orcl1_ora_23188_i2108431.trc Sun Apr 03 21:09:46 2016 Dumping diagnostic data in directory=[cdmp_20160403210946], requested by (instance=1, osid=23188), summary=[incident=2108431]. Sun Apr 03 21:09:46 2016 Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc: ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], [] Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_23188.trc: ORA-00600: internal error code, arguments: [2662], [0], [2200563965], [0], [2200568242], [20971648], [], [], [], [], [], [] Error 600 happened during db open, shutting down database USER (ospid: 23188): terminating the instance due to error 600 |
上述的这个错误处理方式其实也有2种,大致如下:
1、由于scn差距很小,因此直接适当推进scn即可。
2、bbed修改dba地址20971648 中的事务来绕过该错误。
很明显,这里我选择第1种方法更简单;这里我再次修改scn,稍微增加大一点即可;很顺利的打开了数据库。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
SQL> startup mount pfile='/tmp/pfile.ora'; ORACLE instance started. Total System Global Area 2.0243E+10 bytes Fixed Size 2237088 bytes Variable Size 7449087328 bytes Database Buffers 1.2751E+10 bytes Redo Buffers 41189376 bytes Database mounted. SQL> oradebug setmypid Statement processed. SQL> alter system set job_queue_processes=0; System altered. SQL> oradebug poke 0x060019598 4 0x832B8852 BEFORE: [060019598, 06001959C) = 00000000 AFTER: [060019598, 06001959C) = 832B8852 SQL> alter database open; Database altered. SQL> |
看上去整个恢复过程很简单,也就不到半小时就打开了数据库。可是当我检查数据库文件状态时,整个数据库一共有23个数据文件,其中有11个数据文件状态为missing,
这也就是说都无法识别到数据库文件。实际上此时数据库alert log中也在报如下的错误,告诉我们这部分数据文件无法识别:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
Sun Apr 03 21:26:09 2016 minact-scn: Inst 1 is now the master inc#:2 mmon proc-id:24523 status:0x7 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000 [24583] Successfully onlined Undo Tablespace 2. Undo initialization finished serial:0 start:123081664 end:123083044 diff:1380 (13 seconds) Dictionary check beginning Tablespace 'NORMING_DATA' #10 found in data dictionary, but not in the controlfile. Adding to controlfile. Tablespace 'NORMING_TEMP' #11 found in data dictionary, but not in the controlfile. Adding to controlfile. Tablespace 'NORMINGTEST_TEMP' #12 found in data dictionary, but not in the controlfile. Adding to controlfile. Tablespace 'NORMINGTEST_DATA' #13 found in data dictionary, but not in the controlfile. Adding to controlfile. Tablespace 'NORMINGLJ_TEMP' #14 found in data dictionary, but not in the controlfile. Adding to controlfile. Tablespace 'NORMINGLJ_DATA' #15 found in data dictionary, but not in the controlfile. Adding to controlfile. Tablespace 'TABLESPACE_XYZH' #16 found in data dictionary, but not in the controlfile. Adding to controlfile. File #13 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00013' in the controlfile. This file can no longer be recovered so it must be dropped. File #14 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00014' in the controlfile. This file can no longer be recovered so it must be dropped. File #15 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00015' in the controlfile. This file can no longer be recovered so it must be dropped. File #16 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00016' in the controlfile. This file can no longer be recovered so it must be dropped. File #17 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00017' in the controlfile. This file can no longer be recovered so it must be dropped. File #18 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00018' in the controlfile. This file can no longer be recovered so it must be dropped. File #19 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00019' in the controlfile. This file can no longer be recovered so it must be dropped. File #20 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00020' in the controlfile. This file can no longer be recovered so it must be dropped. File #21 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00021' in the controlfile. This file can no longer be recovered so it must be dropped. File #22 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00022' in the controlfile. This file can no longer be recovered so it must be dropped. File #23 found in data dictionary but not in controlfile. Creating OFFLINE file 'MISSING00023' in the controlfile. This file can no longer be recovered so it must be dropped. Dictionary check complete |
由于此时数据库已经打开了,因此为产生了一个重建控制文件的脚本,发现脚本内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
STARTUP NOMOUNT CREATE CONTROLFILE REUSE DATABASE "ORCL" RESETLOGS ARCHIVELOG MAXLOGFILES 192 MAXLOGMEMBERS 3 MAXDATAFILES 1024 MAXINSTANCES 32 MAXLOGHISTORY 9344 LOGFILE GROUP 1 '+DATA/orcl/onlinelog/group_1.273.850670135' SIZE 50M BLOCKSIZE 512, GROUP 2 '+DATA/orcl/onlinelog/group_2.274.850670135' SIZE 50M BLOCKSIZE 512 -- STANDBY LOGFILE DATAFILE '+DATA/orcl/datafile/system.268.850670033', '+DATA/orcl/datafile/sysaux.269.850670033', '+DATA/orcl/datafile/undotbs1.270.850670033', '+DATA/orcl/datafile/users.271.850670033', '+DATA/orcl/datafile/undotbs2.276.850670237', '+DATA/orcl/datafile/datacenter', '+DATA/orcl/datafile/partner_platform', '+DATA/orcl/datafile/sw_portal', '+DATA/orcl/datafile/system.dbf', '+DATA/orcl/datafile/system_02.dbf', '+DATA/orcl/datafile/user_02.dbf', '+DATA/orcl/datafile/user_03.dbf', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00013', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00014', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00015', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00016', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00017', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00018', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00019', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00020', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00021', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00022', '/u01/app/oracle/product/11.2.0/db_1/dbs/MISSING00023' CHARACTER SET ZHS16GBK; |
实际上我问客户,他们的反馈是之前由于控制文件损坏,客户也重建了控制文件,进行了多次恢复,而且也进行了resetlogs操作。
从上面的信息来看,不难看出客户重建控制文件的时候漏掉了11个数据文件。由于这部分文件的信息在数据字典中存在,因此在open的时候Oracle 会自动进行offline drop。
或许有人要说,直接找到文件然后重建控制文件不就行了吗?确实如此,然而实际上这里却并没有这么简单。
我进入到asm磁盘组检查文件发现有几个文件名称很奇怪,例如user_02.dbf 实际上link到了system,类似这样的情况。
这种情况下极容易出错。争取的做法查询dba_data_files进行数据文件的挨个确认。
确认好asm磁盘组漏掉的4个文件之后,还有7个文件位于文件系统中。全部添加到脚本中进行创建时发现这些文件和之前到文件到resetlogs已经完全不同了。
其实创建控制文件会报错ora-01189。
因此这里还必须手工去修改这11个数据文件头的resetlogs信息;等我将resetlogs信息全部修改完毕后,可以顺利创建控制文件。
但是当我进行reconver时却发现需要之前等archivelog,进一步检查发现归档日志都全部被删掉了。
因此最后还必须的再次修改这部分数据文件的checkpoint信息,将其改成与其他正常的文件一致,最后可以顺利打开数据库,
且检查所有的数据库文件状态均为online状态,如下所示:
最后再将文件系统的文件迁移到asm磁盘组,然后添加redo信息,启动rac节点2.
Leave a Reply
You must be logged in to post a comment.