遭遇RAC第一节点实例由于ASM实例异常导致数据库实例非正常停止,记录在此。1.故障现象两节点RAC第一节点实例停止,经检查ASM实例亦异常终止。2.故障分析检查数据库实例及ASM实例的的alert寻找处理思路。1)alert日志内容Sun May 8 06:59:06 2011Errors in file /oracle/app/oracle/admin/racdb/bdump/racdb1_asmb_21478.trc:ORA-15064: communication failure with ASM instanceORA-03113: end-of-file on communication channelSun May 8 06:59:06 2011ASMB: terminating instance due to error 15064Sun May 8 06:59:06 2011Errors in file /oracle/app/oracle/admin/racdb/bdump/racdb1_lms1_21275.trc:ORA-15064: communication failure with ASM instanceSun May 8 06:59:06 2011Errors in file /oracle/app/oracle/admin/racdb/bdump/racdb1_lgwr_21283.trc:ORA-15064: communication failure with ASM instanceSun May 8 06:59:06 2011Errors in file /oracle/app/oracle/admin/racdb/bdump/racdb1_lms0_21271.trc:ORA-15064: communication failure with ASM instanceSun May 8 06:59:06 2011Errors in file /oracle/app/oracle/admin/racdb/bdump/racdb1_lmon_21267.trc:ORA-15064: communication failure with ASM instanceSun May 8 06:59:06 2011Errors in file /oracle/app/oracle/admin/racdb/bdump/racdb1_lmd0_21269.trc:ORA-15064: communication failure with ASM instanceSun May 8 06:59:06 2011System state dump is made for local instanceSystem State dumped to trace file /oracle/app/oracle/admin/racdb/bdump/racdb1_diag_21263.trcSun May 8 06:59:06 2011Errors in file /oracle/app/oracle/admin/racdb/bdump/racdb1_mman_21279.trc:ORA-15064: communication failure with ASM instanceSun May 8 06:59:07 2011Shutting down instance (abort)License high water mark = 7Sun May 8 06:59:07 2011Trace dumping is performing id=[cdmp_20110508065906]Sun May 8 06:59:11 2011Instance terminated by ASMB, pid = 21478Sun May 8 06:59:12 2011Instance terminated by USER, pid = 4110Mon May 9 13:44:05 20112)trace文件中截取到如下故障内容kjctseventdump-end tail 14 heads 0 @ 0 14 @ -1115894656 DEFER MSG QUEUE ON LMS1 IS EMPTY SEQUENCES: 0:0.0 1:2933.0error 15064 detected in background processORA-15064: communication failure with ASM instance3)ASM日志中记录了如下内容Thu Feb 10 19:17:58 2011NOTE: cache recovered group 1 to fcn 0.20162635Thu Feb 10 19:17:58 2011NOTE: opening chunk 1 at fcn 0.20162635 ABANOTE: seq=79 blk=1597Thu Feb 10 19:17:58 2011NOTE: cache mounting group 1/0xBA97DAE1 (ORADATA) succeededSUCCESS: diskgroup ORADATA was mountedThu Feb 10 19:18:01 2011NOTE: recovering COD for group 1/0xba97dae1 (ORADATA)SUCCESS: completed COD recovery for group 1/0xba97dae1 (ORADATA)Thu Feb 10 19:18:01 2011Starting background process ASMBASMB started with pid=17, OS id=7767Thu Feb 10 19:21:06 2011NOTE: ASMB process exiting due to lack of ASM file activitySun May 8 06:48:33 2011Shutting down instance (abort)License high water mark = 6Instance terminated by USER, pid = 20819初步判断是由于ASM出现异常导致的此次故障。但是和这里的提示“NOTE: ASMB process exiting due to lack of ASM file activity”没有关系。这个提示仅仅是一个提示而已,在ASM日志中的其他地方也有多次出现。3.尝试故障处理1)尝试启动ASM无果。2)手工启动ASM实例可以成功racdb1@racdb1 /home/oracle$ export ORACLE_SID=+ASM1+ASM1@racdb1 /home/oracle$ sqlplus / as sysdbaSQL*Plus: Release 10.2.0.3.0 - Production on Sun May 8 13:43:06 2011Copyright (c) 1982, 2006, Oracle. All Rights Reserved.Connected to:Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit ProductionWith the Partitioning, Real Application Clusters and Data Mining optionsNotConnected@> shutdown immediate;ASM diskgroups dismountedASM instance shutdownNotConnected@> startup;ASM instance startedTotal System Global Area 130023424 bytesFixed Size 2071000 bytesVariable Size 102786600 bytesASM Cache 25165824 bytes3)但启动数据库实例时抛出“ORA-01105”和“ORA-38767”错误。racdb1@racdb1 /home/oracle$ sqlplus / as sysdbaSQL*Plus: Release 10.2.0.3.0 - Production on Sun May 8 13:43:53 2011Copyright (c) 1982, 2006, Oracle. All Rights Reserved.Connected to an idle instance.NotConnected@> startup;ORACLE instance started.Total System Global Area 8388608000 bytesFixed Size 2086096 bytesVariable Size 1644170032 bytesDatabase Buffers 6727663616 bytesRedo Buffers 14688256 bytesORA-01105: mount is incompatible with mounts by other instancesORA-38767: flashback retention target parameter mismatch4.再次尝试故障处理对除VIP之外的CRS资源进行重启,此时仍然无法启动ASM实例和数据库实例。5.最后的处理方法最后尝试重启第一个节点的所有CRS资源,终于将RAC的第一个节点的所有资源启动完毕。6.小结通过一系列的故障处理尝试,最终恢复了RAC数据库故障。Good luck.secooler11.05.08-- The End --
本文转自einyboy博客园博客,原文链接:http://www.cnblogs.com/einyboy/archive/2012/08/23/2651960.html,如需转载请自行联系原作者。