帮朋友check一次中国银联数据库的坏块问题(并不是严格意义上的坏块),大致情况为smon需要trascation recover —>特定的block 发现无法读取 —>transaction recover出现问题 —>smon terminated
Wed Aug 15 19:02:57 2012
SMON: Parallel transaction recovery slave got internal error
SMON: Downgrading transaction recovery to serial
Wed Aug 15 19:02:57 2012
Errors in file /oracle/admin/settora/bdump/settora_smon_4755904.trc:
ORA-00600: internal error code, arguments: [kdourp_inorder2], [43], [43], [7], [44], [], [], []
Doing block recovery for file 10 block 176769
Doing block recovery for file 26 block 3259716
Block recovery from logseq 108030, block 49455 to scn 41320651147
Wed Aug 15 19:02:58 2012
Recovery of Online Redo Log: Thread 1 Group 5 Seq 108030 Reading mem 0
Mem# 0: /orasvr/settora/redo05.log
Block recovery completed at rba 108030.49555.16, scn 9.2665945485
ORACLE Instance settora (pid = 8) – Error 600 encountered while recovering transaction (7, 19) on object 331910.
过一段时间数据库就会down掉,从日志信息里面可以看到:
Wed Aug 15 19:23:18 2012
SMON: Parallel transaction recovery slave got internal error
SMON: Downgrading transaction recovery to serial
Wed Aug 15 19:23:18 2012
Errors in file /oracle/admin/settora/bdump/settora_smon_4755904.trc:
ORA-00600: internal error code, arguments: [kdourp_inorder2], [43], [43], [7], [44], [], [], []
Wed Aug 15 19:23:20 2012
Errors in file /oracle/admin/settora/bdump/settora_pmon_4690184.trc:
ORA-00474: SMON process terminated with error
Wed Aug 15 19:23:20 2012
PMON: terminating instance due to error 474
Instance terminated by PMON, pid = 4690184
执行下面SQL
select /*+full(a)*/ from PRO_SETTLE3.SHOPACCOUNT a;
returns:ORA-00600: internal error code, arguments: [kdourp_inorder2], [43], [43], [7], [44], [], [], []
使用dbv工具来检测:
$ dbv file=/orasvr1/settle5/settle000.dbf blocksize=8192
DBVERIFY: Release 10.2.0.4.0 - Production on Thu Aug 16 09:40:49 2012
Copyright (c) 1982, 2007, Oracle. All rights reserved.
DBVERIFY - Verification starting : FILE = /orasvr1/settle5/settle000.dbf
kdrchk: row is marked as both Last and Next continue
prow=0x1101aec98 flag=0x07
Block Checking: DBA = 42119809, Block Type = KTB-managed data block
data header at 0x1101ae07c
kdbchk: bad row tab 0, slot 51
Page 176769 failed with check code 6253
DBVERIFY - Verification complete
Total Pages Examined : 386688
Total Pages Processed (Data) : 282817
Total Pages Failing (Data) : 1
Total Pages Processed (Index): 98760
Total Pages Failing (Index): 0
Total Pages Processed (Other): 2707
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 2404
Total Pages Marked Corrupt : 0
Total Pages Influx : 0
Highest block SCN : 2666197038 (9.2666197038)
Total Pages Failing (Data) : 1
Page 176769 被标识为corruption
Total Pages Examined : 512 Number of blocks looked at Total Pages Processed (Data) : 1 Number of TABLE blocks seen Total Pages Failing (Data) : 1 Number of TABLE blocks with internal inconsistencies Total Pages Processed (Index): 0 Number of INDEX blocks seen Total Pages Failing (Index): 0 Number of INDEX block with internal inconsistencies Total Pages Empty : 507 Number of unused blocks seen Total Pages Marked Corrupt : 2 Number of blocks with corrupt cache wrappers Total Pages Influx : 0 Number of pages we re-read as the page looked like it was being modified when it was first read.
SQL> select file#,block#,blocks from v$database_block_corruption; no rows selected
在数据库中并没有关于坏块的记录存在。这并不是一个普通意义上的坏块。对于这种块使用corrupt skip是没有用的skip_corrupt_blocks只能使oracle跳过Oracle能够读出的块,而如果在操作系统层read调用就失败的,则不能跳过该块。
oracle 在read page 176769时发生了error 6253 (cannot read arguments from address file,其实这并不是真正的root cause,我在和huangyong讨论过后基本认定为一个内部的inconsistent block) oracle 不能 mark 为 corrupt block.
这里的思路为使用rowid来跳过这些有问题的block 大致方法如下
SQL> SELECT dbms_rowid.rowid_create(1,331910,10,176769,0) LOW_RID from DUAL;
LOW_RID
——————
AABRCGAAKAAArKBAAA
SQL> SELECT dbms_rowid.rowid_create(1,331910,10,176770,0) HI_RID from dual;
HI_RID
——————
AABRCGAAKAAArKCAAA
SQL> SELECT dbms_rowid.rowid_create(1,331910,26,3259716,0) LOW_RID from DUAL;
LOW_RID
——————
AABRCGAAaAAMb1EAAA
SQL> SELECT dbms_rowid.rowid_create(1,331910,26,3259717,0) HI_RID from DUAL;
HI_RID
——————
AABRCGAAaAAMb1FAAA
SQL> create table PRO_SETTLE3.SHOPACCOUNT_1 as select /*+ rowid(a) */ * from PRO_SETTLE3.SHOPACCOUNT a where rowid < 'AABRCGAAKAAArKBAAA'; Table created. SQL> insert into PRO_SETTLE3.SHOPACCOUNT_1 select /*+ rowid(a) */ * from PRO_SETTLE3.SHOPACCOUNT a where rowid >= ‘AABRCGAAKAAArKCAAA’ and rowid < 'AABRCGAAaAAMb1EAAA'; 49626 rows created. SQL> insert into PRO_SETTLE3.SHOPACCOUNT_1 select /*+ rowid(a) */ * from PRO_SETTLE3.SHOPACCOUNT a where rowid >= ‘AABRCGAAaAAMb1FAAA’;
208837 rows created.
SQL> select count(*) from PRO_SETTLE3.SHOPACCOUNT_1;
COUNT(*)
———-
262256
之后drop掉这张有问题的table restart database 恢复正常.
同样我们可以采用Kerry Osborne的脚本来实现.这里稍微修改了下 增加了选择tablespace 的功能temp:
eg:
[oracle@testdb ~]$ dd if=/dev/zero of=/data/oracle/oradata/wuxuan1/liu.dbf bs=8192 seek=1798 count=1 conv=notrunc
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 5.2e-05 seconds, 158 MB/s
[oracle@testdb ~]$ sqlplus liu/liu
SQL*Plus: Release 11.2.0.3.0 Production on Fri Aug 17 02:32:20 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Error accessing PRODUCT_USER_PROFILE
Warning: Product user profile information not loaded!
You may need to run PUPBLD.SQL as SYSTEM
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> select count(*) from temp;
select count(*) from temp
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'
SQL> @temp
WARNING: This script may issue a DROP TABLE command. Do not execute it unless you have read through it
and are comfortable you know what it does.
Ready? (hit ctl-C to quit)
Enter value for owner_name: LIU
old 8: v_owner_name varchar2(30) := upper('&owner_name');
new 8: v_owner_name varchar2(30) := upper('LIU');
Enter value for table_name: TEMP
old 9: v_table_name varchar2(30) := upper('&table_name');
new 9: v_table_name varchar2(30) := upper('TEMP');
Enter value for tablespace: LIU
old 11: v_tablespace varchar2(30) := upper('&tablespace');
new 11: v_tablespace varchar2(30) := upper('LIU');
Saved 13247 records in TEMP_SAVED.
201 bads records in TEMP_BAD.
PL/SQL procedure successfully completed.
SQL>
SQL> select count(*) from temp_saved;
COUNT(*)
----------
13247
SQL> select * from v$database_block_corruption;
FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO
---------- ---------- ---------- ------------------ ---------
4 258 1 0 ALL ZERO
4 1798 1 0 ALL ZERO
block 1798 已经被标识为media corruption
手动构造rowid
SQL> select * from temp_bad where rownum<5;
OLD_ROWID OLD_FILE OLD_OBJECT OLD_BLOCK OLD_ROW
------------------------------ ---------- ---------- ---------- ----------
ERROR_MESSAGE
--------------------------------------------------------------------------------
4.1798.181 13646 4 1798 181
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'
4.1798.182 13646 4 1798 182
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'
4.1798.183 13646 4 1798 183
OLD_ROWID OLD_FILE OLD_OBJECT OLD_BLOCK OLD_ROW
------------------------------ ---------- ---------- ---------- ----------
ERROR_MESSAGE
--------------------------------------------------------------------------------
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'
4.1798.184 13646 4 1798 184
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'
SQL> select dbms_rowid.rowid_create(1,13646,4,1798,181) from dual;
DBMS_ROWID.ROWID_C
------------------
AAADVOAAEAAAAcGAC1
SQL> select dbms_rowid.rowid_create(1,13646,4,1798,182) from dual;
DBMS_ROWID.ROWID_C
------------------
AAADVOAAEAAAAcGAC2
SQL> select * from temp where rowid ='AAADVOAAEAAAAcGAC1';
select * from temp where rowid ='AAADVOAAEAAAAcGAC1'
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'
SQL> select * from temp where rowid ='AAADVOAAEAAAAcGAC2';
select * from temp where rowid ='AAADVOAAEAAAAcGAC2'
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 4, block # 1798)
ORA-01110: data file 4: '/data/oracle/oradata/wuxuan1/liu.dbf'
Reference:
Extracting Data from a Corrupt Table using ROWID Range Scans in Oracle8 and higher [ID 61685.1]
Extracting Data from a Corrupt Table using DBMS_REPAIR or Event 10231 [ID 33405.1]
Extract rows from a CORRUPT table creating ROWID from DBA_EXTENTS [ID 422547.1]
就一个逻辑坏块啊,bbed估计就搞定了,基于rowid,block内的数据就丢了。
不是逻辑坏块 bbed估计够呛