love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

Phone:18180207355 提供专业Oracle/MySQL/PostgreSQL数据恢复、性能优化、迁移升级、紧急救援等服务

gc freelist问题再现

本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

本文链接地址: gc freelist问题再现

今天一大早某客户在vx群反馈其中一套系统早上上班时间出现了一段时间的hung,后面自动恢复了。从客户提供的ASH数据来看,问题相对清晰:

Slot Time (Duration) Slot Count Event Event Count % Event
09:26:01 (2.0 min) 56,948 gc freelist 42,070 9.28
read by other session 9,985 2.20
enq: HW – contention 3,332 0.73
09:28:00 (2.0 min) 59,045 gc freelist 43,374 9.56
read by other session 10,188 2.25
enq: HW – contention 3,773 0.83
09:30:00 (2.0 min) 59,432 gc freelist 43,156 9.52
read by other session 9,253 2.04
enq: HW – contention 5,238 1.16
09:32:00 (2.0 min) 59,812 gc freelist 43,037 9.49
read by other session 9,296 2.05
enq: HW – contention 5,449 1.20
09:34:00 (2.0 min) 60,775 gc freelist 43,675 9.63
read by other session 9,619 2.12
enq: HW – contention 5,428 1.20
09:36:00 (2.0 min) 63,090 gc freelist 44,403 9.79
read by other session 11,003 2.43
enq: HW – contention 5,520 1.22
09:38:00 (2.0 min) 63,576 gc freelist 44,636 9.84
read by other session 11,286 2.49
enq: HW – contention 5,474 1.21
09:40:00 (1.0 min) 30,798 gc freelist 20,947 4.62
read by other session 5,530 1.22
enq: HW – contention 2,688 0.59

之前写过一篇关于gc freelist等待的文章。实际上该等待事件并不多见,如果当你发现top event中出现该等待,那么多半是遇到了Bug。

我们回到这个案例本身,实际上从ASH数据来看,无非就是gc freelist,enq:HW – contention,read by other session. 从直观感觉我认为可能大概率出问题在gc freelist,从而导致整个集群都非常缓慢,近似于hung的状态。当然,当其中一个节点都出现了大量gc freelist,导致整个系统很慢的情况下,那么另外一个节点ash 势必也是大量的锁等待。

Slot Time (Duration) Slot Count Event Event Count % Event
09:22:36 (1.4 min) 6,506 enq: HW – contention 2,859 3.02
enq: TX – row lock contention 1,380 1.46
gc buffer busy acquire 1,128 1.19
09:24:00 (2.0 min) 10,361 enq: HW – contention 4,266 4.51
enq: TX – index contention 2,679 2.83
enq: TX – row lock contention 2,452 2.59
09:26:00 (2.0 min) 11,352 enq: HW – contention 4,550 4.81
enq: TX – index contention 3,030 3.20
enq: TX – row lock contention 2,802 2.96
09:28:00 (2.0 min) 12,415 enq: HW – contention 5,000 5.29
enq: TX – index contention 3,411 3.61
enq: TX – row lock contention 3,078 3.25
09:30:00 (2.0 min) 13,300 enq: HW – contention 5,271 5.57
enq: TX – index contention 3,184 3.37
enq: TX – row lock contention 3,079 3.25
09:32:00 (2.0 min) 13,638 enq: HW – contention 5,423 5.73
enq: TX – row lock contention 3,296 3.48
gc buffer busy acquire 3,216 3.40
09:34:00 (2.0 min) 14,283 enq: HW – contention 5,647 5.97
enq: TX – row lock contention 3,678 3.89
gc buffer busy acquire 3,307 3.50
09:36:00 (1.7 min) 12,751 enq: HW – contention 5,083 5.37
enq: TX – row lock contention 3,327 3.52
gc buffer busy acquire 3,108 3.29

因此这个时候看ash已经没有什么用了。从目前判断问题主要在节点2上,因此我来看下客户提供的节点2故障时间段内的一个diag trace:

我们可以清楚的看到,实际上大量session被该会话所阻塞,而该session在等待gc freelist,而且已经等待了10多分钟。

进一步分析该session的trace,可以看到执行情况,发现跟审计有关:

从现象来看,跟Oracle mos文章 INSERT INTO SYS.AUD$ Is Blocking Due to Waiting on ‘gc freelist’ (Doc ID 2207798.1) 描述很像的。

同时也参考了几篇类似文章例如Unified And/Or Traditional Auditing Inserts Waiting On GC FREELIST (In ASSM Tablespace) (Doc ID 2596033.1)

总的来讲,建议是调整如下参数:_gc_element_percent  建议设置为140~200之间(默认值是110).

当然,根据本文这里的case,实际上也可以通过move aud$到其他表空间来解决或关闭审计来绕过这个问题。

Leave a Reply

You must be logged in to post a comment.