I went to the systemstate dump to examine these in more detail. I started off by
looking for resources that were being held and resources being waited for.
Ignoring idle waits, the processes in the state dump are waiting for either
latch free, enqueue, row cache lock, or library cache pin. The state dump showed
one process holding both the (single) shared pool latch and the (single) row
cache objects latch, with other processes holding some of the library cache
child latches (the library cache latch has 5 children).
Looking at our waits in more detail we found the following sources for
contention:
-Contention on the library cache latch: all sessions holding a library cache
child latch were waiting for the shared pool latch.
-Contention for enqueues: all enqueue waits in the systemstate dump occurred
from the attempt to obtain a space management (ST) enqueue in exclusive (X)
mode. This enqueue was currently held by a process waiting for either the row
cache lock or the row cache object latch (depends which dump we look at)
-Contention for library cache pin: all processes are requesting a Share on the
same handle. The process which holds that pin in X mode is waiting for the row
cache object latch.
-Contention for row cache lock: we have waits for this, but nothing in the
systemstate dump is shown holding it. (Can you wait on the row cache lock if you
are not holding the row cache object latch?)
-Contention for global cache lock busy: still researching. (well, not really,
but I sort of ignored this one)
So we see that ultimately the bulk of the contention points to a single process
which is holding both the shared pool latch and the row cache object latch. What
is this process doing? Why the long hold of these latches?
The process (which is not specific to a particular username or piece of SQL. I
know this since I have multiple state dumps) displays:
holding 8000ebec shared pool level=7
Location from where latch is held: kghfrunp: alloc: clatch nowait:
Context saved from call: 0
state=busy
holding 8000c114 row cache objects level=4
Location from where latch is held: kghfrunp: clatch: nowait:
Context saved from call: 0
state=busy
Since the final statspack report completed after the slowdown cleared, I can
look for changes in certain pools rather than look at the end points. The
largest changes occurred in the following areas of the SGA.
POOL NAME BEGIN VALUE (early slice) END (early)=Beg (slow)
END VALUE (slow)
Shared pool dictionary cache 44,700,000 46,000,000
27,900,000
Shared pool sql area 34,500,000 35,300,000
53,600,000
Shared pool free memory 24,500,000 27,900,000
16,700,000
Shared pool KQLS heap 34,000,000 28,700,000
40,700,000
My thought is that the system needs more space for the dictionary cache, so the
process holding the row cache object latch also takes the shared pool latch.
This latch then tries to free unpinned (frunp) chunks. The scanning of the
shared pool and freeing additional space is taking a long time. (similar to
Riyaj 's response to this thread)
All this matches fairly well with the description of the Bug at the beginning of
this thread. In the meantime, another DBA here increased the shared_pool size
(which delays the onset of this problem, but it should hit harder when it hits),
and upgraded from 8.1.7.2 to 8.1.7.4 (Oracle thought it was a different bug
[1819214, 1647363] and that this upgrade would fix them. I am tracking down
whether our bug is also covered by this upgrade). It seems reducing shared pool
contention (reduce shared_pool size, use session_cached_cursors, ...) will also
help.
Thanks to Ross, Mladen, and Riyaj for the help.
Henry
-- --Original Message-- --
From: oracle-l-bounce@(protected)
[mailto:oracle-l-bounce@(protected)]On Behalf Of Mladen Gogala
Sent: Wednesday, April 28, 2004 12:52 AM
To: oracle-l@(protected)
Subject: Re: systemstate dump
I only now saw this post. Sorry for being late. There is an oracle
bug in 8i with DLM. To make the long story short, a global enqueue
"loses " owner, and, consequently, cannot be relased. That causes all
kinds of beautiful events. Here is what I 'm talking about and what Mogens
would call CTNOM:
Article-ID: <Note:157766.1 >
Circulation: PUBLISHED (EXTERNAL)
Folder: server.HA.RAC
Topic: Performance, Tuning and Hanging Issues
Title: Sessions Wait Forever for 'global cache cr request ' Wait
Event in OPS or RAC
Document-Type: PROBLEM
Impact: MEDIUM
Skill-Level: NOVICE
Server-Version: 08.01 to 09.00
Updated-Date: 18-FEB-2003 11:35:45
References: <BUG:1916409 >
Shared-Refs:
Authors: MPOLASKI.US
Attachments: NONE
Content-Type: TEXT/PLAIN
Products: 5/RDBMS;
Platforms: GENERIC;
Problem Description
-- ---- ---- ------
Sessions hang indefinately in an OPS environment. Session(s) wait forever
for
the 'global cache cr request ' wait event. Other sessions could end up
waiting
for the 'buffer busy due to global cache ' or 'global cache lock busy ' wait
events.
...
Explanation
-- ---- ---
Due to Bug 1916409 a PCM lock with the expired timeout might not be ever
cancelled.
Possible Symptoms:
- The 'WAITING SESSIONS ' section of an opsdiag.sql output (Note 135714.1)
shows
a session waiting for 'global cache cr request ' with a high value for
'seconds '
- The 'PCM LOCK WAITERS ' section of an opsdiag.sql output (Note 135714.1)
shows
a lock upgrading (for example: granted in 'Null ' mode and requesting
'Share '
mode with a high value for 'sec ' (seconds)).
- 'global cache cr request ' wait event never finishes. Seq# and
seconds_in_wait continue to increase.
select sid, event, seq#, seconds_in_wait
from gv$session_wait
where event like 'global cache% ';
- The lock dump (from an lkdebug dump or systemstate dump) for the offending
PCM lock has On_timer_q? : Y and a negative timeout value. Example:
- 'buffer busy due to global cache ' or 'global cache lock busy ' wait events
appearing on the system (although this is more of a side effect).
> -- --Original Message-- --
> From: Henry Poras [mailto:hporas@(protected)]
> Sent: Wed 4/28/2004 2:08 AM
> To: oracle-l@(protected)
> Cc:
> Subject: systemstate dump
>
>
>
>
> >From statspack, I compared the 15 (closer to 20) minute snapshot where the
> issue arose to the 15 minute snapshot just prior to that. What I found was:
>
> -CPU time almost unchanged
> -a huge change in the Total Wait Time (cs) for the following events (the
> number of waits were all within an order of 2 of each other except for latch
> free which jumped from 200 to 150,000).
>
> NAME TIME-initial (cs) TIME-final (cs)
>
> latch free 200 2,800,000
> enqueue 900 280,000
> global cache lock busy 8,200 30,000
> library cache pin 500 18,000
> row cache lock 600 15,000
> library cache lock 600 8,000
> DFS lock handle 30 5,000
> PX Deq: Execution Msg -- 5,000
>
-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ------
Please see the official ORACLE-L FAQ: http://www.orafaq.com
-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ------
To unsubscribe send email to: oracle-l-request@(protected)
put 'unsubscribe ' in the subject line.
--
Archives are at http://www.freelists.org/archives/oracle-l/
FAQ is at http://www.freelists.org/help/fom-serve/cache/1.html
-- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- --