Monday, March 13, 2006

ORA-04031 errors

Starting getting ORA-04031 today in one of our 24x7 databases. That is a good way to ruin your day, but this was a 9i database - that isn't supposed to happen. I run scripts from our central database that monitors possible shared pool fragmentation (among other things) every ½ hour, charts it all and more importantly sends an email when things are looking like it could be a problem. I was amazed, the utility never reported a problem via email. I started looking at the issue, had 15 phone calls in 5 minutes, so I flushed the shared pool to at least get people into the system. I checked the shared pool right after I flushed it, in only a few seconds, was already up to 70% usage (450meg shared pool).

No way, that just isn't right. The application designers have never heard of the term "bind variable" before coming to our shop, one of these "database independent" , middle tier independent, we just run java server pages applications. A horror show and a constant source of "its too slow" complaints from all 500+ users. After a few minutes investigation, I found a session that had been running for 105 minutes and was still going. No way, my utility would have told me that, it looks for things like that. I am getting phone calls again that people are getting 04031 errors when opening screens. Take a look, 100% shared pool, so I flush it again.


I can now deal with the session, turns out it is pulling massive amounts of data over a DB link from our financials systems (a couple million rows) with thousands and thousands of queries. I hate integration done by developers. I kill both sessions from both databases and flush the shared pool again. 10 minutes later, 55% usage, the balance is restored. No more complaints. Turns out the guy that started the report, went HOME.


Now, onto my utility, it can't be broken, I wrote it, and I normally write really good stuff.It is accessed via an APEX front end using LDAP authentication… thing of beauty. I pop straight to the logging screen, yup, there are the connections to the problem database. Drill into the results, yup, there is a possible shared pool fragmentation alert and a long running query alert. Go to the outgoing email logs, yup 3 emails sent 12 minutes before all of the phone calls.

Check my inbox, and my blackberry no messages. So, I think email server problems. Go check the email server, I can ping it and send an email from my desktop, check the SMTP servers, they are all working, I can connect and do a POP command. Now what… go to the machine the utility DB runs on, manually send an email, wait, wait, wait Hmmm not getting the email. Try from another machine. Not getting the email. Try from gmail, no email.

I get one of our tech guys to check the mail server. Find out that the lotus notes mail server has decided to stop forwarding email, actually turns out all 3 notes servers decide that forwarding mail is a bad thing, no reason why just "Failure to process message", over 3000 messages lost, gone, poof but luckily reboot fixes it (must be a Microsoft server).

Those emails appear to be gone forever though, I certainly hope the CEO wasn't expecting anything important. I then proceed to calmly explain to my manager the he is going to have dozens of phone calls, and other managers coming to him for the next week complaining that their site lost money and people can't do their work because the "database crashed" and how could we let this happen, why do they pay so much of their monthly department earnings to the IT department?

The managers of course haven't a foggiest idea what a "database" is, or care when it is explained to them it was the fact some goober accountant decided that he really and truly needed all of the detail data in our financials system over the last 7 years in his crystal report, so he actually caused all of the problems. My poor manager is going to get raked over the coals, he is going to get all of this, because, LOTUS NOTES SUCKS.

4 comments:

Anonymous said...

Nice story and a good reminder -

Never attribute to poor code what can be explained by dumb users.

alter system flush accountant;

Anonymous said...

You can also get this error if you're using a SERVER=SHARED connection in TNSNAMES; if the "large" pool is referenced in the error text.

Don said...

Mind sharing your SQL to analyze shared pool usage?

Herod T said...

Nothing exciting

All based on SQL and notes found in metalink note

396940.1