With the conversion of '07, staffing issues, hardware failures, fires, and the day to day grind, my job satisfaction over the last year has well, been at an all time low. I can never seem to get anything done properly, everything is patched together and done in haste to move onto the next fire. One of the fine ladies that works with me actually bought everybody in the group a little fireman's hat and a tiny little working fire hydrant and hose squirt gun, because it seems to be how we are viewed by the company as of late, simply running around putting out fires. A necessary, expensive evil entity who never has time to talk, and is only around when there is an emergency, or a fresh pot of coffee.
Well, today was a little different. 4 of us IT people had taken a group trek to the lunch room for a refill, we bumped into a few of our sales staff, once we all realized we were on hallowed neutral ground and no marking of territory was needed we started to chit chat, the weather around here has been record breaking and that is what we were talking about, all relaxing nicely. Then one of the sales guys pipes up, completely out of the blue and says "So, any of you ever look at our order processing and shipping system?". Well the system is a bolt on system to our oracle e-business suite and is relatively new. Development was outsourced and then the application support was laid in the lap of our haggard sales support team. None us IT people at that time were part of the sales support team, so the answer was a resounding "No". I monitor the database, but it is quiet and never complains about anything.
The sales guy goes on to talk about how over the last 2 months, it takes longer and longer to process an order, to the point of the sales guys queue up their orders and process let the system process them over night, because if they process the order manually, it locks up the application for as long as the order takes to process. The sales guy, lets call him Henry, goes on to say when the system was first in place, an order would process in about 1-2 seconds, now an order he processed this morning while a customer was on the other end of the phone took 7 minutes. I know his time is valid because the screen actually tells the sales person how long the processing took.
Henry continues on saying all of the sales folks have raised the issue with their manager, who was supposed to pass this along to the IT group, but it appears never has. I decide to return to my cube and take a look. I do the normal thing of right off the bat of starting a snapshot every 10 minutes and dig up some of the baseline snapshots so I have something to compare too. I call Henry and ask him to process an order so I can trace his session. I find his session and start a trace, he processes an order, and his session goes inactive after only a couple of seconds. I have Henry on the other end of the phone, and he says the process is still running. I can plainly see, that according to the DB, the process is over.
The app is a Delphi fat client, so I can rule out any connection pooling or anything like that. I go up to Henry's desk, the process is still going. I wait until it finishes, it takes almost 7 minutes, and he reviews the PDF that pop's up of the order and then he clicks the forward button and lets the work flow take over and push the order off to the next people in line. Those people happen to be in the cube right next to Henry, so I check with them. They say yes, they see they have an alert that the order has come in, but the lady is not wanting to open the order because at this time of day, it can take almost 10 minutes for the form to open., she says, first thing in the morning she can process a hundred forms in just a few minutes and I should come back tomorrow. I push the issue and she opens the form, and yes it takes - get this, 7 minutes.
Speculating to myself on what the problem might be, I go down to the sales support group and talk to their team lead, they know about the problem and it has been a ticket in the help desk for awhile but they are short staffed by 3 and do not have time to deal with it any time soon. I ask if I can, and of course am promptly handed the keys to the system and told to "have fun, don't change anything". I take a look and I find the PL/SQL that generates the PDF, it is a simply external call to a java program via the client that reads the DB and generates the PDF. I can run it in test no problem, generates a PDF in a couple of seconds. I ask Henry to try test, he does, takes just a few seconds.
Well, this is turning into a really long story, but I needed to lay the groundwork, sorry about that. Continuing the story, I decide to check out the DB server. lots of free space, minimal CPU usage, our reporting software shows that the machine only averages about 70% CPU usage over the course of a few days. Well that isn't it. I then cd into the temporary holding location for the PDF files and do an ls. Then I wait, and wait, and wait, and wait. I open another session and check the server. No CPU usage, no nothing. Being the server is a Linux box I run top and take a look, nothing out of the ordinary, I run iostat, nothing out of the ordinary. Very confused I check the other session and it is completed, I run "time ls" and wait. I am sure you can guess it, time comes back to just under 7 minutes. I check the scroll back on my terminal, it goes back to the max of the buffer, 2000 lines. I do a ls | wc -l. I wait the 7 minutes and get the outstanding number of 532,932. Wow, I didn't think that was possible, that is a lot of files. Then I realize with the file name sizes, ls is 4 columns wide. That is 2,131,728 files in that one directory assuming they are all the same file name length. I go back up one directory, do a ls -lrt and the directory is only 2 bytes in size. That doesn't seem right, with that many files in there, the directory header should be massive.
I call the SA group and have them check. They come back, surprisingly about 10 minutes later saying "Yeah, the directory is f*cked" it is a mount point on the NAS and we can't even see it from console. The NAS is mounted to the DB server and as a share that the sales people can access. A few phone calls later, and we confirm that the PDF's are temporary and not needed, we down the DB with the user's consent, the SA's unmount the drive and remount a new empty one immediately. Once we bring it back, we call Henry and he tries it and an order processes in under 2 seconds. We look at the directory and there is a single PDF in there. We know we haven't written 2+ million orders in 2 months (we wish!!), so the SA's work on getting some files out of the directory so we can take a look. While waiting on the SA's I monitor the directory, there are now 3 files, but Henry is the only user in the system still, we haven't released it to the public yet. I call Henry, he processed 1 other order, not 2 more - he swears by it. I asked Henry to process another, he does. now 7 files, and another one, now 15 files. I tell Henry to stop and call the support team and tell them about the problem.
The sales support group immediately drop everything and dig into the issue. Takes the team lead about 10 minutes to find the problem. The Delphi, just in case, backs up all files submitted by that user before it processes the current order. It assigns a sequence number and copies the files, effectively doubling the number of orders issued by the sales person before the new order is generated. The user's PC does all the work copying the files. The code to remove the backup files is instead a simple comment saying something to the effect of "delete was taking too long, investigate using a cron task to delete, the directory can fill up fast", made by one of the sales team support people who left awhile back. A quick change, and bingo, the code will delete order's older than 24 hours and not make any backups. Since the app is ran from a network drive, the sales group compiles a new version and pushes it out. After letting a few more users in, we verify the problem is fixed and go home for the day.
This made me feel good. Nothing to do with problems in an oracle database, no meetings, no hours of planning and discussion. Just suit up, get to work, and put the fire out...
sigh.