I'm beginning to really hate the old SCO systems that are left out there. However, when they call with a problem, I have to help, right? Well of course I do. But I can't help cringing sometimes..
This call was from a Boston client who honestly is trying to get off this creaking old boat. They can't seem to find off the shelf software that will meet their needs, but they have hired some company to write them a new app. Unfortunately it will be Windows, but that is what it is, right? Anyway, while the app development proceeds, the old SCO has to keep running. Early on a Friday morning, it stopped.
Well, not exactly. It stopped letting people log in. People who were already logged in could work, but recieved a message described to me as "something like 'system database not allocated'"
Well, OK - normally I'd complain and say "Please give me the EXACT message you saw", but for this I knew it had to be something in the TCB (Trusted Computing Base) being messed up. SCO has built in tools to examine and even fix up minor problems, so I felt I might be able to lead someone through fixing it, but on the other hand this stuff can sometimes get nasty so I wasn't sure. I had another problem too: I had to be somewhere else in fifteen minutes and I wouldn't be able to talk anyone through anything until I was done with that.
"How many people are still logged in?", I asked.
"Four"
Well, heck, that's not so bad. It's Friday, a slow day for them as it is for most businesses, and they only usually have about twelve or so people working on that system anyway. I asked if they could limp along for a few hours. Yes, they could.. but please hurry.
So I went off to my meeting, but couldn't help thinking about how I'd approach this problem. Most likely, I could just have them run "integrity" to get a list of damaged files and restore them from backup. Most likely..
That's assuming the backup is good, of course. What if the backup has messed up files or the problem is really somewhere else? I can look at a tcb file and know what I'm looking at, but it would just be gobbledy-gook to the person at their end, so I'd have to get them to print things and fax them to me.. yuck. I shortly convinced myself that it was better to go in to the job.
After finishing up my other business, I called and explained that. I said that I probably could fix this over the phone, but I was a little hesitant because it could get nasty, and I'd rather just come in. The people at the other end immediately agreed: they really didn't want to be led through anything anyway. By the way: they were now down to two people working because the other two had "accidentally" logged out.. no, I don't know how you "accidentally" log out either.
An hour later I was there. I ran "integrity: and it pointed to /etc/auth/system/default as the problem. I looked at it and found it zero length.. how could that happen? That file should look something like this:
I've never seen a system glitch just zero that, so I was suspicious.
"Who's got root?"
It was a short list.
"I think somebody mean to clear something else and somehow cleared this.. or.."
The "or" would be file system corruption. Possible, but no other indicationss of that. Nothing in "syslog" or "messages" indicating any other problems. I restored the file from backup and people immdeiately could log in. I watched log files for any other problem and ran "integrity" again. No problems..
I hung out a while checking this and that.. nothing seemed odd, so I really think someone had to have done this by mistake.. I have no idea what they thought they were doing.. it couldn't have been sabotage because it was too limited.. surely anyone wanting to cause damage would have done more.. well, I would think so anyway.
So that's where I left it. I'll check in on it a few times over the next week, but I really don't think there is anything wrong.. just an "accident".