Before The Storm
A few years back, I was working in a manufacturing company as IT manager. Like many industries, we had a number of machines with embedded computer systems. For the sake of convenience, we called these “production machines” because they produce stuff. By and large, the PCs we have inside the production machines are just normal desktop PCs that have a bunch of data acquisition cards in them.
Invariably these PCs are purchased and configured when this production machine is being commissioned, and then just left as-is until the production machine is retired. In some cases, this can be as long as 20 years. Please bear in mind that this is 20 years inside a dusty, hot factory environment.
I’ve been in manufacturing environments before, and this concept is not new to me. Thanks to a number of poignant lessons in the past, I make it my business to understand these PCs inside and out. I like to keep them on a tight refresh cycle, or when it’s not practical (in the case of archaic hardware or software), keep as many spares as possible.
Also, regular backups are important. You just have to understand that unlike a normal PC, it can be difficult to do and plan it well in advance. More often than not, these PCs aren’t IT’s responsibility; they fall under engineering or facilities. Even so, these guys understand that IT runs just about every other PC in the business and welcome any advice or assistance that IT can provide. Finally, these PCs are usually tightly integrated into a production machine, and failure of the PC means the machine stops.
And so we have today’s stars: Me, the new IT manager, and Aaron, the site’s facilities manager. He’s in charge of the maintenance of the site, including all of these production machines. He’s super paranoid about people trying to take his job, so he guards all his responsibilities jealously and doesn’t communicate anything lest they get the drop on his efforts.
Oh, and he has a fixation about not spending company money, even to the point of shafting the lawn-mowing guy out of a few hours pay. Then there’s the vice president of operations, the factory boss who’s a no-nonsense sort of guy, plus one of the old boys of the factory, Dale, who’s a man in his 70s.
I’m new, but in my first few weeks I’ve already had a number of run-ins with Aaron. I’m a fairly relaxed guy, but I have no qualms about letting someone dig their own grave and fall into it—and in Aaron’s case, I’d be happy to lend him my shovel. My pet hate was when organising new network drops, I will always run a double when we needed a single. We’re paying working-at-heights money already, and a double drop is material cost only. He’d invariably countermand all my orders and insist on singles.
Then a few weeks or months later, I’d have the sparkie guy in again to install the second drop, at another $4k. And then there was the time that Aaron was getting shirty because I was holding up a project of his. Well sorry, if you are running a project that requires 12-16 network ports, you’d better at least talk to the IT guys prior to the day of installation.
Not only will you not have drops, you won’t have switch ports. And if you didn’t budget for them, or advise far enough in advance that I could, then you can wait until I get around to it. Failure to plan is not an emergency. So you could see that we didn’t exactly gel together well. Which brings us to these production machines, and the PCs nested within.
Every attempt for me to try and document, or even understand them, was shut down by Aaron.
Me: Hardware and software specifications?
Aaron: That’s my job, get lost.
Me: Startup and shutdown procedures?
Aaron: That’s my job, get lost.
Me: Backup?
Aaron: That’s my job, get lost.
Me: Emergency contacts?
Aaron: That’s my job, get lost.
You get the picture. It resulted in a strong and terse email from Aaron to leave it alone. He had all the documentation, contacts, backups, and didn’t need or want my meddling. I was not to touch any production machine’s PC under any circumstance. Moving forward a few months and I’m helping one of the factory workers on their area’s shared PC.
It’s located right next to one of these production machines. It’s old. The machine itself was nearly an antique, but the controls system had been “recently” upgraded. I had actually seen this software in a different company, so I had some basic familiarity with it. Still, these particular production machines are rare, only a few of them exist in the world.
We bought this one from a company that had gone out of business a few years earlier. It was Test and Tag day and Aaron was running around a sparkie guy to do the testing. My earlier instruction to the sparkie was to not disconnect any computer equipment if it was not powered off. And so it came time to test this production machine’s PC.
The sparkie wasn’t going to touch it while it was on. Luckily Aaron came prepared with his thoroughly documented shutdown procedure: Yank the power cords. The test passed, new labels were applied to the power cord, he plugged it back in and turned it back on, then ran off to his next conquest without waiting for the boot to finish. This was the beginning of the trouble.
10 minutes later, the machine operator starts grumbling. I have a quick peek and see that the control software had started, but the screen was garbled and none of the right measurements were showing. Aaron is called over. He took one look, pales, and then runs off. Another 10 minutes later, the operator looks at me and asks for help.
I call Aaron’s mobile, and it’s off. I called Vice President’s mobile and suggest that he comes over immediately. 10 minutes later, the operator, Vice Present, and I are looking at this machine. It’s screwed. There’s the better part of a million dollars’ worth of product to be processed by this machine, and the nearest alternate machine is in Singapore, belonging to a different company.
If the processing isn’t done within soon, the product will expire and be scrapped. 40% of revenue is from product processed by this machine. We’re screwed. 10 minutes later, we still can’t get a hold of Aaron. We can’t talk to him about the “backups” or any emergency contacts that he knows about. We can’t even get his phone to ring.
So as I have said, I have used this software before and have a basic understanding. I know enough that the configuration is everything, and the configuration is matched to the machine. But I also knew a guy who did some of the implementations. A call to him gave me a lead, and I followed the leads until about four calls later, I had the guy who implemented this particular machine.
This is the old boy from above. He had retired 10 years earlier, but the Vice President had persuaded him to come out of retirement for an eyewatering sum of money. A few hours later, this guy took one look at the machine and confirmed that the database was screwed. We’d need to restore it from backup. Aaron is still not contactable.
Me: Let’s assume for a moment that there is no backup. What do we need to do?
Old Guy: Normally I’d say pray, buy you must have done that already because I haven’t kicked the bucket yet.
To cut a long story short, we had to rebuild the database. But not from scratch. Old Guy’s MO was when setting up a machine, when he was done, he’d create and store a backup database on the machine. The only issue was that 20 years of machine updates needed to be worked out. It also just so happens that through sheer effort, I am able to compare a corrupted database file to a good one, and fool with it enough to get it to load in the configuration editor.
It’s still mangled, but we are able to use that as a reference to build the lost configuration. All told, it took four days to bring this machine back online. But we did. To be honest, I certainly wasn’t capable of doing this solo, and without my efforts to patch the corrupted database file, Old Guy would not have been able to restore 20 years of patches that we had no documentation for.
And what of Aaron? After we started working on the problem, he showed up again. He ignored any advice about a backup (because obviously there wasn’t any), and instead demanded regular status updates for him to report to the Vice President. The little jerk had screwed up the machine, run off to hide, and that now a solution was in progress, he was trying to claim the credit.
When it was all running again, the Vice President came to talk to me.
VP: Thanks for your help. Your efforts have un-screwed us.
Me: No worries.
VP: And now we get to the unpleasant bit. Aaron claims that you didn’t follow procedure when shutting down the machine, causing it to crash. He also claims that you hadn’t taken any backups, and it was effectively your fault.
Me: And when we tried to call him?
VP: He claims he was busy contacting his emergency contacts.
Me: I see.
VP: I don’t believe a word of that. Unfortunately, it’s your word versus his. If I had the evidence, I’d fire him.
Me: (opening the email Aaron had sent me about meddling on my phone) You mean this evidence?
Half an hour later, I got the call to lock Aaron’s account and disabled his access card.