Archive for the ‘Disaster Recovery’ Category

Clouds, on-Demand, and ASPs

Tuesday, October 27th, 2009

I recall one of the big buzzwords in the computing industry in the late 1990s: “ASPs” or “Application Service Providers.” The idea, which isn’t a bad one, was to take all those applications you find on a typical PC and host them from central locations on the Web. Rather than paying $400 for a copy of MS Office, for instance, you’d just run an equivalent application remotely on an as-needed basis. The cost model was to be based on connection/usage time.

Sound familiar? It should.

The term “ASP” largely fizzled out following the Dot-Bomb bust of 2001/02, but the idea continued gaining steam. As marketers do, they just changed the name to reflect current trends, while avoiding the toxic, oh-all-those-places-went-bankrupt ASP moniker. At first, they started using “on demand” (as in IBM’s “eBusiness on demand”) as a replacement term. It evoked the same sort of concept, while being sufficiently nebulous avoid offering any actual meaning. A marketing dream term!

Nowadays, “cloud computing” is the Next Big Thing. It’s sort of a next generation ASP/on-demand model, and can mean different things to different people. For companies like Amazon and Google, it can mean providing virtual hosts that are totally controlled by individual customers (to the point of having the ability to reboot and rebuild systems to suit specific needs). IBM offers something similar, and of course Microsoft is doing their thing as well.

In general, it all comes down to the same thing: centralizing application and computing horsepower, then charging customers by units of work. C-level executives love this idea, because they see dollar signs (smaller ones, i.e. lower costs) in outsourcing even more IT functions. IT professionals, naturally, see it as a threat. If lots of companies outsource IT management, lots of professionals stand the chance of being laid off.

I’m on the fence about the Cloud/on-demand/ASP model. On one hand, centralizing this sort of thing is an excellent management strategy. I recall taking all my university’s high-use applications and moving them to dedicated application servers (with licensing management built in) — a move that saved us massive amounts of time when performing upgrades. On the other hand, network and vendor availability become ultra-critical in a cloud/on-demand environment. If the network goes down (or has insufficient bandwidth) you’re going to be in trouble. Likewise, if your Cloud vendor’s site goes on the blink.

You may want to think hard about the meaning of the term SPoF before you, or your company decide to make this sort of business decision.

PC Resurrection

Friday, July 3rd, 2009

As most of you know, I’ve been having problems with my primary Windows PC’s C drive. It’s turned out to be a fairly complex problem, and the solution has been challenging. The process also shows how easy it is for multiple problems to turn up simultaneously.

The initial problem was slow read performance on the C drive (lots of solid disk activity lights, random slowness), which was finally traced to bad blocks on the disk. Since that’s an easy swap, I ordered a replacement 500GB Seagate drive, opened the case, connected it to a spare SATA port, and powered on the system. It would no longer boot, and wasn’t even running its self tests. I theorized that the board itself was going bad, and this was causing the disk errors.

Since the board was a 2003 model, a full upgrade seemed useful. My new Intel board and 3.0GHZ processor, along with 2GB of 800MHZ Corsair memory, arrived a few days later. A quick hardware swap occurred, only to discover the system disk had apparently lost some Windows executive files. I really wanted my original OS back and didn’t just want to re-install, so I decided to try an experiment.

The original XP installation CD was put in, and Recovery Console was booted. I tried both bootfix and fixmbr to see if they’d restore the drive’s boot blocks, but they were inadequate to the task. The XP CD was again booted, and I told it to Repair the existing Windows installation. An hour later, the system booted on its own. Phase one was accomplished.

I then installed the new motherboard’s drivers and a shiny new GeForce 9500GT SLI video card, and checked performance again. The original disk is still throwing errors, so we now know the problem was probably a combination of a failing motherboard and a flaky disk. The original drive is now being cloned onto a brand new 500GB Seagate Barracuda. Hopefully this will be the end of the diagnostic process and my (completely refurbished) system will be back to normal.

The object lessons are as follows. First, good backups (which I have) are key. Second, not all performance issues are software related. Event Viewer can be your friend. And keep a copy of a partitioning package (I use Partition Commander) around at all times. It might just save the day when a disk decides it’s fed up with life.

End of an Era

Thursday, June 25th, 2009

Were I a superstitious person, I’d be suspicious. The disk problems I described on my primary XP machine a few days ago turned out to be more deeply rooted than originally suspected. I now believe the motherboard was also developing a problem, and it may have contributed to the dead disk. Here’s how I came to this conclusion.

The new drive arrived today. I eagerly hooked it up in place of the secondary SATA drive, believing I could just boot the machine using my copy of Partition Magic’s recovery CD. Then I’d simply examine the defective partition and (hopefully) copy it intact to the new disk. I could then re-arrange partition sizing, copy the extended partition into place on the new drive as well, then shut down. Some quick drive re-swapping would put the newly rebuild 500G drive into place, and all would be well.

But the machine now no longer powers on. This came as a surprise, and not a pleasant one, since it hadn’t exhibited a problem before. The power LED on the motherboard is on, the disks seem to spin up, and even the CPU cooler fan comes on. But the actual power indicator never glows, and video does not appear on the monitor. I now suspect the original physical disk itself is fine, but lost data due to the knock-on effect from the (apparently) failing motherboard.

The superstitious aspect of this is that I built the original machine back in June-July 1998. It was originally installed with Windows 98, and has been upgraded in place — the only original parts are the case and floppy drive — ever since. The OS was never re-installed, only upgraded. So it had a 10-year run, nearly to the month.

This said, it’s giving me an opportunity to do a full upgrade. I haven’t seriously modified the hardware in several years (aside from a new video controller and extra memory), so I’m shopping. A new motherboard (Intel DG35EC Motherboard CPU Bundle with Intel Core 2 Duo E8400 Processor 3.0GHz) is under consideration, along with 4GB of Corsair RAM and a shiny new SLI video card. It’ll be loaded with a fresh copy of XP, so there’ll be no cruft onboard from prior installations. But I will use the old machine’s case, floppy, and power supply. The legacy lives on. And I have all my critical data, so it’s not a major disaster.

The only thing I’m not looking forward to is re-installing all my applications. But I have several other systems I can use in the meantime, so it’s not a huge priority. And the new machine will be really, really fast.

When Disks Go Bad

Monday, June 22nd, 2009

Recently I’ve been having a difficult performance issue on my primary Windows PC. Basically it’s involved sporadic system hangs with lots of disk activity that didn’t seem related to a specific application. The HD activity light would come on for 5-15 seconds, hanging the machine. Then it would recover and all would be well for a random amount of time.

I put the system through all the usual checks — spyware, viruses, and so forth. Nothing showed up (as it should not, since the machine is pretty heavily protected). I tried shutting down various services and applications, like the firewall and various System Tray applications. No effect. Next, I updated video drivers and made sure there were no known issues involving compatibility or Windows updates that occurred recently. The problem persisted.

Finally it occurred to me that I was over-thinking the problem, and that it might lie at a much lower level. So I opened Event Viewer, cleared all the logs, worked for a while, then opened the System log and took a look. The problem was immediately visible — a disk is going bad. The Log is showing multiple cases of bad blocks on HardDisk0, which is the system drive. That’s not good, and it definitely explains everything. Now the problem is to get a new disk, open the box, hook up the new drive temporarily while the old one is still in place, and use Partition Magic to clone the partitions onto the failing drive.

Of course, I need to do all this before the existing disk decides it’s time to go to the Great Silicon Graveyard. In the meantime, I’m trying to pull a backup from both partitions on the failing drive, just in case it fails completely before the replacement shows up in the mail.

The lesson is clear: you can’t blame all performance problems on spyware or disk fragmentation. Sometimes the problem is much more fundamental. If you’re having a problem like this, check the basics. Make sure no errors are showing up in the system logs, or in another hardware-related location. The data you save may be your own.

Google’s Glitch

Thursday, May 14th, 2009

As many people on Google’s Gmail and search services certainly noticed, the huge provider experienced a severe service provision problem between roughly 10:30 and 11:30AM Thursday (US Eastern time). The issue, which apparently involved a routing problem that pushed too much traffic toward servers based in Asia, caused delays in the all-important search function, as well as problems on YouTube and other services. Fundamentally, “many Web sites took twice as long to load and were twice as likely to fail during Google’s disruption” according to one report.

The main issue here is that Google has largely become the primary go-to service in terms of search and other services. Since its services are “used by hundreds of millions of people, even a breakdown affecting a small percentage of its audience can have a huge impact. Google’s search engine, by far the most popular on the Internet, fields more than 9 billion monthly search requests in the United States alone.” Ergo, if Google goes down — even for a short period — a lot of services simply stop working.

As is the case with many large Internet and technology companies in general, the company has distributed its services worldwide to guard against a major meltdown at a particular facility. However, in this case nothing actually went down. Instead, a routing issue overloaded one data center until the problem was corrected. Basically, they’re “doing it right” from an overall design standpoint, but somehow managed to put too much of a load on one location.

The bad thing is that this interrupted services worldwide for about an hour, and certainly caused a lot of user frustration. The good thing is that annoyed users simply did the right thing by going elsewhere. Rather than using Google, they used Yahoo or another search engine until the problem was resolved.

Another good thing (though bad from a production computing standpoint, since outages like this are embarrassing) is that this sort of problem helps big companies test the resiliency of their services. Companies can run as many simulated disaster drills as they want; sometimes it takes a real emergency to find holes in the recovery process and areas for improvement.

Will someone lose their job over this? Possibly. Will Google learn from it? Hopefully.

Too Much Dependence on the Internet?

Monday, April 20th, 2009

In the past, I’ve expressed some cynicism regarding buzzwords like “cloud computing” and so-called Grid applications. To my mind, most processing should happen locally whenever possible. Why? Reliability, that’s what. As soon as you start relying on a network for delivering not only data, but the application that generates it, you’re asking for trouble.

It seems I’m not the only writer who thinks so. Another guy who’s been thinking about this problem recently expressed his own skepticism, saying that a fully distributed, on-demand software environment is “viable only if consumers have a wired or wireless connection to these services working all the time (or should I say, when they really need it?)” This was expressed most recently by the vandal-perpetrated fiber-optic cable cut on the West coast that took hundreds of thousands of customers off the air for the better part of a day. And this event didn’t just affect Internet services — cell lines were also taken down (guess what? they run on the same infrastructure!) during the same period.

We depend on the Internet for many of our daily services. What’s worse is that the power grid, national defense infrastructure, seismic sensors, tsunami detection systems, and other critical functions are also largely Internet-based these days. It wouldn’t take a terrorist event to knock all these services offline, either. A major fire, earthquake, asteroid impact, or flood could take many services off the air — just when they’re most needed.

As the article states, “we’re already dependent on the Internet for information, communications, and commerce, and we’re starting to rely on it for real-time delivery of applications. And now we’re putting all of our digital bits in one Internet basket and becoming more reliant on the cloud without even realizing it.” I’m not sure about others, but I really think this is a bad idea. Decentralization is the key to survivability. The current model of buying and installing software locally may be annoying to some. The lure of easily accessible, “cloud” applications is strong for a number of reasons. But it’s brittle. The swipe of a backhoe or a major natural disaster could take far too many critical services offline, very very suddenly.

Google Labels the Internet Evil

Tuesday, February 3rd, 2009

Okay, it was definitely an accident. And it was fixed very quickly. But still, Google’s recent error in marking every site on the whole Internet as “potentially malicious” shows how human error can cause a major technical faux pas.

The error itself involved the accidental addition of what’s known as a “wildcard” character to a file used to identify suspect websites. While this file was being updated, “the URL of ‘/’ was mistakenly checked in as a value to the file, and ‘/’ expands to all URLs.” This is a very simple mistake, and probably just involved finger failure on the part of the administrator. It was certainly not malicious on Google’s part, but was certainly very embarrassing.

As a knock-on effect, the error also may have caused some email to be mis-classified. According to Google’s statement “the block list is also used in its spam filters, so legitimate messages may have been classified as spam.” They’re now working on ways to correctly re-classify any mail that was incorrectly labeled, but this is only a retrospective correction since those initial messages were almost certainly discarded.

This isn’t the first time administrator error has caused outages or other problems across the Internet, and it won’t be the last. The complexity of the systems and processes involved increases all the time. It’s also obvious that such outages will receive increased media attention. The overall importance of the ‘Net increases daily as more people and individuals turn to it as a work-saving and communications-enhancing tool.

According to my own Second Law of Computing, an increase in the importance of a given system mandates a parallel increase in complexity. You can’t run a “five nines” operation without putting serious safeguards and massive redundancy into place. Redundancy makes the system harder to manage. And as “Scotty” from Star Trek once said, “the more they overtake the plumbing, the easier it is to stop up the drain.”

The simplest error can take whole systems down. It doesn’t matter whether a construction backhoe cuts through your fiber-optiic cable or a janitor accidentally unplugs a critical router — the results are the same.

Seagate’s Disk Issue

Thursday, January 22nd, 2009

Seagate Electronics has been in the storage business as long as I can remember, and generally their drives have been pretty good. I’ve owned a few, and have had no complaints. I can’t recall one of their devices ever going belly-up on me, and their warranties are reasonable for the industry. Storage is a critical area, and no company that sells disks with a high failure rate will survive long (for obvious reasons).

That’s why it was all the more surprising when, earlier this week, one of my informants told me about a Seagate firmware fix that actually seems to have broken more drives than it fixed. Considering the care with which drive manufacturers have to take when issuing fixes (and which, according to a Seagate engineer, is the usual method of doing business), it was very surprising to hear the fix caused so much damage.

Effectively, the firmware update actually caused more drives to “brick” (this is an industry term describing what’s left when a disk dies — it becomes nothing more than a brick). What happens is that the drive’s BIOS stops responding altogether when the system boots, so the drive becomes invisible. The only recourse is to send the drive to a data recovery service.

Even worse, one user on Slashdot claims that several data recovery services have known about the BIOS issue for a long time, and have “figured out an easy way to fix the firmware and kept it secret. They made a great profit, charging prices as if it was a hardware failure.” He goes on to say that “Seagate Datarecovery did the same by quoting up to 1800 USD for a 10 minute fix. Although I am sure that they were the only ones not aware of the easy fix.”

The CNet article notes specific drive types and firmware levels affected by this problem. If you’ve encountered a bricked drive from Seagate, check it out. You might be entitled to a free fix, and your data is probably intact.

When to Kill Your PC

Tuesday, January 13th, 2009

Recently a former colleague sent a message to a mailing list, asking what the options were for a 4 year old PC that was apparently in the process of eating its own disk. The whining and grinding noises it was making, along with occasional Blue Screens of Death, indicated the system’s hard drive was on its last legs. What should she do, replace the disk or buy a new machine?

Replies began appearing almost immediately. Initially, people tried to guide her through various processes designed to preserve the current machine. The first suggestion involved backing up all her user data to an offline device, then buying a new disk and re-installing the OS (Windows XP) on it. Then she could restore her data and continue with her work.

Others suggested installing a second drive in the PC, then using Ghost or System Commander to clone the soon-to-be-deceased disk onto the new drive. This solution had the advantage of not requiring lots of user intervention or manual copying of files; instead she’d just start cloning the drive and wait until the process finished. I performed this process on an old laptop just last spring, in fact. It works great as long as you’re reasonably knowledgeable about disks and what steps should be performed. It’s not so good for someone who has little technical knowledge.

Someone else actually suggested building a RAID array in order to guard against such failures in the future. That’s not a bad idea, but again it’s not really a solution for a generic user. Of course, a PC shop could set all this up for a fee, but the bench charge and disk hardware could easily cost as much as a new system.

Finally the talk turned to replacing the whole system. As I’ve pointed out in recent months, this is a great time to buy a new system. Most makers are offering deep discounts and extended (often interest-free!) payment plans. You can buy a good system with 4GB of RAM, a big disk, and a fast graphics card for under $1000.

The former colleague in question hasn’t made her decision yet, but the whole discussion process just shows how wide-open the options are these days. Cheap upgrades are still a good choice for anyone on a budget, but a brand new machine (with a brand new warranty) could also be just the ticket.

Surfing Without Power

Thursday, December 18th, 2008

I live in the Northeast US, and was in the midst of the major ice storm last week. Thursday night temperatures hovered around the magic freezing mark, and a shiver went down my spine as ice accumulated on our power lines. I shut down my PCs, and suggested my wife do the same to her desktop system. Sure enough, we lost power a few hours later. It didn’t come back on for four days.

Happily, we had a working phone line and I’m a DSL guy. I also have knowledge of electrical systems, and managed to score a generator Friday morning before local stores ran dry. I won’t recommend you try rewiring your house in order to get your gear back online during such an event — only attempt it if you know what you’re doing, since a mistake can be costly in any number of ways — but I can say that we had heat, some lights, and DSL access while our neighbors were sitting in the dark feeding stacks of wood to their fireplaces.

We use UPS units for all our PCs so slight power issues aren’t much of a problem. If the generator sags, the UPS picks up the load and keeps the voltage to the PC (or modem) stable at 120 VAC. Once I’d identified the critical circuits, I added them to the generator’s output and all was well. Obviously I didn’t spend a lot of time online because there were other things to attend to, but I was able to grab weather reports and other useful information periodically.

Also, don’t try to use a laser printer if you’re running on generator power. I say this because most of these devices require a lot of current, like running a blow dryer on the “high” setting. Unless you have a fairly beefy generator, it probably won’t be able to support the laser at the same time as, say, your refrigerator and furnace motor. Overloading a generator will either shut it down or destroy it, and neither is a great option when it’s 25 degrees outside. Wait until the grid comes back up, or find a Kinkos that still has power

Staying online during a power outage can be fun. Or you can take time off and drink coffee while watching the snow fall. Plan well, and you’ll have the choice.