Tuneup Talk Home


Archiving the Web

Ever wonder what happened to all those early pages everyone wrote back in the late 1990s, when the web was just getting started? Many of them might just be out there, lurking silently on a site known as the Internet Archive, or (in a nod to the old Bullwinkle cartoon series) the “Wayback Machine.” This site has been ’scraping’ web pages for years, archiving them as images in order to preserve rendering and appearance.

The archive recently got a major upgrade. It needed it, since it basically re-indexes the whole Internet every 2 months to look for new and changed pages. It’s maintaining one of the biggest (possibly the biggest!) databases on the planet, and the new datacenter “fits in a 20-foot-long outdoor metal cargo container filled with 63 server clusters that offer 4.5 million gigabytes of data storage capacity and 1TB of memory.” It’s installed at a Sun Microsystems (which provided the hardware) facility.

The system is pretty cool overall, and is invaluable as a tool to show how the web has developed over time. One of the hazards of digital libraries — of which the web is one, when you think about it — is that in many cases there’s no preservation system behind them. The US has the Library of Congress, plus there are all those other “analog” libraries that keep copies of various books permanently. But once a web page is altered or taken down, it’s gone. Hence the Archive. It’s a means of keeping all those old pages around for future reference.

Some might ask why anyone would care about millions of really badly designed pages (several of my early efforts are out there, and I cringe when I look at them). But that’s the point, really. Someone should keep copies of such things so we can see how far we’ve come in a very short period of time. Only 10 years ago, we were worrying about making invisible images in various sizes in order to achieve layout on a web page. It was pretty horrible. Nowadays, with the increased use of CSS and improvements in the XHTML standard, life is a lot better for web developers.

Plus, who knows how many useful documents were once on the web, but have been taken down over the years. Remember: computing is all about data.

Leave a Reply