Browse through 55 billion web pages archived from 1996 to a few months ago.
http://www.archive.org/web/web.php
This is even copied (mirrored) to the Grat Library of Alexandria in Egypt at http://archive.bibalex.org.
This has implications for all of us. Long since 'dead' sites can be resurrected and unwanted references restored.
Technically it is rather cool, and sometimes two versions of a page are captured on one day. Check the site to make sure you have an appropriate 'robots.txt. file if you do not want to be publicly archived there. (makes you think though about other agencies that may be collecting data...)
Here is a couple of 'tidbits' from the FAQ.
How large is the Wayback Machine?
The
Internet Archive Wayback Machine contains almost 2 petabytes of data
and is currently growing at a rate of 20 terabytes per month. This
eclipses the amount of text contained in the world's largest libraries,
including the Library of Congress.
What type of machinery is used in this Internet Archive?
Much
of the Internet Archive is stored on hundreds of slightly modified x86
servers. The computers run on the Linux operating system. Each computer
has 512Mb of memory and can hold just over 1 Terabyte of data on ATA
disks. However we are developing a new way of storing our data on a
smaller machine. Each machine will store 1 terabyte. For more
information go to www.petabox.org.
How do you archive dynamic pages?
There
are many different kinds of dynamic pages, some of which are easily
stored in an archive and some of which fall apart completely. When a
dynamic page renders standard html, the archive works beautifully. When
a dynamic page contains forms, JavaScript, or other elements that
require interaction with the originating host, the archive will not
contain the original site's functionality.
Why are some sites harder to archive than others?
If
you look at our collection of archived sites, you will find some broken
pages, missing graphics, and some sites that aren't archived at all.
Here are some things that make it difficult to archive a web site:
- Robots.txt -- We respect robot exclusion headers.
- Javascript
-- Javascript elements are often hard to archive, but especially if
they generate links without having the full name in the page. Plus, if
javascript needs to contact the originating server in order to work, it
will fail when archived. - Server
side image maps -- Like any functionality on the web, if it needs to
contact the originating server in order to work, it will fail when
archived. - Unknown
sites -- The archive contains crawls of the Web completed by Alexa
Internet. If Alexa doesn't know about your site, it won't be archived.
Use the Alexa Toolbar (available at www.alexa.com), and it will know about your page. Or you can visit Alexa's Archive Your Site page at http://pages.alexa.com/help/webmasters/index.html#crawl_site. - Orphan
pages -- If there are no links to your pages, the robot won't find it
(the robots don't enter queries in search boxes.)


Comments
Exclude with robots.txt
I looked in to the wayback machine again and realized,as Mike has pasted above, that you easily can exclude your site with a simple robots.txt file:
#Keep the site from being indexed by the way back machine, web.archive.org
User-agent: ia_archiver
Disallow: /
A search for our site now only brings up a screen saying that the webmaster has excluded it from the archive. Not even the older archived versions are available anymore.
Absolutely worth the effort if you want hide content in the future.
Robots work
I've just added the robots.txt file to my personal site and another ministry site I look after. Went to check on the wayback website and it was gone from there... Thanks