Thursday, January 15, 2009

Cleansing The Wayback Time Machine at Archive.org

The Wayback Time Machine "Waybackmachine" is an archiving website that captures your website in time. Periodically, but not with noticeable consistency, the Wayback Machine will visit websites, crawl their content, and archive much of it for that point in time. Their robot returns from time to time to capture a new footprint in time. Some web sites have "snapshot in time" dating back a number of years, even as farback as 2002 or earlier.

Check out the two links at the bottom of this article for detailedd archiver User-agents and their details!

Why is this a consideration for those looking to Cleanse Internet Footprints? Because the Wayback Machine will be perceived as an unbiased third party that declares what your site said, what images it displayed, and what your site looked like overall for a specific date. If a Bad Company is pursuing you for infirnging on their content or images, the Wayback Machine is like an advocate to declare the starting date of your infringement.

This is bad in two ways. The first is that it may not help your case, should you be served with legal papers or end up heading to court. "Your honor, the Wayback Machine declares that the defendent has infirnged on our copyrights since way the heck back then! Just look at this web site and tell me they are lying to you."

The second issue relatedd to the Wayback Machine is that they may have content or images that really are copyright protected, and needd to be removed for multiple reasons. Regardless whether you are covering your Internet trail or simply applying due diligence to complying with a cease and desist letter, the Wayback Machine must be addressed to Cleanse Your Internet Footprint.

The Wayback Machine is a good robot, and respects your robots.txt file declarations. First, address your robots.txt file and cofigure it for this archiver robot. Archive.org oes not obfuscate their intent to respect privacy, and they state clearly what you must include in your robots.txt file to keep their robot from files, directories, or your whole domain and/or subdomain. Their site states:

How can I remove my site's pages from the Wayback Machine?

The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.

Internet Archive uses the exclusion policy intended for use by both academic and non-academic digital repositories and archivists. See our exclusion policy.

Here are directions on how to automatically exclude your site. If you cannot place the robots.txt file, opt not to, or have further questions, email us at info at archive dot org.

The directions link above is a simple resource to read carefully. Remember, you have a few ways to treat robots.txt file configurations: Triage, Surgically, or Casually. Assuming you are in triage, you should refer to the robots.txt article and blanket your whole site. The surgical approach revisits the robots.txt article with some more detailed configurations.

By placing or appending the following text in your robots.txt fie, you will tell the "Internet Archiver" robot to stay away from the whole site.

"The robots.txt file will do two things:


It will remove all documents from your domain from the Wayback Machine.
It will tell us not to crawl your site in the future."

User-agent: ia_archiver
Disallow: /*

Now, if you are strategically barring access and archiving for specific files, images, directories, or subdomains, you'd make a list of the relative pathways and enter them as line items right below the "User-agent" declaration, as follows:

User-agent: ia_archiver
Disallow: /private
Disallow: /images/not-my-logo.jpg
Disallow: /video/not-my-movie.mp4
Disallow: /not-my-information.aspx

If you already have a robots.txt file, you do not need to replace existing content. In fact, you may be appending many lines, should you take the surgical approach. Robots will read the file to find a User-agent: declaration that applies to its own name, and observe the line items below it, until it reaches a User-agent: declaration that doesn't apply.

I'll work on a more comprehensive list of User-agent names and compile them into a single robots.txt file that you can download, and remove whatever doesn't apply. The Wayback Machine has two archiver names, the second one escaping me at the moment.

For a more comprehensive explanation with examples of robots.txt files, visit http://www.robotstxt.org/robotstxt.html

For a comprehensive list of user-agents and their details, visit http://www.robotstxt.org/db.html

No comments:

Post a Comment