Wednesday, January 14, 2009

Handling robots.txt Files To Block Search Engine And Web Archives

Assuming you have your hitlist in hand and are ready to start cleansing your Internet footprint, we should start with the robots.txt file. "What the heck is that," you may ask? The Wikipedia defines robots.txt file as follow...

The robot exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. The standard complements Sitemaps, a robot inclusion standard for websites.

In essence, your robots.txt file says, "Stay the heck away from this stuff, and/or you may look at this other stuff." Most engines are loyal to this method of keeping information sensitive. But, do not expect them all to adhere to the same moral ground as Google and Yahoo. If you have sensitive media, Intranet content to remain private, or could incidentally infringe on another with publicly accessible media, DO NOT RELY on the robots.txt file. This is just a tool, not a weapon.

For detailed information about robots.txt files, their use, and formatting, etc, you should visit The Web Robots Page and pay attention. For brevity, I'll explain the essentials and get you on your way. You can spark-up the laptop later to explore the finer elements of robots.txt files. Our goal here is to declare to search engines and web archives that you have content they must stay away from, not index, and certainly NOT CACHE.

if you are in triage mode and want to blanket your site with a rejection notice, you wil handle bild your robots.txt file differently than someone in less crisis who want a surgical approach. Yes, you can specifi specific pages, images, and all media types. Lets get Triage underway.

To ask everyone to stay the heck away from absolutely everything, you declare a simple statement in your robots.txt file. In your HTML Editor or Plain texxt Editor, create a file named "robots.txt" and it should start with absolutely nothing in it. Yes, a totally blank file. It should NOT be RTF format or some other format like a Microsoft Document. Just a plain .txt file. Enter the following content exactly as shown below...

User-agent: *
Disallow: /

The text above was bolded just for blog display, and if your file is Plain text, you won't be able to make it bold, italicized, or apply any other style to it. You're done creating your robots.txt file. Upload this to the "root level" of your web server and you're robots.txt file is ready to do some work for you. The "root level" just means right next to your homepage, not in a subdirectory within your site. Lower levels will not work and wil lbe ignored. you should be able to browse your robots.txt file by going to http://www.mydomainname.com/robots.txt and seeing the User-agent text above. If you see your file content, this stage is done.

if you are not in triage mode, you can specifiy whole directories and their entire contents, as well as specific files by name. The Web Robots Pages website has great information beyond the scope of this blog, and web surgeons should resource this page. The example, as shown below, illustrates 1) a directory called "cgi-bin" that should not be indexed, 2) an HTML file inside a "private" directory that should not be indexed, and 3) an image inside the "images" directory that should not be indexed. The first line "User-agent: *" means that ALL robots like Google, yahoo, and others are included, no exceptions.

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/my-passwords.html
Disallow: /images/inappropriate-image.jpg

If you are not stressed or upset right now, don't restrict your guidance to this blog for robots.txt files! Visit The Web Robots Page and take your time. It could save your job, your business, and maybe your sanity. bear in mind that once posted, the robots.txt file will start to result in automatic exclusion from search engine indexing, web archiving, and image collection from your site. As well, previously indexed and cached content should start to disappear like you were never there. BAD ROBOTS will ignore the robots.txt file and continue indexing and caching regardless, and we think they suck! If it's sensitive information, get it off your public server right away and do some reading about what is public and what's private. If you are desperate, email me at 211kleaner@gmail.com and I'll help, time permitting.

Caveat: If Google or others revisit your site infrequently, your content wil lnot disappear until their next visit. Some sites are indexed only a few times a year, others every day. It's a great tool, but we must forge ahead aggressively and approach the engines and archives directly. Moving right along...

No comments:

Post a Comment