Internet Footprint Cleansing

Thursday, January 15, 2009

Cleansing The Wayback Time Machine at Archive.org

The Wayback Time Machine "Waybackmachine" is an archiving website that captures your website in time. Periodically, but not with noticeable consistency, the Wayback Machine will visit websites, crawl their content, and archive much of it for that point in time. Their robot returns from time to time to capture a new footprint in time. Some web sites have "snapshot in time" dating back a number of years, even as farback as 2002 or earlier.

Check out the two links at the bottom of this article for detailedd archiver User-agents and their details!

Why is this a consideration for those looking to Cleanse Internet Footprints? Because the Wayback Machine will be perceived as an unbiased third party that declares what your site said, what images it displayed, and what your site looked like overall for a specific date. If a Bad Company is pursuing you for infirnging on their content or images, the Wayback Machine is like an advocate to declare the starting date of your infringement.

This is bad in two ways. The first is that it may not help your case, should you be served with legal papers or end up heading to court. "Your honor, the Wayback Machine declares that the defendent has infirnged on our copyrights since way the heck back then! Just look at this web site and tell me they are lying to you."

The second issue relatedd to the Wayback Machine is that they may have content or images that really are copyright protected, and needd to be removed for multiple reasons. Regardless whether you are covering your Internet trail or simply applying due diligence to complying with a cease and desist letter, the Wayback Machine must be addressed to Cleanse Your Internet Footprint.

The Wayback Machine is a good robot, and respects your robots.txt file declarations. First, address your robots.txt file and cofigure it for this archiver robot. Archive.org oes not obfuscate their intent to respect privacy, and they state clearly what you must include in your robots.txt file to keep their robot from files, directories, or your whole domain and/or subdomain. Their site states:

How can I remove my site's pages from the Wayback Machine?

The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.

Internet Archive uses the exclusion policy intended for use by both academic and non-academic digital repositories and archivists. See our exclusion policy.

Here are directions on how to automatically exclude your site. If you cannot place the robots.txt file, opt not to, or have further questions, email us at info at archive dot org.

The directions link above is a simple resource to read carefully. Remember, you have a few ways to treat robots.txt file configurations: Triage, Surgically, or Casually. Assuming you are in triage, you should refer to the robots.txt article and blanket your whole site. The surgical approach revisits the robots.txt article with some more detailed configurations.

By placing or appending the following text in your robots.txt fie, you will tell the "Internet Archiver" robot to stay away from the whole site.

"The robots.txt file will do two things:

It will remove all documents from your domain from the Wayback Machine.
It will tell us not to crawl your site in the future."

User-agent: ia_archiver
Disallow: /*

Now, if you are strategically barring access and archiving for specific files, images, directories, or subdomains, you'd make a list of the relative pathways and enter them as line items right below the "User-agent" declaration, as follows:

User-agent: ia_archiver
Disallow: /private
Disallow: /images/not-my-logo.jpg
Disallow: /video/not-my-movie.mp4
Disallow: /not-my-information.aspx

If you already have a robots.txt file, you do not need to replace existing content. In fact, you may be appending many lines, should you take the surgical approach. Robots will read the file to find a User-agent: declaration that applies to its own name, and observe the line items below it, until it reaches a User-agent: declaration that doesn't apply.

I'll work on a more comprehensive list of User-agent names and compile them into a single robots.txt file that you can download, and remove whatever doesn't apply. The Wayback Machine has two archiver names, the second one escaping me at the moment.

For a more comprehensive explanation with examples of robots.txt files, visit http://www.robotstxt.org/robotstxt.html

For a comprehensive list of user-agents and their details, visit http://www.robotstxt.org/db.html

How Do Companies Find Images With Copyright Issues?

Well, here's a very interesting subject. In the early years of the Internet, nobody would know you existed unless you told them. In fact, you pretty much had to beat them over the head to get them to use their computer let alone the "Internet." Today, anything that connects to the Internet, including Intranets and private networks, becomes publicly available.

Google is a dominant behemoth of finding stuff, bringing rise to Google Hacking. It doesn't take long for Google to rush through your site and collect enough information to make you blush. You might even read Google to find out more about yourself than you currently know. Lets just assume you understand how Google collects information and "crawls" and "spiders" the Internet looking for stuff to index and cache. Google is not the only company to invent such a mehanism.

Bad Robots are engines similar to Google in terms of their desire to find Online content. Google is a Good Robot, because they respect our wishes and try to work nicely with us. Google rocks! Bad Robots disrespect your wishes and couldn't care less about your privacy, safety, or concerns. Placing a robots.txt file will be absolutely pointless for Bad Robots. In fact, Bad Robots will use your robots.txt file to find stuff they may otherwise miss. This is an important caveat regarding robots.txt files, as you must NOT specify stuff you want excluded from crawls and spidering, unless you don't mind it becoming public.

If it's sensitive information you seek to protect, you must protect it via authentication and through server features that limit access. We'll discuss this subject elsewhere.

Bad Robots are simply packages of software that are designed to find and sometimes retrieve. From what I have gathered in my reading is that an Isreali company wrote a complex search engine (robot) that specifically handles image comparisons. Bad Companies that manage image distribution and write nasty letters to often nice people use such software to find as many victims as they can.

The first thing they do is load all of their managed images into the Bad Robot and set it to hunt like a bulldog. It spiders sites collecting link structures just like Google does to index the World Wide Web. Where Google respects your privacy requests, the bad image robot steps on your toes and indexes everything. The links are reduced to a unique set of places to revisit for inspection. The crawler then browses all of the links and loads all of the web page images into the comparison routine. It then compares the retrieved image against its database of "protected" images for the Bad Company. If a match is found, a screenshot is generated of the "infirnging" web page, and a report is made to the "owner" of the image copyrights.

The Bad Robot most likely uses crafted programming that maximizes its effect, minimizes repetition, and uses a reduction theory for category or color palette subsecting. I'm just pulling this out of my wooly hat, but I'm sure it's a crazy program.

Word is that the bad image distributors split the profits 50-50 with the software developers. This is not a fact that can be supported to date. But, if this is true, you can see how inspired the companies with Bad Robots would be to find "offenders." In fact, I think this would inspire them to step over the line of reasonable discovery and be too inclusive rather than reasonably exclusive. If you possibly fit the ticket as, "a sucker who will pay the demanded amount from an extortion letter," you'll get it.

Word on the street is that Bad Image Companies will try to use Online resources like "The Wayback Time Machine" and that leads us to another article to explain how to Cleanse Your Internet Footprint from them too. From what I can tell, these Bad Image Companies use automation to prepare their letters, and there are supposedly thousands of them pouring out to the benefit of FedEx. [Not sure if we should be unhappy with FedEx, but who shoots the messenger anymore?] It remains to be determined how much legitimate research these Bad Image Companies do to substantiate and prove their claims.

Someone bought images from an old stock images company, which was bought by a Bad Image Company. The Bad Image Company later bought the old stock images company, then laid claim to their images by sending out Copyright Infringement Demand Extorion Letters. Does the Bad Image Company know their dates are screwed up? Does the automation find the potential copyright date, or perhaps the date they acquired the old stock images company, and state that as the date, for convenience? I don't know, but would love to find out.

If you have information about how Bad Companies are finding images that potentially infringe on their "rights," I'd like to hear about it, so we can share with the people in need of information. Hopefully the Good Image Companies will see what's going on use it to their advantage, by staying nice and making us proponents of their businesses.

.htaccess tips and tricks

http://corz.org/serv/tricks/htaccess.php

For the more technically savvy, there are cool ways to prevent unwanted eyes browsing your content. The .htaccess file resides inside each directory on your server and dictates certain features on how that directory (or your main site) is handled. This file will let you restrict access using various features, to prevent some from viewing the content, but allowing others.

Why is this important? If you are a writer, graphic designer, webmaster, or simply want to share some sort of design or content with a remote third party, you can get into trouble for copyright infringements. Most often, rights are granted to use images or text in the context of a "comp" (something that is intended for review during development, but not for production). Some Bad Companies will say this is a copyright infringement regardless of your intent or even your technical rights, because they want to make money from your fear.

.htaccess files are one avenue of preventing unwanted eyes, or at least limiting access to the intended audience. It's not for those who fear technology or writing even little chunks of code-like text.

introduction to .htaccess

This work in constant progress is some collected wisdom, stuff I've learned on the topic of .htaccess hacking, commands I've used successfully in the past, on a variety of server setups, and in most cases still do. You may have to tweak the examples some to get the desired result, though, and a reliable test server is a powerful ally, preferably one with a very similar setup to your "live" server. Okay, to begin..

There's a good reason why you won't see .htaccess files on the web; almost every web server in the world is configured to ignore them, by default. Same goes for most operating systems. Mainly it's the dot "." at the start, you see?

If you don't see, you'll need to disable your operating system's invisible file functions, or use a text editor that allows you to open hidden files, something like bbedit on the Mac platform. On windows, showing invisibles in explorer should allow any text editor to open them, and most decent editors to save them too**. Linux dudes know how to find them without any help from me.

A Great Canadian Resource For Copyright Infringement Information

http://excesscopyright.blogspot.com/2008/05/watching-getty-images-watching.html

"COPYRIGHT IS GOOD. EXCESS IN COPYRIGHT IS NOT. THERE ARE MANY PARTIES IN THE COPYRIGHT CONSTRUCT. ALL OF THEM MUST AVOID EXCESS IN ORDER FOR COPYRIGHT TO BE VIABLE AND SUSTAINABLE. I PRACTICE IP LAW WITH MACERA & JARZYNA, LLP IN OTTAWA, CANADA. I'VE ALSO BEEN IN GOVERNMENT AND ACADEME. MY VIEWS ARE PURELY PERSONAL AND DON'T NECESSARILY REFLECT THOSE OF MY FIRM OR ANY OF ITS CLIENTS. NOTHING ON THIS BLOG SHOULD BE TAKEN AS LEGAL ADVICE."

Good Companies We Can Trust

In all fairness to the nice companies that do Good Business, I suggest we start a list of the companies that have notoriously worked in good faith with the world and have great reputations. It seems like most forums are fr complaining, yet the good companies need more public promotion aside Yelp and similar review sites. If you have had consistently great interactions with a company in an industry listed in the Do Not Use These Companies blog page, please let us know! We'd like to give people alternatives to the Bad Companies.

1. Shutterstock: "Shutterstock is the largest subscription-based stock photo agency in the world." Wow, I had great feedback from at least 7 graphic designers, some webmasters, and some professional photographers. The consensus has been great for both sides. The photographers say the review process to be accepted by Shutterstock is more rigorous than the image distributors in the Bad Companies List, which probably means Shutterstock is more careful about absorbing catalogs. People buying images from Shutterstock say the rates are extremely low, the quality is easily as good as the Bad Companies, and the rights are significantly more fair (to the tune of 250,000 impressions and 6 months stockpiling rights). Plus, I haven't heard a single case of Shutterstock beating anyone up for supposed infringements. Sounds like a superb deal, but read the Shutterstock Photo Licensing Terms and Conditions before you buy anything, and make sure you DOCUMENT YOUR LICENSE AND IMAGE USE.

Do Not Use These Bad Companies

I suggest we start a list of companies that are notorious for beating up the good guys. If you have received a demand letter that goes beyond a polite and initial Cease And Desist request, email me the details to 211kleaner@gmail.com and I'll append the list. Just send the crappy companies here and we'll keep your name and detailed info anonymous.

1. Getty Images: The Getty Images Settlement Demand Letter - The Getty Images Settlement Demand Letter is a deliberate attempt by Getty Images to deliberately intimidate and bully recipients of the letter to pay an extravagant "settlement fee" in exchange for Getty Images agreement to NOT sue the recipient. Recipients of this letter have allegedly infringed on the alleged copyrights owned by Getty Images. Google Search for this topic, without results from gettyimages.com

2. Masterfile Corporation: The Masterfile Corporation Images Settlement Demand Letter - Masterfile Corporation has cloned the extortion scheme started by getty Images, but ramped up their aggressiveness and attitude. very much like the letter sent by Getty Images, Masterfile Corporation makes outrageous monetary demands for even trivial supposed infractions. The letters are sent with Signature Required by Federal Express to your door. They offer an extremely small time to respond with payment and compliance, else they threaten to quietly take you to court. Google Search for this topic, without results from masterfile.com

Original Release Dates May Be Important

It crosses my mind some comments I have read in numerous websites related to copyright infringements and the problems some people are having with legal issues of copyright infringements. The issue is the original date of content release by the infringer combined with the stated date of infringement by the "owner" of the copyright.

From what I have read, the owner of the registered copyrights must have completed the official copyright registration PRIOR to the date of the infringement. To me this means that if you put some text or an image in your website and got a letter saying you're in trouble for copyright infringement for that media, you may have a point to avoid legal impact.

If you posted an image in your site in 1999 and the letter came to you in 2008 stating you used a protected image, you should ask to see the registered copyright documents. Moreover, you should inspect the letter sent to you for the supposed dates of infringement. If the letter states you infringed in 2005, it may be a ploy to work you over. They may state 2005 because that's when they feel they gained copyright protected ownership of the image in question. If you can prove or dispute the original date that the image was released, and it is prior to the date sent in the letter, you have preceded the copyright registration date, and may not be so guilty.

This is not to say you shouldn't remove the image. Heck, there's no sense in beating an issue like this to death unless you are forced to or know you are right. There's always an alternative to a fight, including image substitution. Apparently there are many images (amongst other media types) that are available publicly as well as through various distributors "without copyright registration protection" that are later bought by stock houses. Once the image stock house gains control of the image, they may seek copyright protection, else claim copyright protection, and then send you a letter threatening legal action.

I'd be interested to hear legal confirmation from a trained and experienced copyright attorney. Beware the following caveat:

Image stock houses hire crappy people to pretend they are attorneys. These pseudo-attorneys browse places like this blog and try to defame such tips and comments. Beware web posts that make your situation sound horrific and persuade you to give up and give in to ridiculous demands.