Wednesday, January 14, 2009

Preparing Your Plan To Cleanse Your Internet Footprints

There is a list of work to follow, but before we address the world, we need to address our local content. Hopefully you have identified what you want to prevent the world from seeing, including both expected and unexpected URLs. You could scour your web server file structures, look through your own code, or resource a programmer or web developer to get involved. Holy crap, that sounds exhaustive! Well, there's a faster madness to this method.

Before you start deleting external content, you should use it to your advantage. Go to Google and ask them what content they have indexed and cached. Google loves to share information and we're making use of this feature. Affectionately referred to as Google Hacking, we are simply using the available commands that the Google engine responds to, in an attempt to return empirical results.

Lets say your domain is http://www.extortionletterinfo.com and you want to know everything Google knows about your site. A novice Google user might use the keywords, "extortion letter info" and be happy with the results. My search just now returned "about 996,000" pages of search results, each page containing 10 web pages to review. Yikes, what the heck are we going to do? No need for caffeinated, doughnut-filled, sleepless nights. We turn to the art of Google hacking, coined and defined by Johnny Long whose website http://johnny.ihackstuff.com has been a fabulous Online resource for ethical hackers and hardcore Googlers over the years. He appears to be using Twitter for his site at http://twitter.com/ihackstuff, but that's not the point, just a resource. If you really want to learn more about Google hacking, pickup his book "Google hacking For Penetration testers" from your local book store.

Onward... An excerpt from Google hacking is the phrasing to get the results of all content for one specific website. We are asking Google, "Please show me everything for the website www.extortionletterinfo.com and no other domains or subdomains." A good start, but to be sure your domain is covered for all subdomains, we drop the "www." part get broader results.

If you have a specific subdomain of concern, and not all subdomains, include the subdomain to refine your search, such as "secure.extortionletterinfo.com" The text you submit to Google is exactly as shown below, no spaces and that's a semicolon between site and the address...

site:www.extortionletterinfo.com


Submit this phrase and Google will check its archives for all listings known under this domain and subdomain. My search just now returned "about 153" pages. To make life easier, we can ask Google to show us 100 results per page, so we only have 2 pages to handle in this case. The URL in the address field of my browser is:

http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&hs=hsD&q=site%3Awww.extortionletterinfo.com&btnG=Search

...so insert "num=100&" as shown below, and hit RETURN on your keyboard...

http://www.google.com/search?num=100&hl=en&client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&hs=hsD&q=site%3Awww.extortionletterinfo.com&btnG=Search

Your results page will now have 100 results to review. Before you move on, or assume this will be around tomorrow, print the web page to PDF, and maybe print it if you like using paper. You need to make records of this stuff as you move forward and check-off files you have handled and see clearly what remains to be handled.

Note that not every listing in search engine results has been cached. This is a list of "indexed" pages, not necessarily pages where another can view your historical content. This doesn't mean you can leave it on your server, it just means you might have less work to do At the search engine. If the listing says "Cached" you know it's higher on your priority list.

At this point you now have a comprehensive list of what Google knows. Nice, but not enough. You mmust repeat this process with the top 4 search engines, which are Google, Yahoo, MSN Live, MSN Search, and Ask. The remaining hundreds of search engines are almost all parasitic on the search results of these primary four engines, else do not have enough impact on the Internet to be worried about. It'll be a little work to correlate which sites have what pages, but it's not the main goal. If you have been served a letter or legal paperwork, they probably state clearly what has caused and problem. Else, you are probably aware of what you need to remove of a sensitive or personal nature.

Holy crap, you thought your list was complte, but hey... we have to be sure your images list is ready too! Browse to http://images.google.com and try the site:www.extortionletterinfo.com search. This site has no indexed images. So, for an example, lets try another domain that does...

site:www.photobucket.com

Only "about 13,000" images here, but it tells you what Google has indexed for this site. Hopefully you know what you're looking for and this is a fast tool to see any images in need of removal. It may also aid those searching for images in websites suspected of infringement or improper use. Make notes by printing to PDF and/or printing to paper. This expands your hitlist well.

With these lists in hand, you are now prepared to forge ahead with cleansing your Internet footprint, and can keep your sanity during the process. That would really suck to start killing content and lose track of who, what, where and when.

No comments:

Post a Comment