<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Free Xenon Consulting &#187; Crawlers</title>
	<atom:link href="http://www.freexenon.com/tag/crawlers/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.freexenon.com</link>
	<description>PSD or Image to Site with Accessibility Built In</description>
	<lastBuildDate>Mon, 17 May 2010 15:59:18 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Robots, Spiders, and other Crawlers&#8230; Oh My!!</title>
		<link>http://www.freexenon.com/2005/10/17/robots-spiders-and-other-crawlers-oh-my/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=robots-spiders-and-other-crawlers-oh-my</link>
		<comments>http://www.freexenon.com/2005/10/17/robots-spiders-and-other-crawlers-oh-my/#comments</comments>
		<pubDate>Mon, 17 Oct 2005 20:07:00 +0000</pubDate>
		<dc:creator>FreeXenon</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[Development]]></category>
		<category><![CDATA[Crawlers]]></category>
		<category><![CDATA[Robots]]></category>
		<category><![CDATA[Robots.txt]]></category>
		<category><![CDATA[Spiders]]></category>

		<guid isPermaLink="false">http://www.fx.serenitystudios.com/?p=4</guid>
		<description><![CDATA[Robots are automated programs spawned by search engines to find and index content. Each search engine has it&#8217;s own robot, which are sometimes referred to as spiders, that (usually) leave a customized User Agent String to identify itself to web servers. Here are example user agent strings found on our server logs:
Googlebot/2.1+(+http://www.google.com/bot.html)  // Google's [...]]]></description>
			<content:encoded><![CDATA[<p>Robots are automated programs spawned by search engines to find and index content. Each search engine has it&#8217;s own robot, which are sometimes referred to as spiders, that (usually) leave a customized User Agent String to identify itself to web servers. Here are example user agent strings found on our server logs:</p>
<pre><code>Googlebot/2.1+(+http://www.google.com/bot.html)  // Google's Spider
Mozilla/5.0+(compatible;+Yahoo!+Slurp;+http://help.yahoo.com/help/us/ysearch/slurp)  //Yahoo's Spider</code></pre>
<p>These 2 are nice spiders as they identify themselves appropriately. User Agents do not have to identify themselves correctly, so sometimes spotting them (if you are looking for them) can be difficult. Most play nice &#8211; yea!</p>
<h3>I, Robot(.txt)</h3>
<p>The robots.txt that I have creates is as follows:</p>
<pre><code>User-agent: *
Disallow: /_css/
Disallow: /_images/
Disallow: /_scripts/
Disallow: /_test/
Disallow: /data/stuff/*.htm
Disallow: /mm2css/
Disallow: /mm2img/
Disallow: /mm2script/
</code></pre>
<p>There is not much to it. It is a really simple process to create one. The first part of a robots.txt is the User-Agent line to which you specify for which User Agents the following rules apply to. We Specified * which means all all user agents. We can specify rules for specific User Agents such as Googlebot by the following: <code>User-Agent:Googlebot</code>, and the rules following it would be picked up by Googlebot. Nothing forces a Bot or Spider to follow the rules that is specified in the robots.txt file. They are followed by choice.</p>
<p>The next statements are exclusion statements. We tell the User Agents what directories to exclude from indexing. Some bots will support file level exclusions. I use both here. I have disallowed all image, scripting and <abbr title="Cascading Style Sheets">CSS</abbr> as they cannot be indexed by robots, so I will just save them the time. I have also excluded _test directories as they do not have indexable data. The last thing I excluded was *.htm in the Stuff Directory. We do not want people going to stuff. The index.html is still indexed for the stuff directory because it has the .html extension, but the &#8220;subordinate&#8221; pages are not as they have the .htm extension. The robots.txt file is looked for at the site&#8217;s root.</p>
<h4 id="WaysToRestrict">Other Means to Restrict Access</h4>
<p>No matter what you put in your robots.txt file a bot can always have aceess to your resources as the robots.txt file is not forced on user agents. If you want to ensure user agents do not have access to your resources secure it with permissions.</p>
<p>To reduce the chance that a bot will look to index your site you can try to ensure that there are no links to it. If there are no links then a bot will have very little reason to index it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.freexenon.com/2005/10/17/robots-spiders-and-other-crawlers-oh-my/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

