Where to put robots.txt for crawler?

Support for webhosts that use CentOS
MajorNewbie
Posts: 127
Joined: 2008/04/17 00:04:43

Where to put robots.txt for crawler?

Post by MajorNewbie » 2008/11/18 17:49:55

I use Redhat5 and Firefox5 and I want the Google web crawler to read my robots.txt.

Which directory should I put it in?

In DocumentRoot that I have specified in httpd.conf?

Is it advisable to set robots.txt to apply to all bots eg. User-agent: * ?

TIA.

NedSlider
Forum Moderator
Posts: 2896
Joined: 2005/10/28 13:11:50
Location: UK

Where to put robots.txt for crawler?

Post by NedSlider » 2008/11/18 21:17:33

Even the most basic of google search would have instantly found you this:

http://www.robotstxt.org/

If you want people to answer your questions then you need to demonstrate you've at least made a minimum of effort to try to find an answer yourself. Google will find you answers to questions like these way quicker than posting on a forum ever will so you're wasting your own time as well as mine in responding. I don't mean to sound harsh, but really.

User avatar
AlanBartlett
Forum Moderator
Posts: 9324
Joined: 2007/10/22 11:30:09
Location: ~/Earth/UK/England/Suffolk
Contact:

Re: Where to put robots.txt for crawler?

Post by AlanBartlett » 2008/11/18 22:15:21

[quote]I use Redhat5 and Firefox5 and I want the Google web crawler to read my robots.txt.[/quote]
To be honest, as I read the first sentence, above, I performed a mental [b]sed 's/Redhat5/RHEL 5 == CentOS/'[/b] command. However, after reading [i]"Firefox5"[/i] I gave up -- not having access to the [b]Major[/b]'s time machine.

MajorNewbie
Posts: 127
Joined: 2008/04/17 00:04:43

Re: Where to put robots.txt for crawler?

Post by MajorNewbie » 2008/11/20 00:26:25

Thx for the link

MajorNewbie
Posts: 127
Joined: 2008/04/17 00:04:43

Re: Where to put robots.txt for crawler?

Post by MajorNewbie » 2008/11/21 22:46:41

[quote]
NedSlider wrote:
Even the most basic of google search would have instantly found you this:

http://www.robotstxt.org/
If you want people to answer your questions then you need to demonstrate you've at least made a minimum of effort to try to find an answer yourself. Google will find you answers to questions like these way quicker than posting on a forum ever will so you're wasting your own time as well as mine in responding. I don't mean to sound harsh, but really.[/quote]

Went through the documentation, it has this to say:

[i][b][color=000099]Why do I find entries for /robots.txt in my log files?[/color][/b][/i]
[i][color=000099]
They are probably from robots trying to see if you have specified any rules for them using the Standard for Robot Exclusion, see also below.
If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server.
Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-)
[/i][/color]

Surely they don't mean using root to create robots.txt in the directory / ???
If so, is it save to set owner to nobody:nobody and chmod it to 774?
How do I test whether it is working properly?

I read somewhere (not in the link) that I have to put robots.txt in the document root specified in Apache, where normally index.html resides. In my case [b][color=990033]"var/www/html/products"[/color][/b] where I have my ColdFusion documents and where index.cfm resides.

index.cfm in this directory is set to 744 and owned by nobody:nobody

I have some reservations about a crawler traversing those directories.

NedSlider
Forum Moderator
Posts: 2896
Joined: 2005/10/28 13:11:50
Location: UK

Re: Where to put robots.txt for crawler?

Post by NedSlider » 2008/11/21 23:12:01

That's correct, robots.txt lives in your [b]document root[/b] directory, not the system's root (/) directory.

Set the permissions the same as your index.html

If you don't want to allow robots to traverse particular directories then disallow them in robots.txt (note this isn't foolproof, only robots that respect robots.txt will adhere to your wishes - it's voluntary not mandatory).

MajorNewbie
Posts: 127
Joined: 2008/04/17 00:04:43

Re: Where to put robots.txt for crawler?

Post by MajorNewbie » 2008/11/24 15:03:21

Thx for that Ned.

How would "bad" robots come into my system if not explicitly allowed in robots.txt, that would mean they have circumvented my firewall and normal file and dir protections?

MajorNewbie
Posts: 127
Joined: 2008/04/17 00:04:43

Re: Where to put robots.txt for crawler?

Post by MajorNewbie » 2008/11/24 16:44:50

I have enabled Googlebot like this:

[color=003300][b]
User-agent: Googlebot
Disallow: /bin/
Disallow: /boot/
Disallow: /dev/
... etc...
[/b][/color]


Do I need to add another line to explicitly disallow all other bots, like this:

[color=003300][b]
User-agent: *
Disallow: /
[/b][/color]

Thx

NedSlider
Forum Moderator
Posts: 2896
Joined: 2005/10/28 13:11:50
Location: UK

Re: Where to put robots.txt for crawler?

Post by NedSlider » 2008/11/24 22:26:54

[quote]
MajorNewbie wrote:
Thx for that Ned.

How would "bad" robots come into my system if not explicitly allowed in robots.txt, that would mean they have circumvented my firewall and normal file and dir protections?[/quote]

Exactly the same way a user enters your system. By definition, a web server serves content so it has to be publicly accessible. A robot can not access anything that a regular user can't access by following links and rummaging around your site. A robot can't access anything that's secured by file permissions (or SELinux) - if the web server can't serve it then it's not accessible to a robot either so you only need to be concerned about content in [b]document root[/b], not your whole file system.

I think you're misconstruing what a web robot is and what it can and can't access.

User avatar
WhatsHisName
Posts: 1547
Joined: 2005/12/19 20:21:43
Location: /earth/usa/nj

Re: Where to put robots.txt for crawler?

Post by WhatsHisName » 2008/11/24 23:30:11

Paraphrasing Ned...

From [url=http://en.wikipedia.org/wiki/Robots.txt]wikipedia.org - Robots Exclusion Standard[/url]:

[quote][b]The protocol ... relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt [i]does not guarantee privacy[/i].[/b]

[b]Some web site administrators have tried to use the robots file to make private parts of a website invisible to the rest of the world, [i]but the file is necessarily publicly available and its content is easily checked by anyone with a web browser.[/i][/b][/quote]

Post Reply

Return to “CentOS 5 - Webhosting Support”