Googlebots
From its-wiki.no
Wiki for ITS | ||||||
---|---|---|---|---|---|---|
|
Contents
Googlebots crawling
we had problems being visible in Google...
- a search of site:cwi.unik.no does not reveal any results in google, while all pages are accessible in Microsoft
Searching for missconfiguration
- the robots.txt file is visible: http://cwi.unik.no/robots.txt
- installed: user agent switcher in Firefox, have tried, but switching to Googlebots shows the robots.txt and the site "as normal".
- Thanks to jamesattard.com for information on curl
use of curl to find the pages
"curl" can see all other web pages, but not cwi.unik.no. why?
- virtual hosts, as defined in /etc/apache2/sites-available . The default contains blocked web pages, e.g. deny from 180.76.0.0/255; 66.249, 62.142, 152,94, 38.101, 83.103, 208.115, 193.37.0.0....
$ curl -A "Googlebot" cwi.unik.no
- does not show anything
$ curl -A "Googlebot" aftenposten.no
- works nicely for aftenposten
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>301 Moved Permanently</title> </head><body>
Moved Permanently
The document has moved <a href="http://www.aftenposten.no/">here</a>.
<address>Apache Server at <a href="mailto:mnowebadmin@medianorge.no">aftenposten.no</a> Port 80</address> </body></html> $
$ curl -A "Googlebot" wiki.unik.no
- works nicely for wiki.unik.no
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> </body></html>
Googlebots is not allowed to crawl the machine
what does that mean?
- IP-specific firewall rules on ports 80 and 443 that could block the goolgebot
Error Message 403
Googlebots throws error 403, which means: The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated.
Others
- then your page heads for incorrect robots meta,
- and your server headers for incorrect configuration.
- .htaccess" file - with content: deny from 66.249
Analysis of robots.txt file
A very good tools is provided by the guys at: http://tool.motoricerca.info/robots-checker.phtml