Googlebots

From its-wiki.no
Jump to: navigation, search

Googlebots crawling

we had problems being visible in Google...

  • a search of site:cwi.unik.no does not reveal any results in google, while all pages are accessible in Microsoft

Searching for missconfiguration

  • the robots.txt file is visible: http://cwi.unik.no/robots.txt
  • installed: user agent switcher in Firefox, have tried, but switching to Googlebots shows the robots.txt and the site "as normal".
  • Thanks to jamesattard.com for information on curl

use of curl to find the pages

"curl" can see all other web pages, but not cwi.unik.no. why?

  • virtual hosts, as defined in /etc/apache2/sites-available . The default contains blocked web pages, e.g. deny from 180.76.0.0/255; 66.249, 62.142, 152,94, 38.101, 83.103, 208.115, 193.37.0.0....


$ curl -A "Googlebot" cwi.unik.no

does not show anything

$ curl -A "Googlebot" aftenposten.no

works nicely for aftenposten
  <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
  <html><head>
  <title>301 Moved Permanently</title>
  </head><body>

Moved Permanently

The document has moved <a href="http://www.aftenposten.no/">here</a>.


  <address>Apache Server at <a href="mailto:mnowebadmin@medianorge.no">aftenposten.no</a> Port 80</address>
  </body></html>
 $

$ curl -A "Googlebot" wiki.unik.no

works nicely for wiki.unik.no
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml">
 </body></html>

Googlebots is not allowed to crawl the machine

what does that mean?

  • IP-specific firewall rules on ports 80 and 443 that could block the goolgebot

Error Message 403

Googlebots throws error 403, which means: The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated.

Others

  • then your page heads for incorrect robots meta,
  • and your server headers for incorrect configuration.
    • .htaccess" file - with content: deny from 66.249

Analysis of robots.txt file

A very good tools is provided by the guys at: http://tool.motoricerca.info/robots-checker.phtml