Wiki for ITS

Googlebots crawling

we had problems being visible in Google...

a search of site:cwi.unik.no does not reveal any results in google, while all pages are accessible in Microsoft

Searching for missconfiguration

the robots.txt file is visible: http://cwi.unik.no/robots.txt
installed: user agent switcher in Firefox, have tried, but switching to Googlebots shows the robots.txt and the site "as normal".
Thanks to jamesattard.com for information on curl

use of curl to find the pages

"curl" can see all other web pages, but not cwi.unik.no. why?

virtual hosts, as defined in /etc/apache2/sites-available . The default contains blocked web pages, e.g. deny from 180.76.0.0/255; 66.249, 62.142, 152,94, 38.101, 83.103, 208.115, 193.37.0.0....

$ curl -A "Googlebot" cwi.unik.no

does not show anything

$ curl -A "Googlebot" aftenposten.no

works nicely for aftenposten

  <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
  <html><head>
  <title>301 Moved Permanently</title>
  </head><body>

Moved Permanently

The document has moved <a href="http://www.aftenposten.no/">here</a>.

  <address>Apache Server at <a href="mailto:mnowebadmin@medianorge.no">aftenposten.no</a> Port 80</address>
  </body></html>
 $

$ curl -A "Googlebot" wiki.unik.no

works nicely for wiki.unik.no

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml">
 </body></html>

Googlebots is not allowed to crawl the machine

what does that mean?

IP-specific firewall rules on ports 80 and 443 that could block the goolgebot

Error Message 403

Googlebots throws error 403, which means: The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated.

Others

then your page heads for incorrect robots meta,
and your server headers for incorrect configuration.
- .htaccess" file - with content: deny from 66.249

Analysis of robots.txt file

A very good tools is provided by the guys at: http://tool.motoricerca.info/robots-checker.phtml

Googlebots

Contents

Googlebots crawling

Searching for missconfiguration

use of curl to find the pages

Moved Permanently

Googlebots is not allowed to crawl the machine

Error Message 403

Others

Analysis of robots.txt file

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Main

Applied Research for

List of

Forms (create or edit)

Help

Tools