Webmasters who control automated web-crawler access to their sites using
'robots.txt' files have a bias that favours
Google over
other search engines, according to new research.
The claim was made by researchers at
Penn
State University based on the results of a study of more than 7,500
websites.
C. Lee Giles, David Reese professor of Information Sciences and Technology at
Penn State, who led the research team which developed the BotSeer search engine
for the study, described the pro-Google bias as "surprising".
"We expected that 'robots.txt' files would treat all search engines equally,
or maybe disfavour certain obnoxious bots," he said.
"So we were surprised to discover a strong correlation between the favoured
robots and search engine market share."
'Robots.txt' files are not an official standard but, by informal agreement,
regulate web-crawlers, also known as 'spiders' and 'bots', which mine the web
continuously.
Web policy makers use the files found in a website's directory to restrict
crawler access to non-public information.
'Robots.txt' files also are used to reduce server load which can result in
denial of service and force a website to shut down. But some web policy makers
and administrators are writing 'robots.txt' files which are not uniformly
blocking access.
Instead, those files give access to Google,
Yahoo and
MSN while
restricting other search engines, the researchers found.
While the study does not include explanations for why web policy makers have
opted to favour Google, the researchers know that the choice was made
consciously. Not using a 'robots.txt' file gives all robots equal access to a
website.
"'Robots.txt' files are written by web policy makers and administrators who
have to intentionally specify Google as the favoured search engine," said
Professor Giles.
Not every site has a 'robots.txt' file, although the number is growing. About
four in 10 of the 7,500 sites analysed by the researchers had such a file, up
from fewer than one in 10 in 1996.
Do you agree?
Have your say on this article