All the latest UK technology news, reviews and analysis

Google bots get the red carpet treatment

by Robert Jaques

19 Nov 2007

Be the first to comment

  • Tweet this
Google
'Robots.txt' files do not treat all search engines equally

Webmasters who control automated web-crawler access to their sites using 'robots.txt' files have a bias that favours Google over other search engines, according to new research.

The claim was made by researchers at Penn State University based on the results of a study of more than 7,500 websites.

C. Lee Giles, David Reese professor of Information Sciences and Technology at Penn State, who led the research team which developed the BotSeer search engine for the study, described the pro-Google bias as "surprising".

"We expected that 'robots.txt' files would treat all search engines equally, or maybe disfavour certain obnoxious bots," he said.

"So we were surprised to discover a strong correlation between the favoured robots and search engine market share."

'Robots.txt' files are not an official standard but, by informal agreement, regulate web-crawlers, also known as 'spiders' and 'bots', which mine the web continuously.

Web policy makers use the files found in a website's directory to restrict crawler access to non-public information.

'Robots.txt' files also are used to reduce server load which can result in denial of service and force a website to shut down. But some web policy makers and administrators are writing 'robots.txt' files which are not uniformly blocking access.

Instead, those files give access to Google, Yahoo and MSN while restricting other search engines, the researchers found.

While the study does not include explanations for why web policy makers have opted to favour Google, the researchers know that the choice was made consciously. Not using a 'robots.txt' file gives all robots equal access to a website.

"'Robots.txt' files are written by web policy makers and administrators who have to intentionally specify Google as the favoured search engine," said Professor Giles.

Not every site has a 'robots.txt' file, although the number is growing. About four in 10 of the 7,500 sites analysed by the researchers had such a file, up from fewer than one in 10 in 1996.

Do you agree?

 

Add your comment

We won't publish your address
By submitting a comment you agree to abide by our Terms & Conditions. Your comment will be moderated before publication.

Poll

IT priorities for 2012

What is the most important IT priority for your company this year?

99%

0%

1%

0%

0%

Connect with V3.co.uk

Sign up to our daily or weekly newsletters

Accurev

Top 5 software development challenges

This paper focuses on a series of best practices and techniques for development teams looking to improve their software development processes

Talend

Rubbish in, rubbish enterprise

Why good data management at all levels is essential in the modern business (video, 6mins)

Web Developer LAMP HTML CSS Bash Linux Cambridge.

Web Developer LAMP HTML CSS Bash Linux Cambridge...

Drupal / Web Developer - Chesterfield

Drupal / Web Developer ( PHP, Drupal, JavaScript, JQuery...

.NET / Web Developer

Web / .NET Developer ( ASP.NET, VB.NET, HTML, CSS, SQL...

Analyst / Developer (Case Management) - NW London - £35,000

Analyst / Developer (Case Management) - NW London - £35...

To send to more than one email address, simply separate each address with a comma.