At the last count, there were more than 800 million pages on the world wide web. It’s no wonder that finding what you want is becoming increasingly laborious. looks for help from the second generation of search engines.
Surfing the web for mild distraction is one thing. Using it to find useful information is quite another.
For while the informal, university and research lab roots of the Internet lend themselves nicely to the vagaries of the former, their mushroom-like growth and inherent anarchy mean the latter is increasingly a tougher proposition.
Until recently the only answer was the search engine. And anyone who has typed in ‘UK distributors’ and got back 6,789,421 alleged ‘hits’ – most of which mention the word ‘erotic’ – will know how useful they are (not).
Step forward a possible answer: the second generation of search engines, leveraging the capabilities of Extensible Markup Language (XML), which lets elements within a web page get marked up exactly, so colours are defined as colours, graphs as graphs and text as text.
Second-generation search engines are expected to produce quicker and more efficient searches on the web than their first generation cousins. IBM is making loud noises about its product, Clever, and the developers of US-based second-generation search engine Google said last month they have received $25 million (£15m) in equity funding from venture capitalists who believe in the concept.
IBM and Google expect increasing notice, as existing search engine players face either an uncertain future (AltaVista being spun off from Compaq), or changing direction (Yahoo! refocusing on consumers by becoming a portal).
But it’s not all good news for these search engine upstarts. Clever is nearly three years old, yet still has no delivery date. This means corporate customers may have to continue to look to existing niche search engines to speed up their web searches instead.
This is bad, as first generation search engines are creaking under the strain of the web’s growth. Two years ago, the web contained around 320 million pages, one third of which were indexed by search engines.
Data from the US based NEC Research Institute, published this month, says there are now around 800 million publicly available pages, of which even the best search engines can only index about 16%.
And even if you conduct a web search, there’s no guarantee the results will be comprehensive. NEC found there is now a six month time lag between a new page being published and its appearance on a search engine.
“It's imperative that the quality of search engines improves. If not, the web will become unusable because the quantity of information out there is far beyond the ability of any human to consume,” says Sridhar Rajagopalan, a member of IBM's research staff.
The Holy Grail of search technology is an engine that will find all the pages desired. But we still have a long way to go. Currently we have comprehensiveness – getting all the pages you want – which is in direct conflict with the goal of relevance – only the pages you want.
Ironically, developments designed to make search engines more comprehensive have in practice added to the problem. Newer search engines such as AltaVista use web crawlers: software agents that automatically hunt out new pages and add them to the directory.
AltaVista says automatic crawlers allow it to provide the web's most comprehensive and up-to-date Internet index: the engine claims it now covers 140 million pages, refreshed every 28 days. But although automatic crawlers have increased the number of pages visible to search engines, and thus increased the number of hits for each search, they also mean end users are deluged with search responses.
The starting point for any search engine is to carry out a first pass through its index to find all the documents containing the search terms. They will then use a formula to prioritise the hits.
At most popular first generation sites, such as AltaVista, Infoseek, Lycos and Excite, the algorithm is based on the term itself. This means the search will pull a document back according to the frequency with which it appears in a document, the position in which it appears, or a combination of both.
Some products, such as AltaVista, go further by allowing the searcher to constrain the search to pages where the term appears in the document’s title or a hyperlink, as well as searching only for particular types of pages, such as those containing image files.
But these methods aren't always a good guide to a page's relevance. For example, web authors often deliberately cheat the search engine by repeating the same key words over and over again in the metadata they provide to describe a page (usually the pornographer’s most irritating trick).
Researchers now have a trick of their own; looking beyond a document’s content and starting to judge a document’s importance according to its relationships with other documents. This is called the popularity method, and it underlies Directhit, Google and IBM's Clever. Analysts think this may be fruitful.
"These methods definitely improve on the ability to provide a relevant result," notes Bridget Leach, industry analyst at Giga Group.
One new engine, Directhit, directly analyses users’ behaviour. It determines how often they select one link rather than another, and how long they spend at the sites visited to determine which are the core pages that will meet a search criterion. The downside is that the result is limited to subjects that have already been extensively frequented by other web users.
In contrast, Clever and Google aim to identify the most relevant information by analysing the hyperlinks between web pages.
Google uses a technique it calls PageRank, which does just that. Having come up with an initial list of hits, the engine then allocates an importance weighting to each page by finding out how many other pages point to it. The most important pages are those with links from pages which are themselves rated as important.
IBM's Clever works in a similar way. It looks at how pages interconnect and sorts them into two categories, authorities and hubs. Authorities are the core sources of information that are passed back to the searcher.
“A first generation search engine will typically return thousands of pages. If you have a high precision search engine that can return just 60 or 70 relevant ones, then searching becomes a much more productive process,” explains Rajagopalan.
But all engines do have a downside. John Snyder, founder and director of information retrieval specialist Muscat, says they all assume you know exactly what you're looking for in the first place. He claims Muscat's search engine Empower operates differently, aimed at knowledge retrieval rather than basic word matching.
“Our engine assumes you don't know exactly what you're after. It works out the relationship of words and ideas on the fly. If you go into it and enter ‘Zippergate’, it will immediately throw up related ideas.”
Empower – developed at Cambridge University – scores the value of the search terms themselves, rather than documents, based on a probability analysis.
For example, if you are searching on law suits, the word ‘lawyer’ in a legal database will have a lower value than it would if it appeared in a news story about contract fraud. This means you would not be deluged with search results that point to lots of databases which contain text only a lawyer would understand or use.
Empower also analyses how often the search terms are associated with other words, and presents the searcher with a list of subcategories. ‘Zippergate’ will give you categories including
US Presidents Clinton and Nixon, scandal, women and America. Depending on which categories or documents you choose, the software will then refine the search to try to focus on the particular areas in which you expressed interest in.
In this case, it could be scandals involving US presidents, Bill Clinton's career, women's relationships with their bosses, or the American constitution.
Again, there are limitations. Snyder admits that Empower would not work well when applied to the whole of the web. However, he says, it is no longer appropriate to have search engines that try to be all things to all people: “Users are too overloaded with information already. We need to find ways of restricting searches to a particular domain.”
Certainly, the technology that eventually gets it right will be in a sweet spot in the market. Entering the fray are a host of smaller players, seizing the opportunity as the search engine ‘dinosaurs’ shuffle off into portal territory and away from their original business.
These newcomers include Muscat, Inktomi and Autonomy, which are tackling searches on corporate intranet and Internet sites. Muscat’s focus is on providing specialist search facilities for corporate intranet and Internet sites at the Electronic Telegraph, DHL, Shell and the BBC.
As the web grows, one future strategy is likely to be smaller domains. Leach says these are probably going to be the only realistic solution for effective searching. She predicts we will see an increase in vertical community based sites, containing specialist links and search engines geared to one subject only.
For example, www.beersite.com claims to be the first search engine focused specifically on the brewing industry. Those searching for data about the brewing industry may be better served by using this search engine instead of typing the words ‘beer’ and ‘industry’ into either a first or second-generation search engine.
The Clever project is also focusing on this. Engineers have begun construction of lists of web pages similar to those compiled manually by search engines such as Yahoo! However, they wish to uncover these automatically and are working on improved algorithms to do so. It is thought IBM’s search for partners is designed to tackle this, but there is no date as yet for the commercial delivery of Clever.
The ultimate search engine that can scour the whole web, and come up with accurate answers to your business query, is unlikely to appear. Attempts to get near that Holy Grail seem to have passed from the hands of the industry heavyweights to niche vendors, backed by researchers at IBM.
Even so, the days of the all purpose search engine have to be numbered. Engines for specific tasks, and even specific subject areas, are going to be the only feasible way forward for web information retrieval.
As Muscat’s Snyder says: “Having just one kind of search engine is like having just one kind of car. You can’t expect to satisfy everyone with the same strategy.”
Dust storm on Titan only the third Solar System body where such storms have been observed
New technique could enable quantum computers to scale-up to millions of qubits
Systrom and Krieger taking time off "to explore our curiosity and creativity"
Comcast's £29.7bn winning bid more than twice the £13.7bn Rupert Murdoch valued Sky at just eight years ago