Microsoft has said it plans as early as the end of this year to launch its own algorithmic search engine for MSN Search. MSN today is using search results from Yahoo Inc.s engine. The Web page spam research is based on two different crawls of the Web conducted almost two years ago, Najork said. Using the results from the crawl of 150 million Web pages conducted over the course of 11 weeks, researchers found that 8.1 percent of the pages were spam and that various statistical techniques could identify about 75 percent of those spam pages. The statistical techniques look for such anomalies as a high number of host names being associated with the same IP address, a large number of characters or words being used in a host name, and an unusual distribution of links.Researchers also are working to use natural language processing to automatically write summaries of news stories and items in a newsbot application. The ability for a computer to generate a summary could be important as more search sites attempt to crawl and sort news sources. MSN, for instance, is planning to launch a new news search service later this year. To thwart Internet worms, researchers are proposing a line of defense in the network stack that could prevent the spread of worms even before software patches are available or deployed. Called Shield, the project uses network filters to monitor the incoming and outgoing traffic of vulnerable applications in order to stop traffic using an exploit. Check out eWEEK.coms Windows Center at http://windows.eweek.com for Microsoft and Windows news, views and analysis.
The Microsoft Research team plans to present its findings in a paper called "Spam, Damn Spam, and Statistics" during a Paris workshop next week. Next up is analyzing Web page content and words to weed out spamlike patterns, Najork said.