| |  | Abelson, Hal | Blown to Bits: Your Life, Liberty, and Happiness After the Digital Explosion read moreAbstract: Sorry no abstract available for this article | 2008 |
| |  | Han, H. | Two supervised learning approaches for name disambiguation in author citations read moreAbstract: Due to name abbreviations, identical names, name misspellings, and pseudonyms in publications or bibliographies (citations), an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, Web search, database integration, and may cause improper attribution to authors. We investigate two supervised learning approaches to disambiguate authors in the citations. One approach uses the naive Bayes probability model, a generative model; the other uses support vector machines (SVMs) [V. Vapnik (1995)] and the vector space representation of citations, a discriminative model. Both approaches utilize three types of citation attributes: coauthor names, the title of the paper, and the title of the journal or proceeding. We illustrate these two approaches on two types of data, one collected from the Web, mainly publication lists from homepages, the other collected from the DBLP citation databases. | 2004 |
| |  | Ntoulas, Alexandros | Whats new on the web?: the evolution of the web from a search engine perspective read moreAbstract: We seek to gain improved insight into how Web search engines shouldcope with the evolving Web, in an attempt to provide users with themost up-to-date results possible. For this purpose we collectedweekly snapshots of some 150 Web sites over the course of one year,and measured the evolution of content and link structure. Our measurements focus on aspects of potential interest to search engine designers: the evolution of link structure over time, the rate ofcreation of new pages and new distinct content on the Web, and the rate of change of the content of existing pages under search-centric measures of degree of change.Our findings indicate a rapid turnover rate of Web pages, i.e.,high rates of birth and death, coupled with an even higher rate ofturnover in the hyperlinks that connect them. For pages that persistover time we found that, perhaps surprisingly, the degree of contentshift as measured using TF.IDF cosine distance does not appear to beconsistently correlated with the frequency of contentupdating. Despite this apparent non-correlation, the rate of content shift of a given page is likely to remain consistent over time. That is, pages that change a great deal in one week will likely change by a similarly large degree in the following week. Conversely, pages that experience little change will continue to experience little change. We conclude the paper with a discussion of the potential implications ofour results for the design of effective Web search engines. | 2004 |
| |  | Broder, Andrei | A taxonomy of web search read moreAbstract: Classic IR (information retrieval) is inherently predicated on users searching for information, the so-called "information need". But the need behind a web search is often not informational -- it might be navigational (give me the url of the site I want to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop, download a file, or find a map). We explore this taxonomy of web searches and discuss how global search engines evolved to deal with web-specific needs. | 2002 |