Introduction to LAW
The Laboratory for Web Algorithmics (LAW) was established in 2002 at the Dipartimento di Scienze dell'Informazione (now merged in the Computer Science Department) of the Università degli studi di Milano.
Research at LAW concerns all algorithmic aspects of the study of the web and of social networks. More in detail…
High-performance web crawling
UbiCrawler is a scalable, fault-tolerant and fully distributed web crawler developed in collaboration with the Istituto di Informatica e Telematica. The first report on the design of UbiCrawler won the Best Poster Award at the Tenth World Wide Web Conference.
BUbiNG is a high-performance, scalable, distributed, open-source crawler built on our experience with UbiCrawler and on the last ten years of research on the topic. You can find it on the Maven Central Repository or download sources and binaries.
Compression of web graphs and social networks
Once a part of the web has been crawled, the resulting graph is very large—you need a compact representation. WebGraph is a framework built to this purpose. Among other things, WebGraph uses new instantaneous codes for the integers and new aggressive algorithmic compression techniques. The techniques developed for web graphs found application also in the compression of social networks.
Presently, the best results obtained for the compression of this kind of graphs are obtained by coupling WebGraph with LLP, a new algorithm that permutes a graph and increases its locality.
We use our compression framework to distribute easily a corpus of datasets comprising both web graphs and social networks. Thanks to WebGraph, a user can download and immediately start exploring the datasets. Several scientific papers have been written using our datasets and software.
Analysis of web graphs and social networks
Web graphs have special properties whose study requires a sizeable amount of mathematics, but also a careful study of actual web graphs. We have studied, for instance, the paradoxical way PageRank evolves during a crawl, and the way PageRank changes depending on the damping factor.
Recently, we developed a new algorithm, called HyperBall, that can compute quickly an approximation of the neighbourhood function of large graphs and several centrality measures based on distances. The algorithm was used to compute for the first time second-order statistics about the distance distribution of a large number of web and social graphs: the results can be accessed at the LAW web site. In particular, HyperBall was used to show that Facebook has just four degrees of separation.
On a more theoretical side, we proposed an axiomatic framework for understanding centrality and studied harmonic centrality, a variant of Bavelas's classical closeness centrality designed to take unreachable nodes into account in a natural way.
Often, the purpose of a crawl is the contruction of a full-text index of the text contained in the crawled pages. Such an index is at the basis of all existing commercial search engines such as Google.
The research about search-engine construction is based on MG4J, a system for full-text indexing of large-scale document collections.