This page describes the Java software developed by the members of the laboratory, or hosted by the LAW albeit developed by third parties.
BUbiNG is the next-generation web crawler built upon the authors' experience with UbiCrawler and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler (no central coordination); single agents, using sizable hardware, can crawl several thousands pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols so to achieve very high throughput.
In case you are a webmaster, and want to stop the crawler from accessing your site, please follow these instructions.
Fast Computation of the Neighbourhood Function and Distance DistributionWe have designed and engineered an algorithm that computes quickly the neighbourhood function and a number of geometric centralities of very large graphs. We provide a Java implementation in WebGraph. Details can be found in our papers “In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond” and “HyperANF: Approximating the neighbourhood function of very large graphs on a budget”. It has been used to measure Facebook.
Compression by Layered Label Propagation
In Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks (Proceedings of the 20th international conference on World Wide Web, 2011) we propose a new way to permute graphs so to increase their locality, which in turn entails much better compression ratios when using WebGraph. The software written to that purpose is distributed in the LAW software under the GNU General Public License.
Computing PageRank and its Derivatives
In PageRank: Functional dependencies (ACM Trans. Inf. Sys., 27(4):1-23, 2009) we proved that there is a simple way to compute the derivatives of PageRank (w.r.t. the damping factor) of any order, and also to compute the MacLaurin polynomial coefficients of PageRank during the PageRank computation for a fixed damping factor. As a result, once the coefficient have been stored it is possible to compute directly the value of PageRank (approximated by the same number of iterations) for any other value of the damping factor. The software written to that purpose is distributed in the LAW software under the GNU General Public License.
In Paradoxical Effects in PageRank Incremental Computations (Internet Math., 2(3):387-404, 2005), we studied the way PageRank evolves during a crawl following several different strategies. The study required the computation of Kendall's τ on a large data set.
Evaluating Kendall's τ on large data samples is not so common, and there is a remarkable scarcity of literature on the subject of computing τ in an efficient way. Clearly, the brute-force O(n2) approach is easy to implement, but useless on a web graph. Knight presented in the sixties an O(n log n) algorithm for the computation of τ (William R. Knight, A Computer Method for Calculating Kendall's Tau with Ungrouped Data, J. Amer. Statist. Assoc., vol. 61, N. 314, 1966, pages 436−439), but the only implementation we are aware of is part of the SAS system. Ideally, sophisticated data structures such as those described by Dietz (Paul F. Dietz, Maintaining order in a linked list. In Proceedings of the 14th Annual ACM Symposium on Theory of Computing, pages 122−127, San Francisco, California, 1982) could give an O(n log n / log log n) bound, but this approach is unfeasible in practice when the data set is so large that the rankings occupy a significant fraction of the available core memory.
Since we are interested in evaluating Kendall's τ on large graphs, we decided to write a direct, efficient implementation of Knight's algorithm, with a special variant needed to treat ties. This variant was indeed briefly described at the end of Knight's paper, but the details were completely omitted.
UbiCrawler is a scalable, fault-tolerant and fully distributed web crawler, described in UbiCrawler: A scalable fully distributed web crawler (Software: Practice & Experience, 34(8):711-726, 2004). The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The classes computing consistent hashing are part of the LAW software.
UbiCrawler (a joint effort of the LAW and of the Istituto di Informatica e Telematica of the italian National Council of Research) is not distributed publicly, on the other hand, BUbiNG is our new crawler released as open-source.
WebGraph is a framework to study the web graph. It provides simple ways to manage very large graphs, exploiting modern compression techniques. Using WebGraph is as easy as installing a few jar files and downloading a data set. This makes studying phenomena such as PageRank, distribution of graph properties of the web graph, etc. very easy.
MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. MG4J is a highly customisable, high-performance, full-fledged search engine providing state-of-the-art features (such as BM25/BM25F scoring) and new research algorithms.
- Voting in social networks. This project is related to recent work on voting in social networks in CIKM (Conference on Information and Knowledge Management, ACM Press, 2009) and Communications of the ACM (to appear). This page refers to the code needed to run the DBLP experiments described in the paper. The software is available as Java source, along with a Ruby script needed to process the XML raw data from DBLP.
- Weighted graphs. This package implements a simple arc-weighted graph in which each arc has a weight expressed as a float. The package also contains a class for computing a PageRank scoring that is sensitive to these weights. The score of each node is distributed to its successors in proportion to the relative weights of the out-links to such predecessors. If all the weights are equal, this gives the normal PageRank. The software is available as Java source.
- Approximate triangle counting. This software is a joint effort (developed by members of the LAW, people from Yahoo! Research and by Sapienza Università di Roma) to build up tools for exact and approximate triangle counting in large (Web-scale) graphs, as described in: Efficient Algorithms for Large-Scale Triangle Counting (ACM Transactions on Knowledge Discovery from Data, 2010). The software is available as Java source.
- Triangular random walks. This package implements the classes related to triangular random walks as described in a (yet unpublished) paper by Paolo Boldi and Marco Rosa. The software is available as Java source and binary (i.e., jar with docs).