# LAW Software

This page describes the Java software developed by the members of the laboratory, or hosted by the LAW albeit developed by third parties.

## BUbiNG

BUbiNG is the next-generation web crawler built upon the authors' experience with UbiCrawler and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler (no central coordination); single agents, using sizable hardware, can crawl several thousands pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols so to achieve very high throughput.

BUbiNG is distributed as part of the LAW software under the GNU General Public License.

In case you are a webmaster, and want to **stop the crawler** from accessing your site, please follow these instructions.

## Fast Computation of the Neighbourhood Function and Distance Distribution

We have designed and engineered an algorithm that computes quickly the neighbourhood function and a number of*geometric centralities*of very large graphs. We provide a Java implementation in WebGraph. Details can be found in our papers “In-core computation of geometric centralities with HyperBall: A hundred billion nodes and beyond” and “HyperANF: Approximating the neighbourhood function of very large graphs on a budget”. It has been used to measure Facebook.

## Compression by Layered Label Propagation

In *Layered label propagation: A multiresolution
coordinate-free ordering for compressing social networks* (*Proceedings of the 20th international conference on World Wide
Web*, 2011) we propose a new way to permute graphs so to increase their *locality*, which in turn entails much
better compression ratios when using WebGraph. The software written to that purpose is
distributed in the LAW software under the GNU General Public License.

## Computing PageRank and its Derivatives

In *PageRank: Functional dependencies*
(ACM Trans. Inf. Sys., 27(4):1-23, 2009) we proved that there is a simple way to compute the derivatives of
PageRank (w.r.t. the damping factor) of any order, and also to compute the MacLaurin polynomial coefficients of PageRank during
the PageRank computation for a fixed damping factor. As a result, once the coefficient have been stored it is possible to compute
directly the value of PageRank (approximated by the same number of iterations) for *any* other value of the damping factor.
The software written to that purpose is distributed in the LAW software under the
GNU General Public License.

## Kendall's τ

In *Paradoxical Effects in PageRank Incremental Computations* (Internet Math., 2(3):387-404, 2005),
we studied the way PageRank evolves during a crawl following
several different strategies. The study required the computation of Kendall's τ on a large data set.

Evaluating Kendall's τ on large data samples is not so common, and there is a
remarkable scarcity of literature on the subject of computing τ in an
efficient way. Clearly, the brute-force *O*(`n`^{2})
approach is easy to implement, but useless on a web graph.
Knight presented in the sixties an *O*(`n` log `n`) algorithm for the computation of τ (William R. Knight,
A Computer Method for Calculating Kendall's Tau with Ungrouped Data, *J. Amer. Statist. Assoc.*, vol. 61, N. 314,
1966, pages 436−439), but the only implementation we are aware of is part of the SAS system.
Ideally, sophisticated data structures such as those described
by Dietz (Paul F. Dietz, Maintaining order in a linked list. In *Proceedings of the 14th Annual ACM Symposium on Theory of
Computing*, pages 122−127, San Francisco, California, 1982)
could give an *O*(`n` log `n` / log log `n`) bound, but this approach is unfeasible in practice when the data
set is so large that the rankings occupy a significant fraction of the
available core memory.

Since we are interested in evaluating Kendall's τ on large graphs, we decided to write a direct, efficient implementation of Knight's algorithm, with a special variant needed to treat ties. This variant was indeed briefly described at the end of Knight's paper, but the details were completely omitted.

We distribute the classes computing Kendall's τ in the LAW software under the GNU General Public License.

## UbiCrawler

UbiCrawler is a scalable, fault-tolerant and fully distributed web crawler, described in
*UbiCrawler: A scalable fully distributed web crawler* (Software: Practice & Experience,
34(8):711-726, 2004).
The main features of UbiCrawler are platform independence,
linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing)
for partitioning the domain to crawl, and more in general the complete decentralization of every task.
The classes computing consistent hashing are part of the LAW software.

UbiCrawler (a joint effort of the LAW and of the Istituto di Informatica e Telematica of the italian National Council of Research) is not distributed publicly, on the other hand, BUbiNG is our new crawler released as open-source.

## WebGraph

WebGraph is a framework to study the web graph. It provides simple ways to manage very large graphs, exploiting modern compression techniques. Using WebGraph is as easy as installing a few jar files and downloading a data set. This makes studying phenomena such as PageRank, distribution of graph properties of the web graph, etc. very easy.

## MG4J

MG4J (*Managing
Gigabytes* for Java) is a free full-text search engine
for large document collections written in Java. MG4J is a
highly customisable, high-performance, full-fledged search
engine providing state-of-the-art features (such as BM25/BM25F
scoring) and new research algorithms.

## Satellite projects

**Voting in social networks**. This project is related to recent work on voting in social networks in CIKM (Conference on Information and Knowledge Management, ACM Press, 2009) and Communications of the ACM (to appear). This page refers to the code needed to run the DBLP experiments described in the paper. The software is available as Java source, along with a Ruby script needed to process the XML raw data from DBLP.**Weighted graphs**. This package implements a simple arc-weighted graph in which each arc has a weight expressed as a float. The package also contains a class for computing a PageRank scoring that is sensitive to these weights. The score of each node is distributed to its successors in proportion to the relative weights of the out-links to such predecessors. If all the weights are equal, this gives the normal PageRank. The software is available as Java source.**Approximate triangle counting.**This software is a joint effort (developed by members of the LAW, people from Yahoo! Research and by Sapienza Università di Roma) to build up tools for exact and approximate triangle counting in large (Web-scale) graphs, as described in:*Efficient Algorithms for Large-Scale Triangle Counting*(ACM Transactions on Knowledge Discovery from Data, 2010). The software is available as Java source.**Triangular random walks.**This package implements the classes related to triangular random walks as described in a (yet unpublished) paper by Paolo Boldi and Marco Rosa. The software is available as Java source and binary (i.e., jar with docs).