This page describes the Java software developed by the members of the LAW.
BUbiNG is the next-generation web crawler built upon the authors' experience with UbiCrawler and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler (no central coordination); single agents, using sizable hardware, can crawl several thousands pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols so to achieve very high throughput.
In case you are a webmaster, and want to stop the crawler from accessing your site, please follow these instructions.
Fast Computation of the Neighbourhood Function and Distance DistributionWe have designed and engineered an algorithm that computes quickly the neighbourhood function and a number of geometric centralities of very large graphs. We provide a Java implementation in WebGraph. It has been used to measure Facebook.
Compression by Layered Label Propagation
We devised a new way to permute graphs so to increase their locality, which in turn entails much better compression ratios when using WebGraph. The software written to that purpose is distributed in the LAW software under the GNU General Public License.
Computing PageRank and its Derivatives
We proved that there is a simple way to compute the derivatives of PageRank (w.r.t. the damping factor) of any order, and also to compute the MacLaurin polynomial coefficients of PageRank during the PageRank computation for a fixed damping factor. As a result, once the coefficient have been stored it is possible to compute directly the value of PageRank (approximated by the same number of iterations) for any other value of the damping factor. The software written to that purpose is distributed in the LAW software under the GNU General Public License.
In Paradoxical Effects in PageRank Incremental Computations we studied the way PageRank evolves during a crawl following several different strategies. The study required the computation of Kendall's τ on a large data set.
UbiCrawler is a scalable, fault-tolerant and fully distributed web crawler. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The classes computing consistent hashing are part of the LAW software.
UbiCrawler (a joint effort of the LAW and of the Istituto di Informatica e Telematica of the italian National Council of Research) is not distributed publicly, on the other hand, BUbiNG is our new crawler released as open-source.
WebGraph is a framework to study the web graph. It provides simple ways to manage very large graphs, exploiting modern compression techniques. Using WebGraph is as easy as installing a few jar files and downloading a data set. This makes studying phenomena such as PageRank, distribution of graph properties of the web graph, etc. very easy.
MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. MG4J is a highly customisable, high-performance, full-fledged search engine providing state-of-the-art features (such as BM25/BM25F scoring) and new research algorithms.