Publications
Journals
 Paolo Boldi, Flavio Furia, and
Sebastiano Vigna.
Monotonicity in undirected networks.
Network Science, 11(3):351−373, 2023.
 Abstract

Is it always beneficial to create a new relationship (have a new follower/friend) in a social network? This question can be formally stated as a property of the centrality measure that defines the importance of the actors of the network. Score monotonicity means that adding an arc increases the centrality score of the target of the arc; rank monotonicity means that adding an arc improves the importance of the target of the arc relatively to the remaining nodes. It is known that most centralities are both score and rank monotone on directed, strongly connected graphs. In this paper, we study the problem of score and rank monotonicity for classical centrality measures in the case of undirected networks: in this case, we require that score, or relative importance, improve at both endpoints of the new edge. We show that, surprisingly, the situation in the undirected case is very different, and in particular that closeness, harmonic centrality, betweenness, eigenvector centrality, Seeley's index, Katz's index, and PageRank are not rank monotone; betweenness and PageRank are not even score monotone. In other words, while it is always a good thing to get a new follower, it is not always beneficial to get a new friend.
 doi:10.1017/nws.2022.42; download from arXiV.
 David Blackman and Sebastiano
Vigna.
A new test for Hammingweight dependencies.
ACM Trans. Model. Comput. Simul., 32(3):1−19, 2022.
 Abstract

We describe a new statistical test for pseudorandom number generators (PRNGs). Our test can find bias induced by dependencies among the Hamming weights of the outputs of a PRNG, even for PRNGs that pass stateoftheart tests of the same kind from the literature, and in particular for generators based on F_{2}linear transformations such as the dSFMT, xoroshiro1024+, and WELL512.
 doi:10.1145/3527582; PDF version; more data can be found here.
 Guy L. Steele Jr. and
Sebastiano Vigna.
LXM: Better splittable pseudorandom number generators (and almost as fast).
Proc. ACM Program. Lang., 5(OOPSLA):1−31, 2021.
 Abstract

In 2014, Steele, Lea, and Flood presented SplitMix, an objectoriented pseudorandom number generator (prng) that is quite fast (9 64bit arithmetic/logical operations per 64 bits generated) and also splittable. A conventional prng object provides a generate method that returns one pseudorandom value and updates the state of the prng; a splittable prng object also has a second operation, split, that replaces the original prng object with two (seemingly) independent prng objects, by creating and returning a new such object and updating the state of the original object. Splittable prng objects make it easy to organize the use of pseudorandom numbers in multithreaded programs structured using forkjoin parallelism. This overall strategy still appears to be sound, but the specific arithmetic calculation used for generate in the SplitMix algorithm has some detectable weaknesses, and the period of any one generator is limited to 2^{64}.
Here we present the LXM family of prng algorithms. The idea is an old one: combine the outputs of two independent prng algorithms, then (optionally) feed the result to a mixing function. An LXM algorithm uses a linear congruential subgenerator and an F_{2}linear subgenerator; the examples studied in this paper use a linear congruential generator (LCG) of period 2^{16}, 2^{32}, 2^{64}, or 2^{128} with one of the multipliers recommended by L’Ecuyer or by Steele and Vigna, and an F_{2}linear xorbased generator (XBG) of the
xoshiro
family orxoroshiro
family as described by Blackman and Vigna. For mixing functions we study the MurmurHash3 finalizer function; variants by David Stafford, Doug Lea, and degski; and the null (identity) mixing function.Like SplitMix, LXM provides both a generate operation and a split operation. Also like SplitMix, LXM requires no locking or other synchronization (other than the usual memory fence after instance initialization), and is suitable for use with simd instruction sets because it has no branches or loops.
We analyze the period and equidistribution properties of LXM generators, and present the results of thorough testing of specific members of this family, using the TestU01 and PractRand test suites, not only on single instances of the algorithm but also for collections of instances, used in parallel, ranging in size from 2 to 2^{24}. Single instances of LXM that include a strong mixing function appear to have no major weaknesses, and LXM is significantly more robust than SplitMix against accidental correlation in a multithreaded setting. We believe that LXM, like SplitMix, is suitable for “everyday” scientific and machinelearning applications (but not cryptographic applications), especially when concurrent threads or distributed processes are involved.
 doi:10.1145/3485525; code
can be found in the
java.util.random
package (Java 17+).
 Pauli Virtanen, Ralf Gommers,
Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni
Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al.
SciPy 1.0: fundamental algorithms for scientific computing in Python.
Nature methods, 17(3):261−272, 2020.
 Abstract

SciPy is an opensource scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.
 Nature Methods link.
 Guy L. Steele Jr. and
Sebastiano Vigna.
Computationally easy, spectrally good multipliers for congruential
pseudorandom number generators.
Software: Practice and Experience, 52(2):443−458, 2022.
 Abstract

Congruential pseudorandom number generators rely on good multipliers, that is, integers that have good performance with respect to the spectral test. We provide lists of multipliers with a good lattice structure up to dimension eight and up to lag eight for generators with typical poweroftwo moduli, analyzing in detail multipliers close to the square root of the modulus, whose product can be computed quickly.
 doi:10.1002/spe.3030; code and data can be found on GitHub.
 David Blackman and Sebastiano
Vigna.
Scrambled linear pseudorandom number generators.
ACM Trans. Math. Softw., 47:1−32, 2021.
 Abstract

F_{2}linear pseudorandom number generators are very popular due to their high speed, to the ease with which generators with a sizable state space can be created, and to their provable theoretical properties. However, they suffer from linear artifacts that show as failures in linearityrelated statistical tests such as the binaryrank and the linearcomplexity test. In this article, we give two new contributions. First, we introduce two new F_{2}linear transformations that have been handcrafted to have good statistical properties and at the same time to be programmable very efficiently on superscalar processors, or even directly in hardware. Then, we describe some scramblers, that is, nonlinear functions applied to the state array that reduce or delete the linear artifacts, and propose combinations of linear transformations and scramblers that give extremely fast pseudorandom number generators of high quality. A novelty in our approach is that we use ideas from the theory of filtered linearfeedback shift registers to prove some properties of our scramblers, rather than relying purely on heuristics. In the end, we provide simple, extremely fast generators that use a few hundred bits of memory, have provable properties, and pass strong statistical tests.
 doi:10.1145/3460772; PDF
version; more data can be found at the
xoshiro
/xoroshiro
generators and the PRNG shootout page;xoroshiro128++
andxoshiro256++
are now part of thejava.util.random
package (Java 17+).
 Marco Genuzio, Giuseppe
Ottaviano, and Sebastiano Vigna.
Fast scalable construction of ([compressed] static  minimal perfect hash)
functions.
Information and Computation, 273:104517, 2020.
 Abstract

Recent advances in the analysis of random linear systems on finite fields have paved the way for the construction of constanttime data structures representing static functions and minimal perfect hash functions using less space with respect to existing techniques. The main obstacle for any practical application of these results is the time required to solve such linear systems: despite they can be made very small, the computation is still too slow to be feasible. In this paper, we describe in detail a number of heuristics and programming techniques to speed up the solution of these systems by orders of magnitude, making the overall construction competitive with the standard and widely used MWHC technique, which is based on hypergraph peeling. In particular, we introduce broadword programming techniques for fast equation manipulation and a lazy Gaussian elimination algorithm. We also describe a number of technical improvements to the data structure which further reduce space usage and improve lookup speed. Our implementation of these techniques yields a minimal perfect hash function data structure occupying 2.24 bits per element, compared to 2.68 for MWHCbased ones, and a static function data structure which reduces the multiplicative overhead from 1.23 to 1.024. For functions whose output has low entropy, we are able to implement feasibly for the first time the Hreinsson–Krøyer–Pagh approach, which makes it possible, for example, to store a function with an output of 10^{6} values distributed following a power law of exponent 2 in just 2.76 bits per key instead of 20.
 doi:10.1016/j.ic.2020.104517. The algorithms described in the paper are implemented in Sux4J.
 Sebastiano Vigna.
On the probability of overlap of random subsequences of pseudorandom number
generators.
Information Processing Letters, 158, 2020.
 Abstract

We analyze in detail the probability that sequences of equal length generated by a pseudorandom number generator starting from random points of the state space overlap, providing for the first time an exact result and manageable bounds. While the computation of the probability is almost elementary, the value has been reported erroneously several times in the literature.
 The combinatorial part of this paper has actually been proved in 1968 by Joseph I. Naus (thanks to Samuel Neves for reporting this reference).
 doi:10.1016/j.ipl.2020.105939; PDF version.
 Stefano Marchini and
Sebastiano Vigna.
Compact Fenwick trees for dynamic ranking and selection.
Software: Practice & Experience, 50(7):1184−1202, 2020.
 Abstract

The Fenwick tree is a classical implicit data structure that stores an array in such a way that modifying an element, accessing an element, computing a prefix sum and performing a predecessor search on prefix sums all take logarithmic time. We introduce a number of variants which improve the classical implementation of the tree: in particular, we can reduce its size when an upper bound on the array element is known, and we can perform much faster predecessor searches. Our aim is to use our variants to implement an efficient dynamic bit vector: our structure is able to perform updates, ranking and selection in logarithmic time, with a space overhead in the order of a few percents, outperforming existing data structures with the same purpose. Along the way, we highlight the pernicious interplay between the arithmetic behind the Fenwick tree and the structure of current CPU caches, suggesting simple solutions that improve performance significantly.
 Download from arXiV; the algorithms described in the paper are implemented in Sux.
 Paolo Boldi, Andrea Marino,
Massimo Santini, and Sebastiano Vigna.
BUbiNG: Massive crawling for the masses.
ACM Trans. Web, 12(2):12:1−12:26, 2019.
 Abstract

Although web crawlers have been around for twenty years by now, there is virtually no freely available, opensource crawling software that guarantees high throughput, overcomes the limits of singlemachine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our nextgeneration web crawler built upon the authors’ experience with UbiCrawler and on the last ten years of research on the topic. BUbiNG is an opensource Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host and IPbased. Unlike existing opensource distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern highspeed protocols to achieve very high throughput.
 PDF version; online version; BUbiNG can be downloaded from the LAW web site or from Maven.
 Paolo Boldi, Alessandro
Luongo, and Sebastiano Vigna.
Rank monotonicity in centrality measures.
Network Science, 5(4):529−550, 2017.
 Abstract

A measure of centrality is rank monotone if after adding an arc x → y, all nodes with a score smaller than (or equal to) y have still a score smaller than (or equal to) y. If in particular all nodes with a score smaller than or equal to y get a score smaller than y (i.e., all ties with y are broken in favor of y) the measure is called strictly rank monotone. We prove that harmonic centrality is strictly rank monotone, whereas closeness is just rank monotone on strongly connected graphs, and that some other measures, including betweenness, are not rank monotone at all (sometimes not even on strongly connected graphs). Among spectral measures, damped scores such as Katz's index and PageRank are strictly rank monotone on all graphs, whereas the dominant eigenvector is strictly monotone on strongly connected graphs only.
 PDF version; doi:10.1017/nws.2017.21.
 Paolo Boldi and Sebastiano
Vigna.
On the lattice of antichains of finite intervals.
Order, 38(1):57−81, 2018.
 Abstract

Motivated by applications to information retrieval, we study the lattice of antichains of finite intervals of a locally finite, totally ordered set. Intervals are ordered by reverse inclusion; the order between antichains is induced by the lower set they generate. We discuss in general properties of such antichain completions; in particular, their connection with Alexandrov completions. We prove the existence of a unique, irredundant ∧representation by ∧irreducible elements, which makes it possible to write the relative pseudocomplement in closed form. We also discuss in details properties of additional interesting operators used in information retrieval. Finally, we give a formula for the rank of an element and for the height of the lattice.
 doi:10.1007/s1108301694188; PDF version; a Java™ package to manipulate lattices of interval antichains is available at the LaMa4J home page.
 Sebastiano Vigna.
Spectral ranking.
Network Science, 4(4):433−445, 2016.
 Abstract

We sketch the history of spectral ranking—a general umbrella name for techniques that apply the theory of linear maps (in particular, eigenvalues and eigenvectors) to matrices that do not represent geometric transformations, but rather some kind of relationship between entities. Albeit recently made famous by the ample press coverage of Google's PageRank algorithm, spectral ranking was devised more than a century ago, and has been studied in psychology, social sciences, bibliometrics, economy, and choice theory. We describe the contribution given by previous scholars in precise and modern mathematical terms: Along the way, we show how to express in a general way damped rankings, such as Katz's index, as dominant eigenvectors of perturbed matrices, and then use results on the Drazin inverse to go back to the dominant eigenvectors by a limit process. The result suggests a regularized definition of spectral ranking that yields for a general matrix a unique vector depending on a boundary condition.
 Current version; Network Science version.
 Sebastiano Vigna.
Further scramblings of Marsaglia's
xorshift
generators. Journal of Computational and Applied Mathematics, 315:175−181, 2016. Abstract

xorshift*
generators are a variant of Marsaglia'sxorshift
generators that eliminate linear artifacts typical of generators based on Z/2Zlinear operations using multiplication by a suitable constant. Shortly after highdimensionalxorshift*
generators were introduced, Saito and Matsumoto suggested a different way to eliminate linear artifacts based on addition in Z/2^{32}Z, leading to theXSadd
generator. Starting from the observation that the lower bits ofXSadd
are very weak, as its reverse fails systematically several statistical tests, we explore variants ofXSadd
using 64bit operations, and describe in detailxorshift128+
, an extremely fast generator that passes strong statistical tests using only three shifts, four xors and an addition.  PDF version;
doi:10.1016/j.cam.2016.11.006;
more data can be found at the
xoshiro
/xoroshiro
generators and the PRNG shootout page.
 Paolo Boldi and Sebastiano
Vigna.
Efficient optimally lazy algorithms for minimalinterval semantics.
Theoretical Computer Science, 648:8−25, 2016.
 Abstract
Minimalinterval semantics associates with each query over a document a set of intervals, called witnesses, that are incomparable with respect to inclusion (i.e., they form an antichain): witnesses define the minimal regions of the document satisfying the query. Minimalinterval semantics makes it easy to define and compute several sophisticated proximity operators, provides snippets for user presentation, and can be used to rank documents. In this paper we provide algorithms for computing conjunction and disjunction that are linear in the number of intervals and logarithmic in the number of operands; for additional operators, such as ordered conjunction and Brouwerian difference, we provide linear algorithms. In all cases, space is linear in the number of operands. More importantly, we define a formal notion of optimal laziness, and either prove it, or prove its impossibility, for each algorithm. We cast our results in a general framework of antichains of intervals on total orders, making our algorithms directly applicable to other domains.
 PDF version; doi:10.1016/j.tcs.2016.07.036; the algorithms described in the paper are implemented in MG4J and LaMa4J.
 Robert Meusel, Sebastiano
Vigna, Oliver Lehmberg, and Christian Bizer.
The graph structure in the web—Analyzed on different aggregation levels.
The Journal of Web Science, 1(1):33−47, 2015.
 Abstract

Knowledge about the general graph structure of theWorldWideWeb is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we analyze a large web graph. The graph was extracted from a large publicly accessible web crawl that was gathered by the Common Crawl Foundation in 2012. The graph covers over 3:5 billion web pages and 128:7 billion hyperlinks. We analyze and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components. We conduct our analysis on three different levels of aggregation: page, host, and paylevel domain (PLD) (one “dot level” above public suffixes). Our analysis shows that, as evidenced by previous research (Serrano et al., 2007), some of the features previously observed by Broder et al., 2000 are very dependent on artifacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers (Donato et al., 2005; Boldi et al., 2002; BaezaYates and Poblete, 2003), very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the “bowtie structure” as described by Broder et al. is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the Web. More importantly, statistical testing and visual inspection of sizerank plots show that the distributions of indegree, outdegree and sizes of strongly connected components of the page and host graph are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy tailed. If we aggregate at paylevel domain, however, a power law emerges. We also provide for the first time accurate measurement of distancebased features, using recently introduced algorithms that scale to the size of our crawl (Boldi and Vigna, 2013).
 PDF version; doi:10.1561/106.00000003.
 Sebastiano Vigna.
An experimental exploration of Marsaglia's
xorshift
generators, scrambled. ACM Trans. Math. Software, 42(4), 2016. Abstract

Marsaglia proposed recently
xorshift
generators as a class of very fast, goodquality pseudorandom number generators. Subsequent analysis by Panneton and L'Ecuyer has lowered the expectations raised by Marsaglia's paper, showing several weaknesses of such generators, verified experimentally using the TestU01 suite. Nonetheless, many of the weaknesses ofxorshift
generators fade away if their result is scrambled by a nonlinear operation (as originally suggested by Marsaglia). In this paper we explore the space of possible generators obtained by multiplying the result of axorshift
generator by a suitable constant. We sample generators at 100 equispaced points of their state space and obtain detailed statistics that lead us to choices of parameters that improve on the current ones. We then explore for the first time the space of highdimensionalxorshift
generators, following another suggestion in Marsaglia's paper, finding choices of parameters providing periods of length 2^{1024} − 1 and 2^{4096} − 1. The resulting generators are of extremely high quality, faster than current similar alternatives, and generate longperiod sequences passing strong statistical tests using only eight logical operations, one addition and one multiplication by a constant.  PDF
Version; doi.org:10.1145/2845077; more data
can be found at the
xoshiro
/xoroshiro
generators and the PRNG shootout page.
 Paolo Boldi and Sebastiano
Vigna.
Axioms for centrality.
Internet Math., 10(34):222−262, 2014.
 Abstract

Given a social network, which of its nodes are more central? This question has been asked many times in sociology, psychology and computer science, and a whole plethora of centrality measures (a.k.a. centrality indices, or rankings) were proposed to account for the importance of the nodes of a network. In this paper, we try to provide a mathematically sound survey of the most important classic centrality measures known from the literature and propose an axiomatic approach to establish whether they are actually doing what they have been designed for. Our axioms suggest some simple, basic properties that a centrality measure should exhibit.
Surprisingly, only a new simple measure based on distances, harmonic centrality, turns out to satisfy all axioms; essentially, harmonic centrality is a correction to Bavelas's classic closeness centrality designed to take unreachable nodes into account in a natural way.
As a sanity check, we examine in turn each measure under the lens of information retrieval, leveraging stateoftheart knowledge in the discipline to measure the effectiveness of the various indices in locating web pages that are relevant to a query. While there are some examples of such comparisons in the literature, here for the first time we also take into consideration centrality measures based on distances, such as closeness, in an informationretrieval setting. The results closely match the data we gathered using our axiomatic approach.
Our results suggest that centrality measures based on distances, which in the last years have been neglected in information retrieval in favor of spectral centrality measures, do provide highquality signals; moreover, harmonic centrality pops up as an excellent generalpurpose centrality index for arbitrary directed graphs.
 PDF version.
 Paolo Boldi, Marco Rosa, and
Sebastiano Vigna.
Robustness of social and web graphs to node removal.
Social Network Analysis and Mining, 3(4):829−842, 2013.
 Abstract

Given a social network, which of its nodes have a stronger impact in determining its structure? More precisely: which noderemoval order has the greatest impact on the network structure? We approach this wellknown problem for the first time in a setting that combines both web graphs and social networks. Our experiments are performed on datasets that are orders of magnitude larger than those appearing in the previous literature: this is possible thanks to some recently developed algorithms and software tools that approximate accurately the number of reachable pairs and the distribution of distances in large graphs. Our experiments highlight deep differences in the structure of social networks and web graphs, show significant limitations of previous experimental results; at the same time, they reveal clustering by label propagation as a new and very effective way of locating nodes that are important from a structural viewpoint.
 Public datasets are available at the LAW site; Java™ implementations are available as free software at the WebGraph home page.
 Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. Viscous democracy for social networks. Commun. ACM, 54(6):129−137, June 2011.
 Paolo Boldi, Francesco Bonchi,
Carlos Castillo, and Sebastiano Vigna.
Query reformulation mining: models, patterns, and applications.
Information Retrieval, 14(3):257−289, 2011.
 Abstract

Understanding query reformulation patterns is a key task towards next generation web search engines. If we can do that, then we can build systems able to understand and possibly predict user intent, providing the needed assistance at the right time, and thus helping users locate information more effectively and improving their websearch experience. As a step in this direction, we build a very accurate model for classifying user query reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92% accuracy. We then apply the model to automatically label two very large query logs sampled from different geographic areas, and containing a total of approximately 17 million query reformulations. We study the resulting reformulation patterns, matching some results from previous studies performed on smaller manually annotated datasets, and discovering new interesting reformulation patterns, including connections between reformulation types and topical categories. We annotate two large queryflow graphs with reformulation type information, and run several graphcharacterization experiments on these graphs, extracting new insights about the relationships between the different query reformulation types. Finally we study query recommendations based on short random walks on the queryflow graphs. Our experiments show that these methods can match in precision, and often improve, recommendations based on queryclick graphs, without the need of users' clicks. Our experiments also show that it is important to consider transitiontype labels on edges for having recommendations of good quality.
 Paolo Boldi, Massimo Santini,
and Sebastiano Vigna.
Permuting web and social graphs.
Internet Math., 6(3):257−283, 2010.
 Abstract

Since the first investigations on web graph compression, it has been clear that the ordering of the nodes of the graph has a fundamental influence on the compression rate (usually expressed as the number of bits per link). The authors of the LINK database, for instance, investigated three different approaches: an extrinsic ordering (URL ordering) and two intrinsic (or coordinatefree) orderings based on the rows of the adjacency matrix (lexicographic and Gray code); they concluded that URL ordering has many advantages in spite of a small penalty in compression. In this paper we approach this issue in a more systematic way, testing some old orderings and proposing some new ones. Our experiments are made in the WebGraph framework, and show that the compression technique and the structure of the graph can produce significantly different results. In particular, we show that for the transpose web graph URL ordering is significantly less effective, and that some new orderings combining host information and Gray/lexicographic orderings outperform all previous methods. In particular, in some large transposed graphs they yield the quite incredible compression rate of 1 bit per link.
 PDF version; datasets and Java™ implementations are available as free software at the LAW site.
 Paolo Boldi, Massimo Santini,
and Sebastiano Vigna.
PageRank: Functional dependencies.
ACM Trans. Inf. Sys., 27(4):1−23, 2009.
 Abstract

PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α=0.85 by Brin and Page is still used. In this paper, we give a mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to popular belief, for realworld graphs values of α close to 1 do not give a more meaningful ranking. Then, we give closedform formulae for PageRank derivatives of any order, and by proving that the kth iteration of the Power Method gives exactly the value obtained by truncating the PageRank power series at degree k, we show how to obtain an approximation of the derivatives. Finally, we view PageRank as a linear operator acting on the preference vector and show a tight connection between iterated computation and derivation.
 PDF version; doi.org:10.1145/1629096.1629097.
 Ricardo Baeza Yates, Paolo
Boldi, and Carlos Castillo.
Generic damping functions for propagating importance in linkbased ranking.
Internet Math., 3(4):445−478, 2006.
 Abstract

This paper introduces a family of linkbased ranking algorithms that propagate page importance through links. The algorithms include a damping function which decreases with distance, thus a direct link implies greater endorsement that a link via a longer path. PageRank is the most widely known ranking function of this family. The main objective of this paper is to determine whether this family of ranking techniques is of some interest per se, and how different choices for the damping function affect rank quality and convergence speed. Even though our results suggest that PageRank can be approximated with other more simple forms of rankings that may be computed more efficiently, our focus is more speculative in nature, given that it aims at separating the kernel of PageRank, that is, linkbased importance propagation, from the way propagation decays over paths. We focus on three damping functions that have linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. The work we carry includes algorithms, analysis, comparisons and experiments that study their behavior under different parameters in real Web graph data. Amongst other results, we show how to calculate a linear approximation that induces a page ordering that is almost identical to PageRank using a fixed number of iterations. Comparisons were made using Kendall's τ on large domain datasets.
 PDF version
 Luca Becchetti, Paolo
Boldi, Carlos Castillo, and Aristides Gionis.
Efficient algorithms for largescale local triangle counting.
ACM Transactions on Knowledge Discovery from Data, 2010.
 Abstract

In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v of V in the graph. We consider the question both for undirected and directed graphs. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first contribution that addresses the problem of local triangle counting with a focus on the efficiency issues arising in massive graphs and that also considers the directed case. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help to detect the presence of spamming activity in largescale Web graphs, as well as to provide useful features for content quality assessment in social networks. For computing the local number of triangles (undirected and directed) we propose two approximation algorithms, which are based on the idea of minwise independent permutations (Broder et al. 1998). Our algorithms operate in a semistreaming fashion, using O(V) space in main memory and performing O(log V) sequential scans over the edges of the graph. The first algorithm we describe in this paper also uses O(E) space in external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results in large graphs, demonstrating the practical efficiency of our approach.
 PDF version
 Paolo Boldi, Massimo Santini,
and Sebastiano Vigna.
Paradoxical effects in PageRank incremental computations.
Internet Math., 2(3):387−404, 2005.
 Abstract

Deciding which kind of visit accumulates highquality pages more quickly is one of the most often debated issue in the design of web crawlers. It is known that breadthfirst visits work well, as they tend to discover pages with high PageRank early on in the crawl. Indeed, this visit order is much better than depth first, which is in turn even worse than a random visit; nevertheless, breadthfirst can be superseded using an omniscient visit that chooses, at every step, the node of highest PageRank in the frontier.
This paper discusses a related, and previously overlooked, measure of effectivity for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields relative ranks that agree with the ones the nodes have in the complete graph; ranks are compared using Kendall's τ.
We describe a number of largescale experiments that show the following paradoxical effect: visits that gather PageRank more quickly (e.g., highestqualityfirst) are also those that tend to miscalculate PageRank. Finally, we perform the same kind of experimental analysis on some synthetic random graphs, generated using wellknown webgraph models: the results are almost opposite to those obtained on real web graphs.
 PDF version; doi.org:10.1080/15427951.2005.10129106; Java™ classes for computing Kendall's τ efficiently are available as free software at the LAW site.
 Paolo Boldi, Violetta Lonati,
Massimo Santini, and Sebastiano Vigna.
Graph fibrations, graph isomorphism, and PageRank.
RAIRO Inform. Théor., 40:227−253, 2006.
 Abstract

PageRank is a ranking method that assigns scores to web pages using the limit distribution of a random walk on the web graph. A fibration of graphs is a morphism that is a local isomorphism of inneighbourhoods, much in the same way a covering projection is a local isomorphism of neighbourhoods. We show that a deep connection relates fibrations and Markov chains with restart, a particular kind of Markov chains that include the PageRank one as a special case. This fact provides constraints on the values that PageRank can assume. Using our results, we show that a recently defined class of graphs that admit a polynomialtime isomorphism algorithm based on the computation of PageRank is really a subclass of fibrationprime graphs, which possess simple, entirely discrete polynomialtime isomorphism algorithms based on classical techniques for graph isomorphism. We discuss efficiency issues in the implementation of such algorithms for the particular case of web graphs, in which O(n) space occupancy (where n is the number of nodes) may be acceptable, but O(m) is not (where m is the number of arcs).
 PDF version; doi.org:10.1051/ita:2006004; Java™ classes for computing a minimum base are available as free software at the LAW site.
 Paolo Boldi, Bruno Codenotti,
Massimo Santini, and Sebastiano Vigna.
UbiCrawler: A scalable fully distributed web crawler.
Software: Practice & Experience, 34(8):711−726, 2004.
 Abstract

We report our experience in implementing UbiCrawler, a scalable distributed web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitation of the Java APIs, which prompted the authors to partially reimplement them.
 Paolo Boldi and Sebastiano
Vigna.
Codes for the World−Wide Web.
Internet Math., 2(4):405−427, 2005.
 Abstract

We introduce a new family of simple, complete instantaneous codes for positive integers, called ζ codes, which are suitable for integers distributed as a power law with small exponent (smaller than 2). The main motivation for the introduction of ζ codes comes from webgraph compression: if nodes are numbered according to URL lexicographical order, gaps in successor lists are distributed according to a power law with small exponent. We give estimates of the expected length of ζ codes against powerlaw distributions, and compare the results with analogous estimates for the more classical γ, δ and variablelength block codes.
 PDF version; doi.org:10.1080/15427951.2005.10129113; to download software and data sets, please look at the WebGraph home page.
Conference Proceedings
 Paolo Boldi, Flavio Furia,
and Sebastiano Vigna.
Spectral rank monotonicity in undirected networks.
In Rosa Maria Benito, Chantal Cherifi, Hocine Cherifi, Esteban Moro, Luis M.
Rocha, and Marta Sales Pardo, editors, Complex Networks & Their
Applications X, volume 1014 of Studies in Computational
Intelligence, pages 234−246. Springer, 2022.
 Abstract

We study the problem of score and rank monotonicity for spectral ranking methods, such as eigenvector centrality and PageRank, in the case of undirected networks. Score monotonicity means that adding an edge increases the score at both ends of the edge. Rank monotonicity means that adding an edge improves the relative position of both ends of the edge with respect to the remaining nodes. It is known that common spectral rankings are both score and rank monotone on directed, strongly connected graphs. We show that, surprisingly, the situation is very different for undirected graphs, and in particular that PageRank is neither score nor rank monotone.
 PDF version; doi:10.1007/9783030934095_20; you can download the Sage code used in the paper.
 Paolo Boldi, Antoine
Pietri, Sebastiano Vigna, and Stefano Zacchiroli.
Ultralargescale repository analysis via graph compression.
In 2020 IEEE 27th International Conference on Software Analysis, Evolution
and Reengineering (SANER), pages 184−194. IEEE, 2020.
 Abstract

We consider the problem of mining the development history—as captured by modern version control systems—of ultralargescale software archives (e.g., tens of millions software repositories corresponding).
We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarlysized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects— encompassing a full mirror of GitHub.
The resulting compressed graph fits in less than 100GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed inmemory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.
 PDF version.
 Paolo Boldi and Sebastiano
Vigna.
The case for Kendall’s assortativity.
In Complex Networks and Their Applications VIII. Volume 2 Proceedings of
the Eighth International Conference on Complex Networks and Their
Applications COMPLEX NETWORKS 2019, volume 882 of Studies in
Computational Intelligence. Springer, 2020.
 Abstract

Since the seminal work of Litvak and van der Hofstad, it has been known that Newman’s assortativity, being based on Pearson’s correlation, is subject to a pernicious size effect which makes large networks with heavytailed degree distributions always unassortative. Usage of Spearman’s ρ, or even Kendall’s τ was suggested as a replacement but the treatment of ties was problematic for both measures. In this paper we first argue analytically that the tieaware version of τ solves the problems observed, and we show that Newman’s assortativity is heavily influenced by tightly knit communities. Then, we perform for the first time a set of largescale computational experiments on a variety of networks, comparing assortativity based on Kendall’s τ and assortativity based on Pearson’s correlation, showing that the pernicious effect of size is indeed very strong on realworld large networks, whereas the tieaware Kendall’s τ can be a practical, principled alternative.
 doi:10.1007/9783030366834_24. PDF version.
 Emmanuel Esposito, Thomas
Mueller Graf, and Sebastiano Vigna.
Recsplit: Minimal perfect hashing via recursive splitting.
In 2020 Proceedings of the Symposium on Algorithm Engineering and
Experiments (ALENEX), pages 175−185. SIAM, 2020.
 Abstract

A minimal perfect hash function bijectively maps a key set S out of a universe U into the first S natural numbers. Minimal perfect hash functions are used, for example, to map irregularlyshaped keys, such as strings, in a compact space so that metadata can then be simply stored in an array. While it is known that just 1.44 bits per key are necessary to store a minimal perfect hash function, no published technique can go below 2 bits per key in practice. We propose a new technique for storing minimal perfect hash functions with expected linear construction time and expected constant lookup time that makes it possible to build for the first time, for example, structures which need 1.56 bits per key, that is, within 8.3% of the lower bound, in less than 2ms per key. We show that instances of our construction are able to simultaneously beat the construction time, space usage and lookup time of the stateoftheart data structure reaching 2 bits per key. Moreover, we provide parameter choices giving structures which are competitive with alternative, largersize data structures in terms of space and lookup time. The construction of our data structures can be easily parallelized or mapped on distributed computational units (e.g., within the MapReduce framework), and structures larger than the available RAM can be directly built in mass storage.
 Download from arXiV. The algorithms described in the paper are implemented in Sux.
 Paolo Boldi and Sebastiano
Vigna.
Kings, name days, lazy servants and magic.
In Hiro Ito, Stefano Leonardi, Linda Pagli, and Giuseppe Prencipe, editors,
9th International Conference on Fun with Algorithms (FUN 2018),
volume 100 of Leibniz International Proceedings in Informatics
(LIPIcs), pages 10:1−10:13, Dagstuhl, Germany, 2018. Schloss
Dagstuhl−LeibnizZentrum fuer Informatik.
 Abstract

Once upon a time, a king had a very, very long list of names of his subjects. The king was also a bit obsessed with name days: every day he would ask his servants to look the list for all persons having their name day. Reading every day the whole list was taking an enormous amount of time to the king's servants. One day, the chancellor had a magnificent idea: he wrote a book with instructions. The number of pages in the book was equal to the number of names, but following the instructions one could find all people having their name day by looking at only a few pages—in fact, as many pages as the length of the name—and just glimpsing at the list. Everybody was happy, but in time the king's servants got lazy: when the name was very long they would find excuses to avoid looking at so many pages, and some name days were skipped. Desperate, the king made a call through its reign, and a fat sorceress answered. There was a way to look at much, much fewer pages using an additional magic book. But sometimes, very rarely, it would not work (magic does not always work). The king accepted the offer, and name days parties restarted. Only, once every a few thousand years, the magic book fails, and the assistants have to go by the chancellor book. So the parties start a bit later. But they start anyway.
 PDF version. The algorithms described in the paper are implemented in Sux4J.
 Marco Genuzio and Sebastiano
Vigna.
Engineering compressed static functions.
In 2018 Data Compression Conference, pages 52−61. IEEE, 2018.
 Abstract

Recent advances in the compact representation of static functions (with constant access time) have made it possible to fully exploit constructions based on random linear system. Such constructions, albeit theoretically appealing, were previously too slow to be usable. In this paper, we extend such techniques to the problem of storing compressed static functions, in the sense that the space used per key should be close to the entropy of the list of values. From a theoretical viewpoint, we are inspired by the approach of Hreinsson, Krøyer and Pagh. Values are represented using a nearoptimal instantaneous code. Then, a bit array is created so that by XOR'ing its content at a fixed number of positions depending on the key one obtains the value, represented by its associated codeword. In the construction phase, every bit of the array is associated with an equation on Z/2Z, and solving the associated system provides the desired representation. Thus, we pass from one equation per key (the noncompressed case) to one equation per bit: the size of the system is thus approximately multiplied by the empirical entropy of the values, making the problem much more challenging. We show that by carefully engineering the value representation we can obtain a practical data structure. For example, we can store a function with geometrically distributed output in just 2.28 bits per key, independently of the key set, with a construction time double with respect to that of a stateoftheart noncompressed function, which requires ≈log log n bits per key, where n is the number of keys, and slightly improved lookup time. We can also store a function with an output of 10^{6} values distributed following a power law of exponent 2 in just 2.75 bits per key, whereas a noncompressed function would require more than 20, with a threefold increase in construction time and significantly faster lookups.
 PDF version. The algorithms described in the paper are implemented in Sux4J.
 Marco Genuzio, Giuseppe
Ottaviano, and Sebastiano Vigna.
Fast scalable construction of (minimal perfect hash) functions.
In Andrew V. Goldberg and Alexander S. Kulikov, editors, Experimental
Algorithms: 15th International Symposium, SEA 2016, St. Petersburg, Russia,
June 58, 2016, Proceedings, number 9685 in Lecture Notes in Computer
Science, pages 339−352. Springer, 2016.
 Abstract

Recent advances in random linear systems on finite fields have paved the way for the construction of constanttime data structures representing static functions and minimal perfect hash functions using less space with respect to existing techniques. The main obstruction for any practical application of these results is the cubictime Gaussian elimination required to solve these linear systems: despite they can be made very small, the computation is still too slow to be feasible.
In this paper we describe in detail a number of heuristics and programming techniques to speed up the resolution of these systems by several orders of magnitude, making the overall construction competitive with the standard and widely used MWHC technique, which is based on hypergraph peeling. In particular, we introduce broadword programming techniques for fast equation manipulation and a lazy Gaussian elimination algorithm. We also describe a number of technical improvements to the data structure which further reduce space usage and improve lookup speed.
Our implementation of these techniques yields a minimal perfect hash function data structure occupying 2.24 bits per element, compared to 2.68 for MWHCbased ones, and a static function data structure which reduces the multiplicative overhead from 1.23 to 1.03.
 Jimmy Lin, Matt Crane, Andrew
Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll,
Craig Macdonald, and Sebastiano Vigna.
Toward reproducible baselines: The opensource IR reproducibility challenge.
In Nicola Ferro, Fabio Crestani, Marie Francine Moens, Josiane Mothe, Fabrizio
Silvestri, Maria Giorgio Di Nunzio, Claudia Hauff, and Gianmaria Silvello,
editors, Advances in Information Retrieval: 38th European Conference on
IR Research, ECIR 2016, Padua, Italy, March 2023, 2016. Proceedings,
pages 408−420. Springer International Publishing, 2016.
 Abstract

The OpenSource IR Reproducibility Challenge brought together developers of opensource search engines to provide reproducible baselines of their systems in a common environment on Amazon EC2. The product is a repository that contains all code necessary to generate competitive ad hoc retrieval baselines, such that with a single script, anyone with a copy of the collection can reproduce the submitted runs. Our vision is that these results would serve as widely accessible points of comparison in future IR research. This project represents an ongoing effort, but we describe the first phase of the challenge that was organized as part of a workshop at SIGIR 2015. We have succeeded modestly so far, achieving our main goals on the Gov2 collection with seven opensource search engines. In this paper, we describe our methodology, share experimental results, and discuss lessons learned as well as next steps.
 Github repository.
 Sebastiano Vigna.
A weighted correlation index for rankings with ties.
In Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa
Bertino, and Ravi Kumar, editors, Proceedings of the 24th international
conference on World Wide Web, pages 1166−1176. ACM, 2015.
 Abstract

Understanding the correlation between two different scores for the same set of items is a common problem in graph analysis and information retrieval. The most commonly used statistics that quantifies this correlation is Kendall's τ however, the standard definition fails to capture that discordances between items with high rank are more important than those between items with low rank. Recently, a new measure of correlation based on average precision has been proposed to solve this problem, but like many alternative proposals in the literature it assumes that there are no ties in the scores. This is a major deficiency in a number of contexts, and in particular when comparing centrality scores on large graphs, as the obvious baseline, indegree, has a very large number of ties in social networks and web graphs. We propose to extend Kendall's definition in a natural way to take into account weights in the presence of ties. We prove a number of interesting mathematical properties of our generalization and describe an O(n log n) algorithm for its computation. We also validate the usefulness of our weighted measure of correlation using experimental data on social networks and web graphs.
 PDF version.
 Robert Meusel, Sebastiano
Vigna, Oliver Lehmberg, and Christian Bizer.
Graph structure in the web — Revisited, or a trick of the heavy tail.
In WWW'14 Companion, pages 427−432. International World Wide Web
Conferences Steering Committee, 2014.
 Abstract

Knowledge about the general graph structure of the World Wide Web is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we describe and analyse a large, publicly accessible crawl of the web that was gathered by the Common Crawl Foundation in 2012 and that contains over 3.5 billion web pages and 128.7 billion links. This crawl makes it possible to observe the evolution of the underlying structure of the World Wide Web within the last 10 years: we analyse and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components.
Our analysis shows that, as evidenced by previous research, some of the features previously observed by Broder et al. are very dependent on artefacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers, very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the “bowtie structure” as described by Broder et al. is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the web.
More importantly, statistical testing and visual inspection of sizerank plots show that the distributions of indegree, outdegree and sizes of strongly connected components are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy tailed. We also provide for the first time accurate measurement of distancebased features, using recently introduced algorithms that scale to the size of our crawl.
 Djamal Belazzougui, Paolo
Boldi, Giuseppe Ottaviano, Rossano Venturini, and Sebastiano Vigna.
Cacheoblivious peeling of random hypergraphs.
In 2014 Data Compression Conference (DCC 2014), pages 352−361. IEEE,
2014.
 Abstract

The computation of a peeling order in a randomly generated hypergraph is the most timeconsuming step in a number of constructions, such as perfect hashing schemes, random rSAT solvers, errorcorrecting codes, and approximate set encodings. While there exists a straightforward linear time algorithm, its poor I/O performance makes it impractical for hypergraphs whose size exceeds the available internal memory.
We show how to reduce the computation of a peeling order to a small number of sequential scans and sorts, and analyze its I/O complexity in the cacheoblivious model. The resulting algorithm requires O(sort(n)) I/Os and O(n log n) time to peel a random hypergraph with n edges.
We experimentally evaluate the performance of our implementation of this algorithm in a realworld scenario by using the construction of minimal perfect hash functions (MPHF) as our test case: our algorithm builds a MPHF of 7.6 billion keys in less than 21 hours on a single machine. The resulting data structure is both more spaceefficient and faster than that obtained with the current stateoftheart MPHF construction for largescale key sets.
 Download from arXiV.
 Paolo Boldi, Andrea Marino,
Massimo Santini, and Sebastiano Vigna.
BUbiNG: Massive crawling for the masses.
ACM Trans. Web, 12(2):12:1−12:26, 2019.
 Abstract

Although web crawlers have been around for twenty years by now, there is virtually no freely available, opensource crawling software that guarantees high throughput, overcomes the limits of singlemachine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our nextgeneration web crawler built upon the authors’ experience with UbiCrawler and on the last ten years of research on the topic. BUbiNG is an opensource Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host and IPbased. Unlike existing opensource distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern highspeed protocols to achieve very high throughput.
 PDF version; online version; BUbiNG can be downloaded from the LAW web site or from Maven.
 Sebastiano Vigna.
Quasisuccinct indices.
In Stefano Leonardi, Alessandro Panconesi, Paolo Ferragina, and Aristides
Gionis, editors, Proceedings of the 6th ACM International Conference on
Web Search and Data Mining, WSDM'13, pages 83−92. ACM, 2013.
 Abstract

Compressed inverted indices in use today are based on the idea of gap compression: documents pointers are stored in increasing order, and the gaps between successive document pointers are stored using suitable codes which represent smaller gaps using less bits. Additional data such as counts and positions is stored using similar techniques. A large body of research has been built in the last 30 years around gap compression, including theoretical modeling of the gap distribution, specialized instantaneous codes suitable for gap encoding, and ad hoc document reorderings which increase the efficiency of instantaneous codes. This paper proposes to represent an index using a different architecture based on quasisuccinct representation of monotone sequences. We show that, besides being theoretically elegant and simple, the new index provides expected constanttime operations, space savings, and, in practice, significant performance improvements on conjunctive, phrasal and proximity queries.
 PDF version. Quasisuccinct indices are now the default indices in MG4J.
 Paolo Boldi and Sebastiano
Vigna.
Four degrees of separation, really.
In Proceedings of the 2012 International Conference on Advances in Social
Networks Analysis and Mining (ASONAM 2012), pages 1222−1227. IEEE
Computer Society, 2012.
 Abstract

We recently measured the average distance of users in the Facebook graph, spurring comments in the scientific community as well as in the general press (“Four Degrees of Separation”). A number of interesting criticisms have been made about the meaningfulness, methods and consequences of the experiment we performed. In this paper we want to discuss some methodological aspects that we deem important to underline in the form of answers to the questions we have read in newspapers, magazines, blogs, or heard from colleagues. We indulge in some reflections on the actual meaning of “average distance” and make a number of side observations showing that, yes, 3.74 “degrees of separation” are really few.
 PDF version.
 Paolo Boldi and Sebastiano
Vigna.
Incore computation of geometric centralities with HyperBall: A hundred
billion nodes and beyond.
In Proc. of 2013 IEEE 13th International Conference on Data Mining
Workshops (ICDMW 2013). IEEE, 2013.
 Abstract

Given a social network, which of its nodes are more central? This question has been asked many times in sociology, psychology and computer science, and a whole plethora of centrality measures (a.k.a. centrality indices, or rankings) were proposed to account for the importance of the nodes of a network. In this paper, we approach the problem of computing geometric centralities, such as closeness and harmonic centrality, on very large graphs; traditionally this task requires an allpairs shortestpath computation in the exact case, or a number of breadthfirst traversals for approximated computations, but these techniques yield very weak statistical guarantees on highly disconnected graphs. We rather assume that the graph is accessed in a semistreaming fashion, that is, that adjacency lists are scanned almost sequentially, and that a very small amount of memory (in the order of a dozen bytes) per node is available in core memory. We leverage the newly discovered algorithms based on HyperLogLog counters, making it possible to approximate a number of geometric centralities at a very high speed and with high accuracy. While the application of similar algorithms for the approximation of closeness was attempted in the MapReduce framework, our exploitation of HyperLogLog counters reduces exponentially the memory footprint, paving the way for incore processing of networks with a hundred billion nodes using “just” 2TiB of RAM. Moreover, the computations we describe are inherently parallelizable, and scale linearly with the number of available cores.
 PDF version.
 Lars Backstrom, Paolo Boldi,
Marco Rosa, Johan Ugander, and Sebastiano Vigna.
Four degrees of separation.
In ACM Web Science 2012: Conference Proceedings, pages 45−54. ACM
Press, 2012.
Best paper award.
 Abstract

Frigyes Karinthy, in his 1929 short story “Láancszemek” (“Chains”) suggested that any two persons are distanced by at most six friendship links. (The exact wording of the story is slightly ambiguous: “He bet us that, using no more than five individuals, one of whom is a personal acquaintance, he could contact the selected individual […]”. It is not completely clear whether the selected individual is part of the five, so this could actually allude to distance five or six in the language of graph theory, but the “six degrees of separation” phrase stuck after John Guare's 1990 eponymous play. Following Milgram's definition and Guare's interpretation, we will assume that “degrees of separation” is the same as “distance minus one”, where “distance” is the usual path length—the number of arcs in the path.) Stanley Milgram in his famous experiment challenged people to route postcards to a fixed recipient by passing them only through direct acquaintances. The average number of intermediaries on the path of the postcards lay between 4.4 and 5.7, depending on the sample of people chosen.
We report the results of the first worldscale socialnetwork graphdistance computations, using the entire Facebook network of active users (≈721 million users, ≈69 billion friendship links). The average distance we observe is 4.74, corresponding to 3.74 intermediaries or “degrees of separation”, showing that the world is even smaller than we expected, and prompting the title of this paper. More generally, we study the distance distribution of Facebook and of some interesting geographic subgraphs, looking also at their evolution over time.
The networks we are able to explore are almost two orders of magnitude larger than those analysed in the previous literature. We report detailed statistical metadata showing that our measurements (which rely on probabilistic algorithms) are very accurate.
 PDF version; Java™ implementations are available as free software at the WebGraph home page. You can download the property files, HyperBall runs and degree distributions presented in the paper. The graphs are also described on the LAW dataset page.
 Paolo Boldi, Francesco Bonchi,
Carlos Castillo, and Sebastiano Vigna.
Voting in social networks.
In Proc. of ACM 18th Conference on Information and Knowledge Management
(CIKM), Napa Valley, CA, USA, 2009. ACM Press.
 Abstract

A voting system is a set of rules that a community adopts to take collective decisions. In this paper we study voting systems for a particular kind of community: electronically mediated social networks. In particular, we focus on delegative democracy (a.k.a. proxy voting) that has recently received increased interest for its ability to combine the benefits of direct and representative systems, and that seems also perfectly suited for electronically mediated social networks. In such a context, we consider a voting system in which users can only express their preference for one among the people they are explicitly connected with, and this preference can be propagated transitively, using an attenuation factor.
We present this system and we study its properties. We also take into consideration the problem of missing votes, which is particularly relevant in online networks, as some recent case shows. Our experiments on realworld networks provide interesting insight into the significance and stability of the results obtained with the suggested voting system.
 Paolo Boldi, Marco Rosa,
Massimo Santini, and Sebastiano Vigna.
Layered label propagation: A multiresolution coordinatefree ordering for
compressing social networks.
In Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa
Bertino, and Ravi Kumar, editors, Proceedings of the 20th international
conference on World Wide Web, pages 587−596. ACM, 2011.
 Abstract

We continue the line of research on graph compression started with WebGraph, but we move our focus to the compression of social networks in a proper sense (e.g., LiveJournal): the approaches that have been used for a long time to compress web graphs rely on a specific ordering of the nodes (lexicographical URL ordering) whose extension to general social networks is not trivial. In this paper, we propose a solution that mixes clusterings and orders, and devise a new algorithm, called Layered Label Propagation, that builds on previous work on scalable clustering and can be used to reorder very large graphs (billions of nodes). Our implementation uses decomposition to perform aggressively on multicore architecture, making it possible to reorder graphs of more than 600 millions nodes in a few hours. Experiments performed on a wide array of web graphs and social networks show that combining the order produced by the proposed algorithm with the WebGraph compression framework provides a major increase in compression with respect to all currently known techniques, both on web graphs and on social networks. These improvements make it possible to analyse in main memory significantly larger graphs.
 PDF version; public datasets and Java™ implementations are available as free software at the LAW site.
 Paolo Boldi, Marco Rosa, and
Sebastiano Vigna.
HyperANF: Approximating the neighbourhood function of very large graphs on a
budget.
In Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa
Bertino, and Ravi Kumar, editors, Proceedings of the 20th international
conference on World Wide Web, pages 625−634. ACM, 2011.
 Abstract

The neighbourhood function N_{G}(t) of a graph G gives, for each t, the number of pairs of nodes <x, y> such that y is reachable from x in less that t hops. The neighbourhood function provides a wealth of information about the graph (e.g., it easily allows one to compute its diameter), but it is very expensive to compute it exactly. Recently, the ANF algorithm (approximate neighbourhood function) has been proposed with the purpose of approximating N_{G}(t) on large graphs. We describe a breakthrough improvement over ANF in terms of speed and scalability. Our algorithm, called HyperANF, uses the new HyperLogLog counters and combines them efficiently through broadword programming; our implementation uses decomposition to exploit multicore parallelism. With HyperANF, for the first time we can compute in a few hours the neighbourhood function of graphs with billions of nodes with a small error and good confidence using a standard workstation. Then, we turn to the study of the distribution of the distances between reachable nodes (that can be efficiently approximated by means of HyperANF), and discover the surprising fact that its index of dispersion provides a clearcut characterisation of proper social networks vs. web graphs. We thus propose the spid (ShortestPaths Index of Dispersion) of a graph as a new, informative statistics that is able to discriminate between the above two types of graphs. We believe this is the first proposal of a significant new nonlocal structural index for complex networks whose computation is highly scalable.
 Implementations are available as free software at the WebGraph home page; public datasets are available at the LAW site.
 Roi Blanco, Peter Mika, and
Sebastiano Vigna.
Effective and efficient entity search in RDF data.
In Lora Aroyo, Chris Welty, Harith Alani, Jamie Taylor, Abraham Bernstein,
Lalana Kagal, Natasha Noy, and Eva Blomqvist, editors, The Semantic Web
— ISWC 2011. 10th International Semantic Web Conference, Proceedings, Part
I, volume 7031 of Lecture Notes in Computer Science, pages
83−97. Springer, 2011.
 Abstract
 Triple stores have long provided RDF storage as well as data access using expressive, formal query languages such as SPARQL. The new end users of the Semantic Web, however, are mostly unaware of SPARQL and overwhelmingly prefer imprecise, informal keyword queries for searching over data. At the same time, the amount of data on the Semantic Web is approaching the limits of the architectures that provide support for the full expressivity of SPARQL. These factors com bined have led to an increased interest in semantic search, i.e., access to RDF data using Information Retrieval methods. In this work, we propose a method for effective and efficient entity search over RDF data. We describe an adaptation of the BM25F ranking function for RDF data, and demonstrate that it outperforms other stateoftheart methods in ranking RDF resources. We also propose a set of new index structures for efficient retrieval and ranking of results. We implement these results using the opensource MG4J framework.
 Paolo Boldi, Francesco
Bonchi, Carlos Castillo, Debora Donato, and Sebastiano Vigna.
Query suggestions using queryflow graphs.
In Proceedings of the 2009 workshop on Web Search Click Data, WSCD
'09, pages 56−63. ACM, 2009.
 Abstract

The queryflow graph is an aggregated representation of the latent querying behavior contained in a query log. Intuitively, in the queryflow graph a directed edge from query q_{i} to query q_{j} means that the two queries are likely to be part of the same search mission. Any path over the queryflow graph may be seen as a possible search task, whose likelihood is given by the strength of the edges along the path. An edge (q_{i}, q_{j}) is also labelled with some information: e.g., the probability that user moves from q_{i} to q_{j}, or the type of the transition, for instance, the fact that q_{j} is a specialization of q_{i}.
In this paper we propose, and experimentally study, query recommendations based on short random walks on the queryflow graph. Our experiments show that these methods can match in precision, and often improve, recommendations based on queryclick graphs, without using users' clicks. Our experiments also show that it is important to consider transitiontype labels on edges for having good quality recommendations.
Finally, one feature that we had in mind while devising our methods was that of providing diverse sets of recommendations: the experimentation that we conducted provides encouraging results in this sense.
 Paolo Boldi, Massimo Santini,
and Sebastiano Vigna.
Permuting web graphs.
In Konstantin Avrachenkov, Debora Donato, and Nelly Litvak, editors,
Algorithms and Models for the WebGraph, 6th International Workshop, WAW
2009, volume 5427 of Lecture Notes in Computer Science, pages
116−126. Springer, 2009.
 Abstract

Since the first investigations on web graph compression, it has been clear that the ordering of the nodes of the graph has a fundamental influence on the compression rate (usually expressed as the number of bits per link). The authors of the LINK database, for instance, investigated three different approaches: an extrinsic ordering (URL ordering) and two intrinsic (or coordinatefree) orderings based on the rows of the adjacency matrix (lexicographic and Gray code); they concluded that URL ordering has many advantages in spite of a small penalty in compression. In this paper we approach this issue in a more systematic way, testing some old orderings and proposing some new ones. Our experiments are made in the WebGraph framework, and show that the compression technique and the structure of the graph can produce significantly different results. In particular, we show that for the transpose web graph URL ordering is significantly less effective, and that some new orderings combining host information and Gray/lexicographic orderings outperform all previous methods. In particular, in some large transposed graphs they yield the quite incredible compression rate of 1 bit per link.
 Djamal Belazzougui, Paolo
Boldi, Rasmus Pagh, and Sebastiano Vigna.
Monotone minimal perfect hashing: Searching a sorted table with O(1)
accesses.
In Proceedings of the 20th Annual ACMSIAM Symposium On Discrete
Mathematics (SODA), pages 785−794, New York, 2009. ACM Press.
 Abstract

A minimal perfect hash function maps a set S of n keys into the set {0,1,…, n − 1} bijectively. Classical results state that minimal perfect hashing is possible in constant time using a structure occupying space close to the lower bound of log e bits per element. Here we consider the problem of monotone minimal perfect hashing, in which the bijection is required to preserve the lexicographical ordering of the keys. A monotone minimal perfect hash function can be seen as a very weak form of index that provides ranking just on the set S (and answers randomly outside of S). Our goal is to minimise the description size of the hash function: we show that, for a set S of n elements out of a universe of 2^{w} elements, O(n log log w) bits are sufficient to hash monotonically with evaluation time O(log w). Alternatively, we can get space O(n log w) bits with O(1) query time. Both of these data structures improve a straightforward construction with O(n log w) space and O(log w) query time. As a consequence, it is possible to search a sorted table with O(1) accesses to the table (using additional O(n log log w) bits). Our results are based on a structure (of independent interest) that represents a trie in a very compact way, but admits errors. As a further application of the same structure, we show how to compute the predecessor (in the sorted order of S) of an arbitrary element, using O(1) accesses in expectation and an index of O(n log w) bits, improving the trivial result of O(nw) bits. This implies an efficient index for searching a blocked memory.
 PDF version. The algorithms described in the paper are implemented in Sux4J.
 Ilaria Bordino, Paolo Boldi,
Debora Donato, Massimo Santini, and Sebastiano Vigna.
Temporal evolution of the UK web.
In ICDM Workshops, pages 909−918. IEEE Computer Society, 2008.
 Abstract

Recently, a new temporal dataset has been made public: it is made of a series of twelve 100M pages snapshots of the .uk domain. The Web graphs of the twelve snapshots have been merged into a single timeaware graph that provide constanttime access to temporal information. In this paper we present the first statistical analysis performed on this graph, with the goal of checking whether the information contained in the graph is reliable (i.e., whether it depends essentially on appearance and disappearance of pages and links, or on the crawler behaviour). We perform a number of tests that show that the graph is actually reliable, and provide the first public data on the evolution of the Web that use a large scale and a significant diversity in the sites considered.
 Djamal Belazzougui, Paolo
Boldi, Rasmus Pagh, and Sebastiano Vigna.
Theory and practise of monotone minimal perfect hashing.
In Proceedings of the Tenth Workshop on Algorithm Engineering and
Experiments (ALENEX), pages 132−144. SIAM, 2009.
 Abstract

Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, orderpreserving minimal perfect hash functions have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given order leads to an unavoidable Ω(n log n) lower bound on the number of bits required to store the function. Recently, it was observed that very frequently the keys to be hashed are sorted in their intrinsic (i.e., lexicographical) order. This is typically the case of dictionaries of search engines, list of URLs of web graphs, etc. We refer to this restricted version of the problem as monotone minimal perfect hashing. We analyse experimentally the data structures proposed in our paper “Monotone Minimal Perfect Hashing: Searching a Sorted Table with O(1) Accesses”, and along our way we propose some new methods that, albeit asymptotically equivalent or worse, perform very well in practise, and provide a balance between access speed, ease of construction, and space usage.
 The algorithms described in the paper are implemented in Sux4J.
 Paolo Boldi, Francesco Bonchi,
Carlos Castillo, and Sebastiano Vigna.
From “dango” to “japanese cakes”: Query reformulation models and
patterns.
In Proc. of the 2009 IEEE/WIC/ACM International Conferences on Web
Intelligence (WI'09). ACM Press, 2009.
Winner of the best paper award.
 Abstract

Understanding query reformulation patterns is a key step towards next generation web search engines: it can help improving users' websearch experience by predicting their intent, and thus helping them to locate information more effectively.
As a step in this direction, we build an accurate model for classifying user query reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92% accuracy. We apply the model to automatically label two large query logs, creating annotated queryflow graphs. We study the resulting reformulation patterns, finding results consistent with previous studies done on smaller manually annotated datasets, and discovering new interesting patterns, including connections between reformulation types and topical categories.
Finally, applying our findings to a third query log that is publicly available for research purposes, we demonstrate that our reformulation classifier leads to improved recommendations in a query recommendation system.
 Paolo Boldi, Francesco Bonchi,
Carlos Castillo, Debora Donato, Aristides Gionis, and Sebastiano Vigna.
The queryflow graph: model and applications.
In Proc. of ACM 17th Conference on Information and Knowledge Management
(CIKM), pages 609−618, Napa Valley, CA, USA, 2008. ACM Press.
 Abstract

Query logs record the queries and the actions of the users of search engines, and as such they contain valuable information about the interests, the preferences, and the behavior of the users, as well as their implicit feedback to searchengine results. Mining the wealth of information available in the query logs has many important applications including querylog analysis, user profiling and personalization, advertising, query recommendation, and more.
In this paper we introduce the queryflow graph, a graph representation of the interesting knowledge about latent querying behavior. Intuitively, in the queryflow graph a directed edge from query q_{i} to query q_{j} means that the two queries are likely to be part of the same “search mission”. Any path over the queryflow graph may be seen as a searching behavior, whose likelihood is given by the strength of the edges along the path.
The queryflow graph is an outcome of querylog mining and, at the same time, a useful tool for it. We propose a methodology that builds such a graph by mining time and textual information as well as aggregating queries from different users. Using this approach we build a realworld queryflow graph from a largescale query log and we demonstrate its utility in concrete applications, namely, finding logical sessions, and query recommendation. We believe, however, that the usefulness of the queryflow graph goes beyond these two applications.
 Paolo Boldi, Roberto Posenato,
Massimo Santini, and Sebastiano Vigna.
Traps and pitfalls of topicbiased PageRank.
In William Aiello, Andrei Broder, Jeannette Janssen, and Evangelos Milios,
editors, WAW 2006. Fourth Workshop on Algorithms and Models for the
WebGraph, volume 4936 of Lecture Notes in Computer Science,
pages 107−116. Springer−Verlag, 2008.
 Abstract

We discuss a number of issues in the definition, computation and comparison of PageRank values that have been addressed sparsely in the literature, often with contradictory approaches. We study the difference between weakly and strongly preferential PageRank, which patch the dangling nodes with different distributions, extending analytical formulae known for the strongly preferential case, and corroborating our results with experiments on a snapshot of 100 millions of pages of the .uk domain. The experiments show that the two PageRank versions are poorly correlated, and results about each one cannot be blindly applied to the other; moreover, our computations highlight some new concerns about the usage of exchangebased correlation indices (such as Kendall's τ) on approximated rankings.
 Ricardo Baeza Yates, Paolo
Boldi, and Carlos Castillo.
Generalizing pagerank: Damping functions for linkbased ranking algorithms.
In Proceedings of ACM SIGIR, pages 308−315, Seattle, Washington,
USA, August 2006. ACM Press.
 Abstract
This paper studies a family of linkbased algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with the distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ranking function of this family.
We focus on three damping functions, having linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. Our analysis includes a comparison among them and experiments for studying their behavior under different parameters.
 PDF (preliminary) version; PostScript (preliminary) version.
 Paolo Boldi, Massimo Santini,
and Sebastiano Vigna.
PageRank as a function of the damping factor.
In Proc. of the Fourteenth International World Wide Web Conference (WWW
2005), pages 557−566, Chiba, Japan, 2005. ACM Press.
 Abstract

PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α=0.85 by Brin and Page is still used. Recently, however, the behaviour of PageRank with respect to changes in α was discovered to be useful in linkspam detection. Moreover, an analytical justification of the value chosen for α is still missing. In this paper, we give the first mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to popular belief, for realworld graphs values of α close to 1 do not give a more meaningful ranking. Then, we give closedform formulae for PageRank derivatives of any order, and an extension of the Power Method that approximates them with convergence O(t^{k} α^{t}) for the kth derivative. Finally, we show a tight connection between iterated computation and analytical behaviour by proving that the kth iteration of the Power Method gives exactly the PageRank value obtained using a Maclaurin polynomial of degree k. The latter result paves the way towards the application of analytical methods to the study of PageRank.
 PDF version.
 Paolo Boldi and
Sebastiano Vigna.
Compressed perfect embedded skip lists for quick invertedindex lookups.
In Proc. SPIRE 2005, volume 3772 of Lecture Notes in Computer
Science, pages 25−28. Springer−Verlag, 2005.
 Abstract
Large inverted indices are by now common in the construction of webscale search engines. For faster access, inverted indices are indexed internally so that it is possible to skip quickly over unnecessary documents. The classical approach to skipping dictates that a skip should be positioned every f^{1/2} document pointers, where f is the overall number of documents where the term appears. We argue that due to the growing size of the web more refined techniques are necessary, and describe how to embed a compressed perfect skip list in an inverted list. We provide statistical models that explain the empirical distribution of the skip data we observe in our experiments, and use them to devise good compression techniques that allow us to limit the waste in space, so that the resulting data structure increases the overall index size by just a few percents, still making it possible to index pointers with a rather fine granularity.
 PDF version of the extended draft.
 Paolo Boldi and Sebastiano
Vigna.
The WebGraph framework I: Compression techniques.
In Proc. of the Thirteenth International World Wide Web Conference (WWW
2004), pages 595−601, Manhattan, USA, 2004. ACM Press.
 Abstract

Studying web graphs is often difficult due to their large size. Recently, several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3.08 bits per link, and its transposed version in as little as 2.89 bits per link.
 PDF version; to download software and data sets, please look at the WebGraph home page.
Posters
 Sebastiano Vigna.
TruRank: Taking PageRank to the limit.
In Fourteenth International World Wide Web Conference (WWW 2005), Special
Interest Tracks & Posters, pages 976−977, Chiba, Japan, 2005. ACM
Press.
 Abstract

PageRank is defined as the stationary state of a Markov chain depending on a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 by Brin and Page is still used. It is common belief that values of α closer to 1 give a “truer to the web” PageRank, but a small α accelerates convergence. Recently, however, it has been shown that when α = 1 all pages in the core component are very likely to have rank 0. This behaviour makes it difficult to understand PageRank when α ≈ 1, as it converges to a meaningless value for most pages. We propose a simple and natural modification to the standard preprocessing performed on the adjacency matrix of the graph, resulting in a ranking scheme we call TruRank. TruRank ranks the web with principles almost identical to PageRank, but it gives meaningful values also when α ≈ 1.
 Paolo Boldi.
TotalRank: Ranking without damping.
In Fourteenth International World Wide Web Conference (WWW 2005), Special
Interest Tracks & Posters, pages 898−899, Chiba, Japan, 2005. ACM
Press.
 Abstract

PageRank is defined as the stationary state of a Markov chain obtained by perturbing the transition matrix of a web graph with a damping factor α that spreads part of the rank. The choice of α is eminently empirical, but most applications use α = 0.85; nonetheless, the selection of α is critical, and some believe that link farms may use this choice adversarially. Recent results prove that the PageRank of a page is a rational function of α, and that this function can be approximated quite efficiently: this fact can be used to define a new form of ranking, TotalRank, that averages PageRanks over all possible α's. We show how this rank can be computed efficiently, and provide some preliminary experimental results on its quality and comparisons with PageRank.
 PDF version.
 Yoshiki Mikami, Pavol Zavarsky, Mohd Zaidi Abd Rozan, Izumi Suzuki, Masayuki Takahashi, Tomohide Maki, Irwan Nizan Ayob, Paolo Boldi, Massimo Santini, and Sebastiano Vigna. The language observatory project (LOP). In Fourteenth International World Wide Web Conference (WWW 2005), Special Interest Tracks & Posters, pages 990−991, Chiba, Japan, 2005. ACM Press.
 Paolo Boldi, Bruno Codenotti,
Massimo Santini, and Sebastiano Vigna.
Structural properties of the African web.
In Poster Proc. of Eleventh International World Wide Web Conference,
Honolulu, USA, 2002.
 Abstract

In this poster we illustrate some data about the African web. These data have been collected using UbiCrawler, a distributed Web crawler designed and developed by the authors.
 Paolo Boldi, Bruno Codenotti,
Massimo Santini, and Sebastiano Vigna.
UbiCrawler: Scalability and faulttolerance issues.
In Poster Proc. of Eleventh International World Wide Web Conference,
Honolulu, USA, 2002.
 Abstract

We report on the implementation of UbiCrawler, a scalable distributed web crawler, analyze its performance, and describe its fault tolerance.
 Paolo Boldi, Bruno Codenotti,
Massimo Santini, and Sebastiano Vigna.
Trovatore: Towards a highly scalable distributed web crawler.
In Poster Proc. of Tenth International World Wide Web Conference,
pages 140−141, Hong Kong, China, 2001.
Winner of the Best Poster Award.
 Abstract

Trovatore is an ongoing project aimed at realizing an efficient distributed and highly scalable web crawler. This poster illustrates the main ideas behind its design.
Misc
 Paolo Boldi, Massimo Santini,
and Sebastiano Vigna.
A large timeaware graph.
SIGIR Forum, 42(2):33−38, 2008.
 Abstract

We describe the techniques developed to gather and distribute in a highly compressed, yet accessible, form a series of twelve snapshot of the .uk web domain. Ad hoc compression techniques made it possible to store the twelve snapshots using just 1.9 bits per link, with constanttime access to temporal information. Our collection makes it possible to study the temporal evolution linkbased scores (e.g., PageRank), the growth of online communities, and in general timedependent phenomena related to the link structure.
 Datasets and Java™ implementations are available as free software at the LAW site.
 Alessio Orlandi and Sebastiano
Vigna.
Compressed collections for simulated crawling.
SIGIR Forum, 42(2):39−44, 2008.
 Abstract

Collections are a fundamental tool for reproducible evaluation of information retrieval techniques. We describe a new method for distributing the document lengths and term counts (a.k.a. withindocument frequencies) of a web snapshot in a highly compressed and nonetheless quickly accessible form. Our main application is reproducibility of the behaviour of focused crawlers: by coupling our collection with the corresponding web graph compressed with WebGraph we make it possible to apply textbased machine learning tools to the collection, while keeping the data set footprint small. Finally, we describe a collection based on a crawl of 100Mpages of the .uk domain, publicly available in bundle with a Java opensource implementation of our techniques.