altavista-2002

If you publish results based on this graph, please quote the references suggested in the dataset page.

This graph is distributed by Yahoo! within the within the Webscope program (AltaVista webpage connectivity dataset, version 1.0), so we cannot distribute it directly, albeit it can be easily generated from the data provided by Yahoo!. It is a fairly large graph that was crawled by the Altavista search engine around 2002.

We urge researchers to stop using this dataset. It has problems similar to the Stanford Matrix: the “giant” strongly connected component is preposterously small (less than 4% of the whole graph). Moreover, 49% of the nodes have degree zero, that is, they are isolated: they are just there to enlarge the id space—the graph has actually around 700 million nodes. The density is incredibly low (also because of the very high number of nodes of outdegree zero), so testing algorithms on this graph might give a false sense of speed (the number of links in a recent snapshot would be at least an order of magnitude bigger).

A good alternative is the ClueWeb12 dataset.

Basic data
nodes1 413 511 390
arcs6 636 600 779
bits/link3.932 (13.28%)
bits/link (transpose)3.099 (10.47%)
average degree4.695
maximum indegree7 637 656
maximum outdegree2 513
dangling nodes53.74%
buckets0.16%
largest component50 092 626 (3.54%)
average distance16.51 (± 0.114)
reachable pairs1.61% (± 0.039)
median distance
harmonic diameter928.62 (± 21.328)
Random access (recommended)
FilenameSize
altavista-2002.properties4.0K
altavista-2002-t.properties4.0K
altavista-2002.stats4.0K
altavista-2002.indegree15M
altavista-2002.outdegree12K
Sequential access (high compression)
FilenameSize
altavista-2002-hc.properties4.0K
altavista-2002-hc-t.properties4.0K
Natural order (random access)
FilenameSize
altavista-2002-nat.properties4.0K
Indegree-frequency plotIndegree-frequency plot (with Fibonacci binning)
Outdegree-frequency plotOutdegree-frequency plot (with Fibonacci binning)
Indegree-rank plot (cumulative)Indegree-rank plot (cumulative)
Outdegree-rank plot (cumulative)Outdegree-rank plot (cumulative)
Distance probability mass functiondistance probability mass function
Connected-components size distributionConnected-components size distribution
Large connected componentsLarge connected components
Distribution of the logarithm of successor gapsDistribution of the logarithm of the successor gaps
Distribution of successor gapsDistribution of successor gaps