altavista-2002
If you publish results based on this graph, please quote the references suggested in the dataset page.
This graph is distributed by Yahoo! within the within the Webscope program (AltaVista webpage connectivity dataset, version 1.0), so we cannot distribute it directly, albeit it can be easily generated from the data provided by Yahoo!. It is a fairly large graph that was crawled by the Altavista search engine around 2002.
We urge researchers to stop using this dataset. It has problems similar to the Stanford Matrix: the “giant” strongly connected component is preposterously small (less than 4% of the whole graph). Moreover, 49% of the nodes have degree zero, that is, they are isolated: they are just there to enlarge the id space—the graph has actually around 700 million nodes. The density is incredibly low (also because of the very high number of nodes of outdegree zero), so testing algorithms on this graph might give a false sense of speed (the number of links in a recent snapshot would be at least an order of magnitude bigger).
A good alternative is the ClueWeb12 dataset.