clueweb12
If you publish results based on this graph, please quote the references suggested in the dataset page.
This is the graph underlying the ClueWeb12 dataset. The graph we distribute contains all crawled pages (not just the 730M pages in the collection), but not the frontier (pages that have been discovered, but not crawled).
The original graph contains a fake isolated node of index zero (that we labelled with the empty string) to align the node indices with the numbering within the collection. As usual, we distribute the graph under a permutation computed by LLP.
This is one of the first available snapshots of the web graph in the billion range that actually looks like a web graph.