clueweb12

If you publish results based on this graph, please quote the references suggested in the dataset page.

This is the graph underlying the ClueWeb12 dataset. The graph we distribute contains all crawled pages (not just the 730M pages in the collection), but not the frontier (pages that have been discovered, but not crawled).

The original graph contains a fake isolated node of index zero (that we labelled with the empty string) to align the node indices with the numbering within the collection. As usual, we distribute the graph under a permutation computed by LLP.

This is one of the first available snapshots of the web graph in the billion range that actually looks like a web graph.

Basic data
nodes978 408 098
arcs42 574 107 469
bits/link1.756 (6.79%)
bits/link (transpose)1.312 (5.07%)
average degree43.514
maximum indegree75 611 690
maximum outdegree7 447
dangling nodes9.50%
buckets1.69%
largest component774 373 029 (79.15%)
average distance12.48 (± 0.027)
reachable pairs75.44% (± 0.657)
median distance13 (53.05%)
harmonic diameter15.71 (± 0.137)
Random access (recommended)
FilenameSize
clueweb12.graph12G
clueweb12.properties4.0K
clueweb12-t.graph7.0G
clueweb12-t.properties4.0K
clueweb12.map12G
clueweb12.smap12G
clueweb12.md5sums4.0K
clueweb12.lmap40G
clueweb12.fcl36G
clueweb12.urls.gz10G
clueweb12.stats4.0K
clueweb12.indegree145M
clueweb12.outdegree24K
clueweb12.scc3.7G
clueweb12.sccsizes516M
Sequential access (high compression)
FilenameSize
clueweb12-hc.graph8.8G
clueweb12-hc.properties4.0K
clueweb12-hc-t.graph6.6G
clueweb12-hc-t.properties4.0K
Natural order (random access)
FilenameSize
clueweb12-nat.graph24G
clueweb12-nat.properties4.0K
clueweb12-nat.fcl32G
clueweb12-nat.urls.gz13G
Indegree-frequency plotIndegree-frequency plot (with Fibonacci binning)
Outdegree-frequency plotOutdegree-frequency plot (with Fibonacci binning)
Indegree-rank plot (cumulative)Indegree-rank plot (cumulative)
Outdegree-rank plot (cumulative)Outdegree-rank plot (cumulative)
Distance probability mass functiondistance probability mass function
Connected-components size distributionConnected-components size distribution
Large connected componentsLarge connected components
Distribution of the logarithm of successor gapsDistribution of the logarithm of the successor gaps
Distribution of successor gapsDistribution of successor gaps