clueweb12

If you publish results based on this graph, please quote the references suggested in the dataset page.

This is the graph underlying the ClueWeb12 dataset. The graph we distribute contains all crawled pages (not just the 730M pages in the collection), but not the frontier (pages that have been discovered, but not crawled).

The original graph contains a fake isolated node of index zero (that we labelled with the empty string) to align the node indices with the numbering within the collection. As usual, we distribute the graph under a permutation computed by LLP.

This is one of the first available snapshots of the web graph in the billion range that actually looks like a web graph.

Basic data

nodes	978 408 098
arcs	42 574 107 469
bits/link	1.756 (6.79%)
bits/link (transpose)	1.312 (5.07%)
average degree	43.514
maximum indegree	75 611 690
maximum outdegree	7 447
dangling nodes	9.50%
buckets	1.69%
largest component	774 373 029 (79.15%)
average distance	12.48 (± 0.027)
reachable pairs	75.44% (± 0.657)
median distance	13 (53.05%)
harmonic diameter	15.71 (± 0.137)

Random access (recommended)

Filename	Size
clueweb12.graph	12G
clueweb12.properties	4.0K
clueweb12-t.graph	7.0G
clueweb12-t.properties	4.0K
clueweb12.map	12G
clueweb12.smap	12G
clueweb12.md5sums	4.0K
clueweb12.lmap	40G
clueweb12.fcl	36G
clueweb12.vf	3.8G
clueweb12.svf	12G
clueweb12.urls.gz	10G
clueweb12.stats	4.0K
clueweb12.indegree	145M
clueweb12.outdegree	24K
clueweb12.scc	3.7G
clueweb12.sccsizes	516M