Datasets

We provide large graphs compressed using LLP + WebGraph. Please read the tutorial to learn how to download and use them.

Acknowledgements

If you publish results based on these graphs, please acknowledge the usage of WebGraph and LLP by quoting the following papers:

@inproceedings{BoVWFI,
  author ="Paolo Boldi and Sebastiano Vigna",
  title = "The {W}eb{G}raph Framework {I}: {C}ompression Techniques",
  year = 2004,
  booktitle="Proc. of the Thirteenth International World Wide Web Conference (WWW 2004)",
  address="Manhattan, USA",
  pages="595--601",
  publisher="ACM Press"
}
@inproceedings{BRSLLP,
  author = "Paolo Boldi and Marco Rosa and Massimo Santini and Sebastiano Vigna",
  title = "Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks",
  booktitle="Proceedings of the 20th international conference on World Wide Web",
  year = 2011,
  publisher="ACM Press"
}

If the graphs you are using were gathered by UbiCrawler, please acknowledge the usage of UbiCrawler by quoting the following paper:

@article{BCSU3,
  author="Paolo Boldi and Bruno Codenotti and Massimo Santini and Sebastiano Vigna",
  title="UbiCrawler: A Scalable Fully Distributed Web Crawler",
  journal="Software: Practice \& Experience",
  year=2004,
  volume=34,
  number=8,
  pages="711--726"
}

The graphs

We have gathered our datasets in reasonably homogenous groups. Besides a brief description, for each group there is a table providing basic information about the available graphs, such as the crawl date, the number of nodes and arcs, and the number of bits per link of the highly compressed version (the version we provide for general usage has faster random access, but worse compression ratio). The percentage after the number of bits per link gives the compression ratio with respect to the information-theoretical lower bound (that is, the logarithm of “n2 choose m” divided by m). We report, when available, the maximum depth per host and the maximum number of URLs per host. The last column shows the institution that provided support for the crawl.

For each graph, we provide a wealth of statistical data and graphs, ranging from plots of the distribution of indegrees and outdegrees to the size of the giant component and a drawing of the largest components. In particular, thanks to HyperBall we can provide the average and median distance, the harmonic diameter and the spid (shortest-paths index of dispersion) of the graph. The statistic are obtained by applying the jackknife method to 100 statistically independent approximations of the neighbourhood function.

Small graphs

The graphs in this section are very small web graphs that were crawled for testing purposes, or to have a look at some weird domain. We provide them as testing and debugging tools, as their size makes them easily manageable, but clearly they cannot be used as the only testbed for compression.

Graph Crawl date Nodes Arcs Bits/link Bits/link (transpose) Max depth Max URLs Thanks
cnr-2000 2000 325 557 3 216 152 2.448 (14.88%) 2.157 (13.11%)     IIT
in-2004 2004 1 382 908 16 917 053 1.767 (9.69%) 1.646 (9.03%)     NagaokaUT
eu-2005 2005 862 664 19 235 140 3.159 (18.94%) 2.667 (15.99%)     DSI
uk-2007-05@100000 2007 100 000 3 050 615 1.58 (12.04%) 1.312 (10.00%)     DSI-DELIS
uk-2007-05@1000000 2007 1 000 000 41 247 159 1.249 (7.80%) 0.972 (6.07%)     DSI-DELIS

Larger crawls

The graphs in this section are larger graphs from different domain. With the exclusion of the series of .uk datasets, these are the largest graphs available from us.

Graph Crawl date Nodes Arcs Bits/link Bits/link (transpose) Max depth Max URLs Thanks
clueweb12 2012 978 408 098 42 574 107 469 1.756 (6.79%) 1.312 (5.07%)      
uk-2002 2002 18 520 486 298 113 762 1.805 (8.37%) 1.55 (7.18%)     IIT
indochina-2004 2004 7 414 866 194 109 311 1.118 (5.72%) 0.958 (4.90%)     NagaokaUT
it-2004 2004 41 291 594 1 150 725 436 1.41 (6.43%) 1.143 (5.21%)     IIT
arabic-2005 2005 22 744 080 639 999 458 1.383 (6.56%) 1.131 (5.37%) 16 10 000 NagaokaUT
sk-2005 2005 50 636 154 1 949 412 601 1.576 (7.24%) 1.289 (5.92%) 16 100 000 IIT
uk-2005 2005 39 459 925 936 364 282 1.463 (6.62%) 1.156 (5.23%)     IIT

Social networks

The graphs in this section are not web graph, but rather (directed and undirected) networks generated by human activities, such as a collaboration network, a co-starring network, LiveJournal, etc.

Graph Crawl date Nodes Arcs Bits/link Bits/link (transpose) Max depth Max URLs Thanks
enwiki-2013 2013 4 206 785 101 355 853 12.639 (67.03%) 10.841 (57.49%)     NADINE
itwiki-2013 2013 1 016 867 25 619 926 11.039 (65.93%) 9.736 (58.15%)     NADINE
eswiki-2013 2013 972 933 23 041 488 10.519 (62.73%) 8.855 (52.81%)     NADINE
frwiki-2013 2013 1 352 053 34 378 431 11.296 (65.90%) 10.049 (58.63%)     NADINE
dewiki-2013 2013 1 532 354 36 722 696 13.024 (74.82%) 11.349 (65.20%)     NADINE
enron 69 244 276 143 5.874 (37.83%) 7.388 (47.58%)      
amazon-2008 2008 735 323 5 158 388 9.153 (50.51%) 8.819 (48.67%)      
ljournal-2008 2008 5 363 260 79 023 142 10.909 (54.77%) 10.898 (54.72%)      
orkut-2007 2007 3 072 626 234 370 166 11.046 (65.98%) 11.046 (65.98%)      
hollywood-2009 2009 1 139 905 113 891 327 4.804 (32.20%) 4.804 (32.20%)      
hollywood-2011 2011 2 180 759 228 985 632 4.885 (30.95%) 4.885 (30.95%)      
dblp-2010 2010 326 186 1 615 400 6.776 (38.83%) 6.776 (38.83%)      
dblp-2011 2011 986 324 6 707 236 8.475 (45.59%) 8.475 (45.59%)      
hu-tel-2006 2006 2 317 492 46 126 952 15.361 (84.07%) 14.723 (80.58%)      
twitter-2010 2010 41 652 230 1 468 365 182 13.897 (64.29%) 13.148 (60.83%)      
wordassociation-2011 2011 10 617 72 172 10.599 (87.95%) 10.13 (84.06%)      

Facebook graphs

These graphs have been studied in the “Four degrees of separation” paper by Lars Backstrom, Paolo Boldi, Marco Rosa, Johan Ugander and Sebastiano Vigna. The series sections the Facebook graph geographically and temporally: please see the paper for the details.

We cannot distribute the graphs for obvious reasons, but you can download aggregated data (WebGraph properties and HyberBall runs) following the link above. If you publish results based on our data, please acknowledge our work by quoting “Four degrees of separation” paper.

Graph Crawl date Nodes Arcs Bits/link Bits/link (transpose) Max depth Max URLs Thanks
fb_it-2007 2007 159 801 209 998         Facebook
fb_se-2007 2007 11 156 43 508         Facebook
fb_itse-2007 2007 172 118 257 510         Facebook
fb_us-2007 2007 8 849 896 1 058 512 594         Facebook
fb-2007 2007 12 957 703 1 289 268 655         Facebook
fb_it-2008 2008 335 769 1 975 722         Facebook
fb_se-2008 2008 1 005 074 46 326 530         Facebook
fb_itse-2008 2008 1 350 503 48 622 390         Facebook
fb_us-2008 2008 20 128 672 2 134 602 645         Facebook
fb-2008 2008 56 045 072 4 267 607 760         Facebook
fb_it-2009 2009 4 566 919 232 070 608         Facebook
fb_se-2009 2009 1 595 655 111 054 342         Facebook
fb_itse-2009 2009 6 157 764 344 271 392         Facebook
fb_us-2009 2009 41 531 899 4 642 488 625         Facebook
fb-2009 2009 139 066 164 12 332 220 710         Facebook
fb_it-2010 2010 11 828 018 1 453 871 140         Facebook
fb_se-2010 2010 2 973 416 299 877 604         Facebook
fb_itse-2010 2010 14 820 162 1 756 721 114         Facebook
fb_us-2010 2010 92 442 765 11 920 357 832         Facebook
fb-2010 2010 332 316 968 37 551 427 745         Facebook
fb_it-2011 2011 17 131 210 3 395 374 924         Facebook
fb_se-2011 2011 3 958 347 556 329 804         Facebook
fb_itse-2011 2011 21 108 892 3 957 504 632         Facebook
fb_us-2011 2011 131 415 278 24 745 022 096         Facebook
fb-2011 2011 562 368 789 95 057 125 765         Facebook
fb_it-current 19 781 745 4 471 227 570 10.343 (57.91%)       Facebook
fb_se-current 4 345 050 671 464 768 10.225 (63.03%)       Facebook
fb_itse-current 24 144 165 5 149 982 270 10.269 (56.33%)       Facebook
fb_us-current 149 068 552 31 865 069 210 11.63 (55.77%)       Facebook
fb-current 721 100 043 137 325 050 412 12.298 (52.79%)       Facebook

The .uk DELIS series

From 2006 to 2007, we gathered twelve monthly snapshots of about 100 millions pages each for the DELIS project. If you want a large recent graph we suggest to use uk-2007-05. A special effort has been made in compacting into the dataset uk-union all the snapshots by labelling each node and arc, thus obtaining a large time-aware graph.

Graph Crawl date Nodes Arcs Bits/link Bits/link (transpose) Max depth Max URLs Thanks
uk-2006-05 2006 77 741 046 2 965 197 340 1.489 (6.65%) 1.246 (5.56%) 16 50 000 DSI-DELIS
uk-2006-06 2006 80 644 902 2 481 281 617 1.402 (6.16%) 1.148 (5.04%) 16 50 000 DSI-DELIS
uk-2006-07 2006 96 395 298 3 030 665 444 1.375 (5.98%) 1.128 (4.91%) 16 50 000 DSI-DELIS
uk-2006-08 2006 100 751 978 3 250 153 746 1.353 (5.88%) 1.107 (4.81%) 16 50 000 DSI-DELIS
uk-2006-09 2006 106 288 541 3 871 625 613 1.194 (5.21%) 0.975 (4.25%) 16 50 000 DSI-DELIS
uk-2006-10 2006 93 463 772 3 130 910 405 1.26 (5.51%) 1.016 (4.45%) 16 50 000 DSI-DELIS
uk-2006-11 2006 106 783 458 3 479 400 938 1.307 (5.66%) 1.047 (4.54%) 16 50 000 DSI-DELIS
uk-2006-12 2006 103 098 631 3 768 836 665 1.158 (5.06%) 0.928 (4.06%) 16 50 000 DSI-DELIS
uk-2007-01 2007 108 563 230 3 929 837 236 1.183 (5.15%) 0.948 (4.13%) 16 50 000 DSI-DELIS
uk-2007-02 2007 110 123 614 3 944 932 566 1.182 (5.14%) 0.947 (4.12%) 16 50 000 DSI-DELIS
uk-2007-03 2007 107 565 084 3 642 701 825 1.274 (5.53%) 1.023 (4.44%) 16 50 000 DSI-DELIS
uk-2007-04 2007 106 867 191 3 790 305 474 1.203 (5.24%) 0.96 (4.18%) 16 50 000 DSI-DELIS
uk-2007-05 2007 105 896 555 3 738 733 648 1.207 (5.26%) 0.965 (4.20%) 16 50 000 DSI-DELIS
uk-union-2006-06-2007-05 2007 133 633 040 5 507 679 822         DSI-DELIS

Other datasets

In 2001, we downloaded from the Stanford Webbase project site a graph that we distribute in compact form. We also provide some statistical data about the Altavista dataset distributed by Yahoo! within the Webscope program (AltaVista webpage connectivity dataset, version 1.0). The “nd” (no-dangling) version was obtained by pruning (one level of) dangling nodes.

Graph Crawl date Nodes Arcs Bits/link Bits/link (transpose) Max depth Max URLs Thanks
webbase-2001 2001 118 142 155 1 019 903 190 2.784 (11.07%) 2.561 (10.18%)     WebBase
altavista-2002 2002 1 413 511 390 6 636 600 779 3.932 (13.28%) 3.099 (10.47%)     Webscope
altavista-2002-nd 2002 653 912 338 4 226 882 364 4.024 (14.35%) 2.917 (10.40%)     Webscope