Datasets
We provide large graphs compressed using LLP + WebGraph. Please read the tutorial to learn how to download and use them.
The graphs can also be loaded and analyzed in JGraphT using suitable adapters. Moreover, for graphs with less than 231 arcs we provide a serialized succinct JGraphT representation.
Acknowledging data
If you publish results based on these graphs, please acknowledge the usage of WebGraph and LLP by quoting the following papers:
@inproceedings{BoVWFI, author = "Paolo Boldi and Sebastiano Vigna", title = "The {W}eb{G}raph Framework {I}: {C}ompression Techniques", year = 2004, booktitle = "Proc. of the Thirteenth International World Wide Web Conference (WWW 2004)", address = "Manhattan, USA", pages = "595--601", publisher = "ACM Press" }
@inproceedings{BRSLLP, author = "Paolo Boldi and Marco Rosa and Massimo Santini and Sebastiano Vigna", title = "Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks", booktitle = "Proceedings of the 20th international conference on World Wide Web", editor = "Sadagopan Srinivasan and Krithi Ramamritham and Arun Kumar and M. P. Ravindra and Elisa Bertino and Ravi Kumar", publisher = "ACM Press", year = 2011, pages = "587--596" }
Acknowledging UbiCrawer
If the graphs you are using were gathered by UbiCrawler, please acknowledge the usage of UbiCrawler by quoting the following paper:
@article{BCSU3, author = "Paolo Boldi and Bruno Codenotti and Massimo Santini and Sebastiano Vigna", title = "UbiCrawler: A Scalable Fully Distributed Web Crawler", journal = "Software: Practice \& Experience", year = 2004, volume = 34, number = 8, pages = "711--726" }
Acknowledging BUbiNG
If the graphs you are using were gathered by BUbiNG, please acknowledge the usage of BUbiNG by quoting the following paper:
@inproceedings{BMSB, author = "Boldi, Paolo and Marino, Andrea and Santini, Massimo and Vigna, Sebastiano", title = "{BUbiNG}: Massive Crawling for the Masses", booktitle = "Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web", year = 2014, pages = "227--228", publisher = "International World Wide Web Conferences Steering Committee" }
The graphs
We have gathered our datasets in reasonably homogenous groups. Besides a brief description, for each group there is a table providing basic information about the available graphs, such as the crawl date, the number of nodes and arcs, and the number of bits per link of the highly compressed version (the version we provide for general usage has faster random access, but worse compression ratio). The percentage after the number of bits per link gives the compression ratio with respect to the information-theoretical lower bound (that is, the logarithm of “n2 choose m” divided by m). We report, when available, the maximum depth per host and the maximum number of URLs per host. The last column shows the institution that provided support for the crawl.
For each graph, we provide a wealth of statistical data and graphs, ranging from plots of the distribution of indegrees and outdegrees to the size of the giant component and a drawing of the largest components. In particular, thanks to HyperBall we can provide the average and median distance, the harmonic diameter and the spid (shortest-paths index of dispersion) of the graph. The statistic are obtained by applying the jackknife method to 100 statistically independent approximations of the neighbourhood function.
Small graphs
The graphs in this section are very small web graphs that were crawled for testing purposes, or to have a look at some weird domain. We provide them as testing and debugging tools, as their size makes them easily manageable, but clearly they cannot be used as the only testbed for compression.
Graph | Crawl date | Nodes | Arcs | Bits/link | Bits/link (transpose) | Max depth | Max URLs | Thanks |
---|---|---|---|---|---|---|---|---|
cnr-2000 | 2000 | 325 557 | 3 216 152 | 2.448 (14.88%) | 2.157 (13.11%) | IIT | ||
in-2004 | 2004 | 1 382 908 | 16 917 053 | 1.767 (9.69%) | 1.646 (9.03%) | NagaokaUT | ||
eu-2005 | 2005 | 862 664 | 19 235 140 | 3.159 (18.94%) | 2.667 (15.99%) | DSI | ||
uk-2007-05@100000 | 2007 | 100 000 | 3 050 615 | 1.58 (12.04%) | 1.312 (10.00%) | DSI-DELIS | ||
uk-2007-05@1000000 | 2007 | 1 000 000 | 41 247 159 | 1.249 (7.80%) | 0.972 (6.07%) | DSI-DELIS |
Larger crawls
The graphs in this section are larger graphs from different domain. With the exclusion of the series of .uk datasets, these are the largest graphs available from us.
Graph | Crawl date | Nodes | Arcs | Bits/link | Bits/link (transpose) | Max depth | Max URLs | Thanks |
---|---|---|---|---|---|---|---|---|
uk-2014 | 2014 | 787 801 471 | 47 614 527 250 | 1.027 (4.10%) | 0.818 (3.26%) | BUbiNG | ||
eu-2015 | 2015 | 1 070 557 254 | 91 792 261 600 | 0.899 (3.59%) | 0.723 (2.89%) | BUbiNG | ||
gsh-2015 | 2015 | 988 490 691 | 33 877 399 152 | 1.606 (6.12%) | 1.427 (5.44%) | BUbiNG | ||
uk-2014-host | 2014 | 4 769 354 | 50 829 923 | 6.866 (33.97%) | 7.027 (34.76%) | BUbiNG | ||
eu-2015-host | 2015 | 11 264 052 | 386 915 963 | 3.133 (15.85%) | 3.434 (17.37%) | BUbiNG | ||
gsh-2015-host | 2015 | 68 660 142 | 1 802 747 600 | 5.929 (26.05%) | 5.802 (25.49%) | BUbiNG | ||
uk-2014-tpd | 2014 | 1 766 010 | 18 244 650 | 9.983 (53.03%) | 10.963 (58.23%) | BUbiNG | ||
eu-2015-tpd | 2015 | 6 650 532 | 170 145 510 | 4.426 (22.78%) | 5.033 (25.90%) | BUbiNG | ||
gsh-2015-tpd | 2015 | 30 809 122 | 602 119 716 | 8.795 (39.92%) | 8.976 (40.74%) | BUbiNG | ||
clueweb12 | 2012 | 978 408 098 | 42 574 107 469 | 1.756 (6.79%) | 1.312 (5.07%) | |||
uk-2002 | 2002 | 18 520 486 | 298 113 762 | 1.805 (8.37%) | 1.55 (7.18%) | IIT | ||
indochina-2004 | 2004 | 7 414 866 | 194 109 311 | 1.118 (5.72%) | 0.958 (4.90%) | NagaokaUT | ||
it-2004 | 2004 | 41 291 594 | 1 150 725 436 | 1.41 (6.43%) | 1.143 (5.21%) | IIT | ||
arabic-2005 | 2005 | 22 744 080 | 639 999 458 | 1.383 (6.56%) | 1.131 (5.37%) | 16 | 10 000 | NagaokaUT |
sk-2005 | 2005 | 50 636 154 | 1 949 412 601 | 1.576 (7.24%) | 1.289 (5.92%) | 16 | 100 000 | IIT |
uk-2005 | 2005 | 39 459 925 | 936 364 282 | 1.463 (6.62%) | 1.156 (5.23%) | IIT |
Social networks
The graphs in this section are not web graph, but rather (directed and undirected) networks generated by human activities, such as a collaboration network, a co-starring network, LiveJournal, etc.
Graph | Crawl date | Nodes | Arcs | Bits/link | Bits/link (transpose) | Max depth | Max URLs | Thanks |
---|---|---|---|---|---|---|---|---|
enwiki-2013 | 2013 | 4 206 785 | 101 355 853 | 12.639 (67.03%) | 10.841 (57.49%) | NADINE | ||
enwiki-2015 | 2015 | 4 853 050 | 106 314 263 | 13.08 (68.13%) | 11.459 (59.68%) | |||
enwiki-2016 | 2016 | 5 096 287 | 113 095 771 | 13.185 (68.49%) | 11.56 (60.05%) | |||
enwiki-2017 | 2017 | 5 409 498 | 122 008 994 | 13.153 (68.10%) | 11.506 (59.57%) | |||
enwiki-2018 | 2018 | 5 616 717 | 128 835 798 | 13.167 (68.07%) | 11.491 (59.40%) | |||
enwiki-2019 | 2019 | 5 839 060 | 135 700 782 | 13.096 (67.57%) | 11.488 (59.27%) | |||
enwiki-2020 | 2020 | 6 047 510 | 142 691 609 | 13.128 (67.63%) | 11.515 (59.32%) | |||
enwiki-2021 | 2021 | 6 261 502 | 150 124 927 | 13.139 (67.60%) | 11.577 (59.56%) | |||
enwiki-2022 | 2022 | 6 492 490 | 159 047 205 | 13.156 (67.61%) | 11.564 (59.43%) | |||
enwiki-2023 | 2023 | 6 625 370 | 165 206 104 | 13.191 (67.78%) | 11.662 (59.92%) | |||
enwiki-2024 | 2024 | 6 790 971 | 172 762 484 | 13.129 (67.44%) | 11.574 (59.45%) | |||
itwiki-2013 | 2013 | 1 016 867 | 25 619 926 | 11.039 (65.93%) | 9.736 (58.15%) | NADINE | ||
eswiki-2013 | 2013 | 972 933 | 23 041 488 | 10.519 (62.73%) | 8.855 (52.81%) | NADINE | ||
frwiki-2013 | 2013 | 1 352 053 | 34 378 431 | 11.296 (65.90%) | 10.049 (58.63%) | NADINE | ||
dewiki-2013 | 2013 | 1 532 354 | 36 722 696 | 13.024 (74.82%) | 11.349 (65.20%) | NADINE | ||
enron | 69 244 | 276 143 | 5.874 (37.83%) | 7.388 (47.58%) | ||||
amazon-2008 | 2008 | 735 323 | 5 158 388 | 9.153 (50.51%) | 8.819 (48.67%) | |||
ljournal-2008 | 2008 | 5 363 260 | 79 023 142 | 10.909 (54.77%) | 10.898 (54.72%) | |||
orkut-2007 | 2007 | 3 072 626 | 234 370 166 | 11.046 (65.98%) | 11.046 (65.98%) | |||
hollywood-2009 | 2009 | 1 139 905 | 113 891 327 | 4.804 (32.20%) | 4.804 (32.20%) | |||
hollywood-2011 | 2011 | 2 180 759 | 228 985 632 | 4.885 (30.95%) | 4.885 (30.95%) | |||
imdb-2021 | 2021 | 2 996 317 | 10 739 291 | 8.684 (41.13%) | 8.684 (41.13%) | |||
dblp-2010 | 2010 | 326 186 | 1 615 400 | 6.776 (38.83%) | 6.776 (38.83%) | |||
dblp-2011 | 2011 | 986 324 | 6 707 236 | 8.475 (45.59%) | 8.475 (45.59%) | |||
hu-tel-2006 | 2006 | 2 317 492 | 46 126 952 | 15.361 (84.07%) | 14.723 (80.58%) | |||
twitter-2010 | 2010 | 41 652 230 | 1 468 365 182 | 13.897 (64.29%) | 13.148 (60.83%) | |||
wordassociation-2011 | 2011 | 10 617 | 72 172 | 10.599 (87.95%) | 10.13 (84.06%) |
Facebook graphs
These graphs have been studied in the “Four degrees of separation” paper by Lars Backstrom, Paolo Boldi, Marco Rosa, Johan Ugander and Sebastiano Vigna. The series sections the Facebook graph geographically and temporally: please see the paper for the details. Note that years up to 2011 are timed at the first of January, but the “current” graphs are timed at May 2011. In particular, the latest whole Facebook graph prompted the title of the paper.
We cannot distribute the graphs for obvious reasons, but you can download aggregated data (WebGraph properties, HyperANF runs and some degree distributions). If you publish results based on our data, please acknowledge our work by quoting the “Four degrees of separation” paper.
Graph | Crawl date | Nodes | Arcs | Bits/link | Bits/link (transpose) | Max depth | Max URLs | Thanks |
---|---|---|---|---|---|---|---|---|
fb_it-2007 | 2007 | 159 801 | 209 998 | |||||
fb_se-2007 | 2007 | 11 156 | 43 508 | |||||
fb_itse-2007 | 2007 | 172 118 | 257 510 | |||||
fb_us-2007 | 2007 | 8 849 896 | 1 058 512 594 | |||||
fb-2007 | 2007 | 12 957 703 | 1 289 268 655 | |||||
fb_it-2008 | 2008 | 335 769 | 1 975 722 | |||||
fb_se-2008 | 2008 | 1 005 074 | 46 326 530 | |||||
fb_itse-2008 | 2008 | 1 350 503 | 48 622 390 | |||||
fb_us-2008 | 2008 | 20 128 672 | 2 134 602 645 | |||||
fb-2008 | 2008 | 56 045 072 | 4 267 607 760 | |||||
fb_it-2009 | 2009 | 4 566 919 | 232 070 608 | |||||
fb_se-2009 | 2009 | 1 595 655 | 111 054 342 | |||||
fb_itse-2009 | 2009 | 6 157 764 | 344 271 392 | |||||
fb_us-2009 | 2009 | 41 531 899 | 4 642 488 625 | |||||
fb-2009 | 2009 | 139 066 164 | 12 332 220 710 | |||||
fb_it-2010 | 2010 | 11 828 018 | 1 453 871 140 | |||||
fb_se-2010 | 2010 | 2 973 416 | 299 877 604 | |||||
fb_itse-2010 | 2010 | 14 820 162 | 1 756 721 114 | |||||
fb_us-2010 | 2010 | 92 442 765 | 11 920 357 832 | |||||
fb-2010 | 2010 | 332 316 968 | 37 551 427 745 | |||||
fb_it-2011 | 2011 | 17 131 210 | 3 395 374 924 | |||||
fb_se-2011 | 2011 | 3 958 347 | 556 329 804 | |||||
fb_itse-2011 | 2011 | 21 108 892 | 3 957 504 632 | |||||
fb_us-2011 | 2011 | 131 415 278 | 24 745 022 096 | |||||
fb-2011 | 2011 | 562 368 789 | 95 057 125 765 | |||||
fb_it-current | 19 781 745 | 4 471 227 570 | 10.343 (57.91%) | |||||
fb_se-current | 4 345 050 | 671 464 768 | 10.225 (63.03%) | |||||
fb_itse-current | 24 144 165 | 5 149 982 270 | 10.269 (56.33%) | |||||
fb_us-current | 149 068 552 | 31 865 069 210 | 11.63 (55.77%) | |||||
fb-current | 721 100 043 | 137 325 050 412 | 12.298 (52.79%) |
The .uk DELIS series
From 2006 to 2007, we gathered twelve monthly snapshots of about 100 millions pages each for the DELIS project. If you want a large recent graph we suggest to use uk-2007-05. A special effort has been made in compacting into the dataset uk-union all the snapshots by labelling each node and arc, thus obtaining a large time-aware graph.
Graph | Crawl date | Nodes | Arcs | Bits/link | Bits/link (transpose) | Max depth | Max URLs | Thanks |
---|---|---|---|---|---|---|---|---|
uk-2006-05 | 2006 | 77 741 046 | 2 965 197 340 | 1.489 (6.65%) | 1.246 (5.56%) | 16 | 50 000 | DSI-DELIS |
uk-2006-06 | 2006 | 80 644 902 | 2 481 281 617 | 1.402 (6.16%) | 1.148 (5.04%) | 16 | 50 000 | DSI-DELIS |
uk-2006-07 | 2006 | 96 395 298 | 3 030 665 444 | 1.375 (5.98%) | 1.128 (4.91%) | 16 | 50 000 | DSI-DELIS |
uk-2006-08 | 2006 | 100 751 978 | 3 250 153 746 | 1.353 (5.88%) | 1.107 (4.81%) | 16 | 50 000 | DSI-DELIS |
uk-2006-09 | 2006 | 106 288 541 | 3 871 625 613 | 1.194 (5.21%) | 0.975 (4.25%) | 16 | 50 000 | DSI-DELIS |
uk-2006-10 | 2006 | 93 463 772 | 3 130 910 405 | 1.26 (5.51%) | 1.016 (4.45%) | 16 | 50 000 | DSI-DELIS |
uk-2006-11 | 2006 | 106 783 458 | 3 479 400 938 | 1.307 (5.66%) | 1.047 (4.54%) | 16 | 50 000 | DSI-DELIS |
uk-2006-12 | 2006 | 103 098 631 | 3 768 836 665 | 1.158 (5.06%) | 0.928 (4.06%) | 16 | 50 000 | DSI-DELIS |
uk-2007-01 | 2007 | 108 563 230 | 3 929 837 236 | 1.183 (5.15%) | 0.948 (4.13%) | 16 | 50 000 | DSI-DELIS |
uk-2007-02 | 2007 | 110 123 614 | 3 944 932 566 | 1.182 (5.14%) | 0.947 (4.12%) | 16 | 50 000 | DSI-DELIS |
uk-2007-03 | 2007 | 107 565 084 | 3 642 701 825 | 1.274 (5.53%) | 1.023 (4.44%) | 16 | 50 000 | DSI-DELIS |
uk-2007-04 | 2007 | 106 867 191 | 3 790 305 474 | 1.203 (5.24%) | 0.96 (4.18%) | 16 | 50 000 | DSI-DELIS |
uk-2007-05 | 2007 | 105 896 555 | 3 738 733 648 | 1.207 (5.26%) | 0.965 (4.20%) | 16 | 50 000 | DSI-DELIS |
uk-union-2006-06-2007-05 | 2007 | 133 633 040 | 5 507 679 822 | DSI-DELIS |
Other datasets
In 2001, we downloaded from the Stanford Webbase project site a graph that we distribute in compact form. We also provide some statistical data about the Altavista dataset distributed by Yahoo! within the Webscope program (AltaVista webpage connectivity dataset, version 1.0). The “nd” (no-dangling) version was obtained by pruning (one level of) dangling nodes.
Graph | Crawl date | Nodes | Arcs | Bits/link | Bits/link (transpose) | Max depth | Max URLs | Thanks |
---|---|---|---|---|---|---|---|---|
webbase-2001 | 2001 | 118 142 155 | 1 019 903 190 | 2.784 (11.07%) | 2.561 (10.18%) | WebBase | ||
altavista-2002 | 2002 | 1 413 511 390 | 6 636 600 779 | 3.932 (13.28%) | 3.099 (10.47%) | Webscope | ||
altavista-2002-nd | 2002 | 653 912 338 | 4 226 882 364 | 4.024 (14.35%) | 2.917 (10.40%) | Webscope |