twitter-2010

If you publish results based on this graph, please quote the references suggested in the dataset page.

Twitter is a website, owned and operated by Twitter Inc., which offers a social networking and microblogging service, enabling its users to send and read messages called tweets. Tweets are text-based posts of up to 140 characters displayed on the user's profile page.

This is a crawl presented by Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon in “What is Twitter, a Social Network or a News Media?”, Proceedings of the 19th International World Wide Web (WWW) Conference, pages 591−600, 2010, ACM press. Nodes are users and there is an arc from x to y if y is a follower of x. In other words, arcs follow the direction of tweet transmission.

Note that the distance distribution reported in the paper (Figure 4) is quite different from the one we computed (with relative standard error ≤0.4% on all points of the cumulative distribution function). Correspondingly, the average distance reported in the paper (4.12) is quite different from our estimate. We have verified that we obtain the distribution reported here even when using breadth-first sampling.

The node numbering of the original graph was not compact, as the node identifiers were Twitter's actual internal identifiers. We thus renumbered nodes in a contiguous way. The original identifiers can be found in the file with extension .ids. Thus, you can access the Twitter page associated to a node by getting the corresponding identifier and then using the Twitter API. For instance, the node of maximum outdegree is node 2997469, corresponding to identifier 19058681, whose owner can be found at http://api.twitter.com/1/users/show.xml?user_id=19058681.

Basic data
nodes41 652 230
arcs1 468 365 182
bits/link13.897 (64.29%)
bits/link (transpose)13.148 (60.83%)
average degree35.253
maximum indegree770 155
maximum outdegree2 997 469
dangling nodes3.72%
buckets0.09%
largest component33 479 734 (80.38%)
average distance4.46 (± 0.002)
reachable pairs81.08% (± 0.257)
median distance5 (73.16%)
harmonic diameter5.29 (± 0.016)
Random access (recommended)
FilenameSize
twitter-2010.graph2.5G
twitter-2010.properties4.0K
twitter-2010-t.graph2.3G
twitter-2010-t.properties4.0K
twitter-2010.map462M
twitter-2010.smap462M
twitter-2010.md5sums4.0K
twitter-2010.lmap576M
twitter-2010.fcl434M
twitter-2010.ids.gz158M
twitter-2010.stats4.0K
twitter-2010.indegree1.5M
twitter-2010.outdegree5.8M
twitter-2010.scc159M
twitter-2010.sccsizes31M
Sequential access (high compression)
FilenameSize
twitter-2010-hc.graph2.4G
twitter-2010-hc.properties4.0K
twitter-2010-hc-t.graph2.3G
twitter-2010-hc-t.properties4.0K
Natural order (random access)
FilenameSize
twitter-2010-nat.graph2.8G
twitter-2010-nat.properties4.0K
twitter-2010-nat.fcl275M
twitter-2010-nat.ids.gz141M
JGraphT serialized succinct representation
FilenameSize
twitter-2010.suxdir8.0G
twitter-2010.suxmap456M
Indegree-frequency plotIndegree-frequency plot (with Fibonacci binning)
Outdegree-frequency plotOutdegree-frequency plot (with Fibonacci binning)
Indegree-rank plot (cumulative)Indegree-rank plot (cumulative)
Outdegree-rank plot (cumulative)Outdegree-rank plot (cumulative)
Distance probability mass functiondistance probability mass function
Connected-components size distributionConnected-components size distribution
Large connected componentsLarge connected components
Distribution of the logarithm of successor gapsDistribution of the logarithm of the successor gaps
Distribution of successor gapsDistribution of successor gaps