Using the LAW datasets
In this tutorial we illustrate the minimal steps required to download and use a LAW dataset. Such a dataset is composed by a graph and, in some cases, a list of node identifiers (e.g., URLs in the case of web graphs).
We assume that you have correctly installed WebGraph and its dependencies. Please refer to the WebGraph site for further details on installation.
We also assume that you add JVM-specific option so to allocate
enough memory: in case the JVM throws a
java.lang.OutOfMemoryError
, you will probably
need to set a larger maximum Java heap size with the option
-Xmx (e.g., -Xmx2G; please use man java, or java
-X, for further help).
For sake of simplicity, the following examples will be based on a Unix environment endowed with the usual utilities (as provided, for example, by an up-to-date GNU/Linux distribution). The same results can be obtained in any other environment supporting Java and a reasonable set of standard utilities.
Every file in a dataset shares a common basename: in the following discussion and examples we will use basename to denote it.
Structural information
The structural part of a dataset is contained in the files basename.properties and basename.graph. These two files are sufficient to scan the graph sequentially. If you want to access the graph randomly, we will offsets, which can be easily generated starting from the first two files.
First of all, download the files basename.properties, basename.graph, and basename.md5sums; since some files are very large, please use a download tool that can restart the download in case of failure (such as, for example, cURL, or GNU/wget). Assuming that the data is located at http://some.url/some/path/, the download can be performed as
for ext in .properties .graph .md5sums; do wget -c http://some.url/some/path/basename$ext done
Then, verify the MD5 sum of the downloaded file to check for possible download problems:
md5sum -c basename.md5sums
Don't worry about can't open error messages (they simply mean that the file basename.md5sums contains many MD5 sums for file you don't have already), but a MD5 check failed error message means that the related file is corrupted and can't be used (you need to download it again, or to rebuild it).
Warning: On very large files, download errors are not uncommon, as well as storage problems: a file that was once correct can become corrupted. Please verify often file integrity using MD5 sums.
You can now build the offsets:
java it.unimi.dsi.webgraph.BVGraph -o -O -L basename
The command above will create a basename.offsets file containing offsets, and a basename.obl file containing a serialised big list of offsets. If WebGraph finds such a file, and its modification date is later than that of the offsets file, it will load (much more quickly) the former rather than the latter.
Accessing a graph
It is now trivial to load and access the graph. The class documentation of ImmutableGraph explains in detail which methods are available. See also the examples.
Other versions
Note that each graph is available in different versions. As explained on the web site, we also provide a highly compressed version (-hc) that is useful for transmission or for purely sequential access (it turns to be actually faster for sequential access—smaller size, better caching), and a natural version, in which the graph is presented in a “natural” way. For web graphs, the natural presentation is in lexicographical URL ordering (i.e., nodes are numbered following the lexicographical order of their URLs). For social networks, the same criterion is applied when identifiers are available (e.g., names for DBLP): otherwise, we just give the graph in the form it has been given to us from the source.
Recompressing
If you are interested in experimenting with different compression parameters, you can easily recompress a graph.
For instance, here we recompress a graph with a very low maximum backward reference:
java it.unimi.dsi.webgraph.BVGraph -o -m 1 basename-hc basename-fast
The recompression will generate the files
basename-fast.properties, basename-fast.graph
and basename-fast.offsets. You can generate
the usual big list with java it.unimi.dsi.webgraph.BVGraph -o -L basename-fast
.
Mapping identifiers to nodes and viceversa
In a LAW dataset graph nodes are simply represented by natural numbers, but they correspond to actual entities (e.g., web pages) which have an identifier (e.g., URLs). If you are interested to map node numbers to identifiers and viceversa, you need suitable data structures. These structures can be downloaded in a pre-packaged form, but we will explain later how to rebuild them in case you want to tune their parameters.
Mapping nodes to identifiers
The space occupied by the identifiers is usually very large—in most cases, significantly larger than the associated graph. We store identifiers using front-coded lists, which compress by prefix omission. A string in a list is stored by noting the common prefix with the previous string, and then by writing the rest of the string. Using sorted identifiers, the resulting compression is very high (see the file basename-nat.fcl). Since the recommended format is a permutation of the natural order, we provide a sligtly bigger list basename.fcl, which wraps a front-coded list over the sorted identifiers using a PermutedFrontCodedStringList, and provide the correct answers for the permuted graph.
Mapping identifiers to nodes
The situation is completely different for the inverse mappings, due to the availability of extremely compact data structures that map strings to numbers. We provide a succinct function basename.map that implements Object2LongFunction and maps an identifier to the corresponding node, but will return random numbers when applied to strings that are not identifiers, and a signed succinct function basename.map that implements StringMap (albeit not some optional operations) and will be able to detect non-identifiers with high probability.
Mapping back and forth
Finally, if you need to map back and forth we provide a StringMap combining a succinct function and a front-coded list, with name basename.lmap (for “literally signed map”). The two methods you need are getLong(), which provides mapping from strings to indices, and list(), which returns a list inverting the mapping.
Strongly connected components
Each graph comes with files describing the structure of its strongly connected components. basename.scc is a sequence of integers in Java binary format assigning to each node its component (components are sorted by size, to component 0 is the largest component). Another file in the same format, basename.sccsizes, indexed by components, provides the size of each component. You can easily load these files using fastutil's BinIO facilities:
int[] component = BinIO.loadInts( "basename.scc" ); int[] size = BinIO.loadInts( "basename.sccsizes" );
Statistics
For each graph, we provide a number of statistics. Some of them are already listed in the properties file associated with the graph. Other (self-descriptive) statistics are contained in the basename.stats file. The files basename.indegree and basename.outdegree contain, in ASCII format, the indegree and outdegree frequency distributions.
In some cases, we provide the average distance, the harmonic diameter and other structural properties. These figures require a significant computational effort to be estimated reliably (we use HyperBall to that purpose). We also provide the empirical standard deviation (the computation is probabilistic).
Rebuilding string maps
All string-mapping structures are built starting from the file basename.urls (or basename.ids), whose i-th line contains the identifier of the i-th node (lines are numbered starting from zero), or from the file basename-nat.urls (or basename-nat.ids), which contains (usually) the identifiers in lexicographical order.
Warning: Usually URLs from our dataset are in ASCII, so you can use the -e option on all classes to speed up reading. This is not true of identifiers, which are UTF-8 encoded.
First of all, you need to download the files, uncompress them and check their MD5 (our example is a web graph):
wget -c http://some.url/some/path/basename.urls.gz wget -c http://some.url/some/path/basename-nat.urls.gz gunzip basename.urls.gz gunzip basename-nat.urls.gz md5sum -c basename.md5sums
We build a succinct function mapping the URLs to their position using an
GOV3Function
provided by
Sux4J:
java it.unimi.dsi.sux4j.mph.GOV3Function -z basename.map basename.urls.gz
If you want to be able not only to map URLs to node numbers but also to determine if a given string is actually a URL present in the map, you will need a signed the map:
java it.unimi.dsi.sux4j.mph.GOV3Function -s 32 -z basename.smap basename.urls.gzThe signing process stores for each URL a signature (32 bits, in the example above) that will be used to check that the result of the map is correct. The longer the signature, the more precise the map (the probability of error is 2−s, where s is the size of the signature).
Once you have created these files, you can use the map they
contain as in the folowing Java code snippet, which uses the
convenient
loadObject()
method from
BinIO
, which opens a file
given its filename and retrieves the stored object:
... String url; Object2LongFunction<? extends CharSequence> url2node = (Object2LongFunction<? extends CharSequence>) BinIO.loadObject( "basename.map" ); ... long node = url2node.getLong( url ); ...where if
url
is a URL present in the map, then
node
will be equal to the corresponding node number,
or
... String url; StringMap<? extends CharSequence> url2node = (StringMap<? extends CharSequence>) BinIO.loadObject( "basename.smap" ); ... int node = url2node.get( url ); ...where if
url
is a URL present in the map, then
node
will be equal to the corresponding node number,
else node
will be equal to -1.
To map node numbers to URLs, instead, you must
use a
FrontCodedStringList
, available in the
DSI utilities. The process is slightly
tricky, as we need first to build a front-coded list for the sorted URLs,
and them wrap them in a PermutedFrontCodedStringList.
java it.unimi.dsi.util.FrontCodedStringList \ -z -u -r 32 basename-nat.fcl < basename-nat.urls.gz
This command needs a few comments. The option -r sets the ratio, that is, how often a complete string (without prefix omission) will be stored. The largest the value, the smaller and slower the map. The option -u compacts significantly the map because it stores the strings internally as UTF-8 bytes (also in this case the map is slightly slower, but not that much).
You will need now to download the permutation basename-nat2llpa.perm. Then,
java it.unimi.dsi.util.PermutedFrontCodedStringList -i basename-nat.fcl \ basename-nat2llpa.perm basename.fcl
Once you have created the file, you can use the map it contains as in the following Java code snippet
... CharSequence url; List<? extends CharSequence> node2url = (List<? extends CharSequence>) BinIO.loadObject( "basename.fcl" ); ... url = node2url.get( node ); ...
Finally, if you want to generate a literally signed map you have to combine the front-coded list with a map:
java it.unimi.dsi.util.LiterallySignedStringMap -z basename.map basename.fcl basename.smap
Smaller but slower maps
A front-coded list can be very large: if you do not have enough memory, or you can afford a slower access, you can build an external prefix map instead, which provides a bidirectional mapping from URLs to nodes, and also from URL prefixes to intervals of nodes:
java it.unimi.dsi.util.ImmutableExternalPrefixMap \ -z -b4Ki -o basename.urls.gz basename.pmap