This package is a companion of the paper Arc-similarity via triangular random walks by Paolo Boldi and Marco Rosa (2011). It is distributed, under the GPL license, within the Satellite Projects of the Laboratory for Web Algorithmics (DSI, Università degli Studi di Milano, Italy). The classes of this package allow everybody to reproduce the experiments outlined in the paper.
Essentially, the paper describes how one can, for a given (directed) graph G, build another graph (called the line graph of G), who is itself directed and has weights on its arcs. The nodes of the line graph are in bijective correspondence with the arcs of G. The paper proposes two ways to weight the arcs of the line graph:
Whatever weighting scheme has been used, some clustering tool can be applied on the weighted line graph to obtain a clustering of its nodes (i.e., of the arcs of G); in particular, we suggest the use of this implementation of the Louvain method.
The classes hereby provided strongly depend on the WebGraph framework and we assume that the reader is well-aware of how such a framework works.
Suppose you start from a given graph, represented using the WebGraph
framework. Let graph
represent from now on the basename of the graph itself (obtained in some
way, e.g., downloading it from the http://law.dsi.unimi.it/ website;
see below for more details).
To build the (unweighted) line graph, use the it.unimi.dsi.webgraph.Transform
class of
the WebGraph toolbox; a typical invocation is:
java it.unimi.dsi.webgraph.Transform line graph graph-line graph-map 2000000000
Here line
is the transformation to be applied (the line-graph transformation), graph
is the input graph. The following two parameters are the basename of the resulting graph (graph-line
in this example) and the basename of the map files (graph-map
): the map files are two files
(with extensions .source
and .target
) that contain one integer per node of the
line graph; the k-th integer in the two files represent the id of the source (target) node of
the arc (in the original graph) that node k (in the line graph) represents.
The last argument in the above invocation (that is optional) is the size of the batch used for the transformation
itself. We suggest to keep it as large as possible, for large input graphs (say, close to Integer.MAX_VALUE
),
compatibly with the available core memory (that may have to be provided as extra argument to the JVM).
The weighted line graph (whose weights are those indicated in the paper as wT) is
built using the BuildTriangularWeightedLineGraph
class. The typical invocation is:
java it.unimi.dsi.law.triangrw.BuildTriangularWeightedLineGraph graph graph-line graph-map graph-weighted
The first three parameters are the basenames of the input files (the original graph, the unweighted line graph and the map files). The last parameter is the basename of the output file. The output is going to be an ArcLabelledImmutableGraph, whose underlying graph is the line graph itself, and whose labels are 4-byte floats.
The above invocation allows one to set the value of β (the default value is 0.2). A switch can be provided to the above class to obtain the ratio-triangular version (the mass-triangular is used by default).
To compute the PageRank on the weighted line graph so obtained, one can use the machinery of the graph_weighted package. Typically:
java es.yrbcn.graph.weighted.WeightedPageRankPowerMethod -a 0.99 graph-weighted graph-rank
This call computes the weighted PageRank on the weighted (line) graph, with α=0.99. The output is a
rank file called graph-rank.ranks
.
The final weighted line graph (whose weights are those indicated in the paper as vT) is
built using again the BuildTriangularWeightedLineGraph
class. The typical invocation is now:
java it.unimi.dsi.law.triangrw.BuildTriangularWeightedLineGraph -s graph-rank.ranks graph graph-line graph-map graph-prweighted
The last parameter is the basename of the output file. The output is going to be again an ArcLabelledImmutableGraph, whose underlying graph is the line graph itself, and whose labels are 4-byte floats.
The final weighted line graph is in binary format, but one can see it through:
java es.yrbcn.graph.weighted.DumpWeightedGraph graph-prweighted
This will output the arcs of the line graph, one per line; each arc will be specified as source node, target node
and weight, TAB-separated. Nodes of the line graph can be mapped to the nodes of the original graph using
the map files (graph-map.source
and graph-map.target
, in our example).
If you want to feed the graph obtained to the Louvain package
you can use their convert
tool (that starts from the dumped version of the graph obtained
as explained in the last paragraph).
The main disadvantage of the strategy presented above is that it requires to build the line graph explicitly and to work on it (both to compute the weighted PageRank values, if needed, and to perform the clustering). Alternatively, as explained in the paper, we provide a new arc-clustering method that directly does the job (internally: it implicitly creates the weighted line graph and works on it).
To use it on the graph graph
, you will typically invoke it as follows:
java it.unimi.dsi.law.triangrw.ALP graph grapht clustwhere
grapht
is the transpose of the graph (that can be build using it.unimi.dsi.webgraph.Transform
) and clust
is the result file (one integer per arc, denoting the label of the arc at the end of the clustering). Used as above, the class will do the
clustering using mass-triangular wT weights; the parameter β can be set using a suitable option, and
another switch can be used to obtain ratio-triangular weights instead.
If you want to use the vT weights, you will first have to run PageRank on the weighted line graph, which you can do implicitly (i.e., without computing the weighted line graph explicitly), by invoking:
java it.unimi.dsi.law.triangrw.LinePageRank graph scoresThis will compute the PageRank scores (written in
scores
): the above class can optionally
receive both α and β as parameters, and it also has a switch to determine whether to use
the mass- or ratio-triangular version.
After running LinePageRank, you can use its resulting scores as follows
java it.unimi.dsi.law.triangrw.ALP graph grapht -s scores clust
In this package, we provide the following other classes:
ArcSimilarity
: an interface that allows one to compute the similarity between
two arcs of a given graph.
TextTFIDFArcSimilarity
: a concrete implementation of the above interface that
computes the TF/IDF similarity of two arcs, provided that there is a text file containing one line
per arc; every line is a TAB-separated triple, where the first element is the source-node id, the
second element is the target-node id and the third element is a sequence of space-separated (possibly:
stemmed) words. The class needs to be fed with a map (mapping terms into numbers, typically
obtained by running monotone
minimal perfect hash function), and a file of counts, that counts how many lines contain
each given term.
The remaining classes can be used to help parsing the DBLP.xml file. Here is an example of how they can be used:
echo "*** Extracting authors" java -DentityExpansionLimit=1000000 ExtractAuthors dblp.xml >authors.txt echo "*** Sorting authors" LC_ALL=C sort authors.txt >authors-sorted.txt echo "*** Producing author map" java it.unimi.dsi.sux4j.mph.LcpMonotoneMinimalPerfectHashFunction authors.mmph authors-sorted.txt echo "*** Extracting titles" java -Xmx6G -DentityExpansionLimit=1000000 ExtractTitles dblp.xml authors.mmph >titles.txt echo "*** Stemming titles" cat titles.txt | java StemTitles >titles-stemmed.txt echo "*** Sorting terms" cut -f 3 titles-stemmed.txt | tr ' ' '\n' | LC_ALL=C sort -u >terms.txt echo "*** Producing term map" java it.unimi.dsi.sux4j.mph.LcpMonotoneMinimalPerfectHashFunction terms.mmph terms.txt echo "*** Counting term frequency" java CountTermFrequency terms.mmph terms.frequency <titles-stemmed.txt echo "*** Producing arc list" cut -f 1-2 titles-stemmed.txt | sort -nk 2 | sort -snk 1 >coauth.arclist echo "*** Building co-authorship graph" java it.unimi.dsi.webgraph.BVGraph -g ArcListASCIIGraph coauth.arclist coauth-temp java it.unimi.dsi.webgraph.Transform symmetrizeOffline coauth-temp coauth rm coauth-temp* echo "*** Building the similarity object" java TextTFIDFArcSimilarity -s titles-stemmed.txt terms.mmph terms.frequency coauth.similarity