Package it.unimi.dsi.law.warc.tool
Class ExtractLinks
java.lang.Object
it.unimi.dsi.law.warc.tool.ExtractLinks
public class ExtractLinks extends Object
Extracts links from a WARC file.
This class scans a WARC file, parsing pages and extracting links. Links are resolved using
a given StringMap
. The resulting successors lists are given one
per line, with the index of the source node followed by the outdegree, followed by successor indices.
Optionally, it is possible to specify a secondary StringMap
for duplicates. It
must return, for each duplicate, the number of the corresponding archetype (the page that it is equal to). This
map is usually a RemappedStringMap
.
-
Field Summary
Fields Modifier and Type Field Description static String
DEFAULT_BUFFER_SIZE
-
Constructor Summary
Constructors Constructor Description ExtractLinks()
-
Method Summary
Modifier and Type Method Description static void
main(String[] arg)
static void
run(FastBufferedInputStream in, boolean isGZipped, Filter<HttpResponse> filter, PrintWriter pw, StringMap<? extends CharSequence> urls, StringMap<? extends CharSequence> duplicates)
Extracts links from a WARC file.
-
Field Details
-
DEFAULT_BUFFER_SIZE
- See Also:
- Constant Field Values
-
-
Constructor Details
-
ExtractLinks
public ExtractLinks()
-
-
Method Details
-
run
public static void run(FastBufferedInputStream in, boolean isGZipped, Filter<HttpResponse> filter, PrintWriter pw, StringMap<? extends CharSequence> urls, StringMap<? extends CharSequence> duplicates) throws IOExceptionExtracts links from a WARC file.- Parameters:
in
- the WARC file as an input stream.isGZipped
- whetherin
is compressed.filter
- the filter.pw
- a print writer there the links will be printed in ASCII format (node number followed by successors).urls
- the term map for URLs.duplicates
- the term map for duplicate URLs.- Throws:
IOException
-
main
- Throws:
Exception
-