Package it.unimi.dsi.law.warc.tool
Class ExtractLinks
java.lang.Object
it.unimi.dsi.law.warc.tool.ExtractLinks
public class ExtractLinks extends Object
Extracts links from a WARC file.
This class scans a WARC file, parsing pages and extracting links. Links are resolved using
a given StringMap. The resulting successors lists are given one
per line, with the index of the source node followed by the outdegree, followed by successor indices.
Optionally, it is possible to specify a secondary StringMap for duplicates. It
must return, for each duplicate, the number of the corresponding archetype (the page that it is equal to). This
map is usually a RemappedStringMap.
-
Field Summary
Fields Modifier and Type Field Description static StringDEFAULT_BUFFER_SIZE -
Constructor Summary
Constructors Constructor Description ExtractLinks() -
Method Summary
Modifier and Type Method Description static voidmain(String[] arg)static voidrun(FastBufferedInputStream in, boolean isGZipped, Filter<HttpResponse> filter, PrintWriter pw, StringMap<? extends CharSequence> urls, StringMap<? extends CharSequence> duplicates)Extracts links from a WARC file.
-
Field Details
-
DEFAULT_BUFFER_SIZE
- See Also:
- Constant Field Values
-
-
Constructor Details
-
ExtractLinks
public ExtractLinks()
-
-
Method Details
-
run
public static void run(FastBufferedInputStream in, boolean isGZipped, Filter<HttpResponse> filter, PrintWriter pw, StringMap<? extends CharSequence> urls, StringMap<? extends CharSequence> duplicates) throws IOExceptionExtracts links from a WARC file.- Parameters:
in- the WARC file as an input stream.isGZipped- whetherinis compressed.filter- the filter.pw- a print writer there the links will be printed in ASCII format (node number followed by successors).urls- the term map for URLs.duplicates- the term map for duplicate URLs.- Throws:
IOException
-
main
- Throws:
Exception
-