Class ExtractLinks

java.lang.Object
it.unimi.dsi.law.warc.tool.ExtractLinks

public class ExtractLinks
extends Object
Extracts links from a WARC file.

This class scans a WARC file, parsing pages and extracting links. Links are resolved using a given StringMap. The resulting successors lists are given one per line, with the index of the source node followed by the outdegree, followed by successor indices.

Optionally, it is possible to specify a secondary StringMap for duplicates. It must return, for each duplicate, the number of the corresponding archetype (the page that it is equal to). This map is usually a RemappedStringMap.

  • Field Details

  • Constructor Details

    • ExtractLinks

      public ExtractLinks()
  • Method Details