it.unimi.di.law.bubing.frontier (BUbiNG 0.9.15)

A set of classes orchestrating the movement of URLs to be fetched next by a BUbiNG agent.

To start understanding how the frontier works, we should consider three levels of aggregation of the jobs that a BUbiNG agent is to perform:

at the ground level, we have URLs, as they come out of the sieve: these are known unvisited URLs that should eventually be considered;
URLs are grouped by their scheme+authority part, to form a VisitState: the visit state collects the statistics related to how the visit of that scheme+authority is going on;
VisitStates are grouped by IP, forming a WorkbenchEntry: the reason behind this is that we do not want to have more threads fetching pages from the same IP at the same time, even if the IP is accessed with different host names.

We now describe how these structures are created and move around, inviting the reader to consult the documentation of the approprate classes to get more information.

How visit states are born. URLs come out of the sieve and are accumulated in the Frontier.readyURLs queue. URLs from the latter queue are read (by a process called distributor) and taken care of, one at a time. Every time a URL is considered, the distributor must know if the URL belongs to an already-known scheme+authority or not. This is done using the Distributor.schemeAuthority2VisitState map that contains (as values) all the visit states ever created. After a visit state is created, and in all moments of its life, new URL may arrive for that scheme+authority and are enqueued to the visit state (or, possibly, to an on-disk virtualization of its tail, if the visit state becomes too large; the virtualization aspects are not described in this document, but we invite the reader to consult the documentation of the distributor to know more about this).

When a new visit state is created. If necessary, a new visit state is created, and it is added to a special queue of Frontier.newVisitStates for a DNS thread to eventually solve its hostname into an IP address (which is a necessary step to be able to aggregate the visit state to an appropriate workbench entry). Note that when the hostname has been resolved, the new visit state is either put in the appropriate workbench entry, or a new one is created for it.

Acquisition of visit states. A visit state can be acquired only when its workbench entry is not acquired (i.e., no one is trying to fetch pages from any of the visit states contained in it), and in that case it is assumed to be nonempty (empty visit states are temporarily removed from their workbench entry until a new URL is put in them again). A further condition should be considered for a visit state to be acquired, which is related to politeness: enough time must have elapsed since the last fetch of a page for the same visit state (i.e., for the same scheme+authority) and moreover enough time must have elapsed since the last fetch of a page for the same workbench entry (i.e., for the same IP). When both conditions are satisfied, the visit state is moved onto a queue called Frontier.todo (by a special thread called TodoThread). At that point both the visit state and the workbench entry it belongs to are declared as being acquired.

Fetching threads. When a fetching thread is free, it will try to get a visit state from the Frontier.todo queue. After this, it will fetch one or more URLs from the host. Every time a page has been fetched, it is put in the Frontier.results queue so that it can be parsed by a parsing thread. Only when this happens, the fetching thread can proceed to fetch another URL (because every fetching thread has a buffer to store the page, that is reused every time). When the fetching thread is done with fetching, it will put back the visit state in the Frontier.done queue, so that it is then released to the workbench.

Class Summary
Class	Description
Distributor	A thread that distributes ready URLs (coming out of the sieve) into the `Workbench` queues with the help of a `WorkbenchVirtualizer`.
DNSThread	A thread that continuously dequeues a `VisitState` from the queue of new visit states (those that still need a DNS resolution), resolves its host and puts it on the `Workbench`.
DoneThread	A thread that continuously dequeues a `VisitState` from the `Frontier.done` queue and `releases` it to the `Workbench`.
FetchingThread	A thread fetching pages that will be then analyzed by a `ParsingThread`.
FetchingThread.BasicHttpClientConnectionManagerWithAlternateDNS	A support class that makes it possible to plug in a custom DNS resolver.
Frontier	The BUbiNG frontier: a class structure that encompasses most of the logic behind the way BUbiNG fetches URLs.
MessageThread	A thread that takes care of pouring the content of `Frontier.receivedURLs` into the `Frontier` itself (via the `Frontier.enqueue(it.unimi.dsi.fastutil.bytes.ByteArrayList)` method).
ParsingThread	A thread parsing pages retrieved by a `FetchingThread`.
ParsingThread.FrontierEnqueuer	A small gadget used to insert links in the frontier.
QuickMessageThread	A thread that takes care of pouring the content of `Frontier.quickReceivedURLs` into `Frontier.receivedURLs`.
StatsThread	A class isolating a number of `ProgressLogger` instances keeping track of a number of quantities of interest related to the `Distributor`, e.g., requests, transferred byets, etc.
TodoThread	A thread that continuously acquires a `VisitState` from the `Workbench` and adds it to the `Frontier.todo` queue.
VisitState	A class maintaining the current state of the visit of a specific scheme+authority.
VisitStateSet	A data structure representing the set of visit states created so far.
Workbench	The workbench is a `DelayQueue` queue of `WorkbenchEntry` instances, each associated with an IP address.
WorkbenchEntry	An element of the `Workbench`.
WorkbenchEntrySet	A data structure representing the set of workbench entries created so far.
WorkbenchVirtualizer	A workbench virtualizer based on a Berkeley DB database.

Enum Summary
Enum Description

Frontier.PropertyKeys
Names of the scalar fields saved by Frontier.snap().

Package it.unimi.di.law.bubing.frontier