To start understanding how the frontier works, we should consider three levels of aggregation of the jobs that a BUbiNG agent is to perform:
VisitState
:
the visit state collects the statistics related to how the visit of that scheme+authority
is going on;
VisitState
s are grouped by IP, forming a WorkbenchEntry
:
the reason behind this is that we do not want to have more threads fetching pages
from the same IP at the same time, even if the IP is accessed with different
host names.
We now describe how these structures are created and move around, inviting the reader to consult the documentation of the approprate classes to get more information.
How visit states are born.
URLs come out of the sieve and are accumulated in the Frontier.readyURLs
queue.
URLs from the latter queue are read (by a process called distributor) and taken care of, one at a time.
Every time a URL is considered, the distributor must know if the URL belongs to an already-known
scheme+authority or not.
This is done using the Distributor.schemeAuthority2VisitState
map that contains (as values) all
the visit states ever created.
After a visit state is created, and in all moments of its life, new URL may arrive for that scheme+authority
and are enqueued to the visit state (or, possibly, to an on-disk virtualization of its tail, if the visit
state becomes too large; the virtualization aspects are not described in this document, but we invite
the reader to consult the documentation of the distributor to know more about this).
When a new visit state is created.
If necessary, a new visit state is created, and it is added to a special queue of Frontier.newVisitStates
for a DNS thread to eventually solve its hostname into an IP address (which is a necessary step to
be able to aggregate the visit state to an appropriate workbench entry).
Note that when the hostname has been resolved, the new visit state is either put in the appropriate
workbench entry, or a new one is created for it.
Acquisition of visit states.
A visit state can be acquired only when its workbench entry is not acquired (i.e., no one is trying to
fetch pages from any of the visit states contained in it), and in that case it is assumed to be nonempty
(empty visit states are temporarily removed from their workbench entry until a new URL is put in them again).
A further condition should be considered for a visit state to be acquired, which is related to politeness:
enough time must have elapsed since the last fetch of a page for the same visit state (i.e., for the same
scheme+authority) and moreover enough time must have elapsed since the last fetch of a page for the same
workbench entry (i.e., for the same IP).
When both conditions are satisfied, the visit state is moved onto a queue called Frontier.todo
(by
a special thread called TodoThread
). At that point both the visit state and the workbench
entry it belongs to are declared as being acquired.
Fetching threads.
When a fetching thread is free, it will try to get a visit state from the Frontier.todo
queue.
After this, it will fetch one or more URLs from the host. Every time a page has been fetched, it is
put in the Frontier.results
queue so that it can be parsed by a parsing thread. Only when this
happens, the fetching thread can proceed to fetch another URL (because every fetching thread has a buffer
to store the page, that is reused every time). When the fetching thread is done with fetching,
it will put back the visit state in the Frontier.done
queue, so that it is then released to the workbench.
Class | Description |
---|---|
Distributor |
A thread that distributes ready URLs
(coming out of the sieve) into
the
Workbench queues with the help of a WorkbenchVirtualizer . |
DNSThread |
A thread that continuously dequeues a
VisitState from the queue of new visit states (those that still need a DNS resolution),
resolves its host and puts it on the Workbench . |
DoneThread |
A thread that continuously dequeues a
VisitState from the Frontier.done queue and
releases it to the Workbench . |
FetchingThread |
A thread fetching pages that will be then analyzed by a
ParsingThread . |
FetchingThread.BasicHttpClientConnectionManagerWithAlternateDNS |
A support class that makes it possible to plug in a custom DNS resolver.
|
Frontier |
The BUbiNG frontier: a class structure that encompasses most of the logic behind the way BUbiNG
fetches URLs.
|
MessageThread |
A thread that takes care of pouring the content of
Frontier.receivedURLs into the Frontier itself (via the
Frontier.enqueue(it.unimi.dsi.fastutil.bytes.ByteArrayList) method). |
ParsingThread |
A thread parsing pages retrieved by a
FetchingThread . |
ParsingThread.FrontierEnqueuer |
A small gadget used to insert links in the frontier.
|
QuickMessageThread |
A thread that takes care of pouring the content of
Frontier.quickReceivedURLs into Frontier.receivedURLs . |
StatsThread |
A class isolating a number of
ProgressLogger instances keeping track of a number of
quantities of interest related to the Distributor , e.g.,
requests, transferred byets, etc. |
TodoThread |
A thread that continuously acquires a
VisitState from the Workbench and adds it
to the Frontier.todo queue. |
VisitState |
A class maintaining the current state of the visit of a specific scheme+authority.
|
VisitStateSet |
A data structure representing the set of visit states created so far.
|
Workbench |
The workbench is a
DelayQueue queue of WorkbenchEntry instances, each associated with an IP address. |
WorkbenchEntry |
An element of the
Workbench . |
WorkbenchEntrySet |
A data structure representing the set of workbench entries created so far.
|
WorkbenchVirtualizer |
A workbench virtualizer based on a Berkeley DB database.
|
Enum | Description |
---|---|
Frontier.PropertyKeys |
Names of the scalar fields saved by
Frontier.snap() . |