java.lang.Iterable<WorkbenchEntry>
public class Workbench extends java.lang.Object implements java.lang.Iterable<WorkbenchEntry>
The workbench is a DelayQueue
queue of WorkbenchEntry
instances, each associated with an IP address.
Each WorkbenchEntry
contains a priority queue of VisitState
instances ordered
by VisitState.nextFetch
—the closest moment in
time at which the scheme+authority stored in the
VisitState
is fetchable (by politeness policies). Each workbench entry has also a
WorkbenchEntry.nextFetch
field with a similar meaning, but associated with the IP address
instead that with the scheme+authority, the idea being that the politeness policy for an IP can
be more tolerant (i.e., shorter delay) than that for an authority.
Observe that there exist, in general, workbench entries that are not on the workbench, typically because all visit states the are associated with are empty.
The workbench is sorted by the maximum between WorkbenchEntry.nextFetch
and the
VisitState.nextFetch
of the top of the queue of visit states. This guarantees that
the top of the visit-state queue of the workbench entry at the top of the workbench contains a
URL that can be fetched by (both authority and IP) politeness if an only if a fetchable URL exists. By setting the value
returned by Delayed.getDelay(TimeUnit)
to the truncated difference between the maximum
above and System.currentTimeMillis()
we can just wait on a DelayQueue.take()
until
the next WorkbenchEntry
is ready (this is what acquire()
does).
A basic invariant is that all workbench entries on the workbench contains nonempty queues of visit states, and every visit state in a workbench entry has nonempty a URL queue. Workbench entries can be out of the workbench either because their top visit state has been acquired by a thread, or because their visit state queue is empty. Visit states can be out of the workbench also if they caused connection problems.
A VisitState
must be
acquired and then released.
When a VisitState
is acquired, its WorkbenchEntry
is removed from the workbench,
so it is not possible to acquire any other VisitState
with the same IP address. This
mechanism guarantees that we never download data using two different FetchingThread
instances
from the same IP (and, a fortiori, from the same host or scheme+authority).
Once a visit state has been acquired, a visit can be performed by calling
VisitState.dequeue()
to obtain (ready) URLs and Frontier.enqueue(it.unimi.dsi.fastutil.bytes.ByteArrayList)
to filter
newly discovered URLs. More information can be found in the documentation of VisitState
.
Modifier and Type | Field | Description |
---|---|---|
protected WorkbenchEntrySet |
address2WorkbenchEntry |
The set of workbench entries.
|
java.util.concurrent.atomic.AtomicLong |
broken |
The number of entirely broken entries (i.e., entries containing only broken visit states).
|
Constructor | Description |
---|---|
Workbench() |
Creates the workbench.
|
Modifier and Type | Method | Description |
---|---|---|
VisitState |
acquire() |
Acquires a visit state for a scheme+authority accessible by politeness.
|
void |
add(WorkbenchEntry entry) |
Adds a nonempty, not acquired workbench entry to the workbench.
|
long |
approximatedSize() |
Returns an approximation of the workbench size (in number of entries present on the workbench).
|
WorkbenchEntry |
getWorkbenchEntry(byte[] address) |
Returns a workbench entry for the given address, possibly creating one.
|
java.util.Iterator<WorkbenchEntry> |
iterator() |
Returns an (unmodifiable) iterator over the entries currently on the workbench.
|
int |
numberOfWorkbenchEntries() |
Returns the number of existing workbench entries (in and out of the workbench).
|
void |
release(VisitState visitState) |
Releases a previously acquired visit state.
|
WorkbenchEntry[] |
workbenchEntries() |
Returns a vector containing the known workbench entries and
null s. |
protected final WorkbenchEntrySet address2WorkbenchEntry
public final java.util.concurrent.atomic.AtomicLong broken
public java.util.Iterator<WorkbenchEntry> iterator()
iterator
in interface java.lang.Iterable<WorkbenchEntry>
public WorkbenchEntry[] workbenchEntries()
null
s.
During the crawl, the vector might change concurrently.
null
s.public WorkbenchEntry getWorkbenchEntry(byte[] address)
Workbench
currently.address
- an IP address in byte-array form.address
(possibly a new one).public int numberOfWorkbenchEntries()
public void add(WorkbenchEntry entry)
entry
- a nonempty, not acquired workbench entry.public VisitState acquire() throws java.lang.InterruptedException
You must call release(VisitState)
when you have finished.
java.lang.InterruptedException
public void release(VisitState visitState)
visitState
- a previously acquired visit state.public long approximatedSize()