java.io.Serializable
, java.lang.Comparable<java.util.concurrent.Delayed>
, java.util.concurrent.Delayed
public class VisitState
extends java.lang.Object
implements java.util.concurrent.Delayed, java.io.Serializable
WorkbenchEntry
, after the resolution of the hostname by a DNSThread
has
taken place. Note, though, that the visit state may be temporarily removed from the entry (albeit still keeping a reference
to it) when it becomes empty.
An instance of this class also records the nextFetch
available for its schemeAuthority
,
a queue of path+queries and a cached robot filter. The nextFetch
field is used to implement
the Delayed
interface.
Instances of this class are synchronized, and enqueue/dequeue operations can happen concurrently: this is
necessary because firstPath()
is called by a FetchingThread
,
dequeuing is performed by a ParsingThread
,
whereas enqueing is performed by the Distributor
.
A visit state is declared broken when some exception (e.g., socket timeout, connection timeout etc.) takes
place when trying to download a URL coming from the VisitState
, or even when the hostname of the VisitState
cannot be resolved by the DNS. The kind of exception that took place is stored in lastExceptionClass
and a
counter is kept of the number of attempts performed so far. Every attempt causes an increase in
the interval that must elapse before trying again: the starting interval is determined by ParsingThread.EXCEPTION_TO_WAIT_TIME
,
and then it is doubled every time, until the number of attempts exceeds a fixed maximum (ParsingThread.EXCEPTION_TO_MAX_RETRIES
).
Two things can happen:
ParsingThread.EXCEPTION_HOST_KILLER
:
if the exception is not very serious, the URL that caused the exception is thrown away and the next one is attempted (the
visit state is not broken any more, at this point, unless an exception takes place with the new URL); for serious exceptions,
the visit state is scheduled for purge: it gets emptied, the number of URLs fetched is set to Integer.MAX_VALUE
to
avoid downloading more pages, and eventually all the URLs found on disk for that host will also be discarded;
Modifier and Type | Field | Description |
---|---|---|
protected boolean |
acquired |
Whether this visit state is currently acquired.
|
org.apache.http.cookie.Cookie[] |
cookies |
The cookies of this visit state.
|
static org.apache.http.cookie.Cookie[] |
EMPTY_COOKIE_ARRAY |
A singleton empty cookie array.
|
Frontier |
frontier |
A reference to the frontier.
|
java.lang.Class<? extends java.lang.Throwable> |
lastExceptionClass |
If not
null , this fields contains the class of the exception that was
thrown during the last attempt to access this scheme+authority. |
long |
lastRobotsFetch |
System.currentTimeMillis() when we fetched the robots we are caching. |
static long |
MOVING_AVERAGE_WINDOW |
The window over which we compute the moving average.
|
long |
nextFetch |
The minimum time at which this visit state can be accessed because of host-based politeness, zero at creation.
|
int |
retries |
If
lastExceptionClass is not null , the count of the retries for this type of exception. |
static byte[] |
ROBOTS_PATH |
A special path marking a
robots.txt refresh request. |
char[][] |
robotsFilter |
The robots-forbidden prefixes we are caching, as returned from the
URLRespectsRobots.parseRobotsReader(Reader, String) method. |
byte[] |
schemeAuthority |
The scheme and authority visited by this visit state.
|
float |
spammicity |
The spammicity score, if computed; -1, otherwise.
|
Short2ShortOpenHashMap |
termCount |
A map from term indices to counts for the pages of this host.
|
int |
termCountUpdates |
The number of calls performed to
updateTermCount(Short2ShortMap) . |
WorkbenchEntry |
workbenchEntry |
The workbench entry this visit state belongs to.
|
Constructor | Description |
---|---|
VisitState(Frontier frontier,
byte[] schemeAuthority) |
Creates a visit state.
|
Modifier and Type | Method | Description |
---|---|---|
void |
checkRobots(long time) |
Checks whether the current robots information has expired and, if necessary, schedules a new
robots.txt download. |
void |
clear() |
Empties this visit state of all the URLs that it contains.
|
int |
compareTo(java.util.concurrent.Delayed o) |
|
byte[] |
dequeue() |
Removes the first path in the queue.
|
void |
enqueuePathQuery(byte[] pathQuery) |
Enqueues a path+query in byte-array representation, possibly putting this visit state in its
entry.
|
void |
enqueueRobots() |
Enqueues the
/robots.txt path as the first element of the queue, if the queue is
empty, or as the second element otherwise, possibly putting this visit state in its
entry. |
byte[] |
firstPath() |
Peeks at the first path in the queue.
|
void |
forciblyEnqueueRobotsFirst() |
Forcibly enqueues the
/robots.txt path as the first element of the queue. |
long |
getDelay(java.util.concurrent.TimeUnit unit) |
|
boolean |
isAlive(long time) |
Return whether this visit state is fetchable (i.e., if there is at leas one URL and it is allowed by politeness to fetch it).
|
boolean |
isEmpty() |
Returns whether this visit state is empty.
|
int |
pathQueryLimit() |
Returns an estimate of the number of path+queries that this visit state should keep in memory.
|
void |
putInEntryIfNotEmpty() |
Puts this visit state in its entry, if it not empty.
|
void |
schedulePurge() |
Disables permanently this visit state and schedules its purge
by setting the count associated
with its
schemeAuthority to Integer.MAX_VALUE , clearing the internal queue and
setting nextFetch to Long.MAX_VALUE . |
void |
setWorkbenchEntry(WorkbenchEntry workbenchEntry) |
Sets the workbench entry and put this visit state in its entry if it is nonempty.
|
int |
size() |
Computes the size (i.e., number of URLs) in this visit state.
|
java.lang.String |
toString() |
|
void |
updateTermCount(Short2ShortMap termCount) |
public static final byte[] ROBOTS_PATH
robots.txt
refresh request.public static final org.apache.http.cookie.Cookie[] EMPTY_COOKIE_ARRAY
public static final long MOVING_AVERAGE_WINDOW
public final byte[] schemeAuthority
public volatile long nextFetch
Long.MAX_VALUE
, the maximum number of URLs has been reached for this visit state.public volatile long lastRobotsFetch
System.currentTimeMillis()
when we fetched the robots we are caching.public volatile char[][] robotsFilter
URLRespectsRobots.parseRobotsReader(Reader, String)
method.public transient volatile WorkbenchEntry workbenchEntry
null
, regardless of whether this visit state is actually in the queue of
its workbench entry, unless lastExceptionClass
is not null
.public org.apache.http.cookie.Cookie[] cookies
public volatile java.lang.Class<? extends java.lang.Throwable> lastExceptionClass
null
, this fields contains the class of the exception that was
thrown during the last attempt to access this scheme+authority.
If this happens, it might be the case that workbenchEntry
is null
.public volatile int retries
lastExceptionClass
is not null
, the count of the retries for this type of exception.protected transient volatile boolean acquired
public transient Frontier frontier
public final Short2ShortOpenHashMap termCount
RuntimeConfiguration.spamDetector
is not null
.public volatile int termCountUpdates
updateTermCount(Short2ShortMap)
.public volatile float spammicity
public VisitState(Frontier frontier, byte[] schemeAuthority)
frontier
- the frontier for which the state is being created.schemeAuthority
- the scheme+authority of this VisitState
.public void setWorkbenchEntry(WorkbenchEntry workbenchEntry)
acquired
}.
public void putInEntryIfNotEmpty()
This method is called only by Workbench.release(VisitState)
.
Note that the associated entry is not put back on the workbench. Use
WorkbenchEntry.putOnWorkbenchIfNotEmpty(Workbench)
for that purpose.
workbenchEntry
≠ null
, acquired
.
acquired
.
public void enqueueRobots()
/robots.txt
path as the first element of the queue, if the queue is
empty, or as the second element otherwise, possibly putting this visit state in its
entry.
This behaviour is necessary: is this visit state
is acquired, the first path+query might be still need to be dequeued by a
ParsingThread
.
public void forciblyEnqueueRobotsFirst()
/robots.txt
path as the first element of the queue.
This method is useful only when restoring a saved state.
public void enqueuePathQuery(byte[] pathQuery)
Note that if you enqueue more than once the same URL to a visit state using this method, it will be fetched twice. Thus, this method should be called only by the receiver of new flows from the sieve.
Note that the scheme+authority of url
must be schemeAuthority
.
pathQuery
- a path+query in byte-array representation.public byte[] firstPath()
The result of this call should be passed to BURL.fromNormalizedSchemeAuthorityAndPathQuery(String, byte[])
to obtain the final BUbiNG URL.
java.util.NoSuchElementException
- if the queue of path+queries is empty.public byte[] dequeue()
java.util.NoSuchElementException
- if the queue of path+queries is empty.public void clear()
public int size()
public boolean isEmpty()
true
iff this visit state does not contain any URL.public boolean isAlive(long time)
time
- the current time.true
if the state is fecthablepublic java.lang.String toString()
toString
in class java.lang.Object
public long getDelay(java.util.concurrent.TimeUnit unit)
getDelay
in interface java.util.concurrent.Delayed
public int compareTo(java.util.concurrent.Delayed o)
compareTo
in interface java.lang.Comparable<java.util.concurrent.Delayed>
public void schedulePurge()
schemeAuthority
to Integer.MAX_VALUE
, clearing the internal queue and
setting nextFetch
to Long.MAX_VALUE
.public void checkRobots(long time)
robots.txt
download.time
- the current time.public int pathQueryLimit()
public void updateTermCount(Short2ShortMap termCount)