java.lang.Runnable
public class ParsingThread
extends java.lang.Thread
FetchingThread
.
Instances of this class iteratively extract from Frontier.results
(using polling and exponential backoff)
a FetchData
that has been previously enqueued by a FetchingThread
.
The content of the response is analyzed and the body of the response is possibly parsed, and its
digest is computed.
Newly discovered (during parsing) URLs are enqueued to the frontier. Then,
a signal is issued on the FetchData
, so that the owner (a FetchingThread
}
can work on a different URL or possibly release the visit state
.
At each step (fetching, parsing, following the URLs of a page, scheduling new URLs, storing) a
configurable Filter
selects whether a URL is eligible for a specific activity.
This makes it possible a very fine-grained configuration of the crawl. BUbiNG will also check
that it is respecting the robots.txt
protocol, and this behavior cannot be
disabled programmatically. If you want to crawl the web without respecting the protocol,
you'll have to write your own BUbiNG variant (for which we will be not responsible).
Modifier and Type | Class | Description |
---|---|---|
protected static class |
ParsingThread.FrontierEnqueuer |
A small gadget used to insert links in the frontier.
|
Modifier and Type | Field | Description |
---|---|---|
protected static ObjectOpenHashSet<java.lang.Class<?>> |
EXCEPTION_HOST_KILLER |
A map recording for each type of exception the number of retries.
|
protected static Object2IntOpenHashMap<java.lang.Class<?>> |
EXCEPTION_TO_MAX_RETRIES |
A map recording for each type of exception the number of retries.
|
protected static Object2LongOpenHashMap<java.lang.Class<?>> |
EXCEPTION_TO_WAIT_TIME |
A map recording for each type of exception a timeout, Note that 0 means standard politeness time.
|
java.util.ArrayList<Parser<?>> |
parsers |
The parsers used by this thread.
|
boolean |
stop |
Whether we should stop (used also to reduce the number of threads).
|
Constructor | Description |
---|---|
ParsingThread(Frontier frontier,
Store store,
int index) |
Creates a thread.
|
Modifier and Type | Method | Description |
---|---|---|
void |
run() |
equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
activeCount, checkAccess, clone, countStackFrames, currentThread, destroy, dumpStack, enumerate, getAllStackTraces, getContextClassLoader, getDefaultUncaughtExceptionHandler, getId, getName, getPriority, getStackTrace, getState, getThreadGroup, getUncaughtExceptionHandler, holdsLock, interrupt, interrupted, isAlive, isDaemon, isInterrupted, join, join, join, onSpinWait, resume, setContextClassLoader, setDaemon, setDefaultUncaughtExceptionHandler, setName, setPriority, setUncaughtExceptionHandler, sleep, sleep, start, stop, stop, suspend, toString, yield
protected static final Object2LongOpenHashMap<java.lang.Class<?>> EXCEPTION_TO_WAIT_TIME
protected static final Object2IntOpenHashMap<java.lang.Class<?>> EXCEPTION_TO_MAX_RETRIES
protected static final ObjectOpenHashSet<java.lang.Class<?>> EXCEPTION_HOST_KILLER
public volatile boolean stop
public final java.util.ArrayList<Parser<?>> parsers