public class MercatorSieve<K,V> extends AbstractSieve<K,V>
AbstractSievethat stores hash data on disk, much in the same way as it was suggested by Allan Heydon and Marc Najork in “Mercator: A Scalable, Extensible Web Crawler”, World Wide Web, (2)4:219−229, 1999, Springer.
Each key known to the sieve is stored as a 64-bit hash in a disk file. At each
enqueue(Object, Object) the hash of the
enqeueud key is added to a bucket, and the key is saved in an auxiliary file. When the bucket is full it sorted and compared with
the set of keys known to the sieve. Note that the output order is guaranteed to be the same of the input order (i.e., keys
are appended in the same order in which they appeared the first time).
|Constructor and Description|
Creates a new Mercator-like sieve.
|Modifier and Type||Method and Description|
Closes (forever) this sieve.
Add the given (key,value) pair to the store.
Forces the check+update of all pairs that have been enqueued.
public MercatorSieve(boolean sieveIsNew, File sieveDir, int sieveSize, int storeIOBufferSize, int auxFileIOBufferSize, AbstractSieve.NewFlowReceiver<K> newFlowReceiver, ByteSerializerDeserializer<K> keySerDeser, ByteSerializerDeserializer<V> valueSerDeser, it.unimi.dsi.sux4j.mph.AbstractHashFunction<K> hashingStrategy, AbstractSieve.UpdateStrategy<K,V> updateStrategy) throws IOException
sieveIsNew- whether we are creating a new sieve or opening an old one.
sieveDir- a directory for storing the sieve files.
sieveSize- the size of the size in longs.
storeIOBufferSize- the size in bytes of the buffer used to read and write the hash store during flushes (allocated twice during flushes).
auxFileIOBufferSize- the size in bytes of the buffer used to read and write the auxiliary file (always allocated; another allocation happens during flushes).
newFlowReceiver- a receiver for the flow of new keys.
keySerDeser- a serializer/deserializer for keys.
valueSerDeser- a serializer/deserializer for values.
hashingStrategy- a hashing strategy for keys.
updateStrategy- the strategy used to update the values associated to duplicate keys.
public void close() throws IOException
public boolean enqueue(K key, V value) throws IOException, InterruptedException
public int numberOfItems()