java.io.Closeable
, java.lang.AutoCloseable
public class MercatorSieve<K,V> extends AbstractSieve<K,V>
AbstractSieve
that stores hash
data on disk, much in the same way as it was suggested by Allan Heydon and Marc Najork in
“Mercator: A Scalable, Extensible Web Crawler”,
World Wide Web, (2)4:219−229, 1999, Springer.
Each key known to the sieve is stored as a 64-bit hash in a disk file. At each enqueue(Object, Object)
the hash of the
enqeueud key is added to a bucket, and the key is saved in an auxiliary file. When the bucket is full it sorted and compared with
the set of keys known to the sieve. Note that the output order is guaranteed to be the same of the input order (i.e., keys
are appended in the same order in which they appeared the first time).
AbstractSieve.DefaultUpdateStrategy<K,V>, AbstractSieve.DiskNewFlow<T>, AbstractSieve.NewFlowReceiver<K>, AbstractSieve.SieveEntry<K,V>, AbstractSieve.UpdateStrategy<K,V>
CHAR_SEQUENCE_HASHING_STRATEGY, hashingStrategy, keySerDeser, newFlowReceiver, updateStrategy, valueSerDeser
Constructor | Description |
---|---|
MercatorSieve(boolean sieveIsNew,
java.io.File sieveDir,
int sieveSize,
int storeIOBufferSize,
int auxFileIOBufferSize,
AbstractSieve.NewFlowReceiver<K> newFlowReceiver,
ByteSerializerDeserializer<K> keySerDeser,
ByteSerializerDeserializer<V> valueSerDeser,
it.unimi.dsi.sux4j.mph.AbstractHashFunction<K> hashingStrategy,
AbstractSieve.UpdateStrategy<K,V> updateStrategy) |
Creates a new Mercator-like sieve.
|
Modifier and Type | Method | Description |
---|---|---|
void |
close() |
Closes (forever) this sieve.
|
boolean |
enqueue(K key,
V value) |
Add the given (key,value) pair to the store.
|
void |
flush() |
Forces the check+update of all pairs that have been enqueued.
|
int |
numberOfItems() |
setNewFlowRecevier
public MercatorSieve(boolean sieveIsNew, java.io.File sieveDir, int sieveSize, int storeIOBufferSize, int auxFileIOBufferSize, AbstractSieve.NewFlowReceiver<K> newFlowReceiver, ByteSerializerDeserializer<K> keySerDeser, ByteSerializerDeserializer<V> valueSerDeser, it.unimi.dsi.sux4j.mph.AbstractHashFunction<K> hashingStrategy, AbstractSieve.UpdateStrategy<K,V> updateStrategy) throws java.io.IOException
sieveIsNew
- whether we are creating a new sieve or opening an old one.sieveDir
- a directory for storing the sieve files.sieveSize
- the size of the size in longs.storeIOBufferSize
- the size in bytes of the buffer used to read and write the hash store during flushes (allocated twice during flushes).auxFileIOBufferSize
- the size in bytes of the buffer used to read and write the auxiliary file (always allocated; another allocation happens during flushes).newFlowReceiver
- a receiver for the flow of new keys.keySerDeser
- a serializer/deserializer for keys.valueSerDeser
- a serializer/deserializer for values.hashingStrategy
- a hashing strategy for keys.updateStrategy
- the strategy used to update the values associated to duplicate keys.java.io.IOException
public void close() throws java.io.IOException
AbstractSieve
close
in interface java.lang.AutoCloseable
close
in interface java.io.Closeable
close
in class AbstractSieve<K,V>
java.io.IOException
public boolean enqueue(K key, V value) throws java.io.IOException, java.lang.InterruptedException
AbstractSieve
enqueue
in class AbstractSieve<K,V>
java.io.IOException
java.lang.InterruptedException
public int numberOfItems()
public void flush() throws java.io.IOException
AbstractSieve
flush
in class AbstractSieve<K,V>
java.io.IOException