So a few notes on implementation.
The real benefits of this particular formulation of hashing
are that our look-ups are again constant time,
and that's terrific for parallelism for a machine like the GPU,
because we keep all of our threads busy.
There's no thread divergence at all, and the construction algorithm is actually pretty simple and pretty fast.
It's actually a very efficient hash table construction when it's done in shared memory with a small of item to hash,
but it turns out that it actually works fairly well with large data sets
when the hash table needs to be constructed in main memory.
The really important thing to keep in mind there
is that the operation of write my item into the hash table and kick the other item out needs to be an atomic operation--
in our implementation we used atomic exchange
to make sure the 2 competing threads don't have any possibility of stomping on top of each other.
And a final note, there's more than one way we might solve this general problem of check if an element is in a set.
Instead of using hash tables, we could choose a fairly brute force way to do this.
We could choose to sort all the elements in the set as a construction step,
and then, for the look-up step, we could do binary into the set
to see if that particular element is present.
So for instance, we might have 5 keys here in a set.
To construct this particular data structure, we sort these in order of the key,
and to do look-up we just do binary search within this set.
So for instance, if we're looking up value k6 with one thread,
we'd go see k5--my keys bigger than that.
I'd then pop this way and so on until I get to k6.
I might have another thread that's looking up k3.
It would start in the middle at k5 and then do binary search until it finds k3.
Now sorting is quite fast on GPUs, and even though hash tables are faster for this particular computation,
they're not so much faster that should never consider a sort to be a good idea.
Often on the GPU, sometimes a brute force approach,
like sort and binary search, might be your best option.
关于实现的一些备注。
该特殊的散列公式的真正好处是
我们的查询时间再次不变,
这对于像GPU这样的机器的并行度来说非常棒,
因为我们保持所有的线程都忙碌。
完全不存在线程发散,结构算法实际上相当简单且快速。
当它在共享内存上处理少量要散列的项,这实际上是非常有效的散列表结构,
但结果证明实际上它在大数据集方面也工作地相当好,
此时散列表需要在主内存构建。
需要记住的很重要的事情是
把我的项写入散列表并踢出其它项的操作需要是原子操作—
在我们的实现中,我们使用原子交换
以确保2个矛盾的线程不可能踩在互相头上。
最后一条备注,有不止一种方法可以解决
这个一般的检查一个元素是否在集合中的问题。
不使用散列表,我们可以选择相当蛮力的方式进行。
我们可以选择对集合中所有元素排序作为一个构建步骤,
然后,在查询步骤,我们可以对集合进行二分查找
以查看是否存在特定元素。
例如,这个集合中我们可能有5个键。
为了构建该特别的数据结构,我们根据键对这些进行排序,
为了进行查询,我们只要在该集合中进行二分查找。
例如,如果我们用一个线程查询k6的值,
我们查看k5,我的键比这大。
然后我向这边移动,以此类推,直到我到达k6。
我可能有另一个线程在查询k3。
它从中间k5开始,然后进行二分查找直到它找到k3。
现在在GPU上排序很快,即使散列表在这个特定计算上更快,
但它们没有快非常多,以致绝不认为排序是个好主意。
常常在GPU上,有时一个蛮力方法,
如排序和二分查找,可能是你的最好选择。