bgtrees.finite_gpufields.cuda_operators package

Submodules

bgtrees.finite_gpufields.cuda_operators.check_inverse module

bgtrees.finite_gpufields.cuda_operators.check_inverse.check_galois(x, pmod=2147483629, nmax=1000)

bgtrees.finite_gpufields.cuda_operators.check_inverse.wrapper_inverse(x)

bgtrees.finite_gpufields.cuda_operators.dot_product module

bgtrees.finite_gpufields.cuda_operators.inverse module

bgtrees.finite_gpufields.cuda_operators.py_dotproduct module

Script to test and benchmark the dot_product kernels.

The C++ version is competitive with NumPy (probably doing the same thing under the hood) when using the @ operator. When using einsum, the C++ version is about 3 to 4 times faster.

The CUDA version is faster than NumPy (scales with the number of elements): - 1e5: 2 times faster - 1e6: 10 times faster - 1e7: 60 times faster

Note that it is an unfair comparison (for us) since in NumPy the % operation is done only at the end. The same factor of 3-4 can be multiplied to these numbers when using einsum with NumPy.

There exists some overhead in our operations that takes as long as computing ~1e4 events. It is unclear how this would scale in a situation in which there are _many_ operations. If the overhead is not per-operation (i.e., once the events are in-device they remain there), this might not be a problem.

bgtrees.finite_gpufields.cuda_operators.py_dotproduct.check_galois(x, y, pmod=2147483629, nmax=1000)

bgtrees.finite_gpufields.cuda_operators.py_dotproduct.fully_python_dot_product(x, y)

bgtrees.finite_gpufields.cuda_operators.py_dotproduct.wrapper_dot_product(x, y)

bgtrees.finite_gpufields.cuda_operators.py_dotproduct.wrapper_dot_product_single_batch(x, y)

Module contents

bgtrees.finite_gpufields.cuda_operators.wrapper_dot_product(x, y)

bgtrees.finite_gpufields.cuda_operators.wrapper_dot_product_single_batch(x, y)

bgtrees.finite_gpufields.cuda_operators.wrapper_inverse(x)