## Describe the changes
This PR adds the capability to pin host memory in golang bindings
allowing data transfers to be quicker. Memory can be pinned once for
multiple devices by passing the flag
`cuda_runtime.CudaHostRegisterPortable` or
`cuda_runtime.CudaHostAllocPortable` depending on how pinned memory is
called
## Describe the changes
This PR fixes an issue when `RunOnDevice` is called for multi-gpu while
other goroutines calling device operations are run outside of
`RunOnDevice`. The issue comes from setting a device other than the
default device (device 0) on a host thread within `RunOnDevice` and not
unsetting that host threads device when `RunOnDevice` finishes.
When `RunOnDevice` locks a host thread to ensure that all other calls in
the go routine are on the same device, it never unsets that thread’s
device. Once the thread is unlocked, other go routines can get scheduled
to it but it still has the device set to whatever it was before while it
was locked so its possible that the following sequence happens:
1. NTT domain is initialized on thread 2 via a goroutine on device 0
2. MSM multiGPU test runs and is locked on thread 3 setting its device
to 1
3. Other tests run concurrently on threads other than 3 (since it is
locked)
4. MSM multiGPU test finishes and release thread 3 back to the pool but
its device is still 1
5. NTT test runs and is assigned to thread 3 --> this will fail because
the thread’s device wasn’t released back
We really only want to set a thread's device while the thread is locked.
But once we unlock a thread, it’s device should return to whatever it
was set at originally. In theory, it should always be 0 if `SetDevice`
is never used outside of `RunOnDevice` - which it shouldn’t be in most
situations
## Describe the changes
This PR adds multi gpu support in the golang bindings.
Tha main changes are to DeviceSlice which now includes a `deviceId`
attribute specifying which device the underlying data resides on and
checks for correct deviceId and current device when using DeviceSlices
in any operation.
In Go, most concurrency can be done via Goroutines (described as
lightweight threads - in reality, more of a threadpool manager),
however, there is no guarantee that a goroutine stays on a specific host
thread. Therefore, a function `RunOnDevice` was added to the
cuda_runtime package which locks a goroutine into a specific host
thread, sets a current GPU device, runs a provided function, and unlocks
the goroutine from the host thread after the provided function finishes.
While the goroutine is locked to the hsot thread, the Go runtime will
not assign other goroutines to that host thread