
Abstract
Modular arithmetic is the computational backbone of many cryptographic and scientific algorithms.
In particular, modular multiplication in a large prime field is computationally expensive and dictates the runtime of many algorithms. While it is relatively easy to utilize vectorization to accelerate batches of independent modular multiplications, we show how to significantly reduce the latency of a single modular multiplication under a generic prime using vectorization. We achieve this using a new RNS Montgomery multiplication method that has a simplified structure (in relation to prior art) and that we conjecture has no unnecessary elementwise modular multiplications. This means that virtually all of (number-theoretic) cryptography can be sped up using vectorization.