viktor: Efficient Vectorized Computations in Kotlin

viktor: Efficient Vectorized Computations in Kotlin

Introducing viktor


is an open-source Kotlin library developed by JetBrains Research that aims to make array calculations more efficient. We achieve this by avoiding nested arrays, delegating expensive operations to JNI + SIMD, and providing built-in support for arithmetics on logarithmically-stored numbers.

This post is in celebration of the 1.1.0 release. We will discuss what the library does (and what it doesn’t), how it came to be, and what lessons we’ve learned while developing it.

Please try the library out! It’s as simple as looking at some examples and adding a line to your Gradle script. If you’re still not convinced, take a look at our benchmark results.

What we do


has been optimized to work with probability arrays, since it is primarily intended for model-based machine learning tasks. For example, we use it in our peak analyzer Span (a bioinformatics tool that detects enriched regions in genomic sequencing data) to fit the underlying hidden Markov model. In particular, we offer:

  • Idiomatic multidimensional array access (rows, columns, slices, views, etc.).
  • Extremely fast element-wise operations (arithmetic, exponent, logarithm, and the like)
    utilizing modern CPU cores to their full extent.
  • Really fast aggregate operations (sum, mean, standard deviation, etc.).
  • Built-in logarithmic storage support: you can convert your values to logarithms and
    work with them without having to convert them back.

What we don’t

  • viktor

    isn’t a linear algebra library – at least not yet. You can get it to multiply matrices, but it’s not optimized for that. If you need lightning-fast matrix multiplication, you’d be better off using Multik, nd4j, or another linear algebra library.

  • viktor

    doesn’t currently have out-of-the-box concurrency (like


    has), though it can be parallelized on the client side.

  • viktor

    doesn’t do GPU computations. Many researchers use their laptops for work, while others have access to multi-core servers and even computational clusters. What unites all these setups is that they all have little to no GPU capabilities.


    is intended for exactly those kinds of cases.



sources are hosted on GitHub, together with a feature overview and some instructional examples.

Starting with version 1.1.0,


binaries (together with sources and Javadoc) are distributed via Maven Central, so it’s really easy to add to any Maven-like JVM project:


Gradle / Gradle Kotlin:


The older versions were published on Bintray (currently in the sunset phase) and can be downloaded manually from GitHub Releases.


The main structure,


, was inspired by NumPy’s ndarray.

Inside, it’s a flat




) endowed with an


and two n-element integer arrays containing the n-dimensional array’s




. This structure allows us to easily express rows, columns, and other slices of an


as other



For instance, a 2×3 array will be stored as a 6-element



offset = 0


shape = {2, 3}

, and

strides = {3, 1}

. If we want to view the second column (the one indexed 1), we just create an array with the same


, but with

offset = 1


shape = {2}

, and

strides = {3}

. It is simplicity itself!



comes with a sizable set of arithmetic and mathematical operations. While the cheap arithmetic operations are performed in a loop, the more expensive mathematical ones are delegated to the Java Native Interface. Moreover, we make sure to utilize the SIMD (single instruction, multiple data) extension sets available on most modern CPUs. The performance gains depend on the operation, the array size, the JVM version, and so on, but can reach 900% even in real-world cases (as it turns out the JVM’s logarithm is pretty slow).

Another useful feature is the built-in support of log-stored values. When working with probabilities, floating-point underflows are a frequent occurrence, since sometimes you have to multiply so many small numbers together that the result can no longer be expressed as a positive number and instead is rounded to zero, losing any utility. To overcome this, people frequently store not the probability itself, but rather its logarithm. Instead of multiplying the probabilities, they can then sum the logarithms. However, they may occasionally need to sum probabilities as well, and this operation is much less natural with logarithmic storage. So


provides a function named


, which does exactly that:

a logAddExp b = log(exp(a) + exp(b))

but in a way that prevents underflows. It is also possible to sum all the values in a log-stored array with


. These operations are also SIMDized whenever possible, achieving even better performance.


binary distribution currently supports one CPU architecture (

amd64 / x86_64

) with two extensions (SSE2, AVX) on Windows, Linux, and macOS. We use TeamCity for the multi-platform build.

Why we needed it

This whole project started because we just wanted to train some mathematical models for our research. However, the training was too slow for our tastes, even after adding concurrency. We used a profiler, and the results surprised us: most of the time was spent calculating logarithms. Just logarithms, over and over again. We replaced the built-in logarithm with Apache Commons Math‘s FastMath logarithm, but it didn’t help. We tried some other existing libraries, but none of them had the exact features that we needed. So, naturally, we had to write our own library.

With that having been decided, we ran into some design issues. We wanted our code to be idiomatic, like in Python’s NumPy, but we also wanted blazing-fast performance like what can be achieved with C++. Oh, and we also needed it in a JVM-compatible language, since our project was JVM-based. It took a little trial-and-error, but eventually,


, a hybrid of C++, Kotlin, and Python, an idiomatic library with a native backend, was released in November 2019.

Incredible stories of (native) optimization

The path ad astra seems to always go per aspera. The following are a few trial-and-error examples, which could be educational, amusing, or both.

At first, we delegated every operation to JNI + SIMD. Then we did the benchmarks, rejoiced at the result, and released the library. However, in our pride, we didn’t think to benchmark the arithmetic operations; we only did mathematics (exponent etc.) and reduction (sum etc.). When we later added the arithmetic benchmarks, we were surprised to see that


performed poorly compared to a naïve loop. A couple of JITWatch sessions later, we learned that even the ancient JDK 1.8 is capable of SIMDizing the most primitive patterns. Like, say, element-wise multiplication of two arrays. Only the JDK calls its native code with much less overhead. Therefore, we dropped the SIMD arithmetic and replaced it with naïve loop arithmetic, and suddenly, performance increased.

At first, we tried to reduce the amount of code by only writing the in-place operations (like




), and then defining the copying operations as copy + in-place. Therefore,

a + b

was defined as

a.copy().apply { this += b }

However, it turned out that copying is stupidly expensive, up to the point that copying takes the same amount of time as the actual calculation. In one of the benchmarks, the system spent 7 ms allocating the new array to hold the results, 10 ms copying the first argument over, and another 10 ms actually performing the addition. So we turned everything around and rewrote each method using a source-destination pattern (in-place methods write back to the source, while copying methods write to a separate destination array), and suddenly, performance increased. (This led to a non-negligible improvement even for computationally expensive operations.)

At first, we wrote our code like this (simplified example):

val data: DoubleArray
// ...
operator fun plusAssign(other: F64Array) {
    for (i in 0 until size) {
        data[i] +=[i]

However, it turned out that even the final references are not extracted, and the JVM invokes


two times per loop iteration. We wrote instead:

operator fun plusAssign(other: F64Array) {
    val dst = data
    val src =
    for (i in 0 until size) {
        dst[i] += src[i]

and suddenly, performance increased.


We have run a series of benchmarks on three JDKs: Oracle JDK 1.8, Oracle JDK 15, and GraalVM 20.3 (which implements OpenJDK 11). Each benchmark measured the performance of an operation using built-in JVM features (baseline) and using


methods. Each benchmark was run for different array sizes, from 1000 to 10 million elements.

The following plots show the performance gain (or loss) of


over the baseline for different operation groups.

Unary operations include exponent, logarithm, and their specialized versions:

expm1 = exp(x) – 1

log1p = log(1 + x)

On the graph above, all the lines are well above 1, which means that the


implementations are much, much faster than their JVM counterparts. The effect is most pronounced on the older JDK 1.8 with performance gain of up to 15x, but is still significant on the more modern ones, too.

Binary operations include the usual arithmetic ones (addition and multiplication) and our specialized log-storage addition operation

log-add-exp = log(exp(a) + exp(b))

The performance of the basic arithmetic operations is very close to the baseline. This is owing to the fact that even the ancient JDK 1.8 is able to SIMDize them natively. The log-storage addition naturally benefits from the exponent and logarithm speedup seen on the previous plot.

Reduction operations include the sum of all elements, the specialized log-storage sum

log-sum-exp = log(∑iexp(xi))

and the scalar product of two vectors (


). After the blazing-fast unary operations, this might not seem like much, but the 5x speedup for


and 3x one for


are not insignificant either.


Our library provides significant speedup for various array operations on multiple platforms. It also offers built-in log-storage support. We’d love to have new users and we welcome your feedback (just email Aleksei Dievskii directly, or use GitHub Issues).