viktor: Efficient Vectorized Computations in Kotlin
Introducing viktor
is an opensource Kotlin library developed by JetBrains Research that aims to make array calculations more efficient. We achieve this by avoiding nested arrays, delegating expensive operations to JNI + SIMD, and providing builtin support for arithmetics on logarithmicallystored numbers.
This post is in celebration of the 1.1.0 release. We will discuss what the library does (and what it doesn’t), how it came to be, and what lessons we’ve learned while developing it.
Please try the library out! It’s as simple as looking at some examples and adding a line to your Gradle script. If you’re still not convinced, take a look at our benchmark results.
What we do
has been optimized to work with probability arrays, since it is primarily intended for modelbased machine learning tasks. For example, we use it in our peak analyzer Span (a bioinformatics tool that detects enriched regions in genomic sequencing data) to fit the underlying hidden Markov model. In particular, we offer:
 Idiomatic multidimensional array access (rows, columns, slices, views, etc.).
 Extremely fast elementwise operations (arithmetic, exponent, logarithm, and the like)
utilizing modern CPU cores to their full extent.  Really fast aggregate operations (sum, mean, standard deviation, etc.).
 Builtin logarithmic storage support: you can convert your values to logarithms and
work with them without having to convert them back.
What we don’t

viktor
isn’t a linear algebra library – at least not yet. You can get it to multiply matrices, but it’s not optimized for that. If you need lightningfast matrix multiplication, you’d be better off using Multik, nd4j, or another linear algebra library.

viktor
doesn’t currently have outofthebox concurrency (like
nd4jhas), though it can be parallelized on the client side.

viktor
doesn’t do GPU computations. Many researchers use their laptops for work, while others have access to multicore servers and even computational clusters. What unites all these setups is that they all have little to no GPU capabilities.
viktoris intended for exactly those kinds of cases.
Distribution
sources are hosted on GitHub, together with a feature overview and some instructional examples.
Starting with version 1.1.0,
binaries (together with sources and Javadoc) are distributed via Maven Central, so it’s really easy to add to any Mavenlike JVM project:
<dependency> <groupId>org.jetbrains.bio</groupId> <artifactId>viktor</artifactId> <version>1.1.0</version> </dependency>
Gradle / Gradle Kotlin:
implementation("org.jetbrains.bio:viktor:1.1.0")
The older versions were published on Bintray (currently in the sunset phase) and can be downloaded manually from GitHub Releases.
Features
The main structure,
, was inspired by NumPy’s ndarray.
Inside, it’s a flat
(
) endowed with an
and two nelement integer arrays containing the ndimensional array’s
and
. This structure allows us to easily express rows, columns, and other slices of an
as other
instances.
For instance, a 2×3 array will be stored as a 6element
with
,
, and
. If we want to view the second column (the one indexed 1), we just create an array with the same
, but with
,
, and
. It is simplicity itself!
The
comes with a sizable set of arithmetic and mathematical operations. While the cheap arithmetic operations are performed in a loop, the more expensive mathematical ones are delegated to the Java Native Interface. Moreover, we make sure to utilize the SIMD (single instruction, multiple data) extension sets available on most modern CPUs. The performance gains depend on the operation, the array size, the JVM version, and so on, but can reach 900% even in realworld cases (as it turns out the JVM’s logarithm is pretty slow).
Another useful feature is the builtin support of logstored values. When working with probabilities, floatingpoint underflows are a frequent occurrence, since sometimes you have to multiply so many small numbers together that the result can no longer be expressed as a positive number and instead is rounded to zero, losing any utility. To overcome this, people frequently store not the probability itself, but rather its logarithm. Instead of multiplying the probabilities, they can then sum the logarithms. However, they may occasionally need to sum probabilities as well, and this operation is much less natural with logarithmic storage. So
provides a function named
, which does exactly that:
a logAddExp b = log(exp(a) + exp(b))
but in a way that prevents underflows. It is also possible to sum all the values in a logstored array with
. These operations are also SIMDized whenever possible, achieving even better performance.
binary distribution currently supports one CPU architecture (
) with two extensions (SSE2, AVX) on Windows, Linux, and macOS. We use TeamCity for the multiplatform build.
Why we needed it
This whole project started because we just wanted to train some mathematical models for our research. However, the training was too slow for our tastes, even after adding concurrency. We used a profiler, and the results surprised us: most of the time was spent calculating logarithms. Just logarithms, over and over again. We replaced the builtin logarithm with Apache Commons Math‘s FastMath logarithm, but it didn’t help. We tried some other existing libraries, but none of them had the exact features that we needed. So, naturally, we had to write our own library.
With that having been decided, we ran into some design issues. We wanted our code to be idiomatic, like in Python’s NumPy, but we also wanted blazingfast performance like what can be achieved with C++. Oh, and we also needed it in a JVMcompatible language, since our project was JVMbased. It took a little trialanderror, but eventually,
, a hybrid of C++, Kotlin, and Python, an idiomatic library with a native backend, was released in November 2019.
Incredible stories of (native) optimization
The path ad astra seems to always go per aspera. The following are a few trialanderror examples, which could be educational, amusing, or both.
At first, we delegated every operation to JNI + SIMD. Then we did the benchmarks, rejoiced at the result, and released the library. However, in our pride, we didn’t think to benchmark the arithmetic operations; we only did mathematics (exponent etc.) and reduction (sum etc.). When we later added the arithmetic benchmarks, we were surprised to see that
performed poorly compared to a naïve loop. A couple of JITWatch sessions later, we learned that even the ancient JDK 1.8 is capable of SIMDizing the most primitive patterns. Like, say, elementwise multiplication of two arrays. Only the JDK calls its native code with much less overhead. Therefore, we dropped the SIMD arithmetic and replaced it with naïve loop arithmetic, and suddenly, performance increased.
At first, we tried to reduce the amount of code by only writing the inplace operations (like
or
), and then defining the copying operations as copy + inplace. Therefore,
was defined as
a.copy().apply { this += b }
However, it turned out that copying is stupidly expensive, up to the point that copying takes the same amount of time as the actual calculation. In one of the benchmarks, the system spent 7 ms allocating the new array to hold the results, 10 ms copying the first argument over, and another 10 ms actually performing the addition. So we turned everything around and rewrote each method using a sourcedestination pattern (inplace methods write back to the source, while copying methods write to a separate destination array), and suddenly, performance increased. (This led to a nonnegligible improvement even for computationally expensive operations.)
At first, we wrote our code like this (simplified example):
val data: DoubleArray // ... operator fun plusAssign(other: F64Array) { for (i in 0 until size) { data[i] += other.data[i] } }
However, it turned out that even the final references are not extracted, and the JVM invokes
two times per loop iteration. We wrote instead:
operator fun plusAssign(other: F64Array) { val dst = data val src = other.data for (i in 0 until size) { dst[i] += src[i] } }
and suddenly, performance increased.
Benchmarks
We have run a series of benchmarks on three JDKs: Oracle JDK 1.8, Oracle JDK 15, and GraalVM 20.3 (which implements OpenJDK 11). Each benchmark measured the performance of an operation using builtin JVM features (baseline) and using
methods. Each benchmark was run for different array sizes, from 1000 to 10 million elements.
The following plots show the performance gain (or loss) of
over the baseline for different operation groups.
Unary operations include exponent, logarithm, and their specialized versions:
expm1 = exp(x) – 1
log1p = log(1 + x)
On the graph above, all the lines are well above 1, which means that the
implementations are much, much faster than their JVM counterparts. The effect is most pronounced on the older JDK 1.8 with performance gain of up to 15x, but is still significant on the more modern ones, too.
Binary operations include the usual arithmetic ones (addition and multiplication) and our specialized logstorage addition operation
logaddexp = log(exp(a) + exp(b))
The performance of the basic arithmetic operations is very close to the baseline. This is owing to the fact that even the ancient JDK 1.8 is able to SIMDize them natively. The logstorage addition naturally benefits from the exponent and logarithm speedup seen on the previous plot.
Reduction operations include the sum of all elements, the specialized logstorage sum
logsumexp = log(∑_{i}exp(x_{i}))
and the scalar product of two vectors (
). After the blazingfast unary operations, this might not seem like much, but the 5x speedup for
and 3x one for
are not insignificant either.
Summary
Our library provides significant speedup for various array operations on multiple platforms. It also offers builtin logstorage support. We’d love to have new users and we welcome your feedback (just email Aleksei Dievskii directly, or use GitHub Issues).