Performance for my Ktor web service (Profiling?)

I’ve been working on a (now) Ktor web service for the past few years for work. It’s about 13k LOC and it processes roughly 2-3 billion req/day. At peak, I see it flooded with 60-120k req/sec. I have it hosted on DigitalOcean (which is also causing me a lot of grief on networking issues at this scale), with the main Ktor app hosted across 170-250 auto-scaled 4 vCPU nodes.

My main issue is: I’m running into some serious performance woes. I don’t want to be buying 4 vCPU instances, but that’s the most that my app will reliably use. If I halve the instance count and double the vCPU per instance, it will just choke badly. Just uses moderately low CPU, and causes extremely high Load Averages. Our web service API latency skyrockets. If I treat it like a Node project and double up app replicas per 8 vCPU node, it does slightly better, but the overhead of two instances just ruins the cost effectiveness.

I’m guessing some of the issues are the non-async code behind some APIs. I also have a couple components that use JDBC behind some withContext(Dispatchers.IO) {...} blocks. Some java.io.File I/O is mixed in there, again with the IO dispatcher. We call a LOT of 3rd party REST APIs. Right now, we’re using the Java 11 HttpClient in Async mode, but not necessarily tied to that if we benchmark and find that it sucks. We primarily use jasync-postgresql and lettuce for Redis, and those have seemingly worked okay with < 5ms latency at the 99.5% percentile up to 50k QPS.

What I want to know is: Anyone running a large Kotlin web app with a lot of activity, how do you get it to scale properly per CPU core without spiking the load avg?

Or more importantly, how the heck can I run a profiler on a coroutine based app? I’ve tried the built-in IntelliJ profiler, as well as JProfiler that claims it can deal with coroutines, and neither really tell me anything useful. I understand that the coroutines are broken apart in the dispatcher’s backing thread pools, so it’s really difficult to take a trace and reconstruct that into an actual call tree like traditional thread-per-worker models. All the profilers I’ve tried lump all coroutine calls under the backing thread pool’s threads in a randomized fashion, with no obvious hotspots sticking out at all.

At the moment, I have zero visibility into what the heck is taking up CPU, what things are hanging on for data and causing Load Avg to spike without using CPU, and what to work on to improve my scalability. I’m just poking around in the dark, just coding on best practices and experience to get this far, and I’m hitting a brick wall and feeling like I might need to port this whole thing to Rust.

I realize this is kind of a vague post, but any potential direction that someone could point me towards? Thanks for reading.

submitted by /u/Leidanav
[link] [comments]