I'm currently facing a very troublesome native memory leak in a java application I'm running on java 1.8.0_302. The memory leak only appears in my production environment, I haven't been able to reproduce it at all in any other environment, even with the same jar, config, java version and everything.
I unfortunately cannot share the jar itself, nor a minimal example to reproduce it for condidentiality reasons and the above, but here are some details on the problem itself and what I've tried so far:
- The app/process is running on a windows server, being registered as a service
- The process grows at a rate of around 400mb per day, the growth is unbounded as far as I can tell, it has reached up to 6 gigs before I had to restart it
- The process itself has a heap size limit of 512mb, which it basically never reaches, hovering around 300mb on average.
- I have verified through jconsole that the non heap categories listed there like the meta space and compressed class space aren't problematic either, the entire non heap category there shows around 150mb max
- Running the GC doesn't clear any of the surplus memory used
- I have enabled NMT summary mode and was able to see that the problematic section was the one labeled "Internal", which is the only section that grows at a constant rate (through use of malloc)
- I wanted to debug further through NMT detailed mode, but it doesn't show the function/classnames which are reserving the space, which makes it impossible to tell which exact class/method is reserving the space
- The process' thread count hovers around 40, 45 max
- I've done a thread dump, it doesn't seem like any of the threads take up too much space
- While the app uses netty, I don't believe it to be the cause, since after enabling netty's leak detection and checking all ByteBuff's in a heap dump I made, I couldn't find anything pointing towards netty being the source
- I've checked every max size of DirectByteBuffer and ByteBuff or similar instances in this heap dump using visualvm, but none of them seem to be the cause
- I have a couple of instances of the same app running on the same production server, with different configurations, but these don't suffer the leak. As far as I can tell, the only new package being used in the problematic instance is version 6.3.0 of the com.microsoft.graph package
- The app does use a custom classloader and some jni files/calls, but the exact same classloader and jni files/calls are used within the other instances without issues, so I don't think they can be the source either
Ordinarily, I'd disable certain parts of the app to determine the source by process of elimination at this point, but I cannot afford taking down the app or parts of the app for prolonged periods of time.
I've arrived at a point where I no longer know how to further diagnose the issue or what else could even be included in the "Internal" category. Any insights into possible causes or avenues to explore when it comes to diagnosing the issue would be greatly appreciated. I'll make sure to update this post regularly with any new findings.