Researching strange issue with weird delays in the app took me to the line where logging is called. At that line data received from communication channel are simply dumped to the log on trace level (they are text, or rather xml). Delay occurs when data to be dumped are of considerable size (unintentionally of course) - e.g. 100+ kilobytes. Delay is surprisingly of precisely 120-125 seconds and it seemingly do not depend on exact data size (e.g. 100kb or 300kb do not matter).
It uses zap logger under the hood and I found it simply stuck on underlying (*os.File) Write
method.
Killing the app with kill -3 <pid>
we collect stacktrace containing the following (I believe it is go v1.22.5
to match line numbers):
goroutine 660 gp=0xc000603180 m=nil [IO wait]:
runtime.gopark(0xc0002be1a0?, 0x10101139100?, 0x80?, 0x92?, 0xb?)
/usr/lib/go/src/runtime/proc.go:402 +0xce fp=0xc0006515e0 sp=0xc0006515c0 pc=0x44358e
runtimepollblock(0x48cf3b?, 0x40b146?, 0x0?)
/usr/lib/go/src/runtime/netpoll.go:573 +0xf7 fp=0xc000651618 sp=0xc0006515e0 pc=0x43c1d7
internal/poll.runtime_pollWait(0x7bca99ea4d48, 0x77)
/usr/lib/go/src/runtime/netpoll.go:345 +0x85 fp=0xc000651638 sp=0xc000651618 pc=0x4714e5
internal/poll.(*pollDesc).wait(0xc000059320?, 0xc000dde000?, 0x1)
/usr/lib/go/src/internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000651660 sp=0xc000651638 pc=0x4ef447
internal/poll.(*pollDesc).waitWrite(...)
/usr/lib/go/src/internal/poll/fd_poll_runtime.go:93
internal/poll.(*FD).Write(0xc000059320, {0xc000dbe000, 0x351d7, 0x36000})
/usr/lib/go/src/internal/poll/fd_unix.go:388 +0x2d9 fp=0xc000651710 sp=0xc000651660 pc=0x4f27d9
os.(*File).write(...)
/usr/lib/go/src/os/file_posix.go:46
os.(*File).Write(0xc00078e210, {0xc000dbe000?, 0x351d7, 0xc0003c6a40?})
/usr/lib/go/src/os/file.go:189 +0x51 fp=0xc000651770 sp=0xc000651710 pc=0x4fc031
go.uber/zap/zapcore.(*lockedWriteSyncer).Write(0xc0000ea0a8, {0xc000dbe000?, 0x2b06a37fd?, 0x200a460?})
/home/rodion/my/oam-agent/out/.cache/go-pkg/go.uber/[email protected]/zapcore/write_syncer.go:66 +0x6c fp=0xc0006517b8 sp=0xc000651770 pc=0xe1956c
go.uber/zap/zapcore.(*ioCore).Write(0xc00048f260, {0xfc, {0xc1e40920854e5dcd, 0x2b06a37fd, 0x200a460}, {0x0, 0x0}, {0xc000d88000, 0x351b2}, {0x0, ...}, ...}, ...)
/home/rodion/my/oam-agent/out/.cache/go-pkg/go.uber/[email protected]/zapcore/core.go:99 +0xb5 fp=0xc000651888 sp=0xc0006517b8 pc=0xe0c975
go.uber/zap/zapcore.(*CheckedEntry).Write(0xc0009a00d0, {0x0, 0x0, 0x0})
/home/rodion/my/oam-agent/out/.cache/go-pkg/go.uber/[email protected]/zapcore/entry.go:253 +0x11c fp=0xc000651a18 sp=0xc000651888 pc=0xe0ebdc
I don't see or understand the reason this netpoll
is waiting in this situation and how this could be related to the data size passed to Write
. Thus any explanations, hints and advice on further debugging is welcome. Not sure if it matters, but the app is executed in docker (64 bit amd linux).
It is some custom built linux I believe so there is a small chance something is unusual or buggy about it - in which case it would be good to think of some way to figure this out too.