I'm wrapping a terraform binary in a script, as a part of an enterprise solution. Therefore I need to take care of:
- file log capture (separately STDOUT, STDERR for some post-processing analytics)
- live log capture (this runs within the Jenkins job, again separate for STDOUT and STDERR)
- PID capture (as the process needs to run in the background, and in the next step I'll add traps for SIGTERM handling)
Currently, the core construct of the script looks like this:
#!/bin/bash
...
...
terraform "$@" > >(tee "${STDOUT_LOG}") 2> >(tee "${STDERR_LOG}" >&2) & TF_PID="$!"
wait "$TF_PID"
EXIT_CODE="$?"
...
wait
exit "$EXIT_CODE"
This script is called several hundred times in one container. We've noticed it leaves zombie processes, the shells within which tee
commands are executed.
Adding a general wait
before exiting the script doesn't help, the shell won't wait for these child processes to be reaped. I couldn't read much about the internals of process substitution, would you have a hint what might be going on here?
EDIT:
> ps aux --forest
11295 ? S 0:00 | \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
11376 ? Sl 0:01 | | \_ terraform init
11377 ? S 0:00 | | \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
11379 ? S 0:00 | | | \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /home/jenkins/workspace/build-1629/src/vnet-01/stdout.log
11378 ? S 0:00 | | \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
11380 ? S 0:00 | | \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /home/jenkins/workspace/build-1629/src/vnet-01/stderr.log
and after a few moments:
> ps aux
...
11377 ? Z 0:00 [terraform.sh] <defunct>
11378 ? Z 0:00 [terraform.sh] <defunct>
...
I'm wrapping a terraform binary in a script, as a part of an enterprise solution. Therefore I need to take care of:
- file log capture (separately STDOUT, STDERR for some post-processing analytics)
- live log capture (this runs within the Jenkins job, again separate for STDOUT and STDERR)
- PID capture (as the process needs to run in the background, and in the next step I'll add traps for SIGTERM handling)
Currently, the core construct of the script looks like this:
#!/bin/bash
...
...
terraform "$@" > >(tee "${STDOUT_LOG}") 2> >(tee "${STDERR_LOG}" >&2) & TF_PID="$!"
wait "$TF_PID"
EXIT_CODE="$?"
...
wait
exit "$EXIT_CODE"
This script is called several hundred times in one container. We've noticed it leaves zombie processes, the shells within which tee
commands are executed.
Adding a general wait
before exiting the script doesn't help, the shell won't wait for these child processes to be reaped. I couldn't read much about the internals of process substitution, would you have a hint what might be going on here?
EDIT:
> ps aux --forest
11295 ? S 0:00 | \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
11376 ? Sl 0:01 | | \_ terraform init
11377 ? S 0:00 | | \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
11379 ? S 0:00 | | | \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /home/jenkins/workspace/build-1629/src/vnet-01/stdout.log
11378 ? S 0:00 | | \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
11380 ? S 0:00 | | \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /home/jenkins/workspace/build-1629/src/vnet-01/stderr.log
and after a few moments:
> ps aux
...
11377 ? Z 0:00 [terraform.sh] <defunct>
11378 ? Z 0:00 [terraform.sh] <defunct>
...
Share
Improve this question
edited Mar 11 at 16:28
Bernard Halas
asked Mar 11 at 13:15
Bernard HalasBernard Halas
1,24015 silver badges30 bronze badges
12
|
Show 7 more comments
2 Answers
Reset to default 3You can ignore process substitution and roll up your sleeves and do it yourself. That way all processes will be childs of the current process, so current process can track all lifetimes. Also, variables don't have to scream UPPERCASE. I think with coproc
you could get away with one fifo less.
{
# setup fifo stderr and stdout fifos
stdout_fifo=$(mktemp -u)
stderr_fifo=$(mktemp -u)
mkfifo "$stdout_fifo" "$stderr_fifo"
trap 'rm "$stdout_fifo" "$stderr_fifo"' EXIT
}
{
# start it
terraform "$@" >"$stdout_fifo" 2>"$stderr_fifo" &
tf_pid=$!
tee "$STDOUT_LOG" <"$stdout_fifo" &
tee_stdout_pid=$!
tee "$STDERR_LOG" <"$stderr_fifo" &
tee_stderr_pid=$!
}
{
# wait for it
wait "$tf_pid"
tf_exit_code="$?"
wait "$tee_stdout_pid"
wait "$tee_stderr_pid"
}
I couldn't read much about the internals of process substitution, would you have a hint what might be going on here?
Your ps
data shows that Bash is making the tee
subshells children of the terraform-1.4
command. I can see some reasons to do that, but perhaps more not to do. In particular, I can imagine situations in which having children that it didn't know about would make a process misbehave.*
The subshells being children of the terraform command takes them out of the original shell's sphere of responsibility, even after the terraform-1.4
parent process terminates. Thus, yes, it is to be expected that the parent shell cannot successfully wait
for them. The terraform-1.4
command doesn't know about these children so is unlikely to try to collect them, but even if it did try, they probably outlive it.
The issue, then, seems to be that whatever process in the container inherits responsibility for cleaning up zombies (PID 1, ordinarily) is not doing that job. Evidently this is a relatively well known issue with containers. As I observed in comments, there is a longstanding issue against Kubernetes for a means of mitigation. That issue also suggests how you could (re)build your container image so that it is not susceptible to this issue: give it an initial process that does handle zombies (and signals) in the way that a Unix PID 1 is responsible for doing. I don't have experience with any of the specific minimal init programs mentioned there (tini
, dumb-init
), and certainly not with integrating them with Kubernetes, but that's the general direction you probably want to go.
* Unlike with people, where it's more likely to work the other way around. :-)
tee
subshells are not collected even after the shell that launched them terminates? Or to put it another way: please be more specific about what you observe, and how you observe it. – John Bollinger Commented Mar 11 at 13:28terraform-1.4
binary (PID 11376), not the parent script (PID 11295). – Bernard Halas Commented Mar 11 at 15:17kubelet
, I guess you are using Kubernetes. See this Kubernetes issue: github/kubernetes/kubernetes/issues/84210 for some discussion. – John Bollinger Commented Mar 11 at 17:26