最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

bash - How to capture STDOUT, STDERR and process PID, without creating zombies - Stack Overflow

programmeradmin2浏览0评论

I'm wrapping a terraform binary in a script, as a part of an enterprise solution. Therefore I need to take care of:

  • file log capture (separately STDOUT, STDERR for some post-processing analytics)
  • live log capture (this runs within the Jenkins job, again separate for STDOUT and STDERR)
  • PID capture (as the process needs to run in the background, and in the next step I'll add traps for SIGTERM handling)

Currently, the core construct of the script looks like this:

#!/bin/bash
...
...
terraform "$@" > >(tee "${STDOUT_LOG}") 2> >(tee "${STDERR_LOG}" >&2) & TF_PID="$!"
wait "$TF_PID"
EXIT_CODE="$?"
...
wait
exit "$EXIT_CODE"

This script is called several hundred times in one container. We've noticed it leaves zombie processes, the shells within which tee commands are executed.

Adding a general wait before exiting the script doesn't help, the shell won't wait for these child processes to be reaped. I couldn't read much about the internals of process substitution, would you have a hint what might be going on here?

EDIT:

> ps aux --forest
  11295 ?        S      0:00      |       \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
  11376 ?        Sl     0:01      |       |   \_ terraform init
  11377 ?        S      0:00      |       |       \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
  11379 ?        S      0:00      |       |       |   \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /home/jenkins/workspace/build-1629/src/vnet-01/stdout.log
  11378 ?        S      0:00      |       |       \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
  11380 ?        S      0:00      |       |           \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /home/jenkins/workspace/build-1629/src/vnet-01/stderr.log

and after a few moments:

> ps aux
...
  11377 ?        Z      0:00 [terraform.sh] <defunct>
  11378 ?        Z      0:00 [terraform.sh] <defunct>
...

I'm wrapping a terraform binary in a script, as a part of an enterprise solution. Therefore I need to take care of:

  • file log capture (separately STDOUT, STDERR for some post-processing analytics)
  • live log capture (this runs within the Jenkins job, again separate for STDOUT and STDERR)
  • PID capture (as the process needs to run in the background, and in the next step I'll add traps for SIGTERM handling)

Currently, the core construct of the script looks like this:

#!/bin/bash
...
...
terraform "$@" > >(tee "${STDOUT_LOG}") 2> >(tee "${STDERR_LOG}" >&2) & TF_PID="$!"
wait "$TF_PID"
EXIT_CODE="$?"
...
wait
exit "$EXIT_CODE"

This script is called several hundred times in one container. We've noticed it leaves zombie processes, the shells within which tee commands are executed.

Adding a general wait before exiting the script doesn't help, the shell won't wait for these child processes to be reaped. I couldn't read much about the internals of process substitution, would you have a hint what might be going on here?

EDIT:

> ps aux --forest
  11295 ?        S      0:00      |       \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
  11376 ?        Sl     0:01      |       |   \_ terraform init
  11377 ?        S      0:00      |       |       \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
  11379 ?        S      0:00      |       |       |   \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /home/jenkins/workspace/build-1629/src/vnet-01/stdout.log
  11378 ?        S      0:00      |       |       \_ /bin/bash /home/jenkins/workspace/build-1629@tmp/terraform.sh init
  11380 ?        S      0:00      |       |           \_ /usr/bin/coreutils --coreutils-prog-shebang=tee /usr/bin/tee -a /home/jenkins/workspace/build-1629/src/vnet-01/stderr.log

and after a few moments:

> ps aux
...
  11377 ?        Z      0:00 [terraform.sh] <defunct>
  11378 ?        Z      0:00 [terraform.sh] <defunct>
...
Share Improve this question edited Mar 11 at 16:28 Bernard Halas asked Mar 11 at 13:15 Bernard HalasBernard Halas 1,24015 silver badges30 bronze badges 12
  • 3 Please clarify: are you saying that the tee subshells are not collected even after the shell that launched them terminates? Or to put it another way: please be more specific about what you observe, and how you observe it. – John Bollinger Commented Mar 11 at 13:28
  • Hello @JohnBollinger, added the required details. What surprised me is that the subshell (process 11377, 11378) are children of terraform-1.4 binary (PID 11376), not the parent script (PID 11295). – Bernard Halas Commented Mar 11 at 15:17
  • Thanks for the additional info. I have some followups. Do the zombies eventually get cleaned up? Are enough accumulating to cause a practical problem? – John Bollinger Commented Mar 11 at 15:52
  • 1 What's presumably happening is that the shell forks a child to run terraform. But before exec'ing terraform, the child forks its own children for the process substitutions, analogous to the way it opens files for redirection. – Barmar Commented Mar 11 at 16:02
  • 1 If zombies accumulate then that is probably because of unhelpful container construction / configuration. Since you mention kubelet, I guess you are using Kubernetes. See this Kubernetes issue: github/kubernetes/kubernetes/issues/84210 for some discussion. – John Bollinger Commented Mar 11 at 17:26
 |  Show 7 more comments

2 Answers 2

Reset to default 3

You can ignore process substitution and roll up your sleeves and do it yourself. That way all processes will be childs of the current process, so current process can track all lifetimes. Also, variables don't have to scream UPPERCASE. I think with coproc you could get away with one fifo less.

{
  # setup fifo stderr and stdout fifos
  stdout_fifo=$(mktemp -u)
  stderr_fifo=$(mktemp -u)
  mkfifo "$stdout_fifo" "$stderr_fifo"
  trap 'rm "$stdout_fifo" "$stderr_fifo"' EXIT
}
{
  # start it
  terraform "$@" >"$stdout_fifo" 2>"$stderr_fifo" &
  tf_pid=$!
  tee "$STDOUT_LOG" <"$stdout_fifo" &
  tee_stdout_pid=$!
  tee "$STDERR_LOG" <"$stderr_fifo" &
  tee_stderr_pid=$!
}
{
  # wait for it
  wait "$tf_pid"
  tf_exit_code="$?"
  wait "$tee_stdout_pid"
  wait "$tee_stderr_pid"
}


I couldn't read much about the internals of process substitution, would you have a hint what might be going on here?

Your ps data shows that Bash is making the tee subshells children of the terraform-1.4 command. I can see some reasons to do that, but perhaps more not to do. In particular, I can imagine situations in which having children that it didn't know about would make a process misbehave.*

The subshells being children of the terraform command takes them out of the original shell's sphere of responsibility, even after the terraform-1.4 parent process terminates. Thus, yes, it is to be expected that the parent shell cannot successfully wait for them. The terraform-1.4 command doesn't know about these children so is unlikely to try to collect them, but even if it did try, they probably outlive it.

The issue, then, seems to be that whatever process in the container inherits responsibility for cleaning up zombies (PID 1, ordinarily) is not doing that job. Evidently this is a relatively well known issue with containers. As I observed in comments, there is a longstanding issue against Kubernetes for a means of mitigation. That issue also suggests how you could (re)build your container image so that it is not susceptible to this issue: give it an initial process that does handle zombies (and signals) in the way that a Unix PID 1 is responsible for doing. I don't have experience with any of the specific minimal init programs mentioned there (tini, dumb-init), and certainly not with integrating them with Kubernetes, but that's the general direction you probably want to go.


* Unlike with people, where it's more likely to work the other way around. :-)

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论