最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Cloud composer 2 is not responding after environment size changed - Stack Overflow

programmeradmin0浏览0评论

I'm using Cloud composer environment on GCP.

Last month, we tried to change our environment size from ENVIRONMENT_SIZE_MEDIUM to ENVIRONMENT_SIZE_SMALL (using Terraform).

After 45 minutes, the operation failed with a timeout.

We reverted this modification, and the operation timed out too. Since this action, the composer monitoring indicates that the airflow webserver is not healthy.

Nevertheless, the web UI sills working and all DAGS run perfectly.

After an analysis, I see in GKE that the workload "airflow-webserver" has 2 revisions: the previous one and the last one.

  • On the previous one, the healthcheck responds correctly.
  • On the last one, the healthcheck doesn't respond (the container "agent" send the log Health Check request failed: Get "http://localhost:8080/_ah/health": dial tcp 127.0.0.1:8080: connect: connection refused) (So, it tries to restart continuously (CrashLoopBackOff))

The fact that the previous revision stills working explains that the web server works (not replaced because the new version is not healthy).

The fact that the last revision doesn't work explains that the monitor indicates the webserver is KO. It also explains that the Composer operation doesn't finish correctly.

For now, each modification on the Cloud composer configuration ends with a timeout, and is not effective.

Does anyone has a clue to repair my environment ?

Technical informations:

This is an extract of the Composer Environment description:

{
    "config": {
      ...
      "environmentSize": "ENVIRONMENT_SIZE_MEDIUM",
      ...
      "softwareConfig": {
        "airflowConfigOverrides": {
          "core-dag_concurrency": "2",
          "core-max_active_tasks_per_dag": "6",
          "core-test_connection": "Enabled",
          "secrets-backend": "airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend"
        },
        ...
        "imageVersion": "composer-2.9.4-airflow-2.9.3",
        "pypiPackages": {
          "apache-airflow-providers-microsoft-mssql": "==3.9.1",
          "apache-airflow-providers-sftp": "==4.11.1"
        }
      },
      "webServerNetworkAccessControl": {
         ...
      },
      "workloadsConfig": {
        "scheduler": {
          "count": 1,
          "cpu": 1.0,
          "memoryGb": 2.0,
          "storageGb": 1.0
        },
        "triggerer": {
          "count": 1,
          "cpu": 0.5,
          "memoryGb": 0.5
        },
        "webServer": {
          "cpu": 0.5,
          "memoryGb": 2.0,
          "storageGb": 1.0
        },
        "worker": {
          "cpu": 2.0,
          "maxCount": 4,
          "memoryGb": 12.0,
          "minCount": 1,
          "storageGb": 2.0
        }
      }
    },
    ...
    "state": "RUNNING",
    "storageConfig": {
      "bucket": "..."
    },
    ...
  }

The last revision Yaml gives this information:

name: airflow-webserver
resources:
  limits:
    cpu: 700m
    ephemeral-storage: 819Mi
    memory: 718Mi
  requests:
    cpu: 700m
    ephemeral-storage: 819Mi
    memory: 718Mi

It looks like the yaml generation completely lost his mind, and generates bad data about sizings, and the container can't start properly.

The container logs show warning, but no real errors (despite of the log level).
container log

I tried to make a minor modification on the environment, but the operation ends with a timeout.

I tried to modify directly the revision yaml file to force the cpu and memory, but the new generated revision yaml file indicates values different that the values I forced.

发布评论

评论列表(0)

  1. 暂无评论