I'm using Cloud composer environment on GCP.
Last month, we tried to change our environment size from ENVIRONMENT_SIZE_MEDIUM to ENVIRONMENT_SIZE_SMALL (using Terraform).
After 45 minutes, the operation failed with a timeout.
We reverted this modification, and the operation timed out too. Since this action, the composer monitoring indicates that the airflow webserver is not healthy.
Nevertheless, the web UI sills working and all DAGS run perfectly.
After an analysis, I see in GKE that the workload "airflow-webserver" has 2 revisions: the previous one and the last one.
- On the previous one, the healthcheck responds correctly.
- On the last one, the healthcheck doesn't respond (the container "agent" send the log
Health Check request failed: Get "http://localhost:8080/_ah/health": dial tcp 127.0.0.1:8080: connect: connection refused
) (So, it tries to restart continuously (CrashLoopBackOff))
The fact that the previous revision stills working explains that the web server works (not replaced because the new version is not healthy).
The fact that the last revision doesn't work explains that the monitor indicates the webserver is KO. It also explains that the Composer operation doesn't finish correctly.
For now, each modification on the Cloud composer configuration ends with a timeout, and is not effective.
Does anyone has a clue to repair my environment ?
Technical informations:
This is an extract of the Composer Environment description:
{
"config": {
...
"environmentSize": "ENVIRONMENT_SIZE_MEDIUM",
...
"softwareConfig": {
"airflowConfigOverrides": {
"core-dag_concurrency": "2",
"core-max_active_tasks_per_dag": "6",
"core-test_connection": "Enabled",
"secrets-backend": "airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend"
},
...
"imageVersion": "composer-2.9.4-airflow-2.9.3",
"pypiPackages": {
"apache-airflow-providers-microsoft-mssql": "==3.9.1",
"apache-airflow-providers-sftp": "==4.11.1"
}
},
"webServerNetworkAccessControl": {
...
},
"workloadsConfig": {
"scheduler": {
"count": 1,
"cpu": 1.0,
"memoryGb": 2.0,
"storageGb": 1.0
},
"triggerer": {
"count": 1,
"cpu": 0.5,
"memoryGb": 0.5
},
"webServer": {
"cpu": 0.5,
"memoryGb": 2.0,
"storageGb": 1.0
},
"worker": {
"cpu": 2.0,
"maxCount": 4,
"memoryGb": 12.0,
"minCount": 1,
"storageGb": 2.0
}
}
},
...
"state": "RUNNING",
"storageConfig": {
"bucket": "..."
},
...
}
The last revision Yaml gives this information:
name: airflow-webserver
resources:
limits:
cpu: 700m
ephemeral-storage: 819Mi
memory: 718Mi
requests:
cpu: 700m
ephemeral-storage: 819Mi
memory: 718Mi
It looks like the yaml generation completely lost his mind, and generates bad data about sizings, and the container can't start properly.
The container logs show warning, but no real errors (despite of the log level).
container log
I tried to make a minor modification on the environment, but the operation ends with a timeout.
I tried to modify directly the revision yaml file to force the cpu and memory, but the new generated revision yaml file indicates values different that the values I forced.