Context

Microservice based out of FastAPI microframework with SocketIO server wrapped as ASGI application executed on Uvicorn Webserver.

The service has several socket events written to handle events from single feature and some supporting RESTful APIs(they are not used much).
- There are socketio events that invoke async tasks which has await function calls to ensure there is a single worker which orchestrates all events to the service.
- There is a keep alive async function which emits a keep alive message(every 15secs) to each of the sids connected to ensure it is not disconnected automatically on being idle. This is done to ensure there are events which invoke long running tasks(upto 15 mins) that are not able to emit the messages after the processing is completed.
We are running it on Kubernetes with memory(request 8gb & limit as 16gb) and CPU(request 800m & limit as 4). Event loop used here is asyncio.
Uvicorn is configured with port = 8000 workers= 1, limit_max_requests=1000, limit_concurrency=100, timeout-graceful-shutdown=30.

    uvicorn.run(socket_app, host="0.0.0.0", port=8000, workers=1, limit_max_requests=1000, limit_concurrency=100, timeout_graceful_shutdown=30)

Kubernetes has a readiness probe which does a curl to health check API with delay of 60s, timeout of 30s, period of 30s, success as 1 & failure as 4

Readiness: exec [curl localhost:8000/healthz] delay=60s timeout=30s period=30s #success=1 #failure=4

Problem

As the load to the service increases, the readiness probe fails and kubernetes marks the pods as unhealthy leading to clients that are connected to it via SocketIO not able to send or receive messages for events of the feature. Investigation further lead to understanding that TCP connections are in CLOSE_WAIT status and it was increasing in linear fashion until the pod couldn't respond to existing connections. Most of these TCP connections that are in CLOSE_WAIT status are between local address 127.0.0.1:8000 and Foreign address 127.0.0.1:xxxxx. Doesn't have pid attached to it. There are some tcp connections that are in FIN_WAIT2 between 127.0.0.1:xxxxx and 127.0.0.1:8000. And also some connections in CLOSE_WAIT which is between PodIP:8000 to another PodIP:xxxxx of a different service.This could be some APIs calls between services. I understood that application(probably uvicorn) has to close these connections as the client has sent is fin packets from here

What are different configurations and implementations I can add for both Uvicorn and SocketIO to ensure these are handled properly? Or Do we have any other solution that I must look into to make the service functions properly?

Edit1: Adding the libraries used.

fastapi==0.112.0
uvicorn==0.30.5
python-socketio==5.11.3
asyncio==3.4.3

Context

Microservice based out of FastAPI microframework with SocketIO server wrapped as ASGI application executed on Uvicorn Webserver.

The service has several socket events written to handle events from single feature and some supporting RESTful APIs(they are not used much).
- There are socketio events that invoke async tasks which has await function calls to ensure there is a single worker which orchestrates all events to the service.
- There is a keep alive async function which emits a keep alive message(every 15secs) to each of the sids connected to ensure it is not disconnected automatically on being idle. This is done to ensure there are events which invoke long running tasks(upto 15 mins) that are not able to emit the messages after the processing is completed.
We are running it on Kubernetes with memory(request 8gb & limit as 16gb) and CPU(request 800m & limit as 4). Event loop used here is asyncio.
Uvicorn is configured with port = 8000 workers= 1, limit_max_requests=1000, limit_concurrency=100, timeout-graceful-shutdown=30.

    uvicorn.run(socket_app, host="0.0.0.0", port=8000, workers=1, limit_max_requests=1000, limit_concurrency=100, timeout_graceful_shutdown=30)

Kubernetes has a readiness probe which does a curl to health check API with delay of 60s, timeout of 30s, period of 30s, success as 1 & failure as 4

Readiness: exec [curl localhost:8000/healthz] delay=60s timeout=30s period=30s #success=1 #failure=4

Problem

Edit1: Adding the libraries used.

fastapi==0.112.0
uvicorn==0.30.5
python-socketio==5.11.3
asyncio==3.4.3

Share Improve this question edited 7 hours ago asked 8 hours ago Jeyshiv 256 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Below is an approach that has helped us resolve similar issues. The key is to ensure that both Uvicorn and SocketIO properly detect and clean up idle or disconnected clients, especially when long-running tasks might block the event loop.

Modify Uvicorn’s connection settings

If idle HTTP connections aren’t needed (e.g. for your health check or non-streaming endpoints), you can shorten the keep-alive period so that the server closes idle connections quicker.

Also, Uvicorn supports multiple HTTP protocol implementations (e.g. h11 and httptools). Depending on the version and your workload, switching or upgrading might affect how FIN packets are handled.

Tip: Try explicitly setting the HTTP protocol if you suspect differences in connection handling.
Configure SocketIO

SocketIO uses heartbeat messages (ping/pong) to check that the connection is still alive. Make sure these intervals are configured aggressively enough so that stale or half-closed connections are detected quickly:
```
import socketio

sio = socketio.AsyncServer(
    ping_interval=10, # seconds between pings
    ping_timeout=5,   # wait time for a pong before disconnecting
)
```
This ensures that even if the underlying TCP connection lingers in CLOSE_WAIT for a bit, your application isn’t holding onto extra resources.
Offload long-running tasks

When you have tasks that may run up to 15 minutes, they can block the event loop and delay processing disconnects or other events. Therefore, you can consider using either background tasks through asyncio.create_task() or external task queues.
Adjust the health check endpoint and readiness probe
- Your Kubernetes readiness probe is calling /healthz. Ensure that this endpoint does not maintain keep-alive connections by sending a Connection: close header:
```
from fastapi import FastAPI
from fastapi.responses import JSONResponse

app = FastAPI()

@app.get("/healthz")
async def healthz():
    response = JSONResponse({"status": "ok"})
    response.headers["Connection"] = "close"  # force the connection to close
    return response
```
- If possible, consider serving health checks on a dedicated port or endpoint so that they do not clash with the SocketIO traffic.

By making these changes, you will make sure that once a client sends a FIN packet, your server can promptly complete the teardown process, consequently preventing a gradual accumulation of half-closed sockets that eventually overwhelm your pod.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - TCP connections in CLOSE_WAIT state not closing in the Microservice(FastAPI with Socket.IO) running as Uvicorn(ASGI Web

Context

Problem

Context

Problem

1 Answer 1

与本文相关的文章

评论列表(0)