I'm using Vertex AI's online predictions endpoint for custom container. I have it set to max replicas 4 and min replicas 1 (vertex online endpoints have min 1 anyways). Now my workload's inference is not instant, there is lot of processing that needs to be done on a document before running inference, and thus it takes a lot of time (processing can take > 5 mins on n1-highcpu-16) - basically downloading pdfs and then converting to images, performing OCR with pytesseract and then running inference on it.
What I do to make this work is spin up background thread when a new instance is received, and let that thread run processing and inference (basically all the heavy lifting), while the main thread listens for more requests. The background thread later updates Firestore with predictions when its done. I've also implemented a shutdown handler, and am keeping track of pending requests:
def shutdown_handler(signal: int, frame: FrameType) -> None:
"""Gracefully shutdown app."""
global waiting_requests
logger.info(f"Signal received, safely shutting down - HOSTNAME: {HOSTNAME}")
payload = {"text" : f"Signal received - {signal}, safely shutting down. HOSTNAME: {HOSTNAME}, has {waiting_requests} pending requests, container ran for {time.time() - start_time} seconds"}
call_slack_webhook(WEBHOOK_URL, payload)
if frame:
frame_info = {
"function": frame.f_code.co_name,
"file": frame.f_code.co_filename,
"line": frame.f_lineno
}
logger.info(f"Current function: {frame.f_code.co_name}")
logger.info(f"Current file: {frame.f_code.co_filename}")
logger.info(f"Line number: {frame.f_lineno}")
payload = {"text": f"Frame info: {frame_info} for hostname: {HOSTNAME}"}
call_slack_webhook(WEBHOOK_URL, payload)
logger.info(f"Exiting process - HOSTNAME: {HOSTNAME}")
sys.exit(0)
Scaling was setup when deploying to endpoint as follows:
--autoscaling-metric-specs=cpu-usage=70
--max-replica-count=4
My problem is, while it still has pending requests/when it is finishing inference/mid-inference, some container gets a sigterm and ends. The duration each worker is up for varies.
Signal received - 15, safely shutting down. HOSTNAME: pgcvj, has 829 pending requests, container ran for 4675.025427341461 seconds
Signal received - 15, safely shutting down. HOSTNAME: w5mcj, has 83 pending requests, container ran for 1478.7322800159454 seconds
Signal received - 15, safely shutting down. HOSTNAME: n77jh, has 12 pending requests, container ran for 629.7684991359711 seconds
Why is this happening, and how to prevent my container from shutting down? Background threads are being spawned as
thread = Thread(
target=inference_wrapper,
args=(run_inference_single_document, record_id, document_id, image_dir),
daemon=False # false so that it doesnt terminate while thread running
)
Dockerfile:
ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:8080", "--timeout", "300", "--graceful-timeout", "300", "--keep-alive", "65", "server:app"]
Does the container shutdown when its CPU usage reduces/no predictions are being received anymore or something? Or is this something related to docker workers timing out?
How could I debug this - as all I'm seeing is that the shutdown handler is being called, and then later Worker Exiting in logs.
I'm using Vertex AI's online predictions endpoint for custom container. I have it set to max replicas 4 and min replicas 1 (vertex online endpoints have min 1 anyways). Now my workload's inference is not instant, there is lot of processing that needs to be done on a document before running inference, and thus it takes a lot of time (processing can take > 5 mins on n1-highcpu-16) - basically downloading pdfs and then converting to images, performing OCR with pytesseract and then running inference on it.
What I do to make this work is spin up background thread when a new instance is received, and let that thread run processing and inference (basically all the heavy lifting), while the main thread listens for more requests. The background thread later updates Firestore with predictions when its done. I've also implemented a shutdown handler, and am keeping track of pending requests:
def shutdown_handler(signal: int, frame: FrameType) -> None:
"""Gracefully shutdown app."""
global waiting_requests
logger.info(f"Signal received, safely shutting down - HOSTNAME: {HOSTNAME}")
payload = {"text" : f"Signal received - {signal}, safely shutting down. HOSTNAME: {HOSTNAME}, has {waiting_requests} pending requests, container ran for {time.time() - start_time} seconds"}
call_slack_webhook(WEBHOOK_URL, payload)
if frame:
frame_info = {
"function": frame.f_code.co_name,
"file": frame.f_code.co_filename,
"line": frame.f_lineno
}
logger.info(f"Current function: {frame.f_code.co_name}")
logger.info(f"Current file: {frame.f_code.co_filename}")
logger.info(f"Line number: {frame.f_lineno}")
payload = {"text": f"Frame info: {frame_info} for hostname: {HOSTNAME}"}
call_slack_webhook(WEBHOOK_URL, payload)
logger.info(f"Exiting process - HOSTNAME: {HOSTNAME}")
sys.exit(0)
Scaling was setup when deploying to endpoint as follows:
--autoscaling-metric-specs=cpu-usage=70
--max-replica-count=4
My problem is, while it still has pending requests/when it is finishing inference/mid-inference, some container gets a sigterm and ends. The duration each worker is up for varies.
Signal received - 15, safely shutting down. HOSTNAME: pgcvj, has 829 pending requests, container ran for 4675.025427341461 seconds
Signal received - 15, safely shutting down. HOSTNAME: w5mcj, has 83 pending requests, container ran for 1478.7322800159454 seconds
Signal received - 15, safely shutting down. HOSTNAME: n77jh, has 12 pending requests, container ran for 629.7684991359711 seconds
Why is this happening, and how to prevent my container from shutting down? Background threads are being spawned as
thread = Thread(
target=inference_wrapper,
args=(run_inference_single_document, record_id, document_id, image_dir),
daemon=False # false so that it doesnt terminate while thread running
)
Dockerfile:
ENTRYPOINT ["gunicorn", "--bind", "0.0.0.0:8080", "--timeout", "300", "--graceful-timeout", "300", "--keep-alive", "65", "server:app"]
Does the container shutdown when its CPU usage reduces/no predictions are being received anymore or something? Or is this something related to docker workers timing out?
How could I debug this - as all I'm seeing is that the shutdown handler is being called, and then later Worker Exiting in logs.
Share Improve this question asked Mar 11 at 18:12 Techie5879Techie5879 6161 gold badge7 silver badges14 bronze badges 01 Answer
Reset to default -1If your application is still predicting when it receives the SIGTERM, it means the container is being terminated before it can complete the prediction.
To handle this, you can implement a SIGTERM handler in your application. When the handler receives the SIGTERM, it should start failing the readiness probe. This will prevent new requests from being routed to the container, and allow it to finish processing the current request before shutting down.
For writing a component to cancel the underlying resources you may visit this public documentation which includes sample code that shows how to attach a SIGTERM handler.