I have a few 100 AWS Batch jobs that run on AWS EC2 on-demand instances. The jobs carry out some computation to generate two parquet files A and B, and upload A and B to separate paths in the same bucket.
When I do run these AWS Batch jobs, I'll see that about 60-70% of them fail. When inspecting logs, some of these jobs will have file A get uploaded successfully, and file B fail with the following exception:
botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:
Sometimes both files A and B fail to upload with the same exception.
Files are about 200 MB parquet files.
The other 30-40% of the jobs which do succeed do not experience this network issue.
What could be the cause of this intermittent failure? How would one go about debugging this?
EDIT - I'll mark this closed. For anyone else running into this issue, this was due to the self hosted NAT that was throttling the bandwidth. I had set up too small an instance (fck-nat) that couldn't handle the 100 odd jobs that were running at the same time.
I have a few 100 AWS Batch jobs that run on AWS EC2 on-demand instances. The jobs carry out some computation to generate two parquet files A and B, and upload A and B to separate paths in the same bucket.
When I do run these AWS Batch jobs, I'll see that about 60-70% of them fail. When inspecting logs, some of these jobs will have file A get uploaded successfully, and file B fail with the following exception:
botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:
Sometimes both files A and B fail to upload with the same exception.
Files are about 200 MB parquet files.
The other 30-40% of the jobs which do succeed do not experience this network issue.
What could be the cause of this intermittent failure? How would one go about debugging this?
EDIT - I'll mark this closed. For anyone else running into this issue, this was due to the self hosted NAT that was throttling the bandwidth. I had set up too small an instance (fck-nat) that couldn't handle the 100 odd jobs that were running at the same time.
Share Improve this question edited Feb 10 at 14:59 AGS asked Feb 1 at 20:31 AGSAGS 4277 silver badges18 bronze badges 1- Are you uploading from a custom boto3 app or awscli? What endpoint are you using? Does this traffic route via IGW or NAT or VPC Endpoint to S3? Which EC2 instance type? How many concurrent uploads to the same S3 bucket are you making? – jarmod Commented Feb 3 at 15:05
1 Answer
Reset to default -1Will need some code snippets to dig further... There are some similar answers including:
- networking issue / VPN
- BOTO3 variables where you can set things like
cli-connection-timeout
which might help - Multiprocessing in Python issues
- Number of other issues where things like packet drops / API taking too long, etc
Questions to ask:
- Are they all in the same VPC?
- Is there any difference between scripts?
- Is there any data skew (where some files are much larger than others)
- If you're doing some processing on input files, and the input files are all 200Mb but the transformations create new data, those transforms might create skew in final output but idk
- Are you sure they're all on-demand and not being dropped as spot instances?
- Lastly are you using long term connections throughout? Like do you have the following below:
cnxn = boto3.connect()
process_data_for_a_while()
cnxn.upload(file)
If you do then maybe the cnxn
is too long lived