amazon web services - EC2 to S3 uploads fail randomly

I have a few 100 AWS Batch jobs that run on AWS EC2 on-demand instances. The jobs carry out some computation to generate two parquet files A and B, and upload A and B to separate paths in the same bucket.

When I do run these AWS Batch jobs, I'll see that about 60-70% of them fail. When inspecting logs, some of these jobs will have file A get uploaded successfully, and file B fail with the following exception:

botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:

Sometimes both files A and B fail to upload with the same exception.

Files are about 200 MB parquet files.

The other 30-40% of the jobs which do succeed do not experience this network issue.

What could be the cause of this intermittent failure? How would one go about debugging this?

EDIT - I'll mark this closed. For anyone else running into this issue, this was due to the self hosted NAT that was throttling the bandwidth. I had set up too small an instance (fck-nat) that couldn't handle the 100 odd jobs that were running at the same time.

botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:

Sometimes both files A and B fail to upload with the same exception.

Files are about 200 MB parquet files.

The other 30-40% of the jobs which do succeed do not experience this network issue.

What could be the cause of this intermittent failure? How would one go about debugging this?

Share Improve this question edited Feb 10 at 14:59 asked Feb 1 at 20:31 AGS 4277 silver badges18 bronze badges

Are you uploading from a custom boto3 app or awscli? What endpoint are you using? Does this traffic route via IGW or NAT or VPC Endpoint to S3? Which EC2 instance type? How many concurrent uploads to the same S3 bucket are you making? – jarmod Commented Feb 3 at 15:05

Add a comment |

1 Answer 1

Sorted by: Reset to default -1

Will need some code snippets to dig further... There are some similar answers including:

networking issue / VPN
BOTO3 variables where you can set things like cli-connection-timeout which might help
Multiprocessing in Python issues
Number of other issues where things like packet drops / API taking too long, etc

Questions to ask:

Are they all in the same VPC?
Is there any difference between scripts?
Is there any data skew (where some files are much larger than others)
- If you're doing some processing on input files, and the input files are all 200Mb but the transformations create new data, those transforms might create skew in final output but idk
Are you sure they're all on-demand and not being dropped as spot instances?
Lastly are you using long term connections throughout? Like do you have the following below:

cnxn = boto3.connect()
process_data_for_a_while()
cnxn.upload(file)

If you do then maybe the cnxn is too long lived

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

amazon web services - EC2 to S3 uploads fail randomly - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)