最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

amazon web services - EC2 to S3 uploads fail randomly - Stack Overflow

programmeradmin0浏览0评论

I have a few 100 AWS Batch jobs that run on AWS EC2 on-demand instances. The jobs carry out some computation to generate two parquet files A and B, and upload A and B to separate paths in the same bucket.

When I do run these AWS Batch jobs, I'll see that about 60-70% of them fail. When inspecting logs, some of these jobs will have file A get uploaded successfully, and file B fail with the following exception:

botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:

Sometimes both files A and B fail to upload with the same exception.

Files are about 200 MB parquet files.

The other 30-40% of the jobs which do succeed do not experience this network issue.

What could be the cause of this intermittent failure? How would one go about debugging this?

EDIT - I'll mark this closed. For anyone else running into this issue, this was due to the self hosted NAT that was throttling the bandwidth. I had set up too small an instance (fck-nat) that couldn't handle the 100 odd jobs that were running at the same time.

I have a few 100 AWS Batch jobs that run on AWS EC2 on-demand instances. The jobs carry out some computation to generate two parquet files A and B, and upload A and B to separate paths in the same bucket.

When I do run these AWS Batch jobs, I'll see that about 60-70% of them fail. When inspecting logs, some of these jobs will have file A get uploaded successfully, and file B fail with the following exception:

botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:

Sometimes both files A and B fail to upload with the same exception.

Files are about 200 MB parquet files.

The other 30-40% of the jobs which do succeed do not experience this network issue.

What could be the cause of this intermittent failure? How would one go about debugging this?

EDIT - I'll mark this closed. For anyone else running into this issue, this was due to the self hosted NAT that was throttling the bandwidth. I had set up too small an instance (fck-nat) that couldn't handle the 100 odd jobs that were running at the same time.

Share Improve this question edited Feb 10 at 14:59 AGS asked Feb 1 at 20:31 AGSAGS 4277 silver badges18 bronze badges 1
  • Are you uploading from a custom boto3 app or awscli? What endpoint are you using? Does this traffic route via IGW or NAT or VPC Endpoint to S3? Which EC2 instance type? How many concurrent uploads to the same S3 bucket are you making? – jarmod Commented Feb 3 at 15:05
Add a comment  | 

1 Answer 1

Reset to default -1

Will need some code snippets to dig further... There are some similar answers including:

  • networking issue / VPN
  • BOTO3 variables where you can set things like cli-connection-timeout which might help
  • Multiprocessing in Python issues
  • Number of other issues where things like packet drops / API taking too long, etc

Questions to ask:

  • Are they all in the same VPC?
  • Is there any difference between scripts?
  • Is there any data skew (where some files are much larger than others)
    • If you're doing some processing on input files, and the input files are all 200Mb but the transformations create new data, those transforms might create skew in final output but idk
  • Are you sure they're all on-demand and not being dropped as spot instances?
  • Lastly are you using long term connections throughout? Like do you have the following below:
cnxn = boto3.connect()
process_data_for_a_while()
cnxn.upload(file)

If you do then maybe the cnxn is too long lived

发布评论

评论列表(0)

  1. 暂无评论