I am creating a dataset on AWS S3, for their Opendata program.
I am fetching audio files, which are already stored on S3. I then segment them into smaller audio chunks, and putting those on S3 again. The problem occurs when I need to PUT (if I comment out, it won't give error).
The S3 bucket is in US-region.
To test if it was connectivity error on my end I tried:
- Sagemaker Free lab, using US-region: it hung without error, but after 4 hours there was no progress
- Google Colab, in US-region : same error, but they restricted temporarily the resources due to data volume, and cannot try again
- local environment, in EU-region: it returns TimeOutError, no progress
Can you please help in avoiding the error and possibly speed up operations ? I must only use basic S3 services, not using Lambda or other AWS services to meet allowed budget costs.
Below what I tried:
# boto3 client for uploading (signed requests)
s3_client = boto3.client(
's3',
aws_access_key_id=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_KEY,
region_name=REGION_NAME
)
%%time
import functools
cache_audio = {}
for o, (lb, up) in enumerate(batches[6:]):
for ix, row in annotated_segments.loc[lb:up].iterrows():
# clear cache to keep memory safe
if len(cache_audio) > 5:
cache_audio.clear()
# audio props
file_name = row['File name']
# path props
file_folder = row['File folder']
# segment props
segment_name = row['segment_name']
start = row['voice_start']
end = row['voice_end']
# read from Cache
if file_name not in cache_audio:
audio, rate = fetch_audio(row)
cache_audio[file_name] = audio
else:
audio = cache_audio[file_name]
# store segment on S3
audio_segment = audio[start : end]
try:
s3_path = f"data/annotated_segments/{file_folder}/{file_name}/{segment_name}"
# initialise the bianary file
file_obj = io.BytesIO()
# write the audio segment
# .html#soundfile.write
soundfile.write(file_obj, audio_segment, samplerate = rate, format='WAV') # norm=False for raw data
# Reset the file pointer to the beginning
file_obj.seek(0)
# put annotated segments in S3
put_audio_to_s3(file_obj, s3_path)
except Exception as e:
print(f"Error uploading file: {e}. File name: { file_name }. Batch: {lb} - {up}")
print(f'Success! Completed {o}-th batch: {lb} - {up}')
Error raised after a while:
---------------------------------------------------------------------------TimeoutError Traceback (most recent call last)File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:754, in HTTPResponse._error_catcher(self) 753 try:--> 754 yield 756 except SocketTimeout as e: 757 # FIXME: Ideally we'd like to include the url in the ReadTimeoutError but 758 # there is yet no clean way to get at it from this context.File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:879, in HTTPResponse._raw_read(self, amt, read1) 878 with self._error_catcher():--> 879 data = self._fp_read(amt, read1=read1) if not fp_closed else b"" 880 if amt is not None and amt != 0 and not data: 881 # Platform-specific: Buggy versions of Python. 882 # Close the connection when no data is returned (...) 887 # not properly close the connection in all cases. There is 888 # no harm in redundantly calling close.File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:862, in HTTPResponse._fp_read(self, amt, read1) 860 else: 861 # StringIO doesn't like amt=None--> 862 return self._fp.read(amt) if amt is not None else self._fp.read()File ~/miniconda3/envs/fruitbats/lib/python3.10/http/client.py:482, in HTTPResponse.read(self, amt) 481 try:--> 482 s = self._safe_read(self.length) 483 except IncompleteRead:File ~/miniconda3/envs/fruitbats/lib/python3.10/http/client.py:631, in HTTPResponse._safe_read(self, amt) 625 """Read the number of bytes requested. 626 627 This function should be used when <amt> bytes "should" be present for 628 reading. If the bytes are truly not available (due to EOF), then the 629 IncompleteRead exception can be used to detect the problem. 630 """--> 631 data = self.fp.read(amt) 632 if len(data) < amt:File ~/miniconda3/envs/fruitbats/lib/python3.10/socket.py:717, in SocketIO.readinto(self, b) 716 try:--> 717 return self._sock.recv_into(b) 718 except timeout:File ~/miniconda3/envs/fruitbats/lib/python3.10/ssl.py:1307, in SSLSocket.recv_into(self, buffer, nbytes, flags) 1304 raise ValueError( 1305 "non-zero flags not allowed in calls to recv_into() on %s" % 1306 self.__class__)-> 1307 return self.read(nbytes, buffer) 1308 else:File ~/miniconda3/envs/fruitbats/lib/python3.10/ssl.py:1163, in SSLSocket.read(self, len, buffer) 1162 if buffer is not None:-> 1163 return self._sslobj.read(len, buffer) 1164 else:TimeoutError: The read operation timed outThe above exception was the direct cause of the following exception:ReadTimeoutError Traceback (most recent call last)File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/botocore/response.py:99, in StreamingBody.read(self, amt) 98 try:---> 99 chunk = self._raw_stream.read(amt) 100 except URLLib3ReadTimeoutError as e: 101 # TODO: the url will be None as urllib3 isn't setting it yetFile ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:955, in HTTPResponse.read(self, amt, decode_content, cache_content) 953 return self._decoded_buffer.get(amt)--> 955 data = self._raw_read(amt) 957 flush_decoder = amt is None or (amt != 0 and not data)File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:878, in HTTPResponse._raw_read(self, amt, read1) 876 fp_closed = getattr(self._fp, "closed", False)--> 878 with self._error_catcher(): 879 data = self._fp_read(amt, read1=read1) if not fp_closed else b""File ~/miniconda3/envs/fruitbats/lib/python3.10/contextlib.py:153, in _GeneratorContextManager.__exit__(self, typ, value, traceback) 152 try:--> 153 self.gen.throw(typ, value, traceback) 154 except StopIteration as exc: 155 # Suppress StopIteration *unless* it's the same exception that 156 # was passed to throw(). This prevents a StopIteration 157 # raised inside the "with" statement from being suppressed.File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:759, in HTTPResponse._error_catcher(self) 756 except SocketTimeout as e: 757 # FIXME: Ideally we'd like to include the url in the ReadTimeoutError but 758 # there is yet no clean way to get at it from this context.--> 759 raise ReadTimeoutError(self._pool, None, "Read timed out.") from e # type: ignore[arg-type] 761 except BaseSSLError as e: 762 # FIXME: Is there a better way to differentiate between SSLErrors?ReadTimeoutError: AWSHTTPSConnectionPool(host='fruitbat-vocalizations.s3.us-west-2.amazonaws', port=443): Read timed out.During handling of the above exception, another exception occurred:ReadTimeoutError Traceback (most recent call last)File <timed exec>:25Cell In[25], line 15, in fetch_audio(row, sr) 12 s3_object_key = str(s3_path.relative_to(DSLOC)) 14 response = s3_client.get_object(Bucket=BUCKET_NAME, Key=s3_object_key)---> 15 file_content = response['Body'].read() 17 # 18 # this will read in float64 by default and multichannel if any 19 data, rate = soundfile.read(io.BufferedReader(io.BytesIO(file_content)), always_2d=True)File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/botocore/httpchecksum.py:240, in StreamingChecksumBody.read(self, amt) 239 def read(self, amt=None):--> 240 chunk = super().read(amt=amt) 241 self._checksum.update(chunk) 242 if amt is None or (not chunk and amt > 0):File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/botocore/response.py:102, in StreamingBody.read(self, amt) 99 chunk = self._raw_stream.read(amt) 100 except URLLib3ReadTimeoutError as e: 101 # TODO: the url will be None as urllib3 isn't setting it yet--> 102 raise ReadTimeoutError(endpoint_url=e.url, error=e) 103 except URLLib3ProtocolError as e: 104 raise ResponseStreamingError(error=e)ReadTimeoutError: Read timeout on endpoint URL: "None"
I am creating a dataset on AWS S3, for their Opendata program.
I am fetching audio files, which are already stored on S3. I then segment them into smaller audio chunks, and putting those on S3 again. The problem occurs when I need to PUT (if I comment out, it won't give error).
The S3 bucket is in US-region.
To test if it was connectivity error on my end I tried:
- Sagemaker Free lab, using US-region: it hung without error, but after 4 hours there was no progress
- Google Colab, in US-region : same error, but they restricted temporarily the resources due to data volume, and cannot try again
- local environment, in EU-region: it returns TimeOutError, no progress
Can you please help in avoiding the error and possibly speed up operations ? I must only use basic S3 services, not using Lambda or other AWS services to meet allowed budget costs.
Below what I tried:
# boto3 client for uploading (signed requests)
s3_client = boto3.client(
's3',
aws_access_key_id=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_KEY,
region_name=REGION_NAME
)
%%time
import functools
cache_audio = {}
for o, (lb, up) in enumerate(batches[6:]):
for ix, row in annotated_segments.loc[lb:up].iterrows():
# clear cache to keep memory safe
if len(cache_audio) > 5:
cache_audio.clear()
# audio props
file_name = row['File name']
# path props
file_folder = row['File folder']
# segment props
segment_name = row['segment_name']
start = row['voice_start']
end = row['voice_end']
# read from Cache
if file_name not in cache_audio:
audio, rate = fetch_audio(row)
cache_audio[file_name] = audio
else:
audio = cache_audio[file_name]
# store segment on S3
audio_segment = audio[start : end]
try:
s3_path = f"data/annotated_segments/{file_folder}/{file_name}/{segment_name}"
# initialise the bianary file
file_obj = io.BytesIO()
# write the audio segment
# https://python-soundfile.readthedocs.io/en/latest/index.html#soundfile.write
soundfile.write(file_obj, audio_segment, samplerate = rate, format='WAV') # norm=False for raw data
# Reset the file pointer to the beginning
file_obj.seek(0)
# put annotated segments in S3
put_audio_to_s3(file_obj, s3_path)
except Exception as e:
print(f"Error uploading file: {e}. File name: { file_name }. Batch: {lb} - {up}")
print(f'Success! Completed {o}-th batch: {lb} - {up}')
Error raised after a while:
---------------------------------------------------------------------------TimeoutError Traceback (most recent call last)File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:754, in HTTPResponse._error_catcher(self) 753 try:--> 754 yield 756 except SocketTimeout as e: 757 # FIXME: Ideally we'd like to include the url in the ReadTimeoutError but 758 # there is yet no clean way to get at it from this context.File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:879, in HTTPResponse._raw_read(self, amt, read1) 878 with self._error_catcher():--> 879 data = self._fp_read(amt, read1=read1) if not fp_closed else b"" 880 if amt is not None and amt != 0 and not data: 881 # Platform-specific: Buggy versions of Python. 882 # Close the connection when no data is returned (...) 887 # not properly close the connection in all cases. There is 888 # no harm in redundantly calling close.File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:862, in HTTPResponse._fp_read(self, amt, read1) 860 else: 861 # StringIO doesn't like amt=None--> 862 return self._fp.read(amt) if amt is not None else self._fp.read()File ~/miniconda3/envs/fruitbats/lib/python3.10/http/client.py:482, in HTTPResponse.read(self, amt) 481 try:--> 482 s = self._safe_read(self.length) 483 except IncompleteRead:File ~/miniconda3/envs/fruitbats/lib/python3.10/http/client.py:631, in HTTPResponse._safe_read(self, amt) 625 """Read the number of bytes requested. 626 627 This function should be used when <amt> bytes "should" be present for 628 reading. If the bytes are truly not available (due to EOF), then the 629 IncompleteRead exception can be used to detect the problem. 630 """--> 631 data = self.fp.read(amt) 632 if len(data) < amt:File ~/miniconda3/envs/fruitbats/lib/python3.10/socket.py:717, in SocketIO.readinto(self, b) 716 try:--> 717 return self._sock.recv_into(b) 718 except timeout:File ~/miniconda3/envs/fruitbats/lib/python3.10/ssl.py:1307, in SSLSocket.recv_into(self, buffer, nbytes, flags) 1304 raise ValueError( 1305 "non-zero flags not allowed in calls to recv_into() on %s" % 1306 self.__class__)-> 1307 return self.read(nbytes, buffer) 1308 else:File ~/miniconda3/envs/fruitbats/lib/python3.10/ssl.py:1163, in SSLSocket.read(self, len, buffer) 1162 if buffer is not None:-> 1163 return self._sslobj.read(len, buffer) 1164 else:TimeoutError: The read operation timed outThe above exception was the direct cause of the following exception:ReadTimeoutError Traceback (most recent call last)File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/botocore/response.py:99, in StreamingBody.read(self, amt) 98 try:---> 99 chunk = self._raw_stream.read(amt) 100 except URLLib3ReadTimeoutError as e: 101 # TODO: the url will be None as urllib3 isn't setting it yetFile ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:955, in HTTPResponse.read(self, amt, decode_content, cache_content) 953 return self._decoded_buffer.get(amt)--> 955 data = self._raw_read(amt) 957 flush_decoder = amt is None or (amt != 0 and not data)File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:878, in HTTPResponse._raw_read(self, amt, read1) 876 fp_closed = getattr(self._fp, "closed", False)--> 878 with self._error_catcher(): 879 data = self._fp_read(amt, read1=read1) if not fp_closed else b""File ~/miniconda3/envs/fruitbats/lib/python3.10/contextlib.py:153, in _GeneratorContextManager.__exit__(self, typ, value, traceback) 152 try:--> 153 self.gen.throw(typ, value, traceback) 154 except StopIteration as exc: 155 # Suppress StopIteration *unless* it's the same exception that 156 # was passed to throw(). This prevents a StopIteration 157 # raised inside the "with" statement from being suppressed.File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/urllib3/response.py:759, in HTTPResponse._error_catcher(self) 756 except SocketTimeout as e: 757 # FIXME: Ideally we'd like to include the url in the ReadTimeoutError but 758 # there is yet no clean way to get at it from this context.--> 759 raise ReadTimeoutError(self._pool, None, "Read timed out.") from e # type: ignore[arg-type] 761 except BaseSSLError as e: 762 # FIXME: Is there a better way to differentiate between SSLErrors?ReadTimeoutError: AWSHTTPSConnectionPool(host='fruitbat-vocalizations.s3.us-west-2.amazonaws', port=443): Read timed out.During handling of the above exception, another exception occurred:ReadTimeoutError Traceback (most recent call last)File <timed exec>:25Cell In[25], line 15, in fetch_audio(row, sr) 12 s3_object_key = str(s3_path.relative_to(DSLOC)) 14 response = s3_client.get_object(Bucket=BUCKET_NAME, Key=s3_object_key)---> 15 file_content = response['Body'].read() 17 # https://stackoverflow/questions/73350508/read-audio-file-from-s3-directly-in-python 18 # this will read in float64 by default and multichannel if any 19 data, rate = soundfile.read(io.BufferedReader(io.BytesIO(file_content)), always_2d=True)File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/botocore/httpchecksum.py:240, in StreamingChecksumBody.read(self, amt) 239 def read(self, amt=None):--> 240 chunk = super().read(amt=amt) 241 self._checksum.update(chunk) 242 if amt is None or (not chunk and amt > 0):File ~/miniconda3/envs/fruitbats/lib/python3.10/site-packages/botocore/response.py:102, in StreamingBody.read(self, amt) 99 chunk = self._raw_stream.read(amt) 100 except URLLib3ReadTimeoutError as e: 101 # TODO: the url will be None as urllib3 isn't setting it yet--> 102 raise ReadTimeoutError(endpoint_url=e.url, error=e) 103 except URLLib3ProtocolError as e: 104 raise ResponseStreamingError(error=e)ReadTimeoutError: Read timeout on endpoint URL: "None"
Share
Improve this question
asked Mar 24 at 15:01
user305883user305883
1,7493 gold badges27 silver badges60 bronze badges
1
|
1 Answer
Reset to default 0I have encountered the same. It seems it was AWS over IPV6 was not resolving.
First, debug aws cmd example S3 ls
aws s3 ls --debug
see if the above is working or not resolving over the IPv6 address.
If that is the case, you can switch back to IPV4 or configure use_dualstack_endpoint
Verify IPv6 Status
Follow these steps:
Log in with root privileges.
Execute the following command:
$ ip a | grep inet6
In case you find the following as the result, then IPv6 is enabled:
inet6 ::1/128 scope host inet6 fe80::e922:bcdf:e150:labb/64 scope link
If IPv6 is disabled, you should see no output if you run this command.
Disable IPv6
Perform the following steps to disable IPv6.
Follow these steps:
Log in with root privileges.
Open the
/etc/sysctl.conf
file with the following command:
$ sudo vim /etc/sysctl.conf
Add the following lines to it:
net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1
Reload
sysctl.conf
with the following command:
sysctl -p
Verify that IPv6 is disabled by running the following command:
$ ip a | grep inet6
You can enbble use_dualstack_endpoint in AWS config
cat .aws/config
[default]
region = eu-west-1
output = json
use_dualstack_endpoint = true
https://techdocs.broadcom/us/en/ca-enterprise-software/it-operations-management/network-flow-analysis/23-3/installing/system-recommendations-and-requirements/linux-servers/disable-ipv6-networking-on-linux-servers.html
REGION_NAME
? – John Rotenstein Commented Mar 24 at 22:17