I have written a simple function that converts a string consisting of only digits ('0123456789'
) to a list of bytes. More specifically it finds the longest possible sequences of digits that when converted to an integer is less than 256.
For example, given the decimal digits in order, the output should be (0, 123, 45, 67, 89)
.
Now I wrote this function very quickly:
from typing import List
DIGITS = {str(i): i for i in range(10)}
def get_bytes(s: str) -> List[int]:
byte_list = []
chunk = 0
for c in f'{s}0':
if (d := DIGITS.get(c)) is None:
raise ValueError(f"{c} is not a digit")
if (new := chunk * 10 + d) < 256:
chunk = new
else:
byte_list.append(chunk)
chunk = d
return byte_list[not byte_list[0]:]
In [143]: get_bytes('0000')
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[143], line 1
----> 1 get_bytes('0000')
Cell In[142], line 14, in get_bytes(s)
11 byte_list.append(chunk)
12 chunk = d
---> 14 return byte_list[not byte_list[0]:]
IndexError: list index out of range
In [144]: get_bytes('100100100')
Out[144]: [100, 100, 100]
In [145]: get_bytes('12301230123')
Out[145]: [123, 123, 123]
I tried to fix the code, I couldn't. Now there is only one problem remaining, and the exact problem I tried to address with this question.
In the first example, expected output is [0, 0, 0, 0]
, in the third example the correct output should be [123, 0, 123, 0, 123]
. How to fix this?
And for 0, I want every isolated 0 to be one number in the output. In short I want to make sure ''.join(str(i) for i in output) == original
.
According to my tests, the code from the accepted answer performs better than the regex based approach, so I accepted it:
In [384]: get_bytes1(string) == get_bytes2(string)
Out[384]: True
In [385]: %timeit get_bytes1(string)
179 ms ± 588 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [386]: %timeit get_bytes2(string)
161 ms ± 748 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [387]: len(string)
Out[387]: 1000001
get_bytes2
is from the accepted answer, get_bytes1
is the regex based approach. And string
is the first 1000000 decimal places of π plus a '3'
in the front.
I have written a simple function that converts a string consisting of only digits ('0123456789'
) to a list of bytes. More specifically it finds the longest possible sequences of digits that when converted to an integer is less than 256.
For example, given the decimal digits in order, the output should be (0, 123, 45, 67, 89)
.
Now I wrote this function very quickly:
from typing import List
DIGITS = {str(i): i for i in range(10)}
def get_bytes(s: str) -> List[int]:
byte_list = []
chunk = 0
for c in f'{s}0':
if (d := DIGITS.get(c)) is None:
raise ValueError(f"{c} is not a digit")
if (new := chunk * 10 + d) < 256:
chunk = new
else:
byte_list.append(chunk)
chunk = d
return byte_list[not byte_list[0]:]
In [143]: get_bytes('0000')
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[143], line 1
----> 1 get_bytes('0000')
Cell In[142], line 14, in get_bytes(s)
11 byte_list.append(chunk)
12 chunk = d
---> 14 return byte_list[not byte_list[0]:]
IndexError: list index out of range
In [144]: get_bytes('100100100')
Out[144]: [100, 100, 100]
In [145]: get_bytes('12301230123')
Out[145]: [123, 123, 123]
I tried to fix the code, I couldn't. Now there is only one problem remaining, and the exact problem I tried to address with this question.
In the first example, expected output is [0, 0, 0, 0]
, in the third example the correct output should be [123, 0, 123, 0, 123]
. How to fix this?
And for 0, I want every isolated 0 to be one number in the output. In short I want to make sure ''.join(str(i) for i in output) == original
.
According to my tests, the code from the accepted answer performs better than the regex based approach, so I accepted it:
In [384]: get_bytes1(string) == get_bytes2(string)
Out[384]: True
In [385]: %timeit get_bytes1(string)
179 ms ± 588 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [386]: %timeit get_bytes2(string)
161 ms ± 748 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [387]: len(string)
Out[387]: 1000001
get_bytes2
is from the accepted answer, get_bytes1
is the regex based approach. And string
is the first 1000000 decimal places of π plus a '3'
in the front.
3 Answers
Reset to default 2I did one simple thing to fix the code. In the original code, zeros are skipped because when there is an isolated zero, chunk
would be of course 0, and chunk * 10
is still 0.
I have added the condition that catches this one single case:
from typing import List
DIGITS = {str(i): i for i in range(10)}
def get_bytes(s: str) -> List[int]:
byte_list = []
chunk = 0
for c in s:
if (d := DIGITS.get(c)) is None:
raise ValueError(f"{c} is not a digit")
if (new := chunk * 10 + d) < 256:
chunk = new
else:
byte_list.append(chunk)
chunk = d
if chunk == d == 0:
byte_list.append(chunk)
if chunk:
byte_list.append(chunk)
return byte_list
In [159]: get_bytes('0000')
Out[159]: [0, 0, 0, 0]
In [160]: get_bytes('1234567890123456789')
Out[160]: [123, 45, 67, 89, 0, 123, 45, 67, 89]
In [161]: get_bytes('10010010000')
Out[161]: [100, 100, 100, 0, 0]
Though of course, the fool-proof approach is to use re
, and it is faster, but only a little bit:
import re
BYTES = repile(r'1\d\d|2[0-4]\d|25[0-5]|[1-9]\d?|0')
def get_bytes1(s: str) -> List[int]:
return list(map(int, BYTES.findall(s)))
In [282]: %timeit get_bytes(string)
198 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [283]: %timeit get_bytes1(string)
173 ms ± 503 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [284]: len(string)
Out[284]: 1000001
I have fixed the regex. Now is is working as intended, and no edge cases.
This may not be as concise as it could be but it does work for the test cases presented in the question.
DIGITS = {str(d): d for d in range(10)}
def get_bytes(s: str) -> list[int]:
result = []
for i, c in enumerate(s):
if c == "0":
result.append(0)
else:
break
else:
return result
a = 0
for c in s[i:]:
x = DIGITS[c]
t = a * 10 + x
if t > 255:
result.append(a)
if x == 0:
result.append(0)
a = x
elif x == 0 and a == 0:
result.append(0)
else:
a = t
return result if a == 0 else result + [a]
samples = [
"0000",
"12301230123",
"100100100",
"1234567890123456789",
"10010010000"
]
for sample in samples:
print(sample, get_bytes(sample))
Output:
0000 [0, 0, 0, 0]
12301230123 [123, 0, 123, 0, 123]
100100100 [100, 100, 100]
1234567890123456789 [123, 45, 67, 89, 0, 123, 45, 67, 89]
10010010000 [100, 100, 100, 0, 0]
def get_bytes2( input ):
word = input[ 0 ]
listOut = []
for i in range( 1, len( input )):
letter = input[ i ]
if word != "0" and int( word + letter ) < 256:
word += letter
else:
listOut.append( int( word ))
word = letter
if word != "":
listOut.append( int( word ))
return listOut
input = [ "1234567890", "0123456789", "12301230123", "100100100", "0000" ]
for str in input:
print( get_bytes2( str ))
"""
Outputs obtained:
[123, 45, 67, 89, 0]
[0, 123, 45, 67, 89]
[123, 0, 123, 0, 123]
[100, 100, 100]
[0, 0, 0, 0]
"""
First, we instantiate word with the first number in the string, then we enter the for that iterates through the string starting with the second number.
In the if, we check if word contains "0" or if the numeric value of the concatenation of word and letter is greater than "256", in which case we move on to the else, if not, add letter to word.
In the else, we add the numeric value of word to listOut and assign the contents of letter to it.
Finally, exiting the for, if word doesn't contain a "0" or is empty, we add the numeric value of its contents to listOut.