I'm developing a geocoding script using Geocoding Google's API, sending address as 'str'.
My workflow is like: I receive a table like: 'street_name', 'house_number', 'state'.
In Chile, there's a keyword to indicate that the place is residential, this is a 'Villa'. Anyways, 'Villa' es like an 'stop-word' it doesn't give any value and can lead to confusion, so in my script I have:
r'_x002d_': '',
r'/': '',
r'_xa0_': ' ',
# La información de dirección que viene después de la Villa obstaculiza la georeferenciación
r'\bVILLA\b.*': ''
So the problem is when I receive an address like: "Arquimides 123 Villa nueva Lote 3" it returns 'Arquimides 123' because that's what it has to do. The problem I'm facing now is with an address like: "Marga 33, Villa Alemana, Chile" because this is correct addressing but because I made the deletion of 'Villa' and everything coming next now I can't locate any address within 'Villa Alemana'.
I use this function then the regex above to modify the address:
# Función para realizar los reemplazos en una dirección
def reemplazar_palabras(direccion):
direccion = str(direccion) # Asegurarse de que sea una cadena
for patron, reemplazo in reemplazos.items():
direccion = re.sub(patron, reemplazo, direccion, flags=re.IGNORECASE).strip(",")
return direccion
So I'm trying to figure out how to give another perspective to the project because it seems that's gonna be like... dull if I keep adding 'raw' variables.
Is there any model or technique using some kind of modelling to give an scalable solution?
Personally, I think the code was good, but this kind of obstacle represents a high risk because they're valid directions that I'm deleting because my methods.
UPDATE: This is how I 'create' the address:
"""
DIRECCION_F is concatenated address, means Direccion_Final -> 'Direccion' = 'Address'
DIRE_CALLE is meant to only contain the street name, i.e dire_calle: Grecia, dire_numero: 1322, dire_comuna: Santiago
"""
df["DIRECCION_F"]= (
df["DIRE_CALLE"].astype(str).apply(reemplazar_palabras) + ' ' +
df["DIRE_NUMERO"].apply(lambda x: f"{int(x)}" if isinstance(x, (int, float)) and x.is_integer() else str(x)).apply(reemplazar_palabras) + ', ' +
df["DIRE_COMUNA"].astype(str).apply(reemplazar_palabras) + ', ' +
'CHILE')