I am trying to read in a MLTABLE in my pipeline, but getting:
Error Code: ScriptExecution.StreamAccess.NotFound
Native Error: error in streaming from input data sources
StreamError(NotFound)
=> stream not found
NotFound
Error Message: The requested stream was not found. Please make sure the request uri is correct.| session_id=1f8669ce-5a60-494b-a8dd-fd07fee8b186
When I try to read in the mltable in an interactive session, it works fine:
import mltable
tbl = mltable.load(f'azureml:/{flight_data.id}')
tbl.to_pandas_dataframe()
The above code works fine.
I am reading in the data using:
flight_data = ml_client.data.get(name='flightdelaydata1', version='2')
Below is the .py
code:
%%writefile {data_prep_folder}/data_prep.py
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import mlflow
import mltable
import os
import argparse
def main():
#Main function of the file
parser = argparse.ArgumentParser()
parser.add_argument('--data', help='Input data for flight delay model', type=str)
parser.add_argument('--train_test_split_ratio', help='Test data proportion', type=float, default=.20)
parser.add_argument('--train_data', help='Training Data', type=str)
parser.add_argument('--test_data', help='Test Data', type=str)
args = parser.parse_args()
# Start logging
mlflow.start_run()
tbl = mltable.load(args.data)
df = tbl.to_pandas_dataframe()
mlflow.log_metric('Number of observations', df.shape[0])
mlflow.log_metric('Number of features', df.shape[1])
df = df.dropna()
df = df.loc[ : , ['Month', 'DayofMonth', 'DayOfWeek', 'DepDelay', 'DepDel15', 'ArrDel15', 'Cancelled', 'ArrDelay']]
Train_data, Test_data = train_test_split(df, test_size=args.train_test_split_ratio)
Train_data.to_csv(os.path.join(args.train_data, 'train.csv'))
Test_data.to_csv(os.path.join(args.test_data, 'test.csv'))
mlflow.end_run()
if __name__ == "__main__":
main()
and the command
code is as below:
from azure.ai.ml.constants import AssetTypes
data_prep_component = command(name='flight_delay_model_data_prep',
description='Flight Delay Model Prediction Data Preparation Component',
display_name='Flight Delay Data Prep',
inputs = {
'data' : Input(type=AssetTypes.MLTABLE, path = flight_data.path),
'train_test_split_ratio' : Input(type='number')
},
outputs = {
'train_data' : Output(type = 'uri_folder'),
'test_data' : Output(type = 'uri_folder')
},
command= '''python {data_prep_folder}/data_prep.py \
--data ${{inputs.data}} --train_test_split_ratio ${{inputs.train_test_split_ratio}} \
--train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}}''',
environment = f'{envt.name}:{envt.version}'
)
Not sure if am making an error while reading the mltable
in the job code.
Can someone please help me where I am going wrong.
I am trying to read in a MLTABLE in my pipeline, but getting:
Error Code: ScriptExecution.StreamAccess.NotFound
Native Error: error in streaming from input data sources
StreamError(NotFound)
=> stream not found
NotFound
Error Message: The requested stream was not found. Please make sure the request uri is correct.| session_id=1f8669ce-5a60-494b-a8dd-fd07fee8b186
When I try to read in the mltable in an interactive session, it works fine:
import mltable
tbl = mltable.load(f'azureml:/{flight_data.id}')
tbl.to_pandas_dataframe()
The above code works fine.
I am reading in the data using:
flight_data = ml_client.data.get(name='flightdelaydata1', version='2')
Below is the .py
code:
%%writefile {data_prep_folder}/data_prep.py
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import mlflow
import mltable
import os
import argparse
def main():
#Main function of the file
parser = argparse.ArgumentParser()
parser.add_argument('--data', help='Input data for flight delay model', type=str)
parser.add_argument('--train_test_split_ratio', help='Test data proportion', type=float, default=.20)
parser.add_argument('--train_data', help='Training Data', type=str)
parser.add_argument('--test_data', help='Test Data', type=str)
args = parser.parse_args()
# Start logging
mlflow.start_run()
tbl = mltable.load(args.data)
df = tbl.to_pandas_dataframe()
mlflow.log_metric('Number of observations', df.shape[0])
mlflow.log_metric('Number of features', df.shape[1])
df = df.dropna()
df = df.loc[ : , ['Month', 'DayofMonth', 'DayOfWeek', 'DepDelay', 'DepDel15', 'ArrDel15', 'Cancelled', 'ArrDelay']]
Train_data, Test_data = train_test_split(df, test_size=args.train_test_split_ratio)
Train_data.to_csv(os.path.join(args.train_data, 'train.csv'))
Test_data.to_csv(os.path.join(args.test_data, 'test.csv'))
mlflow.end_run()
if __name__ == "__main__":
main()
and the command
code is as below:
from azure.ai.ml.constants import AssetTypes
data_prep_component = command(name='flight_delay_model_data_prep',
description='Flight Delay Model Prediction Data Preparation Component',
display_name='Flight Delay Data Prep',
inputs = {
'data' : Input(type=AssetTypes.MLTABLE, path = flight_data.path),
'train_test_split_ratio' : Input(type='number')
},
outputs = {
'train_data' : Output(type = 'uri_folder'),
'test_data' : Output(type = 'uri_folder')
},
command= '''python {data_prep_folder}/data_prep.py \
--data ${{inputs.data}} --train_test_split_ratio ${{inputs.train_test_split_ratio}} \
--train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}}''',
environment = f'{envt.name}:{envt.version}'
)
Not sure if am making an error while reading the mltable
in the job code.
Can someone please help me where I am going wrong.
2 Answers
Reset to default 1 +50Looking at the documentation for mltable.load and looking at usage examples here, I am not sure if command call that you are using to run the azure job supports AssetTypes.
The example syntax provided for mltable is as follows:
job = command(
command="python train.py --input ${{inputs.green}}",
inputs={"green": Input(type="mltable", path=data_asset.id)},
compute="cpu-cluster",
environment=Environment(
image="mcr.microsoft/azureml/openmpi4.1.0-ubuntu20.04",
conda_file="./job-env/conda_dependencies.yml",
),
code="./src",
)
Maybe you can try switching your command job to the following and see if that fixes the issue:
data_prep_component = command(name='flight_delay_model_data_prep',
description='Flight Delay Model Prediction Data Preparation Component',
display_name='Flight Delay Data Prep',
inputs = {
'data' : Input(type="mltable", path = flight_data.id),
'train_test_split_ratio' : Input(type='number')
},
outputs = {
'train_data' : Output(type = 'uri_folder'),
'test_data' : Output(type = 'uri_folder')
},
command= '''python {data_prep_folder}/data_prep.py \
--data ${{inputs.data}} --train_test_split_ratio ${{inputs.train_test_split_ratio}} \
--train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}}''',
environment = f'{envt.name}:{envt.version}'
)
Sorry! I don't have an active azure instance at the moment to test it but do give it a shot.
Looking at the documentation for azure.ai.ml.Input
, you might need to change the value of path
.
Note their example:
from azure.ai.ml import Input
...
Input(
path="azureml:list_data_v2_test:2", type="uri_folder", description="registered data asset"
)
Their example is using a uri_folder
type rather than MLTable, so it doesn't work exactly the same, but notice the azureml:
prefix on the path
value, and the format.
Check out also the Remarks section at the bottom of the documentation on mltable
, as that has some examples and information about the expected URIs of MLTables, in particular:
There must be a valid MLTable YAML file named 'MLTable' present at the given uri.
In short try changing:
Input(type=AssetTypes.MLTABLE, path = flight_data.path)
to:
Input(type=AssetTypes.MLTABLE, path = f'azureml:/{flight_data.id}')
since that is the value that is working in your mltable interactive session.
--data
argument is not being passed in the right format. – Razzi Abuissa Commented Mar 7 at 19:48command
to the code, are you referring to something else? – Karthik S Commented Mar 8 at 15:52mltable.load(f'azureml:/{flight_data.id}')
however in the command script you're usingInput(type=AssetTypes.MLTABLE, path = flight_data.path)
-flight_data.id
(with some azureml prefix) versusflight_data.path
. In both cases it's not clear how theflight_data
object is defined, so it would help to see how that is declared / imported. – Razzi Abuissa Commented Mar 9 at 7:14flight_data
object by:flight_data = ml_client.data.get(name='flightdelaydata1', version='2')
– Karthik S Commented Mar 9 at 9:34