python - Getting Error Code: ScriptExecution.StreamAccess.NotFound error while trying to read in mltable in a pipeline

I am trying to read in a MLTABLE in my pipeline, but getting:

Error Code: ScriptExecution.StreamAccess.NotFound
Native Error: error in streaming from input data sources
    StreamError(NotFound)
=> stream not found
    NotFound
Error Message: The requested stream was not found. Please make sure the request uri is correct.| session_id=1f8669ce-5a60-494b-a8dd-fd07fee8b186

When I try to read in the mltable in an interactive session, it works fine:

import mltable
tbl = mltable.load(f'azureml:/{flight_data.id}')
tbl.to_pandas_dataframe()

The above code works fine.

I am reading in the data using:

flight_data = ml_client.data.get(name='flightdelaydata1', version='2')

Below is the .py code:

%%writefile {data_prep_folder}/data_prep.py
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import mlflow
import mltable
import os

import argparse

def main():

    #Main function of the file

    parser = argparse.ArgumentParser()
    parser.add_argument('--data', help='Input data for flight delay model', type=str)
    parser.add_argument('--train_test_split_ratio', help='Test data proportion', type=float, default=.20)
    parser.add_argument('--train_data', help='Training Data', type=str)
    parser.add_argument('--test_data', help='Test Data', type=str)

    args = parser.parse_args()

    # Start logging

    mlflow.start_run()

    tbl = mltable.load(args.data)
    df = tbl.to_pandas_dataframe()
    mlflow.log_metric('Number of observations', df.shape[0])
    mlflow.log_metric('Number of features', df.shape[1])

    df = df.dropna()
    df = df.loc[ : , ['Month', 'DayofMonth', 'DayOfWeek', 'DepDelay', 'DepDel15', 'ArrDel15', 'Cancelled', 'ArrDelay']]

    Train_data, Test_data = train_test_split(df, test_size=args.train_test_split_ratio)

    Train_data.to_csv(os.path.join(args.train_data, 'train.csv'))
    Test_data.to_csv(os.path.join(args.test_data, 'test.csv'))

    mlflow.end_run()


if __name__ == "__main__":
    main()

and the command code is as below:

from azure.ai.ml.constants import AssetTypes

data_prep_component = command(name='flight_delay_model_data_prep',
                              description='Flight Delay Model Prediction Data Preparation Component',
                              display_name='Flight Delay Data Prep',
                              inputs = {
                                        'data' : Input(type=AssetTypes.MLTABLE, path = flight_data.path),
                                        'train_test_split_ratio' : Input(type='number')
                              },
                              outputs = {
                                        'train_data' : Output(type = 'uri_folder'),
                                        'test_data' : Output(type = 'uri_folder')
                                        },
                              command= '''python {data_prep_folder}/data_prep.py \
                                            --data ${{inputs.data}} --train_test_split_ratio ${{inputs.train_test_split_ratio}} \
                                            --train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}}''',
                              environment = f'{envt.name}:{envt.version}'
                              )

Not sure if am making an error while reading the mltable in the job code. Can someone please help me where I am going wrong.

I am trying to read in a MLTABLE in my pipeline, but getting:

Error Code: ScriptExecution.StreamAccess.NotFound
Native Error: error in streaming from input data sources
    StreamError(NotFound)
=> stream not found
    NotFound
Error Message: The requested stream was not found. Please make sure the request uri is correct.| session_id=1f8669ce-5a60-494b-a8dd-fd07fee8b186

When I try to read in the mltable in an interactive session, it works fine:

import mltable
tbl = mltable.load(f'azureml:/{flight_data.id}')
tbl.to_pandas_dataframe()

The above code works fine.

I am reading in the data using:

flight_data = ml_client.data.get(name='flightdelaydata1', version='2')

Below is the .py code:

%%writefile {data_prep_folder}/data_prep.py
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import mlflow
import mltable
import os

import argparse

def main():

    #Main function of the file

    parser = argparse.ArgumentParser()
    parser.add_argument('--data', help='Input data for flight delay model', type=str)
    parser.add_argument('--train_test_split_ratio', help='Test data proportion', type=float, default=.20)
    parser.add_argument('--train_data', help='Training Data', type=str)
    parser.add_argument('--test_data', help='Test Data', type=str)

    args = parser.parse_args()

    # Start logging

    mlflow.start_run()

    tbl = mltable.load(args.data)
    df = tbl.to_pandas_dataframe()
    mlflow.log_metric('Number of observations', df.shape[0])
    mlflow.log_metric('Number of features', df.shape[1])

    df = df.dropna()
    df = df.loc[ : , ['Month', 'DayofMonth', 'DayOfWeek', 'DepDelay', 'DepDel15', 'ArrDel15', 'Cancelled', 'ArrDelay']]

    Train_data, Test_data = train_test_split(df, test_size=args.train_test_split_ratio)

    Train_data.to_csv(os.path.join(args.train_data, 'train.csv'))
    Test_data.to_csv(os.path.join(args.test_data, 'test.csv'))

    mlflow.end_run()


if __name__ == "__main__":
    main()

and the command code is as below:

from azure.ai.ml.constants import AssetTypes

data_prep_component = command(name='flight_delay_model_data_prep',
                              description='Flight Delay Model Prediction Data Preparation Component',
                              display_name='Flight Delay Data Prep',
                              inputs = {
                                        'data' : Input(type=AssetTypes.MLTABLE, path = flight_data.path),
                                        'train_test_split_ratio' : Input(type='number')
                              },
                              outputs = {
                                        'train_data' : Output(type = 'uri_folder'),
                                        'test_data' : Output(type = 'uri_folder')
                                        },
                              command= '''python {data_prep_folder}/data_prep.py \
                                            --data ${{inputs.data}} --train_test_split_ratio ${{inputs.train_test_split_ratio}} \
                                            --train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}}''',
                              environment = f'{envt.name}:{envt.version}'
                              )

Not sure if am making an error while reading the mltable in the job code. Can someone please help me where I am going wrong.

Share Improve this question edited Mar 6 at 9:28 asked Mar 3 at 16:20 Karthik S 11.6k2 gold badges14 silver badges31 bronze badges

Could you add the command you're using to invoke your python code? It's possible the --data argument is not being passed in the right format. – Razzi Abuissa Commented Mar 7 at 19:48
@RazziAbuissa, I've added the command to the code, are you referring to something else? – Karthik S Commented Mar 8 at 15:52
Perhaps the error has to do with the fact that in the working interactive session you use mltable.load(f'azureml:/{flight_data.id}') however in the command script you're using Input(type=AssetTypes.MLTABLE, path = flight_data.path) - flight_data.id (with some azureml prefix) versus flight_data.path. In both cases it's not clear how the flight_data object is defined, so it would help to see how that is declared / imported. – Razzi Abuissa Commented Mar 9 at 7:14
@RazziAbuissa, I've defined flight_data object by: flight_data = ml_client.data.get(name='flightdelaydata1', version='2') – Karthik S Commented Mar 9 at 9:34

Add a comment |

2 Answers 2

Sorted by: Reset to default 1 +50

Looking at the documentation for mltable.load and looking at usage examples here, I am not sure if command call that you are using to run the azure job supports AssetTypes.

The example syntax provided for mltable is as follows:

job = command(
    command="python train.py --input ${{inputs.green}}",
    inputs={"green": Input(type="mltable", path=data_asset.id)},
    compute="cpu-cluster",
    environment=Environment(
        image="mcr.microsoft/azureml/openmpi4.1.0-ubuntu20.04",
        conda_file="./job-env/conda_dependencies.yml",
    ),
    code="./src",
)

Maybe you can try switching your command job to the following and see if that fixes the issue:

data_prep_component = command(name='flight_delay_model_data_prep',
                              description='Flight Delay Model Prediction Data Preparation Component',
                              display_name='Flight Delay Data Prep',
                              inputs = {
                                        'data' : Input(type="mltable", path = flight_data.id),
                                        'train_test_split_ratio' : Input(type='number')
                              },
                              outputs = {
                                        'train_data' : Output(type = 'uri_folder'),
                                        'test_data' : Output(type = 'uri_folder')
                                        },
                              command= '''python {data_prep_folder}/data_prep.py \
                                            --data ${{inputs.data}} --train_test_split_ratio ${{inputs.train_test_split_ratio}} \
                                            --train_data ${{outputs.train_data}} --test_data ${{outputs.test_data}}''',
                              environment = f'{envt.name}:{envt.version}'
                              )

Sorry! I don't have an active azure instance at the moment to test it but do give it a shot.

Looking at the documentation for azure.ai.ml.Input, you might need to change the value of path.

Note their example:

from azure.ai.ml import Input

...

Input(
    path="azureml:list_data_v2_test:2", type="uri_folder", description="registered data asset"
)

Their example is using a uri_folder type rather than MLTable, so it doesn't work exactly the same, but notice the azureml: prefix on the path value, and the format.

Check out also the Remarks section at the bottom of the documentation on mltable, as that has some examples and information about the expected URIs of MLTables, in particular:

There must be a valid MLTable YAML file named 'MLTable' present at the given uri.

In short try changing:

Input(type=AssetTypes.MLTABLE, path = flight_data.path)

to:

Input(type=AssetTypes.MLTABLE, path = f'azureml:/{flight_data.id}')

since that is the value that is working in your mltable interactive session.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Getting Error Code: ScriptExecution.StreamAccess.NotFound error while trying to read in mltable in a pipeline - Stack O

2 Answers 2

与本文相关的文章

评论列表(0)