最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - ClickHouse GCS Integration in Air-Gapped AWS Cluster via Proxy - Stack Overflow

programmeradmin3浏览0评论

I need help with a ClickHouse query that needs to access Google Cloud Storage (GCS) data from an air-gapped AWS environment. Here's a detailed description of the problem:

Current Setup

Environment: Air-gapped AWS cluster.

Proxy: Custom Golang proxy running on an EC2 instance with internet access

Proxy Configuration: HTTPS_PROXY=http://10.0.x.x:8080

The Query

We're trying to execute the following ClickHouse query to insert data from a GCS bucket:

INSERT INTO my_clickhouse_table (
    column_1, column_2, column_3 ... column_n
)
SELECT
    column_1, column_2, column_3 ... column_n

FROM gcs(
    ':08:35/data-127823782.parquet',
    'access_key_id',
    'secret_access_key'
)

The Problem

While the proxy configuration works for creating BigQuery and GCS clients in Python-based applications within the pod, it fails when executing the above ClickHouse query. The query is unable to access the GCS bucket. The API call doesn't go through the proxy and it is blocked.

Requirement

My query should get executed and fetched the data from GCS file. Since it's an airgapped env, the communication should go via proxy.

Are there any specific settings or configurations needed in ClickHouse to route GCS requests through the proxy? I couldn't find any from the doc.

Here is the reference from the DOC, which wasn't useful in above scenario:

HTTP Proxy Support

ClickHouse Connect adds basic HTTP proxy support using the urllib 3 library. It recognizes the standard HTTP_PROXY and HTTPS_PROXY environment variables. Note that using these environment variables will apply to any client created with the clickhouse_connect.get_client method. Alternatively, to configure per client, you can use the http_proxy or https_proxy` arguments to the get_client method. For details on the implementation of HTTP Proxy support, see the urllib3 documentation.

Link:


In AWS (air-gapped - no outside internet access but can access via proxy): Application in a K8S pod Clickhouse server in a K8S setup

In GCP: GCS bucket where data resides.

Usecase is to ingest data from this GCS into Clickhouse via Proxy


I also tried with below code but it didn't work:

    ch_client = clickhouse_connect.get_client(
                    host = 'localhost', 
                    port = 8123, 
                    username = 'default', 
                    password = '',
                    http_proxy='http://10.0.x.x:8080',
                    https_proxy='http://10.0.x.x:8080'
                )

    query = """
            select * FROM 
            gcs(' 
            2025_18:08:35/data-127823782.parquet', 'access-key', 'secret-key')
            """

    result = ch_client.query(query)

Also tried setting env variable but with no help.

os.environ['HTTPS_PROXY'] = "http://10.0.x.x:8080"

I need help with a ClickHouse query that needs to access Google Cloud Storage (GCS) data from an air-gapped AWS environment. Here's a detailed description of the problem:

Current Setup

Environment: Air-gapped AWS cluster.

Proxy: Custom Golang proxy running on an EC2 instance with internet access

Proxy Configuration: HTTPS_PROXY=http://10.0.x.x:8080

The Query

We're trying to execute the following ClickHouse query to insert data from a GCS bucket:

INSERT INTO my_clickhouse_table (
    column_1, column_2, column_3 ... column_n
)
SELECT
    column_1, column_2, column_3 ... column_n

FROM gcs(
    'https://storage.googleapis.com/test_bucket/my_folder/16-01-2025_18:08:35/data-127823782.parquet',
    'access_key_id',
    'secret_access_key'
)

The Problem

While the proxy configuration works for creating BigQuery and GCS clients in Python-based applications within the pod, it fails when executing the above ClickHouse query. The query is unable to access the GCS bucket. The API call https://storage.googleapis.com doesn't go through the proxy and it is blocked.

Requirement

My query should get executed and fetched the data from GCS file. Since it's an airgapped env, the https://storage.googleapis.com communication should go via proxy.

Are there any specific settings or configurations needed in ClickHouse to route GCS requests through the proxy? I couldn't find any from the doc.

Here is the reference from the DOC, which wasn't useful in above scenario:

HTTP Proxy Support

ClickHouse Connect adds basic HTTP proxy support using the urllib 3 library. It recognizes the standard HTTP_PROXY and HTTPS_PROXY environment variables. Note that using these environment variables will apply to any client created with the clickhouse_connect.get_client method. Alternatively, to configure per client, you can use the http_proxy or https_proxy` arguments to the get_client method. For details on the implementation of HTTP Proxy support, see the urllib3 documentation.

Link: https://clickhouse.com/docs/en/integrations/python#http-proxy-support


In AWS (air-gapped - no outside internet access but can access via proxy): Application in a K8S pod Clickhouse server in a K8S setup

In GCP: GCS bucket where data resides.

Usecase is to ingest data from this GCS into Clickhouse via Proxy


I also tried with below code but it didn't work:

    ch_client = clickhouse_connect.get_client(
                    host = 'localhost', 
                    port = 8123, 
                    username = 'default', 
                    password = '',
                    http_proxy='http://10.0.x.x:8080',
                    https_proxy='http://10.0.x.x:8080'
                )

    query = """
            select * FROM 
            gcs('https://storage.googleapis.com/test_bucket/my_folder/16-01- 
            2025_18:08:35/data-127823782.parquet', 'access-key', 'secret-key')
            """

    result = ch_client.query(query)

Also tried setting env variable but with no help.

os.environ['HTTPS_PROXY'] = "http://10.0.x.x:8080"
Share Improve this question edited Jan 20 at 18:57 Abhinavece asked Jan 20 at 18:21 AbhinaveceAbhinavece 2013 silver badges12 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

To setup proper connetion between your clickhouse-server and your GCS bucket via proxy you need to setup HTTP_PROXY=http://10.0.x.x:8080 environment variable on clickhouse-server not in clickhouse client

if you use container, setup this inside container look https://docs.docker.com/reference/compose-file/services/#environment for details

if you use standalone clickhouse-server systemd unit, use following commands

sudo mkdir -p /etc/systemd/system/clickhouse-server.service.d
sudo nano /etc/systemd/system/clickhouse-server.service.d/http-proxy.conf

and paste the following file content

[Service]
Environment="HTTP_PROXY=http://10.0.x.x:8080"

p.s. clickhouse_connect.get_client use proxy related parameters, just for connection between clickhouse client and clickhouse-server

发布评论

评论列表(0)

  1. 暂无评论