I need help with a ClickHouse query that needs to access Google Cloud Storage (GCS) data from an air-gapped AWS environment. Here's a detailed description of the problem:
Current Setup
Environment: Air-gapped AWS cluster.
Proxy: Custom Golang proxy running on an EC2 instance with internet access
Proxy Configuration: HTTPS_PROXY=http://10.0.x.x:8080
The Query
We're trying to execute the following ClickHouse query to insert data from a GCS bucket:
INSERT INTO my_clickhouse_table (
column_1, column_2, column_3 ... column_n
)
SELECT
column_1, column_2, column_3 ... column_n
FROM gcs(
':08:35/data-127823782.parquet',
'access_key_id',
'secret_access_key'
)
The Problem
While the proxy configuration works for creating BigQuery and GCS clients in Python-based applications within the pod, it fails when executing the above ClickHouse query. The query is unable to access the GCS bucket. The API call doesn't go through the proxy and it is blocked.
Requirement
My query should get executed and fetched the data from GCS file. Since it's an airgapped env, the communication should go via proxy.
Are there any specific settings or configurations needed in ClickHouse to route GCS requests through the proxy? I couldn't find any from the doc.
Here is the reference from the DOC, which wasn't useful in above scenario:
HTTP Proxy Support
ClickHouse Connect adds basic HTTP proxy support using the urllib 3 library. It recognizes the standard HTTP_PROXY and HTTPS_PROXY environment variables. Note that using these environment variables will apply to any client created with the clickhouse_connect.get_client method. Alternatively, to configure per client, you can use the http_proxy or https_proxy` arguments to the get_client method. For details on the implementation of HTTP Proxy support, see the urllib3 documentation.
Link:
In AWS (air-gapped - no outside internet access but can access via proxy): Application in a K8S pod Clickhouse server in a K8S setup
In GCP: GCS bucket where data resides.
Usecase is to ingest data from this GCS into Clickhouse via Proxy
I also tried with below code but it didn't work:
ch_client = clickhouse_connect.get_client(
host = 'localhost',
port = 8123,
username = 'default',
password = '',
http_proxy='http://10.0.x.x:8080',
https_proxy='http://10.0.x.x:8080'
)
query = """
select * FROM
gcs('
2025_18:08:35/data-127823782.parquet', 'access-key', 'secret-key')
"""
result = ch_client.query(query)
Also tried setting env variable but with no help.
os.environ['HTTPS_PROXY'] = "http://10.0.x.x:8080"
I need help with a ClickHouse query that needs to access Google Cloud Storage (GCS) data from an air-gapped AWS environment. Here's a detailed description of the problem:
Current Setup
Environment: Air-gapped AWS cluster.
Proxy: Custom Golang proxy running on an EC2 instance with internet access
Proxy Configuration: HTTPS_PROXY=http://10.0.x.x:8080
The Query
We're trying to execute the following ClickHouse query to insert data from a GCS bucket:
INSERT INTO my_clickhouse_table (
column_1, column_2, column_3 ... column_n
)
SELECT
column_1, column_2, column_3 ... column_n
FROM gcs(
'https://storage.googleapis.com/test_bucket/my_folder/16-01-2025_18:08:35/data-127823782.parquet',
'access_key_id',
'secret_access_key'
)
The Problem
While the proxy configuration works for creating BigQuery and GCS clients in Python-based applications within the pod, it fails when executing the above ClickHouse query. The query is unable to access the GCS bucket. The API call https://storage.googleapis.com doesn't go through the proxy and it is blocked.
Requirement
My query should get executed and fetched the data from GCS file. Since it's an airgapped env, the https://storage.googleapis.com communication should go via proxy.
Are there any specific settings or configurations needed in ClickHouse to route GCS requests through the proxy? I couldn't find any from the doc.
Here is the reference from the DOC, which wasn't useful in above scenario:
HTTP Proxy Support
ClickHouse Connect adds basic HTTP proxy support using the urllib 3 library. It recognizes the standard HTTP_PROXY and HTTPS_PROXY environment variables. Note that using these environment variables will apply to any client created with the clickhouse_connect.get_client method. Alternatively, to configure per client, you can use the http_proxy or https_proxy` arguments to the get_client method. For details on the implementation of HTTP Proxy support, see the urllib3 documentation.
Link: https://clickhouse.com/docs/en/integrations/python#http-proxy-support
In AWS (air-gapped - no outside internet access but can access via proxy): Application in a K8S pod Clickhouse server in a K8S setup
In GCP: GCS bucket where data resides.
Usecase is to ingest data from this GCS into Clickhouse via Proxy
I also tried with below code but it didn't work:
ch_client = clickhouse_connect.get_client(
host = 'localhost',
port = 8123,
username = 'default',
password = '',
http_proxy='http://10.0.x.x:8080',
https_proxy='http://10.0.x.x:8080'
)
query = """
select * FROM
gcs('https://storage.googleapis.com/test_bucket/my_folder/16-01-
2025_18:08:35/data-127823782.parquet', 'access-key', 'secret-key')
"""
result = ch_client.query(query)
Also tried setting env variable but with no help.
os.environ['HTTPS_PROXY'] = "http://10.0.x.x:8080"
Share
Improve this question
edited Jan 20 at 18:57
Abhinavece
asked Jan 20 at 18:21
AbhinaveceAbhinavece
2013 silver badges12 bronze badges
1 Answer
Reset to default 0To setup proper connetion between your clickhouse-server and your GCS bucket via proxy
you need to setup HTTP_PROXY=http://10.0.x.x:8080
environment variable on clickhouse-server
not in clickhouse client
if you use container, setup this inside container look https://docs.docker.com/reference/compose-file/services/#environment for details
if you use standalone clickhouse-server systemd unit, use following commands
sudo mkdir -p /etc/systemd/system/clickhouse-server.service.d
sudo nano /etc/systemd/system/clickhouse-server.service.d/http-proxy.conf
and paste the following file content
[Service]
Environment="HTTP_PROXY=http://10.0.x.x:8080"
p.s.
clickhouse_connect.get_client
use proxy related parameters, just for connection between clickhouse client and clickhouse-server