google cloud dataflow - What’s the difference between regular Apache Beam connectors and Managed IO?

Apache Beam recently introduced Managed I/O APIs for Java and Python. What is the difference between Managed I/O and the regular Apache Beam connectors (sources and sinks) ?

Share Improve this question asked Mar 12 at 21:51 chamikara 2,0741 gold badge11 silver badges9 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

Managed I/O offers a simplified API for using Apache Beam connectors (sources and sinks). Managed I/O includes several key features.

Simplified API

Managed I/O offers a simplified Java and Python API for using Apache Bean connectors in your pipelines. For example, Java Iceberg source available via Managed I/O can be instantiated using the following syntax.

Map<String, Object> configMap = …
Managed.read(Managed.ICEBERG).withConfig(configMap)

Java Iceberg sink can be instantiated using the following syntax.

Managed.write(Managed.ICEBERG).withConfig(configMap)

Similarly, a Python Iceberg source can be instantiated using the following syntax.

config_map = {...}
managed.Read(managed.ICEBERG, config=iceberg_config)

This makes it really easy to switch from one connector to another and start using new Apache Beam connectors.

Please Note that, for established connectors such as Kafka and BigQuery, Managed I/O uses the same Apache Beam transforms underneath, hence offering the same, well battle-tested, connector implementation via the new simplified API.

Runner specific features

When running Apache Beam pipelines using Google Cloud Dataflow, Managed I/O provides additional features that makes using connectors more safe, performant, and better suited for Dataflow.

Dataflow automatically upgrades the Managed I/O connectors to the latest vetted version during job submission and streaming update via a replacement job. This makes sure that these key connectors always include the latest improvements and fixes during job execution, even if your overall pipeline is using an older version of Apache Beam.

Dataflow may change the behavior of connectors to better optimize for the Dataflow runtime. For example, when operating in at-least-once streaming mode Dataflow automatically switches the BigQuery sink to use the more performant Storage Write API at-least-once delivery semantics.

Please see the Dataflow documentation for more details about the Managed I/O features offered by Dataflow. The documentation also includes information specific to supported I/O connectors and example pipelines.

Connector Support

Managed I/O supports key Beam connectors Iceberg, Kafka, and BigQuery. To determine the latest set of supported connectors, please refer to the Apache Beam I/O Connectors listing. For additional information please reach out to the Apache Beam community or Google Cloud Dataflow support.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

google cloud dataflow - What’s the difference between regular Apache Beam connectors and Managed IO? - Stack Overflow

1 Answer 1

Runner specific features

Connector Support

与本文相关的文章

评论列表(0)