Apache Beam recently introduced Managed I/O APIs for Java and Python. What is the difference between Managed I/O and the regular Apache Beam connectors (sources and sinks) ?
Apache Beam recently introduced Managed I/O APIs for Java and Python. What is the difference between Managed I/O and the regular Apache Beam connectors (sources and sinks) ?
Share Improve this question asked Mar 12 at 21:51 chamikarachamikara 2,0741 gold badge11 silver badges9 bronze badges1 Answer
Reset to default 3Managed I/O offers a simplified API for using Apache Beam connectors (sources and sinks). Managed I/O includes several key features.
Simplified API
Managed I/O offers a simplified Java and Python API for using Apache Bean connectors in your pipelines. For example, Java Iceberg source available via Managed I/O can be instantiated using the following syntax.
Map<String, Object> configMap = …
Managed.read(Managed.ICEBERG).withConfig(configMap)
Java Iceberg sink can be instantiated using the following syntax.
Managed.write(Managed.ICEBERG).withConfig(configMap)
Similarly, a Python Iceberg source can be instantiated using the following syntax.
config_map = {...}
managed.Read(managed.ICEBERG, config=iceberg_config)
This makes it really easy to switch from one connector to another and start using new Apache Beam connectors.
Please Note that, for established connectors such as Kafka and BigQuery, Managed I/O uses the same Apache Beam transforms underneath, hence offering the same, well battle-tested, connector implementation via the new simplified API.
Runner specific features
When running Apache Beam pipelines using Google Cloud Dataflow, Managed I/O provides additional features that makes using connectors more safe, performant, and better suited for Dataflow.
Dataflow automatically upgrades the Managed I/O connectors to the latest vetted version during job submission and streaming update via a replacement job. This makes sure that these key connectors always include the latest improvements and fixes during job execution, even if your overall pipeline is using an older version of Apache Beam.
Dataflow may change the behavior of connectors to better optimize for the Dataflow runtime. For example, when operating in at-least-once streaming mode Dataflow automatically switches the BigQuery sink to use the more performant Storage Write API at-least-once delivery semantics.
Please see the Dataflow documentation for more details about the Managed I/O features offered by Dataflow. The documentation also includes information specific to supported I/O connectors and example pipelines.
Connector Support
Managed I/O supports key Beam connectors Iceberg, Kafka, and BigQuery. To determine the latest set of supported connectors, please refer to the Apache Beam I/O Connectors listing. For additional information please reach out to the Apache Beam community or Google Cloud Dataflow support.