Google Cloud Dataflow Usage & Best Practices

Ritul Rai
2 min readJan 14, 2021

Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem

Below basic rules, practices and guidelines can be followed to make best use of it with minimal cost:

1. Reduce the Disk size

  • By default, the disk size for the dataflow pipeline is set to 250GB for a batch pipeline and 400GB for a streaming pipeline.
  • If you are processing the incoming events in memory, this is mostly a wasted resource, so instead reduce this parameter to 30GB or less (the min recommended value is 30GB but we faced no issues while running the pipeline at 9–10GB of PD)
  • You can do so by specifying the disk size as follows while deploying your pipeline: — diskSizeGb=30
  • Now looking at Google Cloud Pricing calculator, reducing this value saves us around 20$ per month per worker.

2. Specify a custom machine type

  • By default, Dataflow supports the n1 machine types for the pipeline and while these machines cover a variety of use cases, however, you should try to use a custom machine of your own with either a powerful CPU or a large RAM.
  • To do this, you can add the following parameter while deploying the pipeline : — workerMachineType=custom-8–7424
  • The value above would correspond to 8 cores and 7424 MB of memory and you can tweak this according to your will instead of being locked into using the presets.

3. Disable public IPs

  • By default, the Dataflow service assigns your pipeline both public and private IP addresses.
  • Now if you don’t want your data to be made available to the general public, it’s a good idea to disable public IPs as that not only makes your pipeline more secure but might potentially also help you in saving a few bucks on your network costs.
  • Adding the following flag to the pipeline execution disables public IPs : — usePublicIps=false

4. Cleanup unused resources and jobs

  • Delete jobs and resources immediate after local testing. Please do not keep this running and leave idle. Cleanup resources right after local development, deployment and testing.
  • Follow all above defined guidelines to perform the load testing or local deployment in controlled manner. This will not cause extra cost.

--

--