GCP unused resources' Prometheus metrics

  • By tom
  • Last update: May 1, 2022
  • Comments: 12

gcp-idle-resources-metrics

Identify unused resources at Google Cloud Platform through Prometheus' metrics

Current supported services

  • Google Compute Engine
    • Instances
    • Disks

Usage

Set up a service account on the project you want to monitor. You must grant roles/compute.viewer to it.

You can authenticate by setting the Application Default Credentials (i.e: Placing the service account's JSON key and setting the environment variable GOOGLE_APPLICATION_CREDENTIALS=path-to-credentials.json) or letting the application automatically load the credentials from metadata (Workload Identity is recomended).

You must set, at least, the project ID and the regions you want to monitor. Either by:

  • Specifying through command args --project_id --regions us-east1,us-central1
  • Specifying through environment variables GCP_PROJECT_ID= GCP_REGIONS=us-east1,us-central1 (if authenticating through metadata, the project doesn't need to be specified)

Docker

cp ~/.config/gcloud/application_default_credentials.json ./credentials.json

chmod 444 credentials.json

docker build -t gcp-idle-resources-metrics . 

docker run -it --rm --network=host \
  -v $(pwd)/credentials.json:/credentials.json \
  -e GOOGLE_APPLICATION_CREDENTIALS=/credentials.json \
  -e GCP_PROJECT_ID= \
  -e GCP_REGIONS=us-east1,us-central1,southamerica-east1 \
  gcp-idle-resources-metrics

Check the exported metrics.

Download

gcp-idle-resources-metrics.zip

Comments(12)

  • 1

    Improve logging

    Why is this pull request necessary, and what does it do? Logs are now like this:

    ts=2022-05-01T22:13:03.466Z caller=main.go:137 level=info msg="Starting gcp-idleness-exporter" version="(version=, branch=, revision=)"
    ts=2022-05-01T22:13:03.466Z caller=main.go:138 level=info msg="Build context" build_context="(go=go1.18, user=, date=)"
    ts=2022-05-01T22:13:03.466Z caller=main.go:175 level=info msg="Starting exporter for project REDACTED at [us-central1 us-east1 southamerica-east1]"
    ts=2022-05-01T22:13:03.466Z caller=main.go:177 level=info msg="Listening on :5000"
    ts=2022-05-01T22:14:46.254Z caller=main.go:93 level=info collector=gce_disk_snapshot metrics="[gce_disk_snapshot_age_days gce_disk_snapshot_amount]"
    ts=2022-05-01T22:14:46.254Z caller=main.go:93 level=info collector=gce_is_disk_attached metrics=[gce_is_disk_attached]
    ts=2022-05-01T22:14:46.254Z caller=main.go:93 level=info collector=gce_is_machine_running metrics=[gce_is_machine_running]
    ts=2022-05-01T22:14:46.254Z caller=main.go:93 level=info collector=dataproc_is_cluster_running metrics=[dataproc_is_cluster_running]
    

    Special notes for your reviewer: https://github.com/7onn/gcp-idleness-exporter/issues/19

  • 2

    Improve collectors description in logs

    Describe the problem/challenge you have Since the snapshot metrics were bundled together in gcp_disk_snapshot collector, it became unclear what are the available metrics there.

    Describe the solution you'd like I'd like to improve the startup logs enriching them with information about the collectors' available metrics.

    Anything else you would like to add:

    ts=2022-05-01T20:21:51.566Z caller=main.go:144 level=info msg="Starting gcp-idleness-exporter" version="(version=2.0.0, branch=HEAD, revision=fa38de847a6baea471b10bbf0ea187cf5f620115)"
    ts=2022-05-01T20:21:51.566Z caller=main.go:145 level=info msg="Build context" build_context="(go=go1.18, user=root@f5ce1f300c35, date=20220501-18:12:33)"
    ts=2022-05-01T20:21:51.568Z caller=main.go:182 level=info msg="Starting exporter for project REDACTED at [us-central1 southamerica-east1]"
    ts=2022-05-01T20:21:51.568Z caller=main.go:184 level=info msg="Listening on :5000"
    ts=2022-05-01T20:24:40.057Z caller=main.go:93 level=info msg="Enabled collectors"
    ts=2022-05-01T20:24:40.057Z caller=main.go:100 level=info collector=dataproc_is_cluster_running
    ts=2022-05-01T20:24:40.057Z caller=main.go:100 level=info collector=gce_disk_snapshot
    ts=2022-05-01T20:24:40.057Z caller=main.go:100 level=info collector=gce_is_disk_attached
    ts=2022-05-01T20:24:40.057Z caller=main.go:100 level=info collector=gce_is_machine_running
    

    Environment: n/a

  • 3

    Rename project to gcp-idleness-exporter

    Why is this pull request necessary, and what does it do? To rename this repository and the application module to gcp-idleness-exporter.

    Special notes for your reviewer: https://github.com/7onn/gcp-idle-resources-metrics/issues/17

  • 4

    Rename project to make the search for it more intuitional

    Describe the problem/challenge you have gcp-idle-resources-metrics is a very verbose name and the same idea would be more effectively communicated by calling the app gcp-idleness-exporter as exporter is more conventional to Prometheus' products.

    Describe the solution you'd like Rename the whole thing. Both in the app repository and in the Helm charts.

    Anything else you would like to add: n/a

    Environment: n/a

  • 5

    Move snapshot metrics into gce_disk_snapshot collector

    Why is this pull request necessary, and what does it do? To avoid memory waste by instantiating multiple Google Clients and Compute services, this pull request submits a unification of snapshot-related metrics into a single collector.

    Special notes for your reviewer: https://github.com/7onn/gcp-idle-resources-metrics/issues/15

  • 6

    Unify redundant requests to GCE Disks

    Describe the problem/challenge you have We currently have three collectors making API calls to retrieve disk-snapshot-related information.

    • gce_disk_snapshot_age_days
    • gce_disk_snapshot_amount
    • gce_is_old_snapshot

    As they run asynchronously, each one of them is instantiating their own GCP client and Compute Service.

    Describe the solution you'd like For performance purposes, I'd like to take advantage of a single instantiation of these mentioned agents to provide these metrics. Plus removing the gce_is_old_snapshot as this information is now available through the snapshot age.

    Anything else you would like to add: n/a

    Environment: n/a

  • 7

    Export gce_disk_snapshot_age_days metrics

    Why is this pull request necessary, and what does it do? To enable the user to alert, for instance, when the latest snapshot is older than a given amount of days, this pull request implements the gce_disk_snapshot_age_days collector.

    Special notes for your reviewer: https://github.com/7onn/gcp-idle-resources-metrics/issues/3

  • 8

    Remove unnecessary sort in GCEDiskSnapshotAmountCollector

    Why is this pull request necessary, and what does it do? To avoid computational effort waste, this pull request removes a sort function that was meaningless to the algorithm.

    Special notes for your reviewer: n/a

  • 9

    Export gcp_disk_snapshot_amount metrics

    Why is this pull request necessary, and what does it do? So we can spot disks with lots of snapshots versions, this exports a new metric called gcp_disk_snapshot_amount

    Special notes for your reviewer: https://github.com/7onn/gcp-idle-resources-metrics/issues/3

  • 10

    Prevent application from crashing due to Collector errors

    Why is this pull request necessary, and what does it do? If, for instance, the service account lacks permissions for some API, the app crashes instead of logging the error. Therefore, this pull request intends to prevent the crashing behavior.

    Special notes for your reviewer: https://github.com/7onn/gcp-idle-resources-metrics/issues/8

  • 11

    Label region on Dataproc metrics

    Why is this pull request necessary, and what does it do? The URL for accessing the Dataproc cluster through the GCP Console requires the region label. Therefore, the metric must contain it so the alert can assemble a valid link to it.

    Special notes for your reviewer: https://github.com/7onn/gcp-idle-resources-metrics/issues/5#issuecomment-1108937719

    This is how the URL will be assembled in the alert: https://console.cloud.google.com/dataproc/clusters/{{ $labels.name }}?region={{ $labels.region }}&project={{ $labels.project }}

  • 12

    Improve Dataproc collector's accuracy

    Describe the problem/challenge you have The current way of retrieving a Dataproc cluster status is through its Status.State. The issue with this approach is that the default interval (5m) might possibly take some small and quick jobs out of the account and erroneously consider the cluster idle.

    Describe the solution you'd like I'd like to consider the Dataproc Jobs statuses besides the cluster status as a whole.