ConfigSync - used to sync Git, OCI and Helm charts to your clusters. Part of KPT.

  • By null
  • Last update: Dec 24, 2022
  • Comments: 15

Config Sync

Config Sync lets cluster operators and platform administrators deploy consistent configurations and policies across multiple clusters. This simplifies and automates configuration and policy management at scale.

Start using Config Sync

Follow the installation guide to install OSS Config Sync. If you are using GKE or Anthos, you can also install Config Sync through the Google Cloud GUI or Google Cloud CLI.

Start contributing to Config Sync

We welcome contributions to Config Sync from the community. Take a look at our contribution guide to get started.

Download

kpt-config-sync.zip

Comments(15)

  • 1

    add watch for privateCertSecret

    This updates the watch function to map changes to the user-managed secret in the RepoSync namespace to the upserted secret in the config-management-system namespace. This ensures the secret is kept up to date if the user updates the secret.

    Change is based on https://github.com/GoogleContainerTools/kpt-config-sync/pull/11

  • 2

    upsert privateCertSecret from RepoSync

    This follows the pattern of the user-provided secrets for git credentials, where the user is expected to create the RepoSync secret in the same namespace as the RepoSync. The secrets are then upserted to the config-management-system namespace by the Reconciler. This is to support use cases where the RepoSync user does not have access to the c-m-s namespace.

  • 3

    Send correct number of declared_resources

    • This fixes an edge case where declared_resources was zero when more than zero resources were declared.
    • The reason this wasn't caught before is because TestDontDeleteAllNamespaces expects zero resources when it tests deletion of all namespaces.
    • This unblocks adding a new cluster-scoped safety resource in a future change.
  • 4

    Fix metrics validation

    Fix metrics validation

    • Enable resource_to_telemetry_conversion on the otel-collector.
    • Use the k8s.deployment.name resource attribute to filter and validate metrics from specific reconcilers.
    • Add a safety ClusterRole to e2e tests. This cluster-scoped resource is required to allow deletion of all other cluster-scoped resources in mono-repo mode. This allows for generic cleanup code, rather than needing each test to work around the deletion validation code.
    • Rewrite e2e validation of the declared_resources metric to dynamically account for default resources, whether the test is using centralized or delegated setup.
    • Fix a race condition in the TestAddUpdateRemoveClusterScopedCRDV1 tests when deleting both the CRD and CR in the same commit.
    • Shorten the metrics wait timeout to 1m from 6m.

    Depends on:

    • https://github.com/GoogleContainerTools/kpt-config-sync/pull/114
    • https://github.com/GoogleContainerTools/kpt-config-sync/pull/115
    • https://github.com/GoogleContainerTools/kpt-config-sync/pull/117
    • https://github.com/GoogleContainerTools/kpt-config-sync/pull/118
    • https://github.com/GoogleContainerTools/kpt-config-sync/pull/119
    • https://github.com/GoogleContainerTools/kpt-config-sync/pull/120
  • 5

    Cleanup logs & errors in reconciler-manager

    • Make managed object and sync object log messages more consistent
    • Use ObjectKeys/NamespacedNames where possible to avoid making assumptions about which namespace to use, and make it easier to log the namespace & name together.
    • Seperate auth secret and ca cert secret upserts to make the code easier to read, and have distinct error messages, that explain what the secret is for, since the user needs to supply it.
  • 6

    run tidy as part of presubmit check

    The license script runs tidy with -compat=1.17 flag, which results in a different set of modules from what is currently committed to the repo. This adds license as a presubmit check to ensure the command does not result in a dirty repo.

  • 7

    add terraform config for dev/ci clusters

    This terraform config is intended to automate the provisioning of test infra resources needed to run the e2e tests. It can be used for the use case of our prow periodic jobs as well as development workflows.

  • 8

    Make client-side timeout when talking to API Server configurable

    We have a few hundreds of CRDs in our clusters, and are seeing Config Sync reconciliation fail with API Server timeouts due to the client-side timeout parameter being too aggressive.

    This makes the client-side timeout for API server requests configurable under .spec.override (next to things like gitSyncDepth and statusMode; if you'd like it to be somewhere else, feel free to recommend something and I'll adjust the PR). I tried to mimic patterns already in use for other parameters, both in implementation and tests (but oh, god, does this test suite need some work...!) so I hope it looks OK.

    I added two new tests for the actual expansion of config into environment variables, because I realized that all the tests of the public API that already existed were using the production implementation also to create the expected results, which in practice means that a bug there would never be exposed by the tests. There's room to add a lot more test cases in these two, to ensure that the set of expected env vars in the all the other tests match the intention. I'll leave that work for someone else 😅

    A couple of questions:

    • Given that 1.13 was released quite recently, I expect that 1.14 is still a few weeks out. What's the easiest way I can re-generate all the manifets and locally build a Docker image to use in a deployment of Config Sync to test this out in one of our clusters, and see if it works?
    • I've run make generate; are there any other code-generation commands I should run?
    • Should I - and if so, how do I - write a changelog entry somewhere?
  • 9

    Add GCP+GKE metric attributes

    • Enable the GCP resource processor on the opentelemetry agent. This adds the following resource attributes:
      • cloud.provider
      • cloud.platform
      • cloud.account.id
      • cloud.region OR cloud.availability_zone
      • host.id
      • host.name (when not using workload identity)
    • When then GCE metadata server is unavailable, the resource processor exits without modifying the resource attributes. So it's safe to enable globally.
    • Enabling the resource processor on the agent sidecars allows for more accurate attributes, like the host.id of the node the reconciler is on, instead of the host.id of the otel-collector.

    For more details, see the docs: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/resourcedetectionprocessor

  • 10

    Clean up scheme usage and type conversion

    • Add core.Scheme, to be used eveywhere in this code base. This scheme should have all required typed registered.
    • Add utility functions for type lookup and conversion:
      • kinds.ToUnstructured
      • kinds.ToTypedObject
      • kinds.ToTypedWithVersion
      • kinds.ToUnstructuredWithVersion
      • kinds.ToUnstructureds
      • kinds.Lookup
    • Refactor most instances of type conversion and detection to use the new convertion code, for consistency.
    • Fix nomos hydrate output to skip the empty metadata field. This was a bonus from using a scheme with the List type registered.
  • 11

    Filter some metrics by pod name for testing

    • Fix some failing tests that were erroring because their expectations were assuming all the metrics for the same deployment would be reported with the same tags. Now they filter by pod names instead of delpoyment name, using the latest reconciler pod for that deployment.
    • Add debug logging for SyncMetricOptions
    • Fix status updates that were being skipped too agressively. They now record metrics and status if the status has never been updated before.
    • Remove unnecessary metrics diff on update
    • Add NONE value for commit tag in last_sync_timestamp and last_apply_timestamp metrics, to work around a bug in the otel-collector which causes invalid output by the Prometheus exporter. Without this fix, the commit value gets parsed to the status value, and the status tag is ignored as invalid.
    • Fix metric expectation in TestCRDDeleteBeforeRemoveCustomResourceV1Beta1 and TestCRDDeleteBeforeRemoveCustomResourceV1 to expect source but not sync error. Only non-blocking source errors cause sync errors. Blocking source errors cause early exit, which skips reporting of sync status.
    • Fix metric expectation in TestInvalidRepoSyncBranchStatus to expect source error in the RepoSync, not the RootSync.
  • 12

    Add Config Sync resource related attributes to GCM

    This change converts the non-k8s-pod typed resource attributes added in this PR that are related to Config Sync resources into metric labels.

    Adding these labels at reconciler level requires the shared Otel Collector to remove them from all metrics before exporting to internal(Monarch) pipeline.

    This change also adds memory_limiter processor as a recommended way to control the intervals and memory that Otel Collector operates on.

    Labels are viewable in groupby drop down list and can be selected to aggregate metrics.http://screen/ZtkKWoEx6bxnyiZ

    Prometheus pipeline remain functioning http://screen/4tS9qSfoaFLSvAA

    Context: https://github.com/GoogleCloudPlatform/opentelemetry-operations-go/issues/534

  • 13

    [WIP] Add test for GIT_SSL_CAINFO env var removal

    This confirms that the new SSA used to update the reconciler Deployment correctly deletes the GIT_SSL_CAINFO env var (as long as no one else has modified it).

  • 14

    Improve rendering status e2e validation

    • Test the actual API output, instead of using an internal nomos CLI function.
    • Validate source and sync status too, to ensure other errors aren't being populated.
    • Validate the Syncing status condition values too
  • 15

    Switch to new googlecloud exporter config with e2e test

    Test looks for RPC errors in otel-collector deployment log, which catches any failure the collector has when exporting metrics.

    Tested [pass] with legacy feature flag on "--feature-gates=-exporter.googlecloud.OTLPDirect" Tested [fail] without legacy feature flag so that Otel Collector generates large amounts of error.

    Context b/244597838