kepler
Kepler (Kubernetes Efficient Power Level Exporter) uses eBPF to probe energy related system stats and exports as Prometheus metrics
Architecture
Talk and Demo
Open Source Summit NA 2022 talk and demo
Requirement
Kernel 4.18+
Installation and Configuration for Prometheus
Prerequisites
Need access to a Kubernetes cluster.
Deploy the Kepler exporter
Deploying the Kepler exporter as a daemonset to run on all nodes. The following deployment will also create a service listening on port 9102.
# kubectl create -f manifests/kubernetes/deployment.yaml
Deploy the Prometheus operator and the whole monitoring stack
- Clone the kube-prometheus project to your local folder.
# git clone https://github.com/prometheus-operator/kube-prometheus
- Deploy the whole monitoring stack using the config in the
manifests
directory. Create the namespace and CRDs, and then wait for them to be available before creating the remaining resources
# cd kube-prometheus
# kubectl apply --server-side -f manifests/setup
# until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
# kubectl apply -f manifests/
Configure Prometheus to scrape Kepler-exporter endpoints.
# cd ../kepler
# kubectl create -f manifests/kubernetes/keplerExporter-serviceMonitor.yaml
Sample Grafana dashboard
Import the pre-generated Kepler Dashboard into grafana
To start developing Kepler
To set up a development environment please read our Getting Started Guide
Cannot start up exporter with Kind
Describe the bug Try
kepler
with Kind, but the exporter cannot startup, get below errors in logs: not sure if the Kind is not in the support list.To Reproduce Steps to reproduce the behavior:
Additional context Add any other context about the problem here.
Make kepler metrics conform to the Prometheus metrics guideline
Why this PR is needed? Currently it is hard to understand all the prometheus metrics and even know which are the metrics that we are exporting... The metric naming is complex and does not follow the prometheus metric name guideline. More details are written in the issue #286
What this PR does? This PR updates the prometheus metrics along with some changes to enable the new metrics.
Additionally, Prometheus suggest to only report metrics in joules instead of watts. Giving that, we don't need to report the current power consumption since it can be calculated using promQL. There are more details about this in the issue #286
For the sake of compatibility with other modules, we keep some deprecated metrics and will remove it later.
Additional comments The changes are carefully separated in different commits to be easier to review.
I will update the Grafana dashboard in another PR. There are already many updates in this PR....
Signed-off-by: Marcelo Amaral [email protected]
[WIP][don't merge] Dev:1st impl for integration test
Details for this PR e2e folder with end to end testing, will run with two models, base on environment
kepler_address
.Signed-off-by: Sam Yuan [email protected]
dial error: dial unix /tmp/estimator.sock: connect: no such file or directory
Describe the bug After rolling over the daemonset to the latest image on quay.io registry (sha256:01a86339a8acb566ddcee848640ed4419ad0bffac98529e9b489a3dcb1e671f5) there is the message from title being shown constantly. Example output of the problem:
Is the estimator.sock expected to be missing in current state of the project?
Each node is reporting the same error. As a sidenote, since then nodes are not logging any new kepler metrics to Prometheus. I am in no place to suggest that these are connected issues and the missing metrics might be some other local issue, but there's that.
To Reproduce Steps to reproduce the behavior:
Expected behavior /tmp/estimator.sock error is not reported.
Desktop (please complete the following information):
Exclude VM node when deploy kepler exporter
Currently kepler exporter cannot get data successfully on VM node. With this PR change, kepler exporter will not deploy on VM nodes. Only bare metal nodes will be scheduled for kepler deployment.
Signed-off-by: Hao, Ruomeng [email protected]
Energy consumption of CPU is 0
Describe the bug Checking "Pod Current Energy Consumption" on Grafana dashboard, CPU energy consumption of each pod is 0. Checking Prometheus, "pod_curr_energy_in_core_millijoule" of all pods are 0. "Total" and "DRAM" have data but "CPU" is 0.
The issue exists on both RHEL8.6 and Ubuntu 22.04 host.
Update Grafana dashboards with the new container metrics
Why this PR is needed? PR #287 will update Prometheus metrics and affect the current Grafana dashboard. Where the new metrics will report energy per container and have more meaningful names. More details are written in issue #286.
Currently, it is difficult to understand all the queries in the existing Grafana dashboard. There are some constant values that are not obvious and some queries that are wrong. For example:
The
sum_over_time(pod_curr_energy_in_core_millijoule{pod_namespace=\"$namespace\", pod_name=\"$pod\"}[24h])*15/3/3600000000
metric:sum_over_time
sum the metric within the timeframe (the value in the square brackets) by getting a cumulative number from the gauge. The problem here is the granularity, we know the gauge is reported every 3s. So the query will not sum the aggregation across the 3s. Instead of a gauge, a counter should be used, e.g.,pod_aggr_energy_in_core_millijoule
, but of course it won't make sense to usesum_over_time
. If we use the counter, to get thekw*h
, we will need to use theincrease
function:So, in Prometheus, metrics are based on averages and approximations. In fact, the
increase
function takes the average of the time period and multiplies it by the interval.Also, in case we are using a counter, division by 3 makes no sense, as the
rate
function already returns values per second... and theincrease
just get the rate and multiply by the interval.Additionally, I didn't understand the multiplication by
15
and the division by3600000000
...Another example: The
rate(pod_curr_energy_in_gpu_millijoule{}[1m])/3
metric. The previous metricpod_curr_energy_in_gpu_millijoule
was a gauge, andrate
over a gauge metric doesn't make sense... Again, it would make sense to use the counterpod_aggr_energy_in_core_millijoule
, but not divide by 3....What this PR does? This PR updates the Grafana dashboard with the new metrics and the properly queries.
For the query that will return watt, we will have:
And another query will return
kWh per day
: Note that, to calculate thekwh
we need to multiply the kilowatts by the hours of daily use, therefore we will count the how many hours within a day the container is running.I have also fixed other minor issues in the dashboard, such as
All
value in the namespace and pod variablesAdditional comments
Signed-off-by: Marcelo Amaral [email protected]
Fix CI error
resolve https://github.com/sustainable-computing-io/kepler/issues/193
change log:
add commit push condition for main branch. move test coverage for default unit test.(to avoid test coverage based on specific build tag as bcc) bug fix for test coverage file missing.
Signed-off-by: Sam Yuan [email protected]
VM: all node / pod energy report 0 (again..)
Describe the bug A clear and concise description of what the bug is.
after use latest update with a few enhancement, my pod/node report energy become 0 again.. switch to v0.3 I can see the data correctly reported ,...
latest
v0.3
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context Add any other context about the problem here.
getKernelVersion doesn't work at all
Describe the bug A clear and concise description of what the bug is.
https://github.com/sustainable-computing-io/kepler/blob/main/pkg/config/config.go#L65
paste those to https://go.dev/play/ and run it
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context Add any other context about the problem here.
implement model-based power estimator
This PR introduces a dynamic way to estimate the power by Estimator class (pkg/model/estimator.go).
data/model
.h5
of keras model,.sav
of scikit-learn model, and simple ratio model computed metric importance by correlation to powerThere are additional three dependent points to integrate this class to the Kepler
exporter.go
GetPower
function inreader.go
/data/model
which containsmetadata.json
giving the rest details of model such as model file, feature engineering pkl files, features, error, so on. (auto-select the minimum error model if it is empty, "")data/model
of container folder (can be done by statically add in the docker image or deployment manifest volumes)check example use in
pkg/model/estimator_test.go
If you are agree with this direction, we can modify estimator.py to
Signed-off-by: Sunyanan Choochotkaew [email protected]
why use dummy Impl of power component instead of estimated?
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
we are using dummy if no RAPL or MSR https://github.com/sustainable-computing-io/kepler/blob/main/pkg/power/components/power.go#L60
but we do have estimated https://github.com/sustainable-computing-io/kepler/blob/main/pkg/power/components/source/estimate.go
so from name, same estimated is suitable than dummy? Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
Consume hardware power metrics from Hardware Sentry
Is your feature request related to a problem? Please describe. Solutions exist to collect power metrics, and even semantic conventions. It would be nice if Kepler would leverage that.
Describe the solution you'd like Example of a solution that collects hardware power metrics: Hardware Sentry. It's free but it's not yet open-source.
It would be greatly beneficial to Kepler if it could use Hardware Sentry as a source for hardware power metrics (notably
hw_host_energy_joules_total
andhw_energy_joules_total{hw_type="cpu|gpu|memory|physical_disk|network"}
).Also, OpenTelemetry have defined semantic conventions for hardware, including for power and energy metrics. Kepler should follow these conventions.
Describe alternatives you've considered None, really.
Additional context I am the CEO of the company who develops Hardware Sentry. We're pushing for solutions that help companies reduce the carbon footprint of their data centers (notably with temperature optimization). I'm very happy to discover Kepler and sustainable-computing-io!
containerIDToContainerInfo should be updated to reflect removed container?
Describe the bug A clear and concise description of what the bug is.
https://github.com/sustainable-computing-io/kepler/blob/main/pkg/cgroup/resolve_container.go#L56
seems this defined somewhere and updated when creation seems no place to update it when pod got destroyed?
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context Add any other context about the problem here.
e2e tests for Kepler, estimator, and model server
Is your feature request related to a problem? Please describe. Having all of the components e2e tested on baremetal and VM (especially on CI)
Describe the solution you'd like The tests should verify that:
Build manifest deployment with options
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like A clear and concise description of what you want to happen.
The solution is to build manifest based on the same base defined patching and additional resources.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.