Monitoring OKE with Prometheus and Grafana
There are various options available when it comes to monitoring OKE. There are various OSS tools like Grafana, Kibana etc and different ways to install the various components - all in the cluster itself or use a central dashboard. It is very easy to install something like Prometheus and Grafana into a cluster and expose the Grafana interface for that cluster - there is a guided example of how to do that on Oracle Cloud Native Labs è Here.
There is also the need to understand what is happening with underlying OCI resources used by an OKE cluster and to enable this there is an OCI data source for Grafana. Again, guidance is available on Oracle Cloud Native Labs è Here.
This is great for individual dev/test or small numbers of clusters but in reality customers will either already have or will want a centralised dashboard to monitor both Kubernetes and OCI resources for multiple clusters, possibly in multiple OCI regions. This post will cover the steps required to set ups a central Grafana server to monitor OKE clusters running private worker nodes.
Note:
It is assumed that all necessary pre-reqs for running OKE clusters has been completed here, and that the audience is familiar with both OCI and OKE concepts and operations.
Overview
The main concept is to provide a central Grafana instance that can monitor multiple OKE clusters running private worker nodes. We will use the console quick-create a private OKE cluster using the OCI Console Quick Create feature. This will create the cluster and all required networking. Next we will create a Monitoring VCN with a regional subnet, a Local Peering Gateway along with the rquired routing and security settings.
The next step is to add a Local peering gateway in the OKE VCN and connect it to the Monitoring gateway and set-up the extra routing and security settings.
Then we will install Prometheus into this cluster and expose the kube-prometheus service so we can access the Prometheus data from a Grafana server running in the Monitoring compartment.
The final steps are to create a Grafana server, add the kube-prometheus service as a datasource and configure a Dashboard.
There following diagram shows the high-level solution:
Note:
For simplicity, we will use an internet gateway in the worked example. The details of setting up VPN acmes depends on the specific equipment being used on premises. For information on using a VPN connection please see the OCI documentation è Here.
Step by Step
One of the key elements to this solution is making use of the ability to securely peer Virtual Cloud Networks. This allows OKE worker nodes to run in a private sub-net but still be accessible to a Grafana server running in a different dedicated VCN and compartment. The following diagram highlights the key networking elements required in this solution.
Pre-Reqs
Compartments
Create two compartments, one for the OKE cluster and one for the Grafana server. This example I am using OKE and Monitoring.
Create OKE Cluster
Select an appropriate instance shape and number as required/allowed within tenancy limits.
This will set up all required networking and node pools for the cluster.
This will take you to the cluster details screen and show when all resources are up and running.
Make a note of the name of the VCN and the CIDR block (this defaults to 10.0.0.0/16 in quick create)
Create a VCN in the Monitoring Compartment
Click Create Virtual Cloud Network.
Select Create Virtual Cloud Network Only and select a CIDR block of 11.0.0.0/16
Click Create Virtual Cloud Network.
When the network has been created, create a public regional subnet.
Name = Grafana, CIDR Block = 11.0.10.0/24 and select the default Security List
Create an Internet Gateway to allow access to the Grafana dashboard.
Update the default route table to add an internet gateway destination of 0.0.0.0/0
Create a Local Peering Gateway - this will be used to connect to the OKE VCN.
Add a route rule to direct 10.0.0.0/16 traffic to the OKE VCN via the to-oke LPG
Create Egress Rules for the Monitoring VCN
0.0.0.0/16 to route all internet traffic
10.0.0.0/16 to route port 9090 (Grafana dashboard) traffic to OKE
Create ingress Rules for the Monitoring VCN
Port 3000 on 0.0.0.0/0 to accept Grafana requests from the internet. (This would be changed to suit the VPN CIDR block in a VPN set-up.)
Set-Up Additional Networking in OKE Compartment
Select Networking, Virtual Cloud Networks
Select the VCN created in the OKE compartment by the Quick Create.
Create a Local peering Gateway
Establish a peering connection to the Monitoring gateway
After a few moments the connection will be established.
Select route tables for the OKE VCN
Add a route for the monitoring VCN via the local peering gateway from the Loadbalancer subnet.
Destination 11.0.0.0/16, Target to-grafana LPG
Add an ingress security rule for the same subnet to accept traffic from the Grafana dashboard on port 9090
At this point all required networking to allow access from the monitoring VCN to the OKE VCN via private local peering gateways has been completed. The next steps are to install prometheus in the OKE cluster and create a Grafana server in the Monitoring Compartment/VCN.
Installing Prometheus
It is assumed that there is a current kubeconfig available ==> Here to allow kubectl commands to be run against the OKE cluster.
There are many examples of how to install Prometheus and Grafana but this is based on the Oracle Cloud Native Labs on Monitoring OKE ==> Here but with a couple of changes.
Firstly we need to not install Grafana onto the OKE cluster which is installed by default in the lab and secondly we need to change the Kubernetes-prometheus service to use a private load balancer.
First create a role binding to cluster-admin for your OCI user:
kubectl create clusterrolebinding admin --clusterrole=cluster-admin --user=ocid1.user.oc1..nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Then following the Lab instructions.
Add the repo for the prometheus operator:
We can skip the steps to install tiller as this is selected by default in quick create so we can skip to the step to install the operator into a separate namespace called monitoring:
helm install --namespace monitoring coreos/prometheus-operator
Download a set of values used to install prometheus:
Before installing, edit the values.yaml file to not install Grafana on the OKE cluster by changing deployGrafana: True to False
Now, install Prometheus to the OKE cluster using the changed yams file.
helm install coreos/kube-prometheus --name kube-prometheus --namespace monitoring --values values.yaml
After a few moments you can see the newly started pods:
kubectl get po --namespace monitoring
The kube-prometheus service with exposes the Prometheus data is by default defined as ClusterIP, this means that the data is not available outside of the cluster. This must be changed to allow external access.
The simplest way to do this is to change the service to NodePort which will make the data accessible on each worker-node. However, this is dependant on each worker-nodes IP address so if we want to be able to withstand node pool upgrades and node failures we need a more consistent way to access the data. In order to do this we must edit the kube-prometheus service to use a private OCI load-balancer.
kubectl edit svc kube-prometheus -n monitoring
Add a private LB annotation
apiVersion: v1
kind: Service
metadata:
annotations:
and change the type to LoadBalancer.
type:
LoadBalancer
After a few moments you can check the services by issuing the following:
kubectl get svc -n monitoring
You will see that Kubernetes-prometheus has a type of LoadBalancer with a private (10.0.20.4) IP address in the load balancer subnet.
Creating a Grafana Server
We will install a Grafana server in the monitoring compartment/VCN and create a data source as the OKE cluster Prometheus service we added in the previous steps.
There are guides available to do this è Here. However, an OCI custom image was created after following these instructions to make it quicker and easier to get up and running. The following instructions make use of this custom image.
Select Compute in the monitoring compartment.
Select Custom Images.
Click Import Image and select a name - Grafana, OS of Linux, Image Type of OCI and select and object storage URL of https://objectstorage.uk-london-1.oraclecloud.com/p/yNfon08ieUoDnMQRjzArEDR538dF9D6jlkbVK0h3TIg/n/intpaulj/b/Images/o/grafana
This pre-authenticated request will be live until 12/31/2020.
Click Import Image.
After a few minutes the work request will complete
Return to the compute dashboard and create a new instance in the monitoring compartment, selecting an appropriate shape, use a Public IP address, use the custom image option to use the image we imported earlier along with a key file to allow SSH access to the instance.
Click create.
When the image is up and running, make note of it's Public IP address. At this point I would recommend running a yum update to make sure the image has latest fixes etc.
You will see the Grafana Log-in screen.
Login with User=admin and password=passw0rd. Please change this after logging in.
You will see a portal with some existing dashboards. These are included as examples using the OCI data source.
Click on settings and Add Data source
Add the details for the kube-prometheus service defined when we set Prometheus on the OKE cluster. (In this case 10.0.20.04:9090).
Click Save and test.
Return to the home screen and select Home in the top Left.
Select Import Dashboard
Enter 10000 as the Grafana.Com Dashboard ID and click Load. This will select a sample Kubernetes monitoring dashboard from Grafana.com. You can browse and select different re-built solutions but this is a very popular dashboard.
Then click Import
Add the data source we added above. (OKE Toronto)
You will now see a Kubernetes dashboard displaying Prometheus data from your Private OKE cluster.
Multiple VCNs
Many installations will have multiple clusters – may dev, test, QA etc – which will all need to be monitored. Multiple VCNs can be peered using Local Peering Gateways but they are a 1-2-1 relationship. So that each VCN to VCN peering will need it’s own dedicated LPG paring.
The following diagrams show and example of how this could be set-up:
Multiple Regions
It is also possible to securely and privately peer VCNs across regions using Remote Peering Gateways. This facility can extend the solution to have a single Grafana server monitoring multiple OKE clusters in Multiple OCI regions. The concept is very similar to the solutions above and shown in the following diagram.
Multiple Node-Pools
Where either performance or security requirements are high, it is also possible to further separate the Prometheus deployment by creating a separate node pool to run in. One way of doping this is to use Label Selectors. This limits any overhead that running Prometheus incurs but has the overhead of increasing the compute required and therefore increased costs for running your cluster.