Tuesday 22 October 2019

Monitoring OKE with Prometheus and Grafana



There are various options available when it comes to monitoring OKE.  There are various OSS tools like Grafana, Kibana etc and different ways to install the various components - all in the cluster itself or use a central dashboard. It is very easy to install something like Prometheus and Grafana into a cluster and expose the Grafana interface for that cluster - there is a guided example of how to do that on Oracle Cloud Native Labs è Here.
There is also the need to understand what is happening with underlying OCI resources used by an OKE cluster and to enable this there is an OCI data source for Grafana. Again, guidance is available on Oracle Cloud Native Labs è Here.
This is great for individual dev/test or small numbers of clusters but in reality customers will either already have or will want a centralised dashboard to monitor both Kubernetes and OCI resources for multiple clusters, possibly in multiple OCI regions. This post will cover the steps required to set ups a central Grafana server to monitor OKE clusters running private worker nodes.

Note:

It is assumed that all necessary pre-reqs for running OKE clusters has been completed here, and that the audience is familiar with both OCI and OKE concepts and operations.


Overview


The main concept is to provide a central Grafana instance that can monitor multiple OKE clusters running private worker nodes. We will use the console quick-create a private OKE cluster using the OCI Console Quick Create feature. This will create the cluster and all required networking. Next we will create a Monitoring VCN with a regional subnet, a Local Peering Gateway along with the rquired routing and security settings.
The next step is to add a Local peering gateway in the OKE VCN and connect it to the Monitoring gateway and set-up the extra routing and security settings.
Then we will install Prometheus into this cluster and expose the kube-prometheus service so we can access the Prometheus data from a Grafana server running in the Monitoring compartment.
The final steps are to create a Grafana server, add the kube-prometheus service as a datasource and configure a Dashboard.
There following diagram shows the high-level solution:

 

Note:

For simplicity, we will use an internet gateway in the worked example. The details of setting up VPN acmes depends on the specific equipment being used on premises. For information on using a VPN connection please see the OCI documentation è Here.



Step by Step


One of the key elements to this solution is making use of the ability to securely peer Virtual Cloud Networks. This allows OKE worker nodes to run in a private sub-net but still be accessible to a Grafana server running in a different dedicated VCN and compartment. The following diagram highlights the key networking elements required in this solution.



For an overview of local peering refer to the OCI documentation è Here.
   

Pre-Reqs

Ensure that all OKE pre-requisites are in place è Here

 

Compartments

Create two compartments, one for the OKE cluster and one for the Grafana server. This example I am using OKE and Monitoring.

 

Create OKE Cluster

Create a Private OKE cluster in the OKE compartment using the console quick create feature è Here. 
Select an appropriate instance shape and number as required/allowed within tenancy limits.




This will set up all required networking and node pools for the cluster.



This will take you to the cluster details screen and show when all resources are up and running.


Make a note of the name of the VCN and the CIDR block (this defaults to 10.0.0.0/16 in quick create)

Create a VCN in the Monitoring Compartment



Click Create Virtual Cloud Network.
Select Create Virtual Cloud Network Only and select a CIDR block of 11.0.0.0/16

Click Create Virtual Cloud Network.


When the network has been created, create a public regional subnet.



Name = Grafana, CIDR Block = 11.0.10.0/24 and select the default Security List


Create an Internet Gateway to allow access to the Grafana dashboard.


Update the default route table to add an internet gateway destination of 0.0.0.0/0




Create a Local Peering Gateway - this will be used to connect to the OKE VCN.





Add a route rule to direct 10.0.0.0/16 traffic to the OKE VCN via the to-oke LPG



Create Egress Rules for the Monitoring VCN



0.0.0.0/16 to route all internet traffic
10.0.0.0/16 to route port 9090 (Grafana dashboard) traffic to OKE 


Create ingress Rules for the Monitoring VCN



Port 3000 on 0.0.0.0/0 to accept Grafana requests from the internet. (This would be changed to suit the VPN CIDR block in a VPN set-up.)


Set-Up Additional Networking in OKE Compartment

Select Networking, Virtual Cloud Networks



Select the VCN created in the OKE compartment by the Quick Create.




Create a Local peering Gateway



Establish a peering connection to the Monitoring gateway





 After a few moments the connection will be established.



Select route tables for the OKE VCN




Add a route for the monitoring VCN via the local peering gateway from the Loadbalancer subnet.
Destination 11.0.0.0/16, Target to-grafana LPG



Add an ingress security rule for the same subnet to accept traffic from the Grafana dashboard on port 9090


At this point all required networking to allow access from the monitoring VCN to the OKE VCN via private local peering gateways has been completed. The next steps are to install prometheus in the OKE cluster and create a Grafana server in the Monitoring Compartment/VCN.


Installing Prometheus

It is assumed that there is a current kubeconfig available ==> Here to allow kubectl commands to be run against the OKE cluster.
There are many examples of how to install Prometheus and Grafana but this is based on the Oracle Cloud Native Labs on Monitoring OKE ==> Here but with a couple of changes. 
Firstly we need to not install Grafana onto the OKE cluster which is installed by default in the lab and secondly we need to change the Kubernetes-prometheus service to use a private load balancer.
First create a role binding to cluster-admin for your OCI user:

kubectl create clusterrolebinding admin --clusterrole=cluster-admin --user=ocid1.user.oc1..nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

Then following the Lab instructions.
Add the repo for the prometheus operator:


We can skip the steps to install tiller  as this is selected by default in quick create so we can skip to the step to install the operator into a separate namespace called monitoring:

helm install --namespace monitoring coreos/prometheus-operator

Download a set of values used to install prometheus:



Before installing, edit the values.yaml file to not install Grafana on the OKE cluster by changing deployGrafana: True to False


Now, install Prometheus to the OKE cluster using the changed yams file.

helm install coreos/kube-prometheus --name kube-prometheus --namespace monitoring --values values.yaml

After a few moments you can see the newly started pods:

kubectl get po --namespace monitoring


The kube-prometheus service with exposes the Prometheus data is by default defined as ClusterIP, this means that the data is not available outside of the cluster. This must be changed to allow external access.
The simplest way to do this is to change the service to NodePort which will make the data accessible on each worker-node. However, this is dependant on each worker-nodes IP address so if we want to be able to withstand node pool upgrades and node failures we need a more consistent way to access the data. In order to do this we must edit the kube-prometheus service to use a private OCI load-balancer. 

kubectl edit svc kube-prometheus -n monitoring

Add a private LB annotation
apiVersion: v1
kind: Service
metadata:
  annotations:

and change the type to LoadBalancer.

type: LoadBalancer

After a few moments you can check the services by issuing the following:

kubectl get svc -n monitoring

You will see that Kubernetes-prometheus has a type of LoadBalancer with a private (10.0.20.4) IP address in the load balancer subnet.





Creating a Grafana Server


We will install a Grafana server in the monitoring compartment/VCN and create a data source as the OKE cluster Prometheus service we added in the previous steps.
There are guides available to do this è Here. However, an OCI custom image was created after following these instructions to make it quicker and easier to get up and running. The following instructions make use of this custom image.

Select Compute in the monitoring compartment.


Select Custom Images.



Click Import Image and select a name - Grafana, OS of Linux, Image Type of OCI and select and object storage URL of https://objectstorage.uk-london-1.oraclecloud.com/p/yNfon08ieUoDnMQRjzArEDR538dF9D6jlkbVK0h3TIg/n/intpaulj/b/Images/o/grafana
This pre-authenticated request will be live until 12/31/2020.



Click Import Image.


After a few minutes the work request will complete



Return to the compute dashboard and create a new instance in the monitoring compartment, selecting an appropriate shape, use a Public IP address, use the custom image option to use the image we imported earlier along with a key file to allow SSH access to the instance.






Click create.

When the image is up and running, make note of it's Public IP address. At this point I would recommend running a yum update to make sure the image has latest fixes etc.

Open up a browser and navigate to http://your.public.ip.adress:3000

You will see the Grafana Log-in screen.


Login with User=admin and password=passw0rd. Please change this after logging in.

You will see a portal with some existing dashboards. These are included as examples using the OCI data source.



Click on settings and Add Data source



Add the details for the kube-prometheus service defined when we set Prometheus on the OKE cluster. (In this case 10.0.20.04:9090).


Click Save and test.


Return to the home screen and select Home in the top Left.



Select Import Dashboard


Enter 10000 as the Grafana.Com Dashboard ID and click Load. This will select a sample Kubernetes monitoring dashboard from Grafana.com. You can browse and select different re-built solutions but this is a very popular dashboard.




Then click Import



Add the data source we added above. (OKE Toronto)



You will now see a Kubernetes dashboard displaying Prometheus data from your Private OKE cluster.



Multiple VCNs


Many installations will have multiple clusters – may dev, test, QA etc – which will all need to be monitored. Multiple VCNs can be peered using Local Peering Gateways but they are a 1-2-1 relationship. So that each VCN to VCN peering will need it’s own dedicated LPG paring.

The following diagrams show and example of how this could be set-up:





Multiple Regions


It is also possible to securely and privately peer VCNs across regions using Remote Peering Gateways. This facility can extend the solution to have a single Grafana server monitoring multiple OKE clusters in Multiple OCI regions. The concept is very similar to the solutions above and shown in the following diagram. 





Please refer to the OCI documentation for more information on Remote Peering è Here


Multiple Node-Pools


Where either performance or security requirements are high, it is also possible to further separate the Prometheus deployment by creating a separate node pool to run in. One way of doping this is to use Label Selectors. This limits any overhead that running Prometheus incurs but has the overhead of increasing the compute required and therefore increased costs for running your cluster.