Overview
Rook is an open-source distributed filesystem designed for use under Kubernetes, and is only supported on Kubernetes 1.7 or higher.
...
Source code is also available here: https://github.com/rook/rook
Prerequisites
Minimum Version: Kubernetes v1.7 or higher is supported by Rook.
...
You will also need to set up RBAC, and ensure that the Flex volume plugin has been configured.
Set the dataDirHostPath
If you are using dataDirHostPath
to persist Rook data on Kubernetes hosts, make sure your host has at least 5GB of space available on the specified path.
Setting up RBAC
On Kubernetes 1.7+, you will need to configure Rook to use RBAC appropriately.
See https://rook.github.io/docs/rook/master/rbac.html
Flex Volume Configuration
The Rook agent requires setup as a Flex volume plugin to manage the storage attachments in your cluster. See the Flex Volume Configuration topic to configure your Kubernetes deployment to load the Rook volume plugin.
Getting Started
Now that we've examined each of the pieces, let's zoom out and see what we can do with the whole cluster.
For the quickest quick start, check out the Rook QuickStart guide: https://rook.github.io/docs/rook/master/quickstart.html
Getting Started without an Existing Kubernetes cluster
The easiest way to deploy a new Kubernetes cluster with Rook support on OpenStack (Nebula / SDSC) is to use the https://github.com/nds-org/kubeadm-terraform repository.
This may work for other cloud providers as well, but has not yet been thoroughly tested.
Getting Started on an Existing Kubernetes cluster
If you’re feeling lucky, a simple Rook cluster can be created with the following kubectl commands. For the more detailed install, skip to the next section to deploy the Rook operator.
...
For a more detailed look at the deployment process, see below.
Deploy the Rook Operator
The first step is to deploy the Rook system components, which include the Rook agent running on each node in your cluster as well as Rook operator pod.
...
You can also deploy the operator with the Rook Helm Chart.
Restart Kubelet (Kubernetes 1.7.x only)
For versions of Kubernetes prior to 1.8, the Kubelet process on all nodes will require a restart after the Rook operator and Rook agents have been deployed. As part of their initial setup, the Rook agents deploy and configure a Flexvolume plugin in order to integrate with Kubernetes’ volume controller framework. In Kubernetes v1.8+, the dynamic Flexvolume plugin discovery will find and initialize our plugin, but in older versions of Kubernetes a manual restart of the Kubelet will be required.
Create a Rook Cluster
Now that the Rook operator and agent pods are running, we can create the Rook cluster. For the cluster to survive reboots, make sure you set the dataDirHostPath
property. For more settings, see the documentation on configuring the cluster.
...
Code Block | ||
---|---|---|
| ||
$ kubectl -n rook get pod NAME READY STATUS RESTARTS AGE rook-ceph-mgr0-1279756402-wc4vt 1/1 Running 0 5m rook-ceph-mon0-jflt5 1/1 Running 0 6m rook-ceph-mon1-wkc8p 1/1 Running 0 6m rook-ceph-mon2-p31dj 1/1 Running 0 6m rook-ceph-osd-0h6nb 1/1 Running 0 5m |
Monitoring Your Rook Cluster
A glimpse into setting up Prometheus for monitoring Rook: https://rook.github.io/docs/rook/master/monitoring.html
Advanced Configuration
Advanced Configuration options are also documented here: https://rook.github.io/docs/rook/master/advanced-configuration.html
- Log Collection
- OSD Information
- Separate Storage Groups
- Configuring Pools
- Custom ceph.conf Settings
- OSD CRUSH Settings
- Phantom OSD Removal
Debugging
For common issues, see https://github.com/rook/rook/blob/master/Documentation/common-issues.md
For more help debugging, see https://github.com/rook/rook/blob/master/Documentation/toolbox.md
Cluster Teardown
See https://rook.github.io/docs/rook/master/teardown.html for thorough steps on destroying / cleaning up your Rook cluster
Components
Rook runs a number of smaller microservices that run on different nodes in your Kubernetes cluster:
- The Rook Operator + API
- Ceph Managers / Monitors / OSDs
- Rook Agents
The Rook Operator
The Rook operator is a simple container that has all that is needed to bootstrap and monitor the storage cluster.
...
The Rook operator also creates the Rook agents as a daemonset, which runs a pod on each node.
Ceph Managers / Monitors / OSDs
The operator will start and monitor ceph monitor pods and a daemonset for the OSDs, which provides basic Reliable Autonomic Distributed Object Store (RADOS) storage.
...
Ceph monitors (aka "Ceph mons") will be started or failed over when necessary, and other adjustments are made as the cluster grows or shrinks.
Rook Agents
Each agent is a pod deployed on a different Kubernetes node, which configures a Flexvolume plugin that integrates with Kubernetes’ volume controller framework.
All storage operations required on the node are handled such as attaching network storage devices, mounting volumes, and formatting the filesystem.
Storage
Rook provides three types of storage to the Kubernetes cluster:
- Block Storage: Mount storage to a single pod
- Object Storage: Expose an S3 API to the storage cluster for applications to put and get data that is accessible from inside or outside the Kubernetes cluster
- Shared File System: Mount a file system that can be shared across multiple pods
Custom Resource Definitions
Rook also allows you to create and manage your storage cluster through custom resource definitions (CRDs). Each type of resource has its own CRD defined.
- Cluster: A Rook cluster provides the basis of the storage platform to serve block, object stores, and shared file systems.
- Pool: A pool manages the backing store for a block store. Pools are also used internally by object and file stores.
- Object Store: An object store exposes storage with an S3-compatible interface.
- File System: A file system provides shared storage for multiple Kubernetes pods.
Shared Storage Example
Shamelessly stolen from https://rook.github.io/docs/rook/master/filesystem.html
Prerequisites
This guide assumes you have created a Rook cluster as explained in the main Kubernetes guide
Multiple File Systems Not Supported
By default only one shared file system can be created with Rook. Multiple file system support in Ceph is still considered experimental and can be enabled with the environment variable ROOK_ALLOW_MULTIPLE_FILESYSTEMS
defined in rook-operator.yaml
.
Please refer to cephfs experimental features page for more information.
Create the File System
Create the file system by specifying the desired settings for the metadata pool, data pools, and metadata server in the Filesystem
CRD. In this example we create the metadata pool with replication of three and a single data pool with erasure coding. For more options, see the documentation on creating shared file systems.
...
Code Block | ||
---|---|---|
| ||
$ ceph status ... services: mds: myfs-1/1/1 up {[myfs:0]=mzw58b=up:active}, 1 up:standby-replay |
Consume the Shared File System: Busybox + NGINX Example
As an example, we will start the kube-registry pod with the shared file system as the backing store. Save the following spec as kube-registry.yaml
:
...
NOTE: I had to explicitly specify clusterName
in the YAML above... newer versions of Rook will fallback to clusterNamespace
Kernel Version Requirement
If the Rook cluster has more than one filesystem and the application pod is scheduled to a node with kernel version older than 4.7, inconsistent results may arise since kernels older than 4.7 do not support specifying filesystem namespaces.
Testing Shared Storage
After creating our above example, we should now have 2 pods each with 2 containers running on 2 separate nodes:
...
You have just set up your first shared filesystem under Rook!
Under the Hood
For more information on the low-level processes involved in the above example, see https://github.com/rook/rook/blob/master/design/filesystem.md
...
The directories
section is supposed to list the paths that will be included in the storage cluster. (Note that using two directories on the same physical device can cause a negative performance impact.)
Investigating Storage directories
Checking the logs for one of the Rook agents, we can see a success message shows us where the data really lives:
...
Obviously this is not where we want the shared filesystem data stored long-term, so I'll need to figure out why these files are persisted into /var/lib/kubelet
and not into the directories
specified in the Cluster configuration.
Digging Deeper into dataDirHostPath
Checking /var/lib/rook
directory, we see a few sub-directories:
...
As you can see, these metadata files do not appear to be readable on disk and would likely need to be un-mangled by Rook to properly perform a backup.
Checking the kubelet
logs...
Digging into the systemctl
logs for kubectl
, we can see it's complaining about the volume configuration:
...
Sadly, even setting this value explicitly did not fix my immediate issue.
Hacking Terraform
At this point, I decided to start hacking the Terraform deployment to get Rook working to the level we'll need for Workbench.
...
- Rook has been upgraded from v0.6.2 to v0.7.1, in the helm install and in rook-cluster.yaml
- Expanded
storage
section to include anodes
subsection - this specifies which machines / directories should be part of the storage cluster - Turn off
useAllNodes
Checking the rook-operator
logs...
Now, with the new version of rook up and running, I attempted to make a filesystem as before. This time, however, no pods were spawned following my filesystem's creation.
...
Changing /vol_b
to /volb
solved this problem - this must be adjusted both in the deploy-rook.sh
script above, as well as the bootstrap-rook.sh
script alongside of it.
Now we're getting somewhere...
After changing the volume path and redeploying (again), now myfs
pods were being spawned after creating the filesystem in Kubernetes, as they should be:
...
- Upgrading to Rook 0.7.1
- The adjustments to the
directories
configuration inrook-cluster.yaml
are now writing data to the correct drive (/volb), but that drive may be improperly formatted for use with Rook
Narrowing it down...
Adjusting my rook-cluster.yaml to only include the adjustments to the directories
configuration, and to use storeType: filestore
instead of bluestore
.
...
Confirmed by this GitHub issue: https://github.com/rook/rook/issues/1604
Back to bluestore...
Switching back to storeType: bluestore
on Rook v0.6.2 with the correct nodes
/directories
configuration:
...
I have noticed that cluster with pods in an error state such as this one will fail to terraform destroy (the operation never completes even after waiting 15+ minutes)
Resolution
After pouring through the docs and GitHub issues and tediously reading the source code, we found a concerning comment in a GitHub issue: https://github.com/rook/rook/issues/1220#issuecomment-343342515
...
Code Block | ||||
---|---|---|---|---|
| ||||
apiVersion: rook.io/v1alpha1 kind: Filesystem metadata: name: myfs namespace: rook spec: metadataPool: replicated: size: 2 dataPools: - erasureCoded: dataChunks: 2 codingChunks: 1 metadataServer: activeCount: 1 activeStandby: true |
Recovering from backup
This feature is currently in the planning stages: https://github.com/rook/rook/issues/1552
Unofficial Python script for creating / restoring backups from Rook: https://gitlab.com/costrouc/kubernetes-rook-backup
Edge Cases and Quirks
There are many pitfalls here, particularly surrounding my perceived fragility of the shared filesystem
DO NOT delete the filesystem before shutting down all of the pods consuming it
Deleting the shared filesystem out from under the pod will confuse the kubelet, and prevent it from being able to properly unmount and terminate your containers.
...
This will hopefully be improved in later versions of Kubernetes (1.9+?)
You must follow these cleanup steps before terraform destroy
will work
Expanding on the above topic, terraform destroy
will hang on destroying your cluster if you fail to cleanup your filesystems properly:
...
WARNING: this seems like a very tenuous/tedious process... I am hoping that later versions of terraform/rook will improve the stability of cleanup under these scenarios. Perhaps we can expand their cleanup to first drain all nodes of their running pods (if this is not already the case), although this would not fix the case of a user deleting the filesystem before a running pod that is consuming it - in this case, the pods will fail to terminate indefinitely, which I think is what is leading terraform to fail.
Kill (or hide) a hanging pod
There are a few ways to kill a pod with fire:
...
Only use this as a last resort on test clusters, and NEVER use --grace-period=0
on a production cluster.
Cleaning up failed runs of terraform destroy
Here is a quick checklist of the items that you will need to manually if you are unable to terraform destroy
your cluster:
...