存储系统搭建——Ceph集群-吉桂权的个人主

一、Ceph基础简介

nfs存储方式只能用于开发测试，生产环节绝对不可以用，我们选择Ceph！

1、Ceph简介

2、Ceph集群的安装方式选择

官方安装文档：https://docs.ceph.com/en/pacific/install/#

Cephadm安装（推荐）：
Rook安装（推荐）：使用Rook部署一个能管理Ceph集群的系统，一步到位！Rook官网：
其他方式（手动-不推荐）

3、Rook简介：https://www.rook.io/docs/rook/v1.7/quickstart.html

Rook：存储编排系统；
K8s：容器编排系统；

Rook的工作原理：

Rook的架构设计：

二、Rook+Ceph的安装

1、查看前提条件——硬件要求

Raw devices (no partitions or formatted filesystems)；原始磁盘，无分区或者格式化
Raw partitions (no formatted filesystem)；原始分区，无格式化文件系统

意思就是不要使用自己系统已经用过的磁盘，最好是干净的几块磁盘，因为Ceph底层有自己的文件系统（是基于LVM2可拓展的文件系统）

fdisk -l 
找到自己挂载的磁盘
如： Disk /dev/vdc: 107.4 GB, 107374182400 bytes, 209715200 sectors

# 查看满足要求的
lsblk -f
## 结果：
vda                                                           
└─vda1 xfs               9cff3d69-3769-4ad9-8460-9c54050583f9 /
vdb    swap   YUNIFYSWAP 48eb1df6-1663-4a52-ab30-040d552c2d76 
vdc    #没有任何文件系统的干净磁盘，可用

#云厂商都这么磁盘清0
dd if=/dev/zero of=/dev/vdc bs=1M status=progress

2、克隆Rook的git仓库代码

git clone --single-branch --branch release-1.7  
cd rook/cluster/examples/kubernetes/ceph

3、修改operator.yaml：

把文件中的默认镜像修改为我的：

##首先把ceph镜像换成我们自己的
rook/ceph:v1.6.3   换成  registry.cn-hangzhou.aliyuncs.com/jgqk8s/rook-ceph:v1.6.3

## 建议修改以下的东西。在operator.yaml里面
ROOK_CSI_CEPH_IMAGE: "registry.cn-hangzhou.aliyuncs.com/jgqk8s/cephcsi:v3.3.1"
ROOK_CSI_REGISTRAR_IMAGE: "registry.cn-hangzhou.aliyuncs.com/jgqk8s/csi-node-driver-registrar:v2.0.1"
ROOK_CSI_RESIZER_IMAGE: "registry.cn-hangzhou.aliyuncs.com/jgqk8s/csi-resizer:v1.0.1"
ROOK_CSI_PROVISIONER_IMAGE: "registry.cn-hangzhou.aliyuncs.com/jgqk8s/csi-provisioner:v2.0.4"
ROOK_CSI_SNAPSHOTTER_IMAGE: "registry.cn-hangzhou.aliyuncs.com/jgqk8s/csi-snapshotter:v4.0.0"
ROOK_CSI_ATTACHER_IMAGE: "registry.cn-hangzhou.aliyuncs.com/jgqk8s/csi-attacher:v3.0.2"

4、安装operator：

kubectl create -f crds.yaml -f common.yaml -f operator.yaml

5、修改 cluster.yaml中的集群信息：

因为默认是使用所有节点，如果我们不适用所有节点，则需要做以下修改：

  storage: # cluster level storage configuration and selection
    useAllNodes: false
    useAllDevices: false
    config:
      osdsPerDevice: "3" #每个设备osd数量
    nodes:
    - name: "k8s-node3"
      devices: 
      - name: "vdc"
    - name: "k8s-node1"
      devices: 
      - name: "vdc"
    - name: "k8s-node2"
      devices: 
      - name: "vdc"

另外，修改以下ceph镜像：

ceph/ceph:v15.2.11 换成 registry.cn-hangzhou.aliyuncs.com/jgqk8s/ceph-ceph:v15.2.11

6、正式安装集群：

kubectl apply -f cluster.yaml

漫长等待，部署成功后的pod数量：

NAME                                               READY   STATUS      RESTARTS   AGE    IP             NODE     
csi-cephfsplugin-74xkg                             3/3     Running     0          125m   10.233.96.16   node2    
csi-cephfsplugin-98xmt                             3/3     Running     0          125m   10.233.90.14   node1    
csi-cephfsplugin-provisioner-5d498c4bdd-vq2pc      6/6     Running     0          125m   10.233.90.15   node1    
csi-cephfsplugin-provisioner-5d498c4bdd-x97g7      6/6     Running     0          125m   10.233.96.17   node2    
csi-cephfsplugin-zknsc                             3/3     Running     0          118m   10.233.70.14   master   
csi-rbdplugin-provisioner-7bd657db4c-6cl6r         6/6     Running     0          125m   10.233.96.15   node2    
csi-rbdplugin-provisioner-7bd657db4c-tg4d2         6/6     Running     0          125m   10.233.90.13   node1    
csi-rbdplugin-skx2m                                3/3     Running     0          118m   10.233.70.12   master   
csi-rbdplugin-t86nx                                3/3     Running     0          125m   10.233.96.14   node2    
csi-rbdplugin-vww89                                3/3     Running     0          125m   10.233.90.12   node1    
rook-ceph-crashcollector-master-58b49b5db6-m9m5v   1/1     Running     0          11m    10.233.70.21   master   
rook-ceph-crashcollector-node1-5965f9db96-ncgm8    1/1     Running     0          12m    10.233.90.28   node1    
rook-ceph-crashcollector-node2-7d67bb865-sh2ch     1/1     Running     0          11m    10.233.96.39   node2    
rook-ceph-mgr-a-7bd599b466-bgkqg                   1/1     Running     0          12m    10.233.96.35   node2    
rook-ceph-mon-a-6ff5f6b6cb-99mcz                   1/1     Running     0          18m    10.233.90.25   node1    
rook-ceph-mon-c-5cf8b995b5-pxssc                   1/1     Running     0          14m    10.233.96.33   node2    
rook-ceph-mon-d-6cd9c57459-pr5dp                   1/1     Running     0          11m    10.233.70.22   master   
rook-ceph-operator-5bbbb569df-5nbh9                1/1     Running     0          150m   10.233.96.12   node2    
rook-ceph-osd-0-64b57fd54c-trmt5                   1/1     Running     0          11m    10.233.90.29   node1    
rook-ceph-osd-1-6b7568d5c7-bzt9v                   1/1     Running     0          11m    10.233.96.38   node2    
rook-ceph-osd-prepare-node1-2vhbd                  0/1     Completed   0          10m    10.233.90.33   node1    
rook-ceph-osd-prepare-node2-646sj                  0/1     Completed   0          10m    10.233.96.41   node2

7、确保mgr-dashboard这个service是可以访问的

[root@master ceph]# kubectl get svc -n rook-ceph
NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
csi-cephfsplugin-metrics   ClusterIP   10.233.24.242   <none>        8080/TCP,8081/TCP   129m
csi-rbdplugin-metrics      ClusterIP   10.233.1.189    <none>        8080/TCP,8081/TCP   129m
rook-ceph-mgr              ClusterIP   10.233.57.190   <none>        9283/TCP            15m
rook-ceph-mgr-dashboard    ClusterIP   10.233.59.74    <none>        8443/TCP            15m
rook-ceph-mon-a            ClusterIP   10.233.2.171    <none>        6789/TCP,3300/TCP   22m
rook-ceph-mon-c            ClusterIP   10.233.51.168   <none>        6789/TCP,3300/TCP   18m
rook-ceph-mon-d            ClusterIP   10.233.58.171   <none>        6789/TCP,3300/TCP   14m

curl对应的IP端口：

[root@master ceph]# curl -k https://10.233.59.74:8443
<!doctype html>
<html>
<head>
  <meta charset="utf-8">
  <title>Ceph</title>
  <base href="/">

  <script>
    window['base-href'] = window.location.pathname;
  </script>
  ...略...

代表管理后端Dashboard可用！

深坑：如果mgr节点我们部署的是多份，那么他们也是主备模式，而非集群负载模式，所以只有一个可以访问，我们可以直接写一个Service将mgr服务暴露出来：mgr.yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    app: rook-ceph-mgr
    ceph_daemon_id: a
    rook_cluster: rook-ceph
  name: rook-ceph-mgr-dashboard-active
  namespace: rook-ceph
spec:
  ports:
  - name: dashboard
    port: 8443
    protocol: TCP
    targetPort: 8443
  selector:  #service选择哪些Pod
    app: rook-ceph-mgr
    ceph_daemon_id: a
    rook_cluster: rook-ceph
  sessionAffinity: None
  type: NodePort

之后Mgr的Service就可以curl通了！

8、我们再为Mgr-Dashboard创建一个Ingress网络

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ceph-rook-dash
  namespace: rook-ceph
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
    nginx.ingress.kubernetes.io/server-snippet: |
      proxy_ssl_verify off;
spec:
  # tls:  不用每个名称空间都配置证书信息
  # - hosts:
  #     - itdachang.com
  #     -  未来的
  #   secretName: testsecret-tls
  rules:
  - host: ceph.jiguiquan.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: rook-ceph-mgr-dashboard
            port:
              number: 8443

之后我们就可以通过域名：ceph.jiguiquan.com访问到Ceph的管理后台啦！

三、Ceph的后续使用

Ceph的后台的管理员账号为admin，为了方便，我还是换成了NodePort的方式暴露Dashboard了！

1、获取Ceph后台的默认访问密码：

[root@master ceph]# kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" | base64 --decode && echo
JXr"@`8)_?]7%GwP,(C^

后台如图：

登录完，右上角修改以下密码！

2、创建存储池+StorageClass——block-ceph.yaml——用于块存储的StorageClass

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  failureDomain: host  #容灾模式，host或者osd
  replicated:
    size: 2  #数据副本数量
---
apiVersion: storage.k8s.io/v1
kind: StorageClass  #存储驱动
metadata:
   name: rook-ceph-block
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    # clusterID is the namespace where the rook cluster is running
    clusterID: rook-ceph
    # Ceph pool into which the RBD image shall be created
    pool: replicapool

    # (optional) mapOptions is a comma-separated list of map options.
    # For krbd options refer
    # https://docs.ceph.com/docs/master/man/8/rbd/#kernel-rbd-krbd-options
    # For nbd options refer
    # https://docs.ceph.com/docs/master/man/8/rbd-nbd/#options
    # mapOptions: lock_on_read,queue_depth=1024

    # (optional) unmapOptions is a comma-separated list of unmap options.
    # For krbd options refer
    # https://docs.ceph.com/docs/master/man/8/rbd/#kernel-rbd-krbd-options
    # For nbd options refer
    # https://docs.ceph.com/docs/master/man/8/rbd-nbd/#options
    # unmapOptions: force

    # RBD image format. Defaults to "2".
    imageFormat: "2"

    # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
    imageFeatures: layering

    # The secrets contain Ceph admin credentials.
    csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
    csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
    csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

    # Specify the filesystem type of the volume. If not specified, csi-provisioner
    # will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
    # in hyperconverged settings where the volume is mounted on the same node as the osds.
    csi.storage.k8s.io/fstype: ext4

# Delete the rbd volume when a PVC is deleted
reclaimPolicy: Delete
allowVolumeExpansion: true

3、创建存储池+StorageClass——cephfs-ceph.yaml——用于共享文件存储的StorageClass

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: myfs
  namespace: rook-ceph # namespace:cluster
spec:
  # The metadata pool spec. Must use replication.
  metadataPool:
    replicated:
      size: 2
      requireSafeReplicaSize: true
    parameters:
      # Inline compression mode for the data pool
      # Further reference: https://docs.ceph.com/docs/nautilus/rados/configuration/bluestore-config-ref/#inline-compression
      compression_mode:
        none
        # gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool
      # for more info: https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-expected-pool-size
      #target_size_ratio: ".5"
  # The list of data pool specs. Can use replication or erasure coding.
  dataPools:
    - failureDomain: host
      replicated:
        size: 2
        # Disallow setting pool with replica 1, this could lead to data loss without recovery.
        # Make sure you're *ABSOLUTELY CERTAIN* that is what you want
        requireSafeReplicaSize: true
      parameters:
        # Inline compression mode for the data pool
        # Further reference: https://docs.ceph.com/docs/nautilus/rados/configuration/bluestore-config-ref/#inline-compression
        compression_mode:
          none
          # gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool
        # for more info: https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-expected-pool-size
        #target_size_ratio: ".5"
  # Whether to preserve filesystem after CephFilesystem CRD deletion
  preserveFilesystemOnDelete: true
  # The metadata service (mds) configuration
  metadataServer:
    # The number of active MDS instances
    activeCount: 1
    # Whether each active MDS instance will have an active standby with a warm metadata cache for faster failover.
    # If false, standbys will be available, but will not have a warm cache.
    activeStandby: true
    # The affinity rules to apply to the mds deployment
    placement:
      #  nodeAffinity:
      #    requiredDuringSchedulingIgnoredDuringExecution:
      #      nodeSelectorTerms:
      #      - matchExpressions:
      #        - key: role
      #          operator: In
      #          values:
      #          - mds-node
      #  topologySpreadConstraints:
      #  tolerations:
      #  - key: mds-node
      #    operator: Exists
      #  podAffinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                    - rook-ceph-mds
            # topologyKey: kubernetes.io/hostname will place MDS across different hosts
            topologyKey: kubernetes.io/hostname
        preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - rook-ceph-mds
              # topologyKey: */zone can be used to spread MDS across different AZ
              # Use <topologyKey: failure-domain.beta.kubernetes.io/zone> in k8s cluster if your cluster is v1.16 or lower
              # Use <topologyKey: topology.kubernetes.io/zone>  in k8s cluster is v1.17 or upper
              topologyKey: topology.kubernetes.io/zone
    # A key/value list of annotations
    annotations:
    #  key: value
    # A key/value list of labels
    labels:
    #  key: value
    resources:
    # The requests and limits set here, allow the filesystem MDS Pod(s) to use half of one CPU core and 1 gigabyte of memory
    #  limits:
    #    cpu: "500m"
    #    memory: "1024Mi"
    #  requests:
    #    cpu: "500m"
    #    memory: "1024Mi"
    # priorityClassName: my-priority-class
  mirroring:
    enabled: false
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
#  annotations:
#    storageclass.kubernetes.io/is-default-class: "true"
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  # clusterID is the namespace where operator is deployed.
  clusterID: rook-ceph

  # CephFS filesystem name into which the volume shall be created
  fsName: myfs
  # Ceph pool into which the volume shall be created
  # Required for provisionVolume: "true"
  pool: myfs-data0
  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

reclaimPolicy: Delete
allowVolumeExpansion: true

上面两个StorageClass都创建完成后，可以再后台存储池中看到：

4、当StorageClass都准备好之后，我们就可以查看到了：

[root@master ceph]# kubectl get sc -A
NAME              PROVISIONER                     RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local (default)   openebs.io/local                Delete          WaitForFirstConsumer   false                  24h
rook-ceph-block   rook-ceph.rbd.csi.ceph.com      Delete          Immediate              true                   22m
rook-cephfs       rook-ceph.cephfs.csi.ceph.com   Delete          Immediate              true                   3m26s

之后，在申请pvc的时候，我们只需要指定对应的StorageClass即可！