Skip to content

Commit

Permalink
Address review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
xing-yang committed Sep 30, 2020
1 parent ca16599 commit f09401f
Showing 1 changed file with 40 additions and 19 deletions.
59 changes: 40 additions & 19 deletions keps/sig-storage/177-volume-snapshot/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -265,10 +265,10 @@ The following changes are either implemented or to be implemented in preparation

### Additional Changes Planned

* Add metrics support in the snapshot controller (metrics is already added to the external-snapshotter sidecar).
* snapshot_operation_total_seconds
* snapshot_operation_count
* snapshot_csi_total_seconds
* Add metrics support in the snapshot controller. Metrics is already added to the external-snapshotter sidecar.
* snapshot_operation_total_seconds (Snapshot operation end to end duration in number of seconds. Reported from the snapshot controller.)
* snapshot_operation_count (Total number of operations conducted by the snapshot controller with state changes. Includes an error code to indicate success/failure. Reported from the snapshot controller.)
* snapshot_csi_total_seconds (End to end duration of each operation which invokes CSI call in seconds. Reported from CSI external-snapshotter sidecard.)
* Tighten the CRD schema validation to enforce immutability. See details in "keps/sig-storage/1900-volume-snapshot-validation-webhook".

## Production Readiness Review Questionnaire
Expand Down Expand Up @@ -313,11 +313,12 @@ _This section must be completed when targeting alpha to a release._
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).

* **Does enabling the feature change any default behavior?**
Yes. Enabling the feature gate will enable the volume snapshot as data source to PVC and disabling the feature gate will disable the volume snapshot as data source to PVC. This feature gate can only prevent users from creating a PVC using volume snapshot as a data source. It can't prevent users from creating snapshots because snapshot APIs and controller are out of tree.
Snapshot is an opt-in feature so it should not change any default behavior.
Enabling the feature gate will enable the volume snapshot as data source to PVC and disabling the feature gate will disable the volume snapshot as data source to PVC. This feature gate can only prevent users from creating a PVC using volume snapshot as a data source. It can't prevent users from creating snapshots because snapshot APIs and controller are out of tree.

* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?**
Yes. The feature gate can only control creating a new PVC using snapshot as the data source because the snapshot APIs live out of tree. If the feature is disabled after it has been enabled, existing snapshots will still be there but they cannot be used as data source to provision new PVCs.
Yes. The feature gate can only control creating a new PVC using snapshot as the data source because the snapshot APIs live out of tree. If the feature is disabled after it has been enabled, existing snapshots will still be there but they cannot be used as data source to provision new PVCs and existing PVCs that were created with the snapshot data source will continue to function.

* **What happens if we reenable the feature if it was previously rolled back?**
If we reenable the feature that was previously rolled back, the existing snapshots can be used again as data source to provision new PVCs.
Expand All @@ -329,6 +330,7 @@ _This section must be completed when targeting alpha to a release._
conversion tests if API types are being modified.
Here's a unit test that tests dropping disabled snapshot data source:
https://github.com/kubernetes/kubernetes/blob/v1.20.0-alpha.1/pkg/api/persistentvolumeclaim/util_test.go#L31
The original PR for this test is here: https://github.com/kubernetes/kubernetes#72666

### Rollout, Upgrade and Rollback Planning

Expand All @@ -337,28 +339,33 @@ _This section must be completed when targeting beta graduation to a release._
* **How can a rollout fail? Can it impact already running workloads?**
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
The feature gate only affects kube-apiserver. If enabling the feature gate fails, already running workloads will not be able to use the snapshot feature.
The feature gate only affects kube-apiserver. If enabling the feature gate fails, already running workloads will not be able to use the snapshot feature to rehydrate PVCs, however, creation of volume snapshot is irrelevant to feature gate.
Also failed snapshot operations will be added back to a rate limited queue for retries.

* **What specific metrics should inform a rollback?**
Snapshots can't be mounted by a pod. Snapshot can only be used as a data source to create a new volume.
There are metrics on failed create/delete snapshot operations.
* How many snapshots haven't been taken?
This is a metric that should be issued by the snapshot controller. This work is in-progress. The metric would cover number of failures, number of snapshot operations finished, and latencies.

* How many pods using a PVC provisioned from a snapshot datasource are stuck in pending or failing?
This is not part of the snapshot controller's logic but rather sits in the PV controller and the external provisioner because snapshots can't be mounted directly by a pod but can be used as a datasource to create a new volume. There are metrics around PVC provisioning that shows success/error/time for create volume operations. It currently doesn't distinguish between volumes created from a snapshot vs a blank volume though.

* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
We plan to do this testing after the code is implemented that moves snapshot API from v1beta1 to v1. We will test the following:
* Upgrade from v1beta1 to v1. Take a snapshot and create a PVC from a snapshot.
* Downgrade from v1 to v1beta1. Check the snapshot and PVC created above. Take a snapshot and create a PVC from a snapshot.
* Upgrade from v1beta1 to v1 again. Check the snapshot and PVC created earlier. Take a snapshot and create a PVC from a snapshot.
* Upgrade from v1beta1 to v1. Take a snapshot and create a PVC from a snapshot. Also test delete snapshot and PVC.
* Downgrade from v1 to v1beta1. Check the snapshot and PVC created above. Take a snapshot and create a PVC from a snapshot. Also test delete snapshot and PVC.
* Upgrade from v1beta1 to v1 again. Check the snapshot and PVC created earlier. Take a snapshot and create a PVC from a snapshot. Also test delete snapshot and PVC.

* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
fields of API types, flags, etc.?**
Even if applying deprecation policies, they may still surprise some users.

We have added a validation webhook that will tighten the API validation. See details in "keps/sig-storage/1900-volume-snapshot-validation-webhook".
This validating admission webhook should be applied first before going GA to prevent invalid API objects from being created. If the webhook is not applied before going GA, it means there could be invalid API objects created.
The featuregate is only for creating PVC from a volume snapshot data source. The validation webhook is validating VolumeSnapshot and VolumeSnapshotContent API objects. So if the validation webhook is installed and then the featuregate is disabled, it will just prevent PVC from being created from a snapshot.
This validating admission webhook should be applied first before going GA to prevent invalid API objects from being created. If the webhook is not applied before going GA, it means there could be invalid API objects created. It also means existing v1beta1 invalid objects will error on any update, including delete. A distribution/admin must not add v1 API until they are sure no invalid v1beta1 objects exist.

The feature gate is only for creating PVC from a volume snapshot data source. The snapshot controller and validating webhook service are out-of-tree controllers which implement the snapshot feature, and their lifecycle/management/rollback is irrelevant to feature gate. The validation webhook is validating VolumeSnapshot and VolumeSnapshotContent API objects. So if the validation webhook is installed and then the feature gate is disabled, it will just prevent PVC from being created from a snapshot.

### Monitoring Requirements

Expand All @@ -377,7 +384,9 @@ the health of the service?**
- [x] Metrics
- Metric name: snapshot_operation_total_seconds, snapshot_operation_count, snapshot_csi_total_seconds
snapshot_operation_count has a status field that shows Error code. So from that we can tell how many failures have occurred in create/delete snapshot operations.
Another metric we are considering is around the validating webhook itself for conducting validation requests from API server.
We don't support mounting a snapshot to a pod. Only a volume can be mounted to a pod.
We also have metrics on volume provisioning, but it doesn't distinguish between datasource. We will consider adding something to differetiate them.
- [Optional] Aggregation method:
- Components exposing the metric: snapshot-controller and CSI external-snapshotter sidecar
- [ ] Other (treat as last resort)
Expand Down Expand Up @@ -413,7 +422,8 @@ _This section must be completed when targeting beta graduation to a release._
and creating new ones, as well as about cluster-level services (e.g. DNS):
- [Dependency name]: installation of snapshot CRDs, snapshot-controller, snapshot validation webhook, CSI external-snapshotter sidecar
- Usage description:
- Impact of its outage on the feature: installation of snapshot CRDs, snapshot-controller, and CSI external-snapshotter sidecar are required for the volume snapshot feature to work. Snapshot validation webhook is optional. If the validation webhook is not running, API validation will not happen which means there may be invalid snapshot API objects being created.
- Impact of its outage on the feature: Installation of snapshot CRDs, snapshot-controller, and CSI external-snapshotter sidecar are required for the volume snapshot feature to work. Snapshot validation webhook is optional. If the validation webhook is not running, API validation will not happen which means there may be invalid snapshot API objects being created.
The invalid API objects are not usable. We have logic in the controller to check that. That logic was added before the validation web hook was implemented. Since the validation web hook, the snapshot controller, and the snapshot CRDs are all out-of-tree, we don't have a way to make the web hook required. We recommend K8S distro to install them.
- Impact of its degraded performance or high-error rates on the feature:


Expand Down Expand Up @@ -450,6 +460,7 @@ previous answers based on experience in the field._
```

- estimated throughput
We have plan to add stress tests which can help us get maximum throughput achievable.
- originating component(s) (e.g. Kubelet, Feature-X-controller)
focusing mostly on:
- components listing and/or watching resources they didn't before
Expand All @@ -474,15 +485,21 @@ provider?**
the existing API objects?**
Describe them, providing:
- API type(s): New CRDs volumesnapshots, volumesnapshotcontents, and volumesnapshotclasses are added
PVC object also increases because of datasource.
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
Each PV can have multiple snapshots. The limit depends on individual storage systems.
* Each PV can have multiple snapshots. The limit depends on individual storage systems.
* There's quota support for namespaced custom resources as described here:
https://kubernetes.io/docs/concepts/policy/resource-quotas/#object-count-quota
* We can have quota support for VolumeSnapshot resource which is namespaced. This is created by a user.
* VolumeSnapshotContent is cluster scoped and created by the cluster admin, but there is usually a 1 to 1 mapping between the VolumeSnapshot and VolumeSnapshotContent, although admin can create more VolumeSnapshotContents if he/she wants to.
* VolumeSnapshotClass is cluster scoped and created by cluster-admin.

* **Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs]?**
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
This is a new feature so it will not increase time taken by existing operations.
This is a new feature so it will not increase time taken by existing operations. Howver, creating volumes from snapshots will probably take longer than an empty disk, especially if the snapshot needs to be downloaded from an object store.

* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**
Expand All @@ -491,7 +508,11 @@ resource usage (CPU, RAM, disk, IO, ...) in any components?**
volume), significant amount of data sent and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits].
This will add new snapshot API objects to the API server. IO depends on the underlying storage system. We only provide a snapshot API at control layer. Copy on write is typically used by snapshot technologies, but our snapshot API does not make any requirements on that because it depends on individual storage systems.
This will add new snapshot API objects to the API server.
The snapshot controller, validation webhook service, and snapshotter sidecar will consume cpu/memory from the control plane.
Copy on write is typically used by snapshot technologies, but our snapshot API does not make any requirements on that because it depends on individual storage systems.
The application is usually quiesced before taking a snapshot to ensure consistency, therefore taking a snapshot should not drive more IO on the nodes.
Some storage systems upload the snapshot to an object store after the snapshot is cut. That could take a very long time.

### Troubleshooting

Expand All @@ -503,7 +524,7 @@ _This section must be completed when targeting beta graduation to a release._

* **How does this feature react if the API server and/or etcd is unavailable?**
If API server is unavailable, an API update will fail due to timeout. Failed operations will be added back to a rate limited queue for retries.
A CreateSnapshot call to CSI driver is idemponent. So when API server is back and the same request is sent to the CSI driver again, the CSI driver should return the results from the same snapshot.
A CreateSnapshot call to CSI driver is idempotent. So when API server is back and the same request is sent to the CSI driver again, the CSI driver should return the results from the same snapshot.

* **What are other known failure modes?**
For each of them, fill in the following information by copying the below template:
Expand Down

0 comments on commit f09401f

Please sign in to comment.