Skip to content
Kevin Han edited this page Aug 6, 2020 · 4 revisions

Pravega Healthcheck

Summary

This PDP(Pravega Design Proposal) proposes a design of Pravega Health Check. It covers the requirements of the feature, the main considerations and concerns behind the design, the architecture of the sub-system, the typical usages on various levels, and the implementation of some HealthAspects.

Main Considerations & Functional Requirements

  1. Support Health(Readiness) Check
  2. Support Liveness Check
  3. (Optional) Support Wellness Check
  4. Easily consumable by both machine and human
  5. Extensible and maintainable
  6. Standard based approach instead of ad-hoc implementation
  7. Accessible from all the levels, such as aspects, process (localhost), pod, service (cluster), database and Metrics backend
  8. Lightweight with minimum overhead; in PULL mode

Healthcheck Usage Examples

Access Level Query Example Response Example Use Cases
Local curl http://localhost:10080/health {"health": 0}
{"health": -1}
Fundamental Healthcheck
for SegmentStore
Local curl http://localhost:10080/live {"liveness": 0} Fundamental Liveness Check
for SegmentStore
Local curl http://localhost:10090/healthDetails {"health": -1,
"details": "No Active SegmentContainer"}
Fundamental Healthcheck
for Controller
K8S pod curl -v http://10.100.200.125:10080/health; curl -v http://10.100.200.125:10090/health Troubleshooting inside K8S
Operator {LivenessProbe: exec: Command:
curl -v /live
ReadinessProbe: Exec: Command:
curl -v /health
Operator exposes Pravega liveness
and readiness check to K8S
Service (CLI) health -[Segmentstore Controller All]
Database
(Influxdb)
SELECT healthevents
from SegmentstoreHealthEvents ...
History and distribution of health
information available now
Metrics
(Grafana)
Metrics backend User Interface Integration with BK/ZK metrics possible now

API & Architecture

Kep Concepts

  1. HealthAspect - An aspect of the system health. E.g. Cache is an aspect of SegmentStore service health. Note there might be multiple instances from the aspect. HeathAspect is supposed to provide the function to do healthcheck and aspect level aggregation
  2. HealthInfo - Object to storage the final healthcheck result, such as status code and details
  3. HealthAspectProvider - A system component could be a HealthAspectProvider if it registers HealthAspect upon its initialization and closes the aspect when the compoment closes
  4. HealthAspectRegistry - A container to hold all references to all the active HealthAspects. Note the references are weak references to prevent memory leaking

Basic Flow

  1. Each system component with health concern implements HealthAspectProvider interface, which registers its own HealthAspect upon component initialization, and closes the aspect when the component closes
  2. HealthAspect holds the functions to run healthcheck and aspect level aggregation
  3. HealthRegistry holds weak references to all the active HealthAspects to avoid memory leaking
  4. When healthcheck is requested, HealthRegistry iterates all the active HealthAspects to get HealthInfo. In addition, it also does the aspect level aggregation to determine the health for the entire aspect
  5. Aspect level aggregation function is provided by HealthAspect. E.g. for SegmentContainerHealthAspect, if more than half of Segment Containers are not healthy, then the SegmentContainer is considered unhealthy on the aspect level
  6. The existing REST endpoint inside Controller is used to expose HealthCheck result
  7. New REST endpoint will be created for SSS
  8. Pravega Operator will expose those healthcheck endpoints to Kubernetes
  9. Pravega Command Line tool will do service level healthcheck. CLI will query all the service pods and do the service level aggregation
  10. Each call of HealthCheck is also a metrics event, so user could view Pravega Healthcheck history and distribution at the backend, such as Grafana

Architecture Diagram

Architecture Diagram

HealthCheck API

import lombok.Data;

@Data
public class HealthInfo {
    enum Status {
        HEALTH,
        UNHEALTH
    }

    /**
     * The status of the HealthInfo
     */
    final Status status;

    /**
     * The details of the HealthInfo
     */
    final String details;
}

public interface HealthAspect {

    /**
     * Return the unique ID of the aspect instance
     * @return the aspect instance id
     */
    String getAspectInstanceId();

    /**
     * Each Health Aspect should provider an Supplier for HealthInfo
     *
     * @return HealthInfo of the health aspect
     */
    Supplier<HealthInfo> getHealthInfoSupplier();

    /**
     * There might be multiple instances from the same HealthAspect. Aspect Level Aggregator
     * should be provided as well to make a conclusion on Aspect Level.
     *
     * @return Function to make aspect level conclusion based on all the available HealthInfo of the aspect
     */
    Function<Iterable<HealthInfo>, HealthInfo.Status> getHealthAspectAggregator();
}

public interface HealthAspectProvider {

    /**
     * Register the HealthAspect upon its initialization
     */
    void registerHealthAspect();

    /**
     * Close the HealthAspect upon its shutdown
     */
    void closeHealthAspect();
}

public interface HealthAspectProvider {

    /**
     * Register the HealthAspect upon its initialization
     */
    void registerHealthAspect();

    /**
     * Close the HealthAspect upon its shutdown
     */
    void closeHealthAspect();
}

public class AggregatorUtil {
    /**
     * Given a group of HealthInfo determine the overall health situation using majority rule.
     * 
     * @param healthInfos 
     * @return the overall health status
     */
    public static HealthInfo.Status majority(Iterable<HealthInfo> healthInfos) {
        int healthCount = 0;
        int unhealthCount = 0;
        for (HealthInfo info: healthInfos) {
            if (info.getStatus() == HealthInfo.Status.HEALTH) {
                healthCount++;
            } else {
                unhealthCount++;
            }
        }
        return unhealthCount > healthCount ? HealthInfo.Status.UNHEALTH : HealthInfo.Status.HEALTH;
    }   
}

Health Aspects and Corresponding Providers

SegmentStore Container Aspect

public class SegmentContainerHealthAspect implements HealthAspect {

    private String aspectInstanceId;
    private StreamSegmentContainer container;
    private static Function<Iterable<HealthInfo>, HealthInfo.Status> aspectLevelAggregator = healthInfos -> AggregatorUtil.majority(healthInfos);
    private final Supplier<HealthInfo> healthInfoSupplier;

    public SegmentContainerHealthAspect(String aspectInstanceId, StreamSegmentContainer container, Supplier<HealthInfo> healthInfoSupplier) {
        this.aspectInstanceId = aspectInstanceId;
        this.healthInfoSupplier = healthInfoSupplier;
    }

    @Override
    public String getAspectInstanceId() {
        return this.aspectInstanceId;
    }

    @Override
    public Supplier<HealthInfo> getHealthInfoSupplier() {
        return this.healthInfoSupplier;
    }

    @Override
    public Function<Iterable<HealthInfo>, HealthInfo.Status> getHealthAspectAggregator() {
        return aspectLevelAggregator;
    }

    Supplier<HealthInfo> createHealthInfoSupplier() {
        HealthInfo.Status status = container.isClosed() ? HealthInfo.Status.UNHEALTHY : HealthInfo.Status.HEALTHY;
        String details = container.getActiveSegments().toString();
        return new HealthInfo(status, details);
    }
}

public class StreamSegmentContainer implements HealthAspectProvider, AutoCloseable {

    final HealthRegistry healthRegistry;
    final HealthAspect healthAspect;

    public StreamSegmentContainer(int containerId, HealthRegistry healthRegistry) {
        this.healthRegistry = healthRegistry;
        this.healthAspect = new SegmentContainerHealthAspect(new String(id), () -> new HealthInfo(HealthInfo.Status.HEALTH, "OK"));
        registerHealthAspect();
    }

    @Override
    public void registerHealthAspect() {
        healthRegistry.registryHealthAspect(this.healthAspect);
    }

    @Override
    public void closeHealthAspect() {
        healthRegistry.closeHealthAspect(this.healthAspect);
    }

    @Override
    public void close() {
        closeHealthAspect();
    }
}

System

Cache

Long Term Storage

Discarded approaches

https://github.com/pravega/pravega/pull/1902

Clone this wiki locally