Skip to content
Kevin Han edited this page Sep 4, 2020 · 4 revisions

Pravega Health Check

Summary

This PDP(Pravega Design Proposal) proposes a design of Pravega HealthCheck. It covers the requirements of the features, the main considerations and concerns behind the design, the system architecture, the Java API and the REST API of the framework, some typical integration and usage of HealthCheck on different levels, and the implementations of some HealthAspects.

Readiness Check

The Readiness Check is invoked periodically to determine whether the target service instance should start receiving requests or not. If Readiness Check fails, the service instance will not be killed; instead request routing mechanism just stops sendings service requests to the instance.

Health Check (aka Liveness Check)

The Health Check is invoked periodically to determine whether the target service instance is functioning as expected. If Heath Check failed, the service instance will be killed by management process, such as Kubernetes or system operator.

HealthInfo Aggregation

HealthInfos are usually collected from various components, hence certain aggregation rules are needed to determine the health of each Health Aspect and the entire service instance.

Main Considerations & Functional Requirements

  1. Supports Health (Liveness) Check
  2. Supports Readiness Check
  3. Returned HealthInfo should contain both specific status for machine reading and details for human reading
  4. The design should reflect the layered notion of Pravega Health - from individual unit (e.g. a segment container), to aspect (e.g. all the segment containers in an SegmentStore instance form an aspect), to individual service instance (e.g. one Segmentstore instance), up to the service level (e.g. all the Segmentstore instances as Segmentstore service)
  5. REST API client interface
  6. PULL mode - HealthCheck will only be invoked upon request from client
  7. HealthInfo will be cached to reduce the consumption of system resource

Architecture Diagram

Architecture Diagram

API of the HealthCheck Framework

Kep Concepts of the Design

  1. HealthInfo - Object to storage HealthCheck result, which is compriosed of status code and details
  2. HealthUnit - smallest system component to provide HealthInfo
  3. HealthAspect - An HealthAspect is comprised of zero, one or more HealthUnits with the same health concern. e.g. all Segment Containers form a SegmentContainer HealthAspect; Metric HealthAspect may contain zero HealthUnit if metrics is turned off. There is one and only one HealthUnit to form System HealthAspect
  4. Aspect Level Aggregation - HealthAspect must have an aggregation rule defined to aggregate potential multiple HealthInfos received to determine the overall healthiness of the aspect. E.g. For SegmentContainer HealthAspect we could apply majority rule to determine the healthiness of Segment Container as an aspect
  5. Instance Level Aggregation - with HealthInfo returned from all HealthAspects, the rule to determine the overall healthiness of the service instance
  6. HealthRegistry - A container to hold references to all the HealthUnits. Upon HealthCheck request, the registry pulls all the HealthUnits for HealthInfo, then aggregates HealthInfos on Aspect and Instance levels in order to return the final HealthInfo to HealthCheck client

Basic Flow

  1. Those system components with the ability to provide HealthInfo could create one or more HealthUnit objects and store them inside the component. The component needs to register HealthUnit objects upon component's initialization and unregister the HealthUnit objects upon the component's closure.
  2. Each HealthUnit object created must specify which HealthAspect it is coming from
  3. Upon HealthCheck request, HealthRegistry polls all the registered HealthUnits to retrieve HealthInfo
  4. After receiving all the available HealthInfo, Aggregation Rules are applied on HealthAspect level and the instance level to get the final HealthInfo
  5. REST interface is provided to client
  6. Each HealthCheck is also a metrics event, so user could view Healthcheck history and HealthInfo distribution at backend, such as Grafana

Resource Analysis

Each HealthUnit only holds a HealthInfo Supplier lambda for returning HealthInfo. It doesn't hold other resources for HealthCheck purpose.
The HealthInfo Supplier should be implemented in a light and non-blocking way, using information immediately available to the component as much as possible.
HealthRegistry holds weak references to the registered HealthUnits, so if an HealthUnit becomes Garbege Collection available, it will also be removed automatically from HealthRegistry to prevent memory leaking.
HealthUnits are pulled periodically using separate thread, giving no burden to system components holding HealthUnit.
The final HealthInfo will also be cached, so HealthCheck is essentially throttled. By default we could set the internal to 10 seconds.
Given the above measurement and consideration, the HealthCheck process should be lightweight, non-blocking, using minimum memory and CPU resources.

HealthCheck Java API

@Data
public class HealthInfo {
    public enum Status {
        /* The result of the health-check is considered healthy */
        HEALTH,
        /* The result of the health-check is considered unhealthy */
        UNHEALTH,
        /* The result of the health-check is unknown, due to time-out, interruption or other exception happened */
        UNKNOWN
    }

    /*
     * The status of the health-check
     */
    private final Status status;

    /*
     * The details of the health-check
     */
    private final String details;
}

/**
 * HealthAspect is Pravega's notion of health-check on top of individual HealthUnit.
 *
 * For a highly distributed system such as Pravega, the failure of one or more components is completely expected or
 * sometimes even designed, so in addition to the health-check of individual HealthUnit,
 * we have to aggregate all the health-check results from the aspect to determine the healthiness of the aspect.
 *
 * For example, during a scaling-up period, we may see some Segment Containers being shut down while some other
 * being created. We have to collect the HealthInfo from all the Segment Containers (HealthUnit) in order to determine
 * the healthiness of the overall Segment Container aspect.
 */
public enum HealthAspect {

    SYSTEM("System", healthInfos -> {
        return HealthInfoAggregationRules.singleOrNone(healthInfos);
    }),
    CONTROLLER("Controller", healthInfos -> {
        return HealthInfoAggregationRules.majority(healthInfos);
    }),
    SEGMENT_CONTAINER("Segment Container", healthInfos -> {
        return HealthInfoAggregationRules.majority(healthInfos);
    }),
    CACHE("Cache Manager", healthInfos -> {
        return HealthInfoAggregationRules.oneVeto(healthInfos);
    }),
    LONG_TERM_STORAGE("Long Term Storage", healthInfos -> {
        return HealthInfoAggregationRules.singleOrNone(healthInfos);
    }),
    METRICS("Metrics", healthInfos -> {
        return HealthInfoAggregationRules.singleOrNone(healthInfos);
    });

    private final String name;
    private final Function<Collection<HealthInfo>, Optional<HealthInfo>> aspectAggregationRule;

    /**
     *
     * @param name - the name of the aspect
     * @param aspectAggregationRule - the rule to determine aspect level healthiness
     */
    HealthAspect(String name, Function<Collection<HealthInfo>, Optional<HealthInfo>> aspectAggregationRule) {
        Preconditions.checkArgument(aspectAggregationRule != null, "Aspect Aggregation Rule cannot be null");
        this.name = name;
        this.aspectAggregationRule = aspectAggregationRule;
    }

    /**
     * Get the HealthAspect name.
     *
     * @return the HealthAspect name
     */
    public String getName() {
        return this.name;
    }

    /**
     * Get the rule for the aggregation of all the HealthInfo from the HealthAspect.
     *
     * @return the Function to aggregate all HealthInfo from the HealthAspect
     */
    public Function<Collection<HealthInfo>, Optional<HealthInfo>> getAspectAggregationRule() {
        return this.aspectAggregationRule;
    }
}

@Data
public class HealthUnit {

    /**
     * Id to uniquely identify the HealthUnit from the aspect it belongs to.
     * Usually this id can be derived from an existing id, such as the id of the hosting component.
     */
    final String healthUnitId;

    /**
     * The HealthAspect this HealthUnit is coming from.
     */
    final HealthAspect healthAspect;

    /**
     * Supplier to supply HealthInfo of the hosting component upon health-check request.
     */
    final Supplier<HealthInfo> healthInfoSupplier;
}

/**
 * The interface of the container holds HealthUnit references, which must provide the ability to
 * register and unregister HealthUnit.
 */
public interface HealthRegistry {

    /**
     * Register an HealthUnit.
     *
     * @param unit HealthUnit object
     */
    void registerHealthUnit(HealthUnit unit);

    /**
     * Unregister an HealthUnit.
     *
     * @param unit HealthUnit object
     */
    void unregisterHealthUnit(HealthUnit unit);
}

REST API

request Response
/health {"health": 0}
{"health": -1}
/ready {"ready": 0}
{"ready": -1}
/healthDetails {"health status": -1,
"SegmentContainerAspect": "5 healthy, 1 unhealthy",
"SystemAspect": "memory 12G/16G",
"Long Term Storage Aspect": "ECS, storage full",
"Cache Aspect": "throttling at 10s"
"Operation Log":"Unknown"}

Implementations of HealthUnits and HealthAspects

Reference Implementation of HealthUnit

public class SampleSystemComponent implements AutoCloseable {

    final String componentId;
    final HealthRegistry healthRegistry;
    final HealthUnit systemHealthUnit;
    final HealthUnit segmentContainerHealthUnit;

    public SampleSystemComponent(String id, HealthRegistry healthRegistry) {

        this.componentId = id;
        this.healthRegistry = healthRegistry;
        systemHealthUnit = new HealthUnit(this.componentId, HealthAspect.SYSTEM, () -> new HealthInfo(...));
        segmentContainerHealthUnit = new HealthUnit(this.componentId, HealthAspect.SEGMENT_CONTAINER, () -> new HealthInfo(...));
        this.healthRegistry.registryHealthUnit(systemHealthUnit);
        this.healthRegistry.registryHealthUnit(segmentContainerHealthUnit);
    }

    @Override
    public void close() {
        healthRegistry.unregisterHealthUnit(systemHealthUnit);
        healthRegistry.unregisterHealthUnit(segmentContainerHealthUnit);
    }
}

SegmentContainer HealthAspect

System HealthAspect

Cache HealthAspect

Long Term Storage HealthAspect

Healthcheck Usages or Integrations

Access Level Query Example Response Example Use Cases
Local curl http://localhost:10080/health {"health": 0}
{"health": -1}
Fundamental Healthcheck
for SegmentStore
Local curl http://localhost:10080/ready {"ready": 0} Fundamental Readiness Check
for SegmentStore
Local curl http://localhost:10090/healthDetails {"health": -1,
"details": "No Active SegmentContainer"}
Fundamental Healthcheck
for Controller
K8S pod curl -v http://10.100.200.125:10080/health; curl -v http://10.100.200.125:10090/ready Troubleshooting inside K8S
Operator {LivenessProbe: exec: Command:
curl -v /health
ReadinessProbe: Exec: Command:
curl -v /ready
Operator exposes Pravega Liveness
and Readiness check to K8S
Service (CLI) health -[Segmentstore|Controller|All] {"health": 0,
"details": "5 stores healthy, 1 store unhealthy"}
Service level healthcheck
aggregation
Database
(Influxdb)
SELECT health-check-events
from PravegaMetricsStore ...
History and distribution of health
information available now
Metrics
(Grafana)
Metrics backend User Interface Integration with BK/ZK metrics possible now

Discarded approaches

https://github.com/pravega/pravega/pull/1902