Understanding your GHES graphs part 1 - System Health, Processes, and Authentication #42146

djdefi · 2022-12-19T18:09:48Z

djdefi
Dec 19, 2022

GitHub Enterprise Servers collect metrics from its services and the base Linux operating system. This data is useful for troubleshooting performance issues and for understanding how your system is being used. The metrics are collected and stored in a time series database, and are displayed in the GitHub Enterprise Management Console. The Management Console includes a Monitor dashboard located at http(s)://[hostname]/setup/monitor. This dashboard displays graphs created with data gathered by the built in collectd service. Data used in the graphs is sampled every 10 seconds.

Each graph has an informational tooltip describing the graph, which is accessible by hovering over or clicking on the i in the upper left corner of each graph.

Graph data can also be forwarded to an external receiver, by enabling Collectd forwarding within the GitHub Enterprise Management Console. This allows building customized dashboards and alerts for your GitHub Enterprise graph data.

This article series will explore what each of the dashboard sections cover and what specific graph trends to watch out for. As each GitHub Enterprise system is unique in user patterns and integrations, we encourage administrators to reach out to the GitHub support team to assist with interpreting your specific instance's monitor graphs if questions arise. The graph data is included within appliance Support Bundles which can be shared with our support team.

System Health

The system health graphs provide a general overview of services and system resource utilization. CPU, Memory, and Load Average graphs are useful for identifying trends or times where provisioned resource saturation has occurred.

CPU

Abnormally high CPU utilization, or prolonged spikes can mean your instance is under-provisioned.
In the above example, the CPU was nearly 100% consumed by user for a period of time.
Presence of CPU "steal" time on the CPU graph can be an indication that other virtual machines running on the same host system are saturating the underlying resources, causing the GitHub Enterprise system to wait for CPU cycles.
User and System are generally the largest consumers of CPU time.

Memory

The Linux Kernel provides a layer of in memory disk caching, which is represented by the "cache" on the graph. It is perfectly normal, and recommended to have at least a few GB of cache overhead. The system will attempt to cache as much as possible, but applications can take this memory on demand. Because of this, we consider the total amount of available memory to be the sum of "cached" and "free" values.
Running out of available free + cached memory can lead to out of memory (OOM) events, causing services to terminate and unexpected application behavior.

Load

System Load Average is a measurement showing the running task demand on the system.
We recommend monitoring the fifteen minute longterm system load average for values nearing or exceeding the number of CPU cores allocated to the virtual machine.
When the load average rises above the number of CPU cores, it generally means that tasks are needing to wait for resources before they can run.
Assuming the above example graph is a GitHub Enterprise system with 2 CPU cores, we can determine that processes are often waiting for resources.

Processes

By clicking on running in the legend at the bottom of the graph, we can isolate different process states. In the above example we have selected running processes.
The running process count will fluctuate with system activity. Sharp changes or drops could be expected depending on usage trends.
Large or consistent numbers of blocked or zombie processes may indicate a service problem.
It is expected to have processes in the sleeping state during normal operation.

Files

This graph represents the max number of open files, as well as the current number of used open files.
On a healthy system, the number of used files should never reach the max value. Reaching the max can indicate problems with a GitHub Enterprise service.
Limiting maximum open files is a protection built into Linux to prevent runaway processes from impacting other services on the system.

Forks

The fork_rate trend greatly depends on system activity, and will reach values upwards of 1000-2000 on busy systems.
Large spikes beyond the observed averages should be investigated.
This metrics is related to Linux process forking, and is not related to Forking a repository in the GitHub application.

Processes

The processes graph section looks deeper into the major individual services which make up the GitHub Enterprise appliance. Looking at these services individually can show how usage trends impact system resources over time.

Processes

Process counts will fluctuate with usage trends.

Memory

The unicorn and aqueduct process groups normally consume the highest amount of memory, followed by memcached, mysql, and elasticsearch.
unicorn, babeld, and git-daemon processes are most influenced by user activity.
Elements such the size of repositories interacted with and frequency of requests. Because of this, these process graphs can have peaks and valleys.

CPU (Kernel)

CPU time consumed by processes accessing hardware directly, via trusted lower level operating system functions.
The unicorn process normally consumes the most CPU time.
Often has lower values than the following CPU (Application) graph.

CPU (Application)

CPU time consumed by processes via Linux kernel interfaces.
The majority of GitHub Enterprise service CPU time occurs here.
On busy systems, unicorn consumes the most CPU time, followed by babeld, git, git-daemon, and aqueduct.

I/O operations (Read IOPS)

git-daemon, and babeld read Input/Output Operations Per Second ( IOPS ) values are influenced by Git fetch / pull activity.
unicorn read IOPS are influenced by web application or API GET requests.
aqueduct read IOPS are from background jobs, such as regular repository maintenance and repacking.

I/O operations (Write IOPS)

git-daemon write IOPS reflect Git push activity and is most often the largest consumer of write IOPS.
aqueduct background jobs, such as search indexing, and repository repacking are also be a large consumer of write IOPS

Storage traffic (Read)

Read throughput trends are a counterpart to Read IOPS. These values used together can help determine if storage system read performance is as expected.
User activity such as fetching, and retrieving API data will result in read activity.

Storage traffic (Written)

Write throughput trends are a counterpart to Write IOPS. These values used together can help determine if storage system read performance is as expected.
Pushes, background repacks, and API POST operations are often the largest influencers of this graph.

Authentication

The authentication graphs break down the rates at which users and applications are authenticating to the GitHub Enterprise appliance. We also track the protocol or service type such as Git or API for the authentications, which is useful in identifying broad user activity trends. The authentication graphs can help find interesting trends or timeframes to look at when diving deeper into authentication and API request logs.

Authentication Totals

Displays which methods users are authenticating with, and if they are successful in those attempts.
Large numbers of failures usually indicate misconfigured clients which are failing repeatedly.

Authentication Rate

Large numbers of authentications per second can cause authentication worker saturation.
Automated requests or "polling" can be identified by a flat baseline, or intervals of authentications which occur regularly, even during off-peak times such as weekends or holidays.
Human user authentication trends typically follow a bell curve, more closely matching your organization's daily business hours.

LDAP

LDAP graphs will only display data if LDAP Authentication is enabled on the GitHub Enterprise appliance. These graphs can help to identify slow responses from your LDAP server, as well as the overall volume of LDAP password based authentications.

LDAP authentications

If any timeouts appear in the graph, GitHub Enterprise was unable to communicate with the LDAP server in time for an authentication request to take place.
Failures indicate that users or clients are attempting to authenticate with an invalid LDAP username or password.
Using a Personal Access Token authentication instead of username and password for users can help reduce the number and frequency of requests which rely on the LDAP server.

LDAP authentication response time

Useful for tracking LDAP server performance trends, from the perspective of the GitHub Enterprise appliance.
LDAP responses which take longer than 10 seconds will result in a timeout for the authentication request.

LDAP Sync Totals

Reflects the number of user, team, and net new_members records which were synchronized via the LDAP Synchronization feature, when the feature is enabled.

LDAP Sync Runtime

If the runtime of team or user sync cycles exceeds the current LDAP Sync interval, the interval should be increased to allow completion before the next cycle.
Long run times may indicate poor LDAP server performance, or suboptimal configuration of Domain Bases and restricted groups.

Continue the conversation

There's more in "Understanding your GHES graphs" part 2. Please let us know if you have any questions in the comments!

Madugu1990 · 2023-01-01T14:36:35Z

Madugu1990
Jan 1, 2023

I need guidance and mentorship on how I will run this beautiful opportunity. Who can help please?

3 replies

tuves Jan 3, 2023
Maintainer

Welcome to the GitHub Community, @Madugu1990. Here are a few resources to get you started:

0schweps0 Jan 10, 2023

It's a good idea

This comment has been minimized.

Sign in to view

tylerlawrence2023 · 2023-01-04T01:18:00Z

tylerlawrence2023
Jan 4, 2023

I can't find the "button"

2 replies

walkersevenonethreenineone Jan 4, 2023

Which?

This comment was marked as off-topic.

Sign in to view

Junifa · 2023-01-08T07:01:10Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Understanding your GHES graphs part 1 - System Health, Processes, and Authentication #42146

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

This comment has been minimized.

{{title}}

{{title}}

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

{{title}}

Select a reply

GitHub Community

Understanding your GHES graphs part 1 - System Health, Processes, and Authentication #42146

djdefi Dec 19, 2022

System Health

CPU

Memory

Load

Processes

Files

Forks

Processes

Processes

Memory

CPU (Kernel)

CPU (Application)

I/O operations (Read IOPS)

I/O operations (Write IOPS)

Storage traffic (Read)

Storage traffic (Written)

Authentication

Authentication Totals

Authentication Rate

LDAP

LDAP authentications

LDAP authentication response time

LDAP Sync Totals

LDAP Sync Runtime

Continue the conversation

Replies: 5 comments · 5 replies

Madugu1990 Jan 1, 2023

tuves Jan 3, 2023 Maintainer

0schweps0 Jan 10, 2023

This comment has been minimized.

tylerlawrence2023 Jan 4, 2023

walkersevenonethreenineone Jan 4, 2023

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Junifa Jan 8, 2023

djdefi
Dec 19, 2022

Replies: 5 comments 5 replies

Madugu1990
Jan 1, 2023

tuves Jan 3, 2023
Maintainer

tylerlawrence2023
Jan 4, 2023

Junifa
Jan 8, 2023