Monitoring

Developers and administrators can monitor application metrics, system metrics, and infrastructure metrics for Grainite. Grainite exposes a Prometheus endpoint on port 5064 that can be used for gathering and processing monitoring data. The counters and gauges APIs discussed in the API section of the documentation allow for the creation of these metrics which are then exposed to Prometheus by Grainite and can be visualized in a tool like Grafana. More details here.

Developer-defined application metrics

Developers can define application metrics using the counters and gauges APIs. The Userflows example included in our Samples demonstrates the following developer-defined application metrics.

Description Metric Used Metric Type

Description	Metric Used	Metric Type
Completed flows	`gxsapp_completed_flows_total`	Counter
Abandoned flows	`gxsapp_abandoned_flows_total`	Counter
Current flows	`gxsapp_current_flow_counts_current`	Gauge
Current flows by type	`gxsapp_current_flow_counts_current`	Gauge

Completed flows

gxsapp_completed_flows_total

Counter

Abandoned flows

gxsapp_abandoned_flows_total

Counter

Current flows

gxsapp_current_flow_counts_current

Gauge

Current flows by type

gxsapp_current_flow_counts_current

Gauge

Grainite-defined metrics

Grainite provides built-in metrics that allow developers and administrators to monitor

Application runtime metrics
Event processing metrics
Database metrics

Application Runtime Metrics

Description Metric Used Metric Type

Description	Metric Used	Metric Type
Rate of action Invocation per minute	`gxssys_action_count_total`	Gauge
Count of actions errors	`gxssys_action_errors_total`	Counter
Average action execution Latency	`gxssys_grain_execution_us_total`	Counter
Paused Endpoints due to failures	`gxssys_endpoint_paused_total`	Gauge
Task execution errors	`gxsapp_gxtask_execution_errors_total`	Counter
Task instance execution errors	`gxsapp_gxtask_instance_execution_errors_total`	Counter
Task execution status	`gxsapp_gxtask_execution_status_current`	Gauge

Rate of action Invocation per minute

gxssys_action_count_total

Gauge

Count of actions errors

gxssys_action_errors_total

Counter

Average action execution Latency

gxssys_grain_execution_us_total

Counter

Paused Endpoints due to failures

gxssys_endpoint_paused_total

Gauge

Task execution errors

gxsapp_gxtask_execution_errors_total

Counter

Task instance execution errors

gxsapp_gxtask_instance_execution_errors_total

Counter

Task execution status

gxsapp_gxtask_execution_status_current

Gauge

Event processing metrics

Description Metric Used Metric Type

Description	Metric Used	Metric Type
Message delivery latency for the last 30s window of data. This is published for the 50/95/99th percentiles	`gxssys_message_delay_ms_total`	Counter
Topic consumption latency for the last 30s window of data. This is published for the 50/95/99th percentiles	`gxssys_subscription_delay_ms_total`	Counter
Batch Size of fetched requests	`gxdtopic_tot_fetch_batch_size_total` `gxdtopic_tot_batch_size_cnt_total`	Counter
Total events fetched from Topic	`gxdtopic_tot_fetched_events_total`	Counter
Total events fetched and processed	`gxdtopic_tot_consumed_events_total`	Counter
Total Grain to Grain messages fetched	`gxdg2g_tot_fetched_messages_total`	Counter
Total Grain to Grain messages fetched and processed	`gxdg2g_tot_consumed_messages_total`	Counter
Indicates how many events have been pulled from a topic but has not been processed	`gxdexec_cur_inflight_events_current`	Gauge

Message delivery latency for the last 30s window of data. This is published for the 50/95/99th percentiles

gxssys_message_delay_ms_total

Counter

Topic consumption latency for the last 30s window of data. This is published for the 50/95/99th percentiles

gxssys_subscription_delay_ms_total

Counter

Batch Size of fetched requests

gxdtopic_tot_fetch_batch_size_total gxdtopic_tot_batch_size_cnt_total

Counter

Total events fetched from Topic

gxdtopic_tot_fetched_events_total

Counter

Total events fetched and processed

gxdtopic_tot_consumed_events_total

Counter

Total Grain to Grain messages fetched

gxdg2g_tot_fetched_messages_total

Counter

Total Grain to Grain messages fetched and processed

gxdg2g_tot_consumed_messages_total

Counter

Indicates how many events have been pulled from a topic but has not been processed

gxdexec_cur_inflight_events_current

Gauge

Database Metrics

Description Metric Used Metric Type

Description	Metric Used	Metric Type
Average latency to process requests to database	`gxsctl_tot_work_process_latency_total gxsctl_tot_work_total`	Counter
Cumulative count of writes to Grains	`gxggrain_tot_update_total`	Counter
Disk currently used by apps and system	`gxsdat_cur_disk_used_size_current`	Gauge
Number of Grain updates that have materialized	`gxpmatr_tot_fetched_logs_total`	Counter
Number of Grain updates pending	`gxggrain_cur_update_current`	Gauge

Average latency to process requests to database

gxsctl_tot_work_process_latency_total gxsctl_tot_work_total

Counter

Cumulative count of writes to Grains

gxggrain_tot_update_total

Counter

Disk currently used by apps and system

gxsdat_cur_disk_used_size_current

Gauge

Number of Grain updates that have materialized

gxpmatr_tot_fetched_logs_total

Counter

Number of Grain updates pending

gxggrain_cur_update_current

Gauge

Cluster Health

Description Metric Used Metric Used

Description	Metric Used	Metric Used
Total high load and total stalled metrics indicate the health of compute capability of the Grainite cluster	`gxssrvr2_tot_highload_hz` `gxssrvr2_tot_stalled_hz`	Counter
WAL disk used metric provides the current utilization of Grainite Write Ahead Log (WAL)	`gxwwal_cur_disk_rlused_size_current`	Gauge
The current allowed rate and current target rate help to determine if there is a continuous event execution overload on the cluster	`gxdfctrl_cur_allowed_rate_current gxdfctrl_cur_target_rate_current`	Gauge

Total high load and total stalled metrics indicate the health of compute capability of the Grainite cluster

gxssrvr2_tot_highload_hz gxssrvr2_tot_stalled_hz

Counter

WAL disk used metric provides the current utilization of Grainite Write Ahead Log (WAL)

gxwwal_cur_disk_rlused_size_current

Gauge

The current allowed rate and current target rate help to determine if there is a continuous event execution overload on the cluster

gxdfctrl_cur_allowed_rate_current gxdfctrl_cur_target_rate_current

Gauge

Infrastructure Metrics

Cloud providers' monitoring solutions can be used to gather infrastructure-level metrics. We recommend monitoring the following metrics:

CPU usage: CPU usage by each Kubernetes node is measured in the number of CPU cores
CPU utilization: CPU utilization by each node measured as a percent of available CPU resources
Bytes transmitted: Throughput of network traffic being sent out of each node measured in bytes
Bytes received: Throughput of network traffic being received by each node, measured in bytes
Memory usage: Memory usage by each node measured in GiB
Disk read: Throughput of disk IOPS being read by each node to its persistent disk
Disk write: Throughput of disk IOPS being written by each node to its persistent disk

Additional metrics can be added as desired for your deployments within the cloud provider's monitoring console.

PreviousAudit Log NextPrometheus and Grafana Setup

Last updated 10 months ago