Links

Monitoring

Developers and administrators can monitor application metrics, system metrics, and infrastructure metrics for Grainite. Grainite exposes a Prometheus endpoint on port 5064 that can be used for gathering and processing monitoring data. The counters and gauges APIs discussed in the API section of the documentation allow for the creation of these metrics which are then exposed to Prometheus by Grainite and can be visualized in a tool like Grafana. More details here.

Developer-defined application metrics

Developers can define application metrics using the counters and gauges APIs. The Userflows example included in our Samples demonstrates the following developer-defined application metrics.
Description
Metric Used
Metric Type
Completed flows
gxsapp_completed_flows_total
Counter
Abandoned flows
gxsapp_abandoned_flows_total
Counter
Current flows
gxsapp_current_flow_counts_current
Gauge
Current flows by type
gxsapp_current_flow_counts_current
Gauge

Grainite-defined metrics

Grainite provides built-in metrics that allow developers and administrators to monitor
  • Application runtime metrics
  • Event processing metrics
  • Database metrics
Application Runtime Metrics
Description
Metric Used
Metric Type
Rate of action Invocation per minute
gxssys_action_count_total
Gauge
Count of actions errors
gxssys_action_errors_total
Counter
Average action execution Latency
gxssys_grain_execution_us_total
Counter
Paused Endpoints due to failures
gxssys_endpoint_paused_total
Gauge
Task execution errors
gxsapp_gxtask_execution_errors_total
Counter
Task instance execution errors
gxsapp_gxtask_instance_execution_errors_total
Counter
Task execution status
gxsapp_gxtask_execution_status_current
Gauge

Event processing metrics

Description
Metric Used
Metric Type
Message delivery latency for the last 30s window of data. This is published for the 50/95/99th percentiles
gxssys_message_delay_ms_total
Counter
Topic consumption latency for the last 30s window of data. This is published for the 50/95/99th percentiles
gxssys_subscription_delay_ms_total
Counter
Batch Size of fetched requests
gxdtopic_tot_fetch_batch_size_total gxdtopic_tot_batch_size_cnt_total
Counter
Total events fetched from Topic
gxdtopic_tot_fetched_events_total
Counter
Total events fetched and processed
gxdtopic_tot_consumed_events_total
Counter
Total Grain to Grain messages fetched
gxdg2g_tot_fetched_messages_total
Counter
Total Grain to Grain messages fetched and processed
gxdg2g_tot_consumed_messages_total
Counter
Indicates how many events have been pulled from a topic but has not been processed
gxdexec_cur_inflight_events_current
Gauge

Database Metrics

Description
Metric Used
Metric Type
Average latency to process requests to database
gxsctl_tot_work_process_latency_total gxsctl_tot_work_total
Counter
Cumulative count of writes to Grains
gxggrain_tot_update_total
Counter
Disk currently used by apps and system
gxsdat_cur_disk_used_size_current
Gauge
Number of Grain updates that have materialized
gxpmatr_tot_fetched_logs_total
Counter
Number of Grain updates pending
gxggrain_cur_update_current
Gauge

Cluster Health

Description
Metric Used
Metric Used
Total high load and total stalled metrics indicate the health of compute capability of the Grainite cluster
gxssrvr2_tot_highload_hz gxssrvr2_tot_stalled_hz
Counter
WAL disk used metric provides the current utilization of Grainite Write Ahead Log (WAL)
gxwwal_cur_disk_rlused_size_current
Gauge
The current allowed rate and current target rate help to determine if there is a continuous event execution overload on the cluster
gxdfctrl_cur_allowed_rate_current gxdfctrl_cur_target_rate_current
Gauge

Infrastructure Metrics

Cloud providers' monitoring solutions can be used to gather infrastructure-level metrics. We recommend monitoring the following metrics:
  • CPU usage: CPU usage by each Kubernetes node is measured in the number of CPU cores
  • CPU utilization: CPU utilization by each node measured as a percent of available CPU resources
  • Bytes transmitted: Throughput of network traffic being sent out of each node measured in bytes
  • Bytes received: Throughput of network traffic being received by each node, measured in bytes
  • Memory usage: Memory usage by each node measured in GiB
  • Disk read: Throughput of disk IOPS being read by each node to its persistent disk
  • Disk write: Throughput of disk IOPS being written by each node to its persistent disk
Additional metrics can be added as desired for your deployments within the cloud provider's monitoring console.