Comment on page
Monitoring
Developers and administrators can monitor application metrics, system metrics, and infrastructure metrics for Grainite. Grainite exposes a Prometheus endpoint on port 5064 that can be used for gathering and processing monitoring data. The counters and gauges APIs discussed in the API section of the documentation allow for the creation of these metrics which are then exposed to Prometheus by Grainite and can be visualized in a tool like Grafana. More details here.
Developers can define application metrics using the counters and gauges APIs. The Userflows example included in our Samples demonstrates the following developer-defined application metrics.
Description | Metric Used | Metric Type |
---|---|---|
Completed flows | gxsapp_completed_flows_total | Counter |
Abandoned flows | gxsapp_abandoned_flows_total | Counter |
Current flows | gxsapp_current_flow_counts_current | Gauge |
Current flows by type | gxsapp_current_flow_counts_current | Gauge |
Grainite provides built-in metrics that allow developers and administrators to monitor
- Application runtime metrics
- Event processing metrics
- Database metrics
Application Runtime Metrics
Description | Metric Used | Metric Type |
---|---|---|
Rate of action Invocation per minute | gxssys_action_count_total | Gauge |
Count of actions errors | gxssys_action_errors_total | Counter |
Average action execution Latency | gxssys_grain_execution_us_total | Counter |
Paused Endpoints due to failures | gxssys_endpoint_paused_total | Gauge |
Task execution errors | gxsapp_gxtask_execution_errors_total | Counter |
Task instance execution errors | gxsapp_gxtask_instance_execution_errors_total | Counter |
Task execution status | gxsapp_gxtask_execution_status_current | Gauge |
Description | Metric Used | Metric Type |
---|---|---|
Message delivery latency for the last 30s window of data. This is published for the 50/95/99th percentiles | gxssys_message_delay_ms_total | Counter |
Topic consumption latency for the last 30s window of data. This is published for the 50/95/99th percentiles | gxssys_subscription_delay_ms_total | Counter |
Batch Size of fetched requests | gxdtopic_tot_fetch_batch_size_total
gxdtopic_tot_batch_size_cnt_total | Counter |
Total events fetched from Topic | gxdtopic_tot_fetched_events_total | Counter |
Total events fetched and processed | gxdtopic_tot_consumed_events_total | Counter |
Total Grain to Grain messages fetched | gxdg2g_tot_fetched_messages_total | Counter |
Total Grain to Grain messages fetched and processed | gxdg2g_tot_consumed_messages_total | Counter |
Indicates how many events have been pulled from a topic but has not been processed | gxdexec_cur_inflight_events_current | Gauge |
Description | Metric Used | Metric Type |
---|---|---|
Average latency to process requests to database | gxsctl_tot_work_process_latency_total
gxsctl_tot_work_total | Counter |
Cumulative count of writes to Grains | gxggrain_tot_update_total | Counter |
Disk currently used by apps and system | gxsdat_cur_disk_used_size_current | Gauge |
Number of Grain updates that have materialized | gxpmatr_tot_fetched_logs_total | Counter |
Number of Grain updates pending | gxggrain_cur_update_current | Gauge |
Description | Metric Used | Metric Used |
---|---|---|
Total high load and total stalled metrics indicate the health of compute capability of the Grainite cluster | gxssrvr2_tot_highload_hz
gxssrvr2_tot_stalled_hz | Counter |
WAL disk used metric provides the current utilization of Grainite Write Ahead Log (WAL) | gxwwal_cur_disk_rlused_size_current | Gauge |
The current allowed rate and current target rate help to determine if there is a continuous event execution overload on the cluster | gxdfctrl_cur_allowed_rate_current
gxdfctrl_cur_target_rate_current | Gauge |
Cloud providers' monitoring solutions can be used to gather infrastructure-level metrics. We recommend monitoring the following metrics:
- CPU usage: CPU usage by each Kubernetes node is measured in the number of CPU cores
- CPU utilization: CPU utilization by each node measured as a percent of available CPU resources
- Bytes transmitted: Throughput of network traffic being sent out of each node measured in bytes
- Bytes received: Throughput of network traffic being received by each node, measured in bytes
- Memory usage: Memory usage by each node measured in GiB
- Disk read: Throughput of disk IOPS being read by each node to its persistent disk
- Disk write: Throughput of disk IOPS being written by each node to its persistent disk
Additional metrics can be added as desired for your deployments within the cloud provider's monitoring console.
Last modified 5mo ago