Monitoring

The Union.ai data plane deploys a static Prometheus instance that collects metrics required for platform features like cost tracking, task-level resource monitoring, and execution observability. This Prometheus instance is pre-configured and requires no additional setup.

For operational monitoring of the cluster itself (node health, API server metrics, CoreDNS, etc.), the data plane chart includes an optional kube-prometheus-stack instance that can be enabled separately.

Architecture overview

The data plane supports two independent monitoring concerns:

Concern What it monitors How it’s deployed Configurable
Union features Task execution metrics, cost tracking, GPU utilization, container resources Static Prometheus with pre-built scrape config Retention, resources, scheduling
Cluster health (optional) Kubernetes components, node health, alerting, Grafana dashboards kube-prometheus-stack via monitoring.enabled Full kube-prometheus-stack values
                    ┌─────────────────────────────────────┐
                    │          Data Plane Cluster          │
                    │                                     │
                    │  ┌──────────────────────┐           │
                    │  │  Static Prometheus   │           │
                    │  │  (Union features)    │           │
                    │  │  ┌────────────────┐  │           │
                    │  │  │ Scrape targets │  │           │
                    │  │  │ - kube-state   │  │           │
                    │  │  │ - cAdvisor     │  │           │
                    │  │  │ - propeller    │  │           │
                    │  │  │ - opencost     │  │           │
                    │  │  │ - dcgm (GPU)   │  │           │
                    │  │  │ - envoy        │  │           │
                    │  │  └────────────────┘  │           │
                    │  └─────────────────────-┘           │
                    │                                     │
                    │  ┌──────────────────────┐           │
                    │  │  kube-prometheus     │           │
                    │  │  -stack (optional)   │           │
                    │  │  - Prometheus        │           │
                    │  │  - Alertmanager      │           │
                    │  │  - Grafana           │           │
                    │  │  - node-exporter     │           │
                    │  └──────────────────────┘           │
                    └─────────────────────────────────────┘

Union features Prometheus

The static Prometheus instance is always deployed and pre-configured to scrape the metrics that Union.ai requires. No Prometheus Operator or CRDs are needed. This instance is a platform dependency and should not be replaced or reconfigured.

Scrape targets

The following targets are scraped automatically:

Job Target Metrics collected
kube-state-metrics Pod/node resource requests, limits, status, capacity Cost calculations, resource tracking
kubernetes-cadvisor Container CPU and memory usage via kubelet Task-level resource monitoring
flytepropeller Execution round info, fast task duration Execution observability
opencost Node hourly cost rates (CPU, RAM, GPU) Cost tracking
gpu-metrics DCGM exporter metrics (when dcgm-exporter.enabled) GPU utilization
serving-envoy Envoy upstream request counts and latency (when serving.enabled) Inference serving metrics

Configuration

The static Prometheus instance is configured under the prometheus key in your data plane values:

prometheus:
  image:
    repository: prom/prometheus
    tag: v3.3.1
  # Data retention period
  retention: 3d
  # Route prefix for the web UI and API
  routePrefix: /prometheus/
  resources:
    limits:
      cpu: "3"
      memory: "3500Mi"
    requests:
      cpu: "1"
      memory: "1Gi"
  serviceAccount:
    create: true
    annotations: {}
  priorityClassName: system-cluster-critical
  nodeSelector: {}
  tolerations: []
  affinity: {}
Retention and storage

The default 3-day retention is sufficient for Union.ai features. Increase retention if you query historical feature metrics directly.

Internal service endpoint

Other data plane components reach Prometheus at:

http://union-operator-prometheus.<NAMESPACE>.svc:80/prometheus

OpenCost is pre-configured to use this endpoint. You do not need to change it unless you rename the Helm release.

Enabling cluster health monitoring

To enable operational monitoring with Prometheus Operator, Alertmanager, Grafana, and node-exporter:

monitoring:
  enabled: true

This deploys a full kube-prometheus-stack instance with sensible defaults:

  • Prometheus with 7-day retention
  • Grafana with admin credentials (override monitoring.grafana.adminPassword in production)
  • Node exporter, kube-state-metrics, kubelet, CoreDNS, API server, etcd, and scheduler monitoring
  • Default alerting and recording rules

Prometheus Operator CRDs

The kube-prometheus-stack uses the Prometheus Operator, which discovers scrape targets and alerting rules through Kubernetes CRDs (ServiceMonitor, PodMonitor, PrometheusRule, etc.). If you prefer to use static scrape configs with your own Prometheus instead, see Scraping Union services from your own Prometheus.

To install the CRDs, use the dataplane-crds chart:

# dataplane-crds values
crds:
  flyte: true
  prometheusOperator: true  # Install Prometheus Operator CRDs

Then install or upgrade the CRDs chart before the data plane chart:

helm upgrade --install union-dataplane-crds unionai/dataplane-crds \
  --namespace union \
  --set crds.prometheusOperator=true
CRD installation order

CRDs must be installed before the data plane chart. The dataplane-crds chart should be deployed first, and the monitoring stack’s own CRD installation is disabled (monitoring.crds.enabled: false) to avoid conflicts.

Customizing the monitoring stack

The monitoring stack accepts all kube-prometheus-stack values under the monitoring key. Common overrides:

monitoring:
  enabled: true

  # Grafana
  grafana:
    enabled: true
    adminPassword: "my-secure-password"
    ingress:
      enabled: true
      ingressClassName: nginx
      hosts:
        - grafana.example.com

  # Prometheus retention and resources
  prometheus:
    prometheusSpec:
      retention: 30d
      resources:
        requests:
          memory: "2Gi"

  # Alertmanager
  alertmanager:
    enabled: true
    # Configure receivers, routes, etc.

The monitoring stack’s Prometheus supports remote write for forwarding metrics to external time-series databases (Amazon Managed Prometheus, Grafana Cloud, Thanos, etc.):

monitoring:
  prometheus:
    prometheusSpec:
      remoteWrite:
        - url: "https://aps-workspaces.<REGION>.amazonaws.com/workspaces/<WORKSPACE_ID>/api/v1/remote_write"
          sigv4:
            region: <REGION>

For the full set of configurable values, see the kube-prometheus-stack chart documentation.

Scraping Union services from your own Prometheus

If you already run Prometheus in your cluster, you can scrape Union.ai data plane services for operational visibility. All services expose metrics on standard ports.

Union features Prometheus

The built-in static Prometheus handles all metrics required for Union.ai platform features. Scraping from your own Prometheus is for additional operational visibility only – it does not replace the built-in instance.

Static scrape configs

Add these jobs to your Prometheus configuration:

scrape_configs:
  # Data plane service metrics (operator, propeller, etc.)
  - job_name: union-dataplane-services
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [union]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance]
        regex: union-dataplane
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: debug
        action: keep

ServiceMonitor (Prometheus Operator)

If you run the Prometheus Operator, create a ServiceMonitor instead:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: union-dataplane-services
  namespace: union
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: union-dataplane
  namespaceSelector:
    matchNames:
      - union
  endpoints:
    - port: debug
      path: /metrics
      interval: 30s

This requires the Prometheus Operator CRDs. Install them via the dataplane-crds chart with crds.prometheusOperator: true.

Further reading