Monitoring Lakekeeper¶

Lakekeeper exposes Prometheus metrics and per-project endpoint statistics. We recommend integrating these into your Kubernetes/Grafana/Prometheus stack.

Key Metrics¶

HTTP Request Metrics¶

Three metrics cover all HTTP traffic:

Metric	Labels	Description
`axum_http_requests_total`	`method`, `status`, `endpoint`	Request count broken down by HTTP method, status code, and endpoint path
`axum_http_requests_pending`	`method`, `endpoint`	Requests currently in-flight per endpoint and method
`axum_http_requests_duration_seconds`	`method`, `status`, `endpoint`, `le`	Response time histogram; use the `le=1` bucket as a baseline health indicator

Interpreting HTTP request metrics

Visualize axum_http_requests_total by status code for overall API health. Rising 4XX rates indicate client-side issues; rising 5XX rates indicate server or database problems requiring urgent attention. High axum_http_requests_pending counts signal backend bottlenecks — consider scaling Lakekeeper horizontally. For latency, monitor the le=1 bucket of axum_http_requests_duration_seconds as a baseline; spikes typically point to Postgres or upstream service issues.

Cache Metrics¶

Lakekeeper maintains in-memory caches for Short-Term Credentials, Warehouses, Namespaces, Secrets, Roles, User Assignments, and Role Members. All caches share three metric names, differentiated by the cache_type label:

Metric	Type	Labels	Description
`lakekeeper_cache_size`	Gauge	`cache_type`	Current number of entries in the cache
`lakekeeper_cache_hits_total`	Counter	`cache_type`	Total cache hits
`lakekeeper_cache_misses_total`	Counter	`cache_type`	Total cache misses

cache_type values: stc, warehouse, namespace, secrets, role, user_assignments, role_members. A persistently low hit rate signals the cache capacity should be increased. See Configuration > Caching for details.

Role Provider Metrics ¶

When a Role Provider (e.g. LDAP) is configured, Lakekeeper emits the following metrics, each labelled by provider_id:

Metric	Type	Labels	Description
`lakekeeper_role_provider_up`	Gauge	`provider_id`	`1` when the provider is reachable, `0` when unreachable. Updated by the periodic health-check loop.
`lakekeeper_role_provider_get_roles_duration_seconds`	Histogram	`provider_id`, `outcome`	Duration of each role-lookup call. The `outcome` label reflects how the request was served (see table below).
`lakekeeper_role_provider_sync_errors_total`	Counter	`provider_id`	Number of failures writing fresh roles back to the Postgres catalog cache.
`lakekeeper_role_provider_ldap_reconnects_total`	Counter	`provider_id`, `outcome`	LDAP reconnect attempts (LDAP providers only), labelled `success` or `error`.

outcome values for lakekeeper_role_provider_get_roles_duration_seconds (histogram label):

Value	Meaning
`cache_hit`	All applicable providers were fresh; the external provider was not contacted.
`success`	Fresh roles were fetched from the external provider and synced to Postgres.
`stale_fallback`	The external provider was unreachable, but previously cached roles from Postgres were served instead. Authorization continues to work.
`error`	Unrecoverable error — the provider failed and no cached roles were available.

Health probe behavior. Role provider health is intentionally excluded from the /health endpoint. The periodic health-check loop still calls update_health on every cycle (to drive reconnection attempts and keep lakekeeper_role_provider_up current), but an unreachable provider does not cause the pod to fail its liveness or readiness probe. Lakekeeper continues serving the roles it last synced to Postgres (stale_fallback), so authorization keeps working during a provider outage — at the cost of potentially stale group memberships.

This contrasts with the Postgres connection: if Postgres becomes unreachable, the pod will fail its health check (see Database Monitoring below).

Alerting on role provider health

Alert on lakekeeper_role_provider_up == 0 to detect provider outages early. A sustained stale_fallback rate in lakekeeper_role_provider_get_roles_duration_seconds confirms that Lakekeeper is actively falling back to cached roles. Rising lakekeeper_role_provider_sync_errors_total with a healthy provider indicates a separate Postgres write problem — investigate database connectivity or permissions.

Prometheus Integration¶

Lakekeeper listens on LAKEKEEPER__BIND_IP:LAKEKEEPER__METRICS_PORT (defaults: 0.0.0.0:9000). The bind address 0.0.0.0 means "listen on all interfaces" — it is not a valid scrape target. Configure Prometheus to scrape a reachable address such as http://localhost:9000/metrics or http://<service-or-pod-ip>:9000/metrics.

Variable	Description
`LAKEKEEPER__METRICS_PORT`	Port Lakekeeper listens on for the metrics endpoint (default `9000`)
`LAKEKEEPER__BIND_IP`	Listener bind address for metrics, REST API, and Management API (default `0.0.0.0`; use a specific IP to restrict access)

Example Prometheus scrape configuration

scrape_configs:
  - job_name: "lakekeeper"
    static_configs:
      - targets: ["lakekeeper-host:9000"]

Database (Postgres) Monitoring¶

Postgres is Lakekeeper's primary backend. Use postgres_exporter for database-internal signals — kube-state-metrics covers Kubernetes API object state (pods, deployments, nodes) but not Postgres internals.

Signal	Recommended tool
Free connection pool slots	`postgres_exporter`
Connection failures / pool exhaustion	`postgres_exporter`
Query latency	`postgres_exporter`
Replication lag	`postgres_exporter`
Disk usage and IOPS	Cloud provider metrics or `node_exporter`
Pod restarts, deployment health	kube-state-metrics

If you run Postgres via the CloudNativePG operator, its built-in per-instance exporter (port 9187, metrics prefixed cnpg_collector_*) covers WAL file counts and size, archive status, sync replica state, and basic liveness — complementing postgres_exporter for those signals. Connection pool slots, query latency, and replication lag are available as user-defined custom queries in CloudNativePG; disk and IOPS still require node_exporter or cloud provider metrics.

Warning

Lakekeeper's liveness probe checks the database connection. If Postgres becomes unreachable or runs out of connections, the pod will fail its health check and be marked unhealthy.

Kubernetes and Resource Monitoring¶

Monitor pod CPU, memory, and restart counts with kube-state-metrics or equivalent tooling.

Endpoint Statistics¶

Lakekeeper aggregates per-request statistics in memory and flushes them to the database periodically (default every 30 s). Each record captures the HTTP method, endpoint path, response status code, project, and warehouse (where applicable). This data is stored internally by Lakekeeper and is accessible without a Prometheus setup.

These statistics can be viewed in the UI under the Project View's Statistics tab. The Management API also exposes them directly:

POST /management/v1/endpoint-statistics — query endpoint-level usage data, filterable by warehouse, status code, and time window.
GET /management/v1/warehouse/{warehouse_id}/statistics — query warehouse-level table and view counts.

For real-time traffic visibility, the HTTP request metrics expose per-second counters and latency histograms via Prometheus — but only with method, status, and endpoint labels. They carry no project or warehouse dimensions, so they cannot be used for tenant-scoped analysis. Endpoint statistics are the only source of per-project and per-warehouse breakdowns, making them the right tool for chargeback, abuse detection, and per-customer analytics in multi-tenant deployments.

The flush interval is controlled by LAKEKEEPER__ENDPOINT_STAT_FLUSH_INTERVAL (supports s and ms units):

LAKEKEEPER__ENDPOINT_STAT_FLUSH_INTERVAL=60s

See Configuration - Endpoint Statistics for details.

Best Practices¶

Split Grafana dashboards by concern: API health (status codes, pending, latency), database health, cache hit/miss ratios, role provider health, and Kubernetes resource utilization. Alert on sustained 5XX/4XX spikes, high pending request counts, low cache hit rates, and lakekeeper_role_provider_up == 0.

Troubleshooting¶

If Grafana shows stale or missing metrics, verify that Prometheus can reach the metrics endpoint and that the bind IP and port match your scrape configuration. For historical analysis beyond Prometheus retention, query endpoint statistics from the database.