Real-Time Metric Streaming Setup

Real-time metric streaming is the sub-minute ingestion tier that turns database engine telemetry into a live, tenant-attributed cost feed so quota enforcement can act on spend before it accrues rather than after the invoice lands.

Back to: Metric Extraction & Aggregation Pipelines

This dimension of Metric Extraction & Aggregation Pipelines covers how Cloud DBA and FinOps teams move cost signals off a periodic reconciliation cadence and onto a continuous log, where each validated usage event is published, windowed, and evaluated against quota policy within seconds of emission. Batch reconciliation is authoritative but slow: a nightly job that discovers a runaway analytical warehouse at 06:00 has already let eight hours of credits burn. Streaming closes that latency gap by treating extraction, validation, and enforcement as a single flow keyed on an ordered log — but only if partitioning, exactly-once delivery, and watermarking are designed in from the first producer rather than retrofitted after the first double-billed tenant.

The diagram below traces metrics from database engines through validation and Kafka routing to live dashboards and quota alerts.

Billing Model & Attribution Challenges

Streaming cost attribution inherits every reconciliation trap from the billing API and adds one of its own: the record you enforce on is provisional. Provider cost data — AWS Cost Explorer line items, Azure Cost Management amortized rows — finalizes over 24–48 hours, so a real-time feed can never wait for it. Instead the streaming tier prices usage at the moment of emission using a resolved rate card, then lets batch processing for historical metrics correct the estimate once the authoritative invoice settles. The design consequence is that every streamed event must carry enough context to be re-priced later: a bare dollar figure with no usage_type, usage_unit, or quantity cannot be reconciled and becomes silent drift.

Blended versus disaggregated billing is the first source of skew unique to the live path. A CloudWatch metric such as CPUUtilization or WriteIOPS is a usage signal, not a cost — it must be multiplied by a per-unit rate before it means anything financial. If that rate is a blended organization average, a per-tenant stream systematically misprices reserved-instance and Savings Plan coverage. The streaming producer should resolve UnblendedCost-equivalent unit rates per resource class, and where reservation amortization matters, carry both the on-demand and amortized unit cost so the consumer can pick the correct one per policy. This is the same split worked through in how compute and storage costs break down per resource, pushed to the edge so no downstream window ever averages a mixed-rate event.

Temporal semantics are the second trap, and streaming makes them acute. A metric has an event time (when the database emitted it) and a processing time (when the consumer read it), and network jitter, broker rebalances, and retry backoff mean the two diverge. A window keyed on processing time will attribute a spike to the wrong minute and, worse, can fire a quota breach against a tenant whose usage actually occurred inside a previously closed window. Every event therefore carries an engine-assigned event_timestamp at UTC nanosecond resolution, and all aggregation is watermark-driven on that field, never on arrival order.

Currency and unit normalization must happen before the event is published, not after. A record priced in USD per GB-month cannot share a window with one priced in Snowflake credits per second until both resolve to a unit cost against a canonical resource model — the same alignment enforced at the account boundary when normalizing provider billing exports into a unified schema. The canonical streamed record is a flat, tag-enriched row keyed by tenant_id, resource_id, usage_type, usage_unit, quantity, unit_cost, cost_center, and event_timestamp — identical to the shape emitted by the batch and async tiers, so a downstream service can consume all three interchangeably.

Attribution of a windowed cost-center total from these events is deterministic:

C_{cc}(w) = \sum_{i \in cc,\; t_i \in w} q_i \cdot u_i

where the sum runs over every event i mapped to cost center cc whose event_timestamp $t_i$ falls inside window w, $q_i$ is the parsed quantity, and $u_i$ the resolved per-unit cost. Because the window is closed by a watermark rather than by wall-clock arrival, any variance between $C_{cc}(w)$ and the eventual invoice is a data-quality signal — a dropped partition, a late event past the allowed lateness, or a dead-lettered payload — not window misalignment.

Telemetry Extraction & Metric Normalization

Extraction on the live path is a bounded, continuous poll rather than a one-shot fan-out. For CloudWatch-backed engines like RDS and Aurora, the source is get_metric_data, which returns dense time series for many metrics in one call; for engine-internal signals it is the system view querying patterns that isolate billable tenant compute from autovacuum and background workers. Either way the governing constraint is the provider’s API ceiling, so the poll interval and page size are sized against the quota, and the loop advances a cursor so no window is fetched twice.

import asyncio
from datetime import datetime, timedelta, timezone

import aioboto3

async def poll_rds_metrics(session, semaphore, db_instance_id, period=60):
    """Yield (metric, timestamp, value) triples for one RDS instance.

    Uses CloudWatch GetMetricData, which batches multiple queries and
    paginates via NextToken. Period is the aggregation granularity in seconds.
    """
    end = datetime.now(timezone.utc)
    start = end - timedelta(seconds=period * 5)  # small look-back window
    queries = [
        {
            "Id": mid,
            "MetricStat": {
                "Metric": {
                    "Namespace": "AWS/RDS",
                    "MetricName": name,
                    "Dimensions": [
                        {"Name": "DBInstanceIdentifier", "Value": db_instance_id}
                    ],
                },
                "Period": period,
                "Stat": "Average",
            },
        }
        for mid, name in (("cpu", "CPUUtilization"), ("wiops", "WriteIOPS"),
                          ("riops", "ReadIOPS"), ("store", "FreeStorageSpace"))
    ]

    async with semaphore:  # never exceed the CloudWatch GetMetricData TPS ceiling
        async with session.client("cloudwatch") as cw:
            token = None
            while True:
                kwargs = {"MetricDataQueries": queries, "StartTime": start,
                          "EndTime": end, "ScanBy": "TimestampAscending"}
                if token:
                    kwargs["NextToken"] = token
                resp = await cw.get_metric_data(**kwargs)
                for series in resp["MetricDataResults"]:
                    for ts, val in zip(series["Timestamps"], series["Values"]):
                        yield series["Id"], ts, val
                token = resp.get("NextToken")
                if not token:
                    break

Pagination is the schema-drift surface to get right here: get_metric_data returns NextToken, and an absent token — not an empty Values list — is the terminal condition, because a metric with no datapoints in the window still returns a result object with empty arrays. Treating an empty page as the end silently truncates high-cardinality polls.

Normalization runs immediately on each datapoint, while it is still in memory, so a malformed or unpriced reading never reaches the broker. The translation maps the raw CloudWatch metric id to a canonical usage_type/usage_unit, resolves the unit rate, and attaches tenant metadata; validating that every emitted row satisfies the contract is the responsibility of the schema validation rules applied at the billing-data boundary. Using pydantic gives fail-fast rejection and a structured error to route to the dead-letter topic:

from datetime import datetime
from decimal import Decimal
from pydantic import BaseModel, Field

RATE_CARD = {  # resolved unit cost per canonical usage_type
    "wiops": Decimal("0.00000020"),   # $ per write IO
    "riops": Decimal("0.00000020"),   # $ per read IO
    "cpu":   Decimal("0.00001200"),   # $ per vCPU-second equivalent
}

class MetricEvent(BaseModel):
    tenant_id: str
    resource_id: str
    usage_type: str
    usage_unit: str
    quantity: Decimal = Field(ge=0)
    unit_cost: Decimal = Field(ge=0)
    cost_center: str
    event_timestamp: datetime

def normalize_datapoint(metric_id, ts, value, *, resource_id, tenant_id, cost_center):
    """Map one CloudWatch datapoint to a canonical, priced MetricEvent."""
    unit_cost = RATE_CARD.get(metric_id, Decimal(0))
    return MetricEvent(
        tenant_id=tenant_id,
        resource_id=resource_id,
        usage_type=metric_id,
        usage_unit="io" if metric_id.endswith("iops") else "vcpu_second",
        quantity=Decimal(str(value)),
        unit_cost=unit_cost,
        cost_center=cost_center,
        event_timestamp=ts,  # CloudWatch returns tz-aware UTC datetimes
    )

Using Decimal rather than float is not optional for financial data: binary floating point silently corrupts summed cent-level costs across the millions of events a busy fleet emits per hour, and the resulting variance is indistinguishable from a real attribution error during reconciliation.

Python Automation Patterns

The idiomatic structure is a bounded producer that publishes each validated event to a partitioned Kafka topic with exactly-once semantics, so a broker retry or producer restart never emits a duplicate billing event. aiokafka exposes idempotent and transactional producers directly; enabling idempotence with acks="all" is the minimum bar, and keying every record on tenant_id guarantees that all events for one tenant land on the same partition and therefore stay strictly ordered.

from aiokafka import AIOKafkaProducer
import msgspec

async def make_producer(bootstrap_servers):
    """Idempotent producer: no duplicate events on retry, ordered per key."""
    producer = AIOKafkaProducer(
        bootstrap_servers=bootstrap_servers,
        enable_idempotence=True,   # dedupe on broker-side sequence numbers
        acks="all",                # wait for all in-sync replicas
        linger_ms=20,              # small batching window for throughput
        compression_type="lz4",
    )
    await producer.start()
    return producer

async def publish_event(producer, event):
    """Publish one canonical MetricEvent, partitioned by tenant for ordering."""
    payload = msgspec.json.encode(
        {**event.model_dump(mode="json")}  # Decimals serialize as strings
    )
    await producer.send_and_wait(
        topic=f"cost.metrics.{event.usage_type}",
        value=payload,
        key=event.tenant_id.encode(),   # co-locate a tenant on one partition
    )

Topic design mirrors financial granularity: separate cost.metrics.cpu, cost.metrics.wiops, and cost.metrics.store streams let compute, I/O, and storage consumers scale independently and lets alerting subscribe to only the dimension it governs. Partition count is derived, not guessed — it must cover the target throughput while keeping each partition under the broker’s per-partition write ceiling. Little’s Law gives the number of partitions needed to sustain a target event rate at a bounded per-partition throughput:

P_{\min} = \left\lceil \frac{\lambda}{\rho} \right\rceil

where $\lambda$ is the fleet’s peak events-per-second and $\rho$ the sustained events-per-second a single partition absorbs. Sizing to $P_{\min}$ (rounded up for headroom and rebalance tolerance) keeps writes from hot-spotting without over-fragmenting consumer state.

The full poll → normalize → publish loop composes the extraction generator, the validator, and the producer, dead-lettering anything that fails the contract instead of dropping it:

import asyncio

async def stream_instance(session, producer, semaphore, instance, dead_letter):
    """Continuously poll one instance, validate, and publish live cost events."""
    while True:
        async for metric_id, ts, value in poll_rds_metrics(
            session, semaphore, instance["db_instance_id"]
        ):
            try:
                event = normalize_datapoint(
                    metric_id, ts, value,
                    resource_id=instance["db_instance_id"],
                    tenant_id=instance["tenant_id"],
                    cost_center=instance["cost_center"],
                )
            except Exception as exc:  # validation / pricing failure
                await dead_letter(metric_id, value, str(exc))
                continue
            await publish_event(producer, event)
        await asyncio.sleep(instance.get("poll_interval", 60))

Retries belong on the individual poll and publish, not the whole loop, so a transient CloudWatch throttle on one instance never restarts a fleet-wide stream. A tenacity-decorated coroutine with exponential backoff and jitter keeps retry storms from synchronizing across instances, and the same connection-reuse and backpressure discipline used for async semaphore-controlled concurrency in the parsing tier applies unchanged to the streaming producer.

from tenacity import (
    retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type,
)
from botocore.exceptions import ClientError

def is_throttle(exc):
    return (
        isinstance(exc, ClientError)
        and exc.response["Error"]["Code"] in {"Throttling", "ThrottlingException"}
    )

@retry(
    retry=retry_if_exception_type(ClientError),
    wait=wait_exponential_jitter(initial=1, max=30),
    stop=stop_after_attempt(6),
    reraise=True,
)
async def get_metric_data_resilient(cw, **kwargs):
    try:
        return await cw.get_metric_data(**kwargs)
    except ClientError as exc:
        if not is_throttle(exc):
            raise  # non-throttle errors must not be retried blindly
        raise

Quota Enforcement Integration

A stream is only useful if it drives a decision. The enforcement service is a Kafka consumer that tails the cost-metric topics, maintains a windowed running total per cost center, and emits breach events the moment a watermark-closed window crosses a boundary. Because the producer keyed on tenant_id, each consumer instance owns a stable set of tenants and can hold their window state in memory without cross-partition coordination.

from aiokafka import AIOKafkaConsumer
from collections import defaultdict
from decimal import Decimal
import msgspec

async def enforce(bootstrap_servers, policies):
    """Tail cost-metric topics, window per cost center, yield breach events.
    `policies` maps cost_center -> {"soft": Decimal, "hard": Decimal}."""
    consumer = AIOKafkaConsumer(
        "cost.metrics.cpu", "cost.metrics.wiops", "cost.metrics.store",
        bootstrap_servers=bootstrap_servers,
        group_id="quota-enforcer",
        enable_auto_commit=False,       # commit only after state is durable
        auto_offset_reset="latest",
    )
    await consumer.start()
    window_totals = defaultdict(Decimal)
    try:
        async for msg in consumer:
            rec = msgspec.json.decode(msg.value)
            spend = Decimal(rec["quantity"]) * Decimal(rec["unit_cost"])
            cc = rec["cost_center"]
            window_totals[cc] += spend
            policy = policies.get(cc)
            if not policy:
                continue
            if window_totals[cc] >= policy["hard"]:
                yield {"cost_center": cc, "level": "hard",
                       "spend": window_totals[cc]}
            elif window_totals[cc] >= policy["soft"]:
                yield {"cost_center": cc, "level": "soft",
                       "spend": window_totals[cc]}
            await consumer.commit()      # advance offset past enforced record
    finally:
        await consumer.stop()

What a soft-warning threshold and a hard-stop ceiling mean per tenant — and how a breach translates into throttling, storage tiering, or a suspend action — is governed by database quota boundary design, which the streaming layer feeds rather than reimplements. The month-end complement, where completeness matters more than latency and late-arriving cost is folded in via idempotent upserts, runs through batch processing for historical metrics; the two tiers reconcile against the same canonical record so a real-time hard-stop and a nightly chargeback never disagree on what a tenant actually spent.

Because a hard enforcement action can suspend a live workload, the consumer must never fire a breach from an incomplete window. A window whose partition experienced a rebalance or whose watermark has not yet passed the allowed lateness is partial, and the enforcement layer should warn rather than hard-stop on it. Carrying a complete: bool derived from consumer-group assignment stability and watermark progress is the difference between a defensible quota cut-off and an outage caused by a lagging partition. Offsets are committed only after the enforcement decision is durable, so a consumer crash re-reads the uncommitted tail and converges instead of skipping an unenforced breach.

Failure Modes & Troubleshooting

ThrottlingException on CloudWatch GetMetricData. The most common failure at fleet scale, triggered when the poll fan-out exceeds the account’s per-second ceiling. Lower the concurrency semaphore, widen the poll interval for low-volatility metrics, and batch more metrics per get_metric_data call rather than issuing one call per metric. Deeper treatment of adaptive backpressure and token buckets lives in handling rate limits when pulling database metrics.

Consumer lag and unbounded window state. If the enforcement consumer falls behind the producers, breach decisions arrive too late to matter and in-memory window totals grow without bound. Alert on records-lag-max per partition, checkpoint window state to compacted topics or an external store so a slow consumer can be scaled out without losing accumulated totals, and shed load by widening the window granularity before the lag becomes an outage.

Duplicate billing events from non-idempotent producers. A producer configured with acks="1" and no idempotence re-sends on a broker timeout and double-counts on the consumer. Always set enable_idempotence=True with acks="all", and for cross-topic atomicity wrap multi-topic writes in a transactional producer so a partial publish never leaves one dimension counted and another dropped.

Out-of-order and late events corrupting windows. Keying windows on processing time attributes a spike to the wrong minute and can fire a breach against a closed window. Aggregate strictly on event_timestamp, define an explicit allowed-lateness grace period, and route events later than the grace boundary to a late-data topic for the batch tier to reconcile rather than forcing them into a live window.

Partition hot-spotting from skewed keys. When one tenant emits far more events than the rest, its partition saturates while others idle, and the enforcement consumer owning it lags. Detect skew from per-partition throughput, and for pathological tenants use a composite tenant_id:resource_id key to spread load while preserving per-resource ordering, accepting that per-tenant global ordering then requires a merge at read time.

Broker outage stalling the whole pipeline. A full broker partition blocks producers and starves enforcement. The producer should buffer to a bounded local queue and shed to a dead-letter sink when it fills, while the consumer serves the last durable window state; the graceful-degradation and cached-fallback behavior that keeps cost governance serving stale-but-valid figures is covered by fallback routing for cost APIs, and the structured recovery of partial runs by error handling in cost pipelines.

Async Usage Parsing Workflows — the non-blocking, semaphore-governed parser tier that feeds validated records into this stream.
Schema Validation for Billing Data — enforce the canonical usage-record contract before any event is published to a topic.
Batch Processing for Historical Metrics — the reconciliation tier that corrects provisional streamed estimates against finalized invoices.
System View Querying Patterns — extract billable tenant compute from pg_stat_activity, Oracle V$SESSION, and Snowflake metering views as a streaming source.
Error Handling in Cost Pipelines — dead-letter malformed events, recover partial windows, and keep enforcement decisions defensible.

Back to: Metric Extraction & Aggregation Pipelines

Real-Time Metric Streaming Setup #

Billing Model & Attribution Challenges #

Telemetry Extraction & Metric Normalization #

Python Automation Patterns #

Quota Enforcement Integration #

Failure Modes & Troubleshooting #

Related #