Ordered delivery at scale
Once you accept that engagements can arrive late, the next question is whether they can arrive out of order. The honest answer is sometimes. Here is how to design for it.
Ordered delivery is one of those properties that sounds simple, then quietly costs you a quarter of engineering when you actually try to enforce it across a real network.
What “ordered” actually means
Three definitions show up in design reviews, and people argue past each other because they assume the same one:
- Per-user ordering. Engagement B is delivered after engagement A for the same user. This is usually what product teams want.
- Global ordering. Across all users, engagements are delivered in the order they were produced. This is almost never what anyone needs and is very expensive.
- Causal ordering. If event B was produced in response to event A, B is delivered after A. This is what you actually need when engagements depend on prior state.
Stream Sync Engage defaults to per-user ordering with causal ordering as an opt-in.
How we enforce per-user ordering
A user’s stream is partitioned by user_id and routed to a single worker for the duration of a session. Within a partition, delivery is ordered by produce time. Across partitions, there is no guarantee, but the user only sees their own partition, so the guarantee they care about holds.
The trade-off: a slow user can hold up their own queue. We bound this with per-user timeouts. If an engagement has been pending for more than 30 seconds, it expires and the next one moves up.
When to relax it
Marketing announcements do not need ordering. A “new feature” banner showing before a “welcome back” toast is fine. Tagging engagements with an unordered: true flag lets them skip the partition queue entirely, which improves throughput.
The rule we use: if a user could plausibly notice the order, enforce it. If not, do not pay for it.
- #delivery
- #ordering
- #distributed-systems
- #kafka
Found this helpful?
Pass it along to a teammate.