Back to Perspectives
ArchitectureEngineering

Offline-First Architecture: A Practical Guide

Blue Neon20 December 20258 min read

"Just use the cloud" is excellent advice until your users are on a mine site in the Pilbara with satellite connectivity that drops every afternoon. Or on a naval vessel with bandwidth measured in kilobits. Or in a disaster response scenario where the cell towers are the disaster. Offline-first architecture is a critical requirement for a surprising number of real-world applications.

We've built offline-first systems for defence field operations, mining logistics, emergency services, and agricultural monitoring. The consistent lesson is that offline-first is harder than online-first, but the architecture patterns are well-established. You need to commit to them from the start, not bolt them on later.

The Core Problem: Conflict Resolution

The easy part of offline-first is storing data locally. The hard part is what happens when two users modify the same data while offline and then both sync. This is the conflict resolution problem, and it's the reason most offline-first projects fail. Teams can build local storage fine. The gap is a strategy for conflicts.

There are three main approaches, each with different tradeoffs. Last-write-wins (LWW) is the simplest: whichever change has the later timestamp wins. It's easy to implement and guarantees convergence, but it silently discards data. If two field workers update the same inspection report offline, one set of changes disappears without notice. For non-critical data, this is often acceptable. For anything important, it's not.

CRDTs (Conflict-free Replicated Data Types) are mathematical structures that guarantee convergence without coordination. They're the gold standard for offline-first when you need automatic conflict resolution. Libraries like Yjs and Automerge provide CRDT implementations for common data types: text, arrays, maps, counters. The tradeoff is complexity: CRDTs can be memory-intensive for large documents, and not every data structure maps cleanly to a CRDT type.

Application-level merge is the third option: present conflicts to the user and let them resolve manually. This sounds primitive, but for some domains it's the right call. A doctor reviewing conflicting medication records should absolutely see both versions and decide. An automated merge could be clinically dangerous.

"The right conflict resolution strategy is a domain decision, not a technical one. Ask the users what should happen when two people change the same thing."

Local Data Storage

For web-based offline apps, IndexedDB is the foundation. It's the only browser storage API with enough capacity and query capability for real applications. The raw IndexedDB API is miserable to work with, though. We use Dexie.js as a wrapper, which provides a clean query API and built-in support for schema versioning. For React applications, TanStack Query with a persistence plugin gives you a clean data layer that works identically online and offline.

For native mobile or desktop applications, SQLite is the default choice and it's excellent. SQLite handles concurrent reads well, supports full-text search, and is bulletproof in terms of data integrity. On the sync layer, PowerSync and ElectricSQL are both strong options that handle the SQLite-to-server synchronisation with built-in conflict resolution.

For defence and high-security applications where data-at-rest encryption is mandatory, SQLCipher (encrypted SQLite) provides AES-256 encryption with minimal performance overhead. Combined with hardware-backed key storage (TPM on Windows, Secure Enclave on Apple devices), you get data protection that meets most classification requirements.

Sync Architecture

The sync layer is where the real architectural decisions live. You need to decide: push vs. pull vs. bidirectional? Full sync vs. delta sync? Eager vs. lazy? The answers depend on your data volume, connectivity profile, and freshness requirements.

Our standard pattern is delta sync with operation-based replication. Instead of syncing the full state of every record, we sync the operations (create, update, delete) that produced the state. Each operation has a logical timestamp (a Lamport clock or hybrid logical clock) and a client ID. The server receives operations from all clients, orders them, resolves conflicts according to the strategy, and distributes the resolved operations to all other clients.

The protocol is designed for unreliable networks. Operations are queued locally and retried automatically. The sync is idempotent: replaying the same operation twice has no effect. Progress is checkpointed so that a failed sync resumes where it left off, not from the beginning. And the whole thing runs in a background thread (Web Worker in browsers, background service on native) so it never blocks the UI.

Service Workers and Progressive Web Apps

For web-based offline apps, the service worker is the critical piece of infrastructure. It intercepts network requests and serves cached responses when offline. But a naive implementation (caching everything) leads to stale data and bloated storage. We use a Workbox-based strategy with different caching policies per resource type: cache-first for static assets, network-first for API responses with cache fallback, and stale-while-revalidate for data that tolerates brief staleness.

The service worker also handles background sync, queuing failed API writes and replaying them when connectivity returns. The Background Sync API makes this straightforward for simple cases. For complex sync requirements, we implement custom retry logic in the service worker with exponential backoff and conflict detection.

Testing Offline Behaviour

The biggest mistake teams make is testing offline behaviour manually. "I turned off Wi-Fi and clicked around" is not a test strategy. You need automated tests that simulate specific network conditions: complete disconnection, intermittent connectivity, high latency, packet loss, and mid-request failures.

Playwright's network emulation is good for browser-based testing. For system-level testing, we use Toxiproxy to simulate network conditions between services. The test suite includes specific scenarios: user creates a record offline, goes online, another user has modified the same record. Verify conflict resolution. User starts a large upload, loses connectivity mid-transfer, reconnects. Verify resumption. Two users both work offline for 8 hours, then sync simultaneously. Verify all data is preserved.

Offline-first is hard. But the architecture patterns exist, the tooling has matured, and for users in connectivity-challenged environments, reliable offline support is the difference between a usable system and a paperweight. That's worth the engineering investment.