Back to Perspectives
SecurityCompliance

The Hidden Cost of Bad Data Sanitisation

Blue Neon8 January 20267 min read

Data sanitisation is one of those topics that everyone agrees is important and almost nobody does properly. It sits in the category of work that's invisible when done right and catastrophic when done wrong, like plumbing or seatbelts. And like plumbing, most organisations don't think about it until sewage is backing up into the living room.

This goes beyond SQL injection. That's sanitisation 101, and if you're still vulnerable to Bobby Tables in 2026, you have bigger problems. We're talking about the broader discipline of ensuring that data entering, moving through, and leaving your systems is valid, safe, and appropriately handled at every boundary.

The Surface Area Is Larger Than Expected

Most developers think of sanitisation as "clean user input before it hits the database." That's one boundary. In a modern system, there are dozens. API inputs from external services. Webhook payloads. File uploads. Data imported from partner systems via SFTP. CSV uploads from non-technical users. Message queue payloads. Configuration files. Environment variables. Log data that gets ingested into analytics pipelines. Every one of these is an entry point where malformed, malicious, or unexpected data can enter your system.

We audited a government system last year that had robust input validation on its web forms: parameterised queries, HTML encoding, the works. But it also ingested data from four partner agencies via nightly batch files. Those files were loaded directly into the database with no validation beyond a basic schema check. One agency started including pipe characters in a free-text field. The ETL process used pipes as delimiters. Three months of data was silently corrupted before anyone noticed.

"The most dangerous data doesn't trigger an error. It looks valid enough to make it through, then silently corrupts everything downstream."

Beyond Injection: The Real Threats

Cross-System Encoding Issues

Unicode normalisation attacks are real and underappreciated. Visually identical strings can have different byte representations. A username that looks like "admin" but uses a Cyrillic 'а' instead of a Latin 'a' can bypass string comparison checks while looking identical on screen. We enforce Unicode normalisation (NFC form) at every input boundary and use canonical comparison for any security-relevant string matching.

File Upload Attacks

File extension checks are theatre. A file named "report.pdf" can contain executable code. MIME type headers can be spoofed. We validate file content by inspecting magic bytes, process uploads through a sanitisation pipeline (LibreOffice for document conversion, ImageMagick with a restrictive policy.xml for images), and serve uploaded files from a separate domain with restrictive Content-Security-Policy headers. Files are never served directly from the upload location. They're always processed, re-encoded, and served from a CDN or object store with appropriate headers.

Log Injection

User-controlled data that ends up in log files can be used to forge log entries, confuse log analysis tools, or inject terminal escape sequences that execute code when an admin views logs in a terminal. We sanitise all user-controlled data before logging: stripping control characters, escaping newlines, and truncating to prevent log flooding. Structured logging (JSON format via Pino or Serilog) helps because the data is always in a quoted value field, never interpreted as structure.

The Compliance Dimension

In regulated environments, data sanitisation intersects with data governance in ways that create legal liability. PII (personally identifiable information) that leaks into log files violates the Privacy Act. Health data that ends up in analytics systems without proper de-identification violates the My Health Records Act. Credit card numbers in database fields that aren't encrypted violate PCI DSS.

We implement what we call "sanitisation by classification." Every data field in the system has a sensitivity classification: public, internal, sensitive, restricted. The sanitisation rules are tied to the classification, not the field name. A field classified as "restricted" is automatically encrypted at rest, excluded from log output, masked in non-production environments, and scrubbed from analytics exports. This is enforced by middleware, not by individual developer discipline, because discipline fails first under deadline pressure.

Tools like Microsoft Presidio can automatically detect and redact PII in text fields. For structured data, we define sanitisation transformers per classification level and apply them at the boundary between trusted and untrusted zones. This creates a clear audit trail: data classified as sensitive was sanitised by transformer X at boundary Y at timestamp Z.

Practical Implementation

The implementation pattern we use everywhere is "validate at the boundary, trust inside the boundary." Define your trust boundaries explicitly. At each boundary, data passes through a validation and sanitisation layer that is separate from business logic. This layer uses schema validation (Zod in TypeScript, Pydantic in Python, JSON Schema for API contracts) to enforce structure, custom validators for business rules, and sanitisation functions for security-relevant transformations.

Critically, validation and sanitisation are different operations. Validation rejects invalid data. Sanitisation transforms data to make it safe. Doing both, validating first and then sanitising what passes validation, is the right approach. Sanitising without validating can mask errors. Validating without sanitising leaves you vulnerable to data that's structurally valid but semantically dangerous.

Test your sanitisation. Not with a few happy-path examples, but with fuzzing. Tools like AFL++ and Atheris (for Python) generate millions of malformed inputs and verify that your system handles them gracefully. Property-based testing with Hypothesis or fast-check is another excellent approach. Define the invariants your sanitisation should maintain, and let the testing framework find inputs that violate them.

Bad data sanitisation is a slow-motion catastrophe. It doesn't crash your system spectacularly. It corrodes it gradually, creating security vulnerabilities, compliance violations, and data quality issues that compound over time. The cost of fixing it after the fact is always higher than the cost of doing it right from the start, because by the time you notice, the bad data is everywhere.