Research: How Surveillance Systems Build Durable Identities
Technical and structural analysis of fingerprinting, identity graphs, metadata, and the economics of modern tracking infrastructure.
Fingerprinting Is a Correlation Problem, Not a Single Data Point
Browser fingerprinting is often described as though it extracts one unique identifier. The reality is more subtle — and harder to defend against. Fingerprinting works by combining many semi-stable parameters, each individually common, into a combination that is rare or unique.
A 2020 survey by Laperdrix et al. in ACM Computing Surveys analyzed 37 fingerprinting attributes and found that combinations of screen resolution, installed fonts, canvas rendering, and timezone alone achieve identification rates above 90% on desktop browsers. The key insight: you don't need high entropy in any single attribute — you need low correlation between attributes.
The five fingerprinting layers
- Passive (HTTP headers): User-Agent, Accept-Language, Accept-Encoding — transmitted with every request, no JavaScript required.
- Active (JavaScript API): Screen dimensions, color depth, timezone, platform, hardware concurrency (CPU core count), device memory.
- Canvas fingerprinting: A hidden HTML5 canvas element is drawn and read back — subtle GPU/driver rendering differences create a unique hash per device.
- WebGL fingerprinting: Queries GPU vendor strings, renderer details, and 3D rendering behavior. More stable than canvas across browser updates.
- Behavioral biometrics: Mouse movement velocity, keystroke timing, scroll patterns, and touch pressure (mobile) — these change slowly and are very difficult to spoof.
Mitigation requires attacking all five layers simultaneously. The Tor Browser does this by normalizing most API responses to fixed values and routing through Tor. Firefox with privacy.resistFingerprinting addresses most active layer attacks but not behavioral biometrics.
Identity Graphs: Why Data Becomes More Valuable When Joined
One of the least intuitive aspects of modern data systems is that economic value increases with linkage rather than precision. A rough location pattern, a stable device fingerprint, repeated session timing, and a payment-adjacent event stream become more valuable together than any single accurately-known field.
Identity graphs (also called "identity resolution" or "customer stitching" in industry language) work by probabilistically linking different data fragments to the same real-world person. Key techniques:
- Deterministic matching: Uses shared known identifiers — email addresses, phone numbers, device IDs — to merge records with high confidence.
- Probabilistic matching: Uses shared behavioral signals — IP ranges, timing patterns, movement correlations — to infer linkage without a shared identifier. Confidence is expressed as a probability score, not a binary match.
- Cross-device bridging: Your laptop and phone share IP ranges, Wi-Fi access points, and time-correlated browsing sessions. Ad platforms use these to build a unified cross-device profile.
- Third-party data enrichment: Data broker records — voter registration, purchase history, property records, magazine subscriptions — are joined with behavioral data to fill gaps and increase profile completeness.
"The question is not whether a person is named. The question is whether that person can be recognized, scored, influenced, or filtered with commercially meaningful confidence." — Paul Ohm, Georgetown Law, "Broken Promises of Privacy" (2010)
Metadata Survives Successful Encryption
This is the most persistently misunderstood aspect of communications privacy. End-to-end encryption protects the content of a message. It does not protect the fact that a message was sent, between which parties, at what time, from what location, and of what size.
The 2014 Stanford/Princeton study "Metaphone: The Sensitivity of Telephone Metadata" analyzed call metadata from 500 volunteers and found:
- From call patterns alone, researchers could identify medical conditions (calls to oncologists, cardiologists), legal situations (calls to bankruptcy attorneys), and personal relationships — with no access to call content.
- 15 minutes of call metadata per person was sufficient to infer sensitive personal attributes with statistical confidence.
Messaging metadata analysis is equally revealing. Even with Signal's sealed sender feature, network-level traffic analysis — observing when packets are sent and received at the Tor exit node level — can reconstruct conversation timing and frequency with sufficient resources.
Decentralization's Hidden Choke Points
The assumption that decentralized systems are immune to surveillance or censorship is technically incorrect. Every distributed system has dependencies that create de facto centralization:
- DNS: Domain name resolution is hierarchical and controlled. A decentralized app with a .com address can be de-platformed at the DNS level.
- CDN and hosting: Cloudflare serves over 20% of all internet traffic. A "decentralized" app hosted behind Cloudflare is centralized at the network layer.
- App stores: Mobile applications — even those connecting to decentralized networks — must pass through Apple App Store or Google Play, both of which can remove apps on request.
- Bootstrap nodes: P2P networks require initial connection points ("bootstrap nodes" or "seed nodes") that are typically operated by the core development team.
True censorship resistance requires operating at every layer simultaneously: decentralized naming (ENS, OpenNIC), decentralized hosting (IPFS, Freenet), direct P2P protocols without app store dependency, and funding that doesn't rely on traditional payment rails.