Privacy-Preserving Process Mining

Collecting event traces from employee's desktop activity in a privacy-first way
May 27th, 2026
Dei Vilkinsons
Dei Vilkinsons
CEO & Founder
Privacy-Preserving Process Mining

Why process mining matters

Process mining is the closest thing most organizations have to ground truth about how they actually operate. Where interviews, workshops and process diagrams capture the idealized version of a workflow, mined event logs capture what people and systems actually do — including the workarounds, exception paths, rework loops, and informal handoffs that determine whether a process succeeds or fails.

That ground truth is the foundation for almost everything we care about at HASH. It feeds the decision intelligence work we're doing in safety-critical domains; it grounds the agent-based simulations and digital twins of enterprises that are built on HASH; it provides conformance signal for the Petri net-based orchestration of agentic workflows; and — at scale — it's the raw material from which a process foundation model might one day learn the latent state of real organizations. Without good event data, all of these projects are working from a sketch rather than a photograph.

Why process mining sucks

Despite this, adoption of process mining inside many of the organizations that would benefit from it the most remains patchy. Process mining technology has existed in commercial form for well over a decade, and yet many of the firms we work with either don't use it at all, or only employ it on a narrow slice of their ERP-resident workflows where events are generally already centralized.

In our own research with biopharmaceutical supply-chain leaders, the single most consistent objection to outside AI tooling was opacity around data security, privacy and governance. Practitioners told us they routinely passed on solutions that required sensitive operational data to be shipped to an external cloud — not because they believed vendors untrustworthy, but because the certification and procurement burden was prohibitive, and the consequences of getting it wrong were severe. The same dynamic plays out across every regulated industry we've spoken to: life sciences, financial services, defence, healthcare, and critical infrastructure.

The constraints aren't only commercial. In the EU, comprehensive workplace monitoring runs straight into the GDPR's data-minimization principle and into the co-determination rights of works councils, both of which require a credible answer to "what exactly are you collecting, on whom, and why?" before deployment is even possible. Most incumbent process-mining tools — and the newer wave of agent-monitoring and "employee productivity" stacks — were architected from the opposite direction: capture everything that moves, ship it to a SaaS backend, and figure out the governance story afterwards. That posture is a non-starter in any environment where the cost of a mistake is measured in patient harm, regulatory exposure, or breach of trust.

The result is that the organizations who would derive the most value from honest, end-to-end process visibility are precisely the ones who cannot deploy the tools that currently provide it.

A privacy-first architecture

Utilizing the provenance-aware data engine at the heart of HASH, we've been exploring what an alternative might look like: a desktop agent that collects rich event traces from the applications and websites users interact with, while making it structurally difficult — and in many cases impossible — to abuse.

The agent is built around a small number of architectural commitments. None of them is novel on its own, but combining all of them in one tool is, as far as we're aware, new.

HASH's desktop task/process mining agent is only available to enterprise customers who have agreed to additional privacy and security safeguards. This post explores the technical measures we take to protect end-users beyond these contractual requirements placed on firms deploying our technology.

Local-first semantic extraction

Raw screen content, window titles, and keystrokes never leave the device. A small on-device model continuously converts these raw signals into typed events — instances of semantic types such as "invoice opened", "approval submitted", "shipment rerouted", or "ticket closed" — and only the typed events are ever transmitted to the user's web on their organization's HASH instance.

This matters because the typed events carry only the information needed to reconstruct the process; they don't carry the underlying screen pixels, document contents, or keystroke streams from which they were derived. A leak of typed events is a very different incident from a leak of raw recordings, and the architecture makes the latter category of leak technically impossible. Semantic typing also enables semantic filtering, going beyond the dumb "structural" checks on content which, where overly-zealous, result in important information being missed — and which, where incomplete, risk sensitive information being exposed.

Optional on-device PII stripping

For organizations that want to go further, the agent can apply an additional on-device pass that scrubs names, email addresses, internal identifiers, and other PII from typed events before they're transmitted. This is off by default, because for many process-mining use cases the identity of the actor is precisely what matters (segregation-of-duties checks, individual coaching, workload analysis), but it's a one-flag opt-in for orgs operating under stricter privacy regimes, and a useful complement to the aggregation-level k-anonymity work described below.

Whitelist and blacklist modes

Organizations deploying the agent choose between opt-in (whitelist) and opt-out (blacklist) operation, applied independently to applications, websites, and — experimentally — project and activity types (i.e. the kind and granularity of data collected in the first place).

In blacklist mode, the agent observes everything except an enumerated set of excluded surfaces (password managers, HR portals, personal banking, healthcare apps, and anything else an org wants to guarantee is never observed). In whitelist mode, the agent observes nothing unless the surface has been explicitly approved. Whitelist mode is the right default for organizations subject to works-council co-determination or operating in highly regulated environments; blacklist mode is more practical for broad discovery work.

In both cases users are made aware of exactly what's on the relevant allow- or deny-list, so it's never a surprise what information ends up in the event log.

Time-of-day restrictions

Collection can be constrained to defined work hours — typically weekday 09:00–17:00 in each user's local timezone, but configurable per user, per team, or geographically. The intent is straightforward: a work device used for personal browsing in the evening should not produce events in the corporate process graph. The active and inactive windows are presented in the agent's user-facing UI, so subjects can see at a glance exactly when the agent is and isn't awake and observing.

User-facing, redactable logs

Every event the agent produces is visible for review before it's shared with anyone else. Users can inspect events, redact fields, or drop events entirely. Withholding information leaves an auditable placeholder — recording that something was removed and by whom, but not what — rather than disappearing silently. This preserves enough integrity in the downstream event log to support honest process discovery, while keeping the user firmly in control of what their employer (or their own future self) gets to see.

Because entities are semantically typed, this review process becomes practical, whereas in traditional task mining daemons the format of events renders them practically unreadable to average end-users.

The agent has no silent mode. Its current state — observing, paused, restricted to a whitelist, restricted to work hours — is always visible in the operating system's status bar, and users can pause collection from the same UI at any time. Initial consent is explicit, scoped, and re-confirmed on a regular cadence rather than buried in a one-time onboarding flow.

No third-party processors

The full pipeline — raw capture, on-device semantic extraction, optional PII stripping, transmission, storage, and downstream mining — runs inside the organization's own HASH instance, hosted or self-hosted. There are no third-party SaaS sub-processors involved in the handling of event data, and nothing about the architecture requires there to be in the future. For organizations that need to review our default posture, our overall list of subprocessors is published as part of our privacy statement.

Provenance and permissions

Every event is stored in a user or organization's HASH web, carrying consent context as first-class metadata. Permissions are revocable at any granularity: a single event, an application, a project, a day, a quarter. Revocation propagates through the graph and is reflected in any downstream mined model.

This is the part of the architecture we have invested in the most. Process mining only earns trust if the consent trail behind each event is legible and durable, and that's exactly what we've worked hard to make true.

K-anonymity (in development)

For organizations that want population-level process insight without any individually-linkable trace, we're currently exploring aggregation primitives that enforce k-anonymity at the point of mining — guaranteeing that any reported pattern is supported by traces from at least k distinct users. This work is in progress rather than shipped, but it pairs naturally with the optional on-device PII stripping, and we expect it to be the deployment mode of choice for the most sensitive cross-team process discovery work.

What we considered, and where we want to go next

Our first instinct was to do the process mining itself on-device, so that no individual events would ever need to be centralised in the first place. We spent a meaningful amount of time exploring this design, and ultimately concluded it doesn't generalise.

The problem is straightforward: on-device mining only works if every event in a given process trace happens on the same machine. That's true for some narrow, single-user workflows, but the cross-functional processes that matter most — an invoice approval, a clinical batch release, a supply-chain reroute, a customer escalation — typically touch a dozen people across an even greater number of devices. Mining therefore has to operate against a corpus that has been centralised somewhere, even when every other privacy guarantee in the architecture above holds. The on-device commitment moves to extraction, transformation, redaction, and consent; the mining itself happens server-side.

That leaves an open research direction we're actively interested in: process mining over homomorphically encrypted event streams. In such a system, individual events would remain opaque even to the system performing the mining, with only aggregate process structure decryptable by authorised parties. It's a hard problem — fully homomorphic encryption is still expensive, and graph-shaped computations over encrypted data are harder still — but it's a natural complement to our other work, and consistent with the broader thesis of our Safeguarded AI involvement: building infrastructure that lets safety-critical AI operate without anyone having to surrender control of the underlying data.

How it fits with the rest of HASH

The desktop agent is one of several event-collection surfaces that feed a HASH web. It sits alongside the browser extension (which already supports passive entity capture from authenticated web applications) and the broader catalog of system integrations we maintain for ERP, CRM, ticketing, communications, and workflow systems. For most organizations the right deployment combines several of these surfaces, weighted by where the highest-value processes actually live.

Once events have arrived in the web, they're available to the same process mining, conformance checking and simulation tooling we use everywhere else. By default, organizational event data stays private to organizations themselves, and is not used in the training of our process foundation models. To learn more about our approach to security and confidentiality at HASH, visit hash.ai/security

Event data which is as accurate, complete, and up-to-date as possible — informing process representations that are as comprehensive and nuanced as the real-world itself — is a critical pre-requisite of developing many Safeguarded AI applications, where the probabilistic safety guarantees are contingent on being able to faithfully represent how operations being modelled are actually run today.

Try it out

Our desktop agent is currently available to enterprise customers only. If you represent an organization interested in testing the technology out alongside us and feeding back on what works, what doesn't, and what's missing, please feel free to get in touch.

While we believe our approach to privacy goes beyond any of the affordances provided by existing task mining daemons… they're also not our benchmark to beat. Your expectations are, and we're open to further suggestions on how the privacy model can be improved.

Pilot participants get early access to the agent, hands-on support from our team in configuring its consent and collection policies for their environment, and direct input into the roadmap (including the k-anonymity and homomorphic-encryption work described above). If that sounds like you, fill in the form below and we'll be in touch within 24 hours.

For broader questions about the agent, the pilot, or the underlying technology, please get in touch.

Create a free account

Sign up to try HASH out for yourself, and see what all the fuss is about

By signing up you agree to our terms and conditions and privacy policy