Research

Developing a foundation model for business processes

What we’re building

A process foundation model (PFM) — pretrained to learn a single, transferable representation of how work actually flows through an organization — drawn from the events, entities, and relationships in its data. Rather than hand-building a model for every process or company, a PFM applies across monitoring, anomaly detection, simulation, and decision support out of the box.

Value, even from immature data

Most firms don’t have clean, complete process logs — and a PFM shouldn’t require them. By learning the latent state of a process (open obligations, missing dependencies, likely bottlenecks, reachable futures), it can flag risks and recommend interventions precisely where incomplete information defeats conventional analytics and bespoke, hand-built models.

Why no one has built one yet

Real processes are bitemporal and multi-object: facts arrive late, get corrected, and unfold concurrently. Approaches that flatten event logs into a single timeline learn to peek at the future and miss the structure that matters. Capturing valid time, transaction time, and concurrency without leaking hindsight is the gap that has kept a genuine PFM out of reach. This is a problem we’ve solved.

Why we’re well-placed

Since 2019, we’ve been building the substrate to tackle this problem — typed, multi-temporal, provenance-aware graphs (hgres), a query language for them (HashQL), and process autoformalizers and simulation tooling (Petrinaut). Now, we’re developing DG-JEPA, the representation-learning objective for process state — so we can train on data in the right shape instead of waiting for it to exist.

PremiseAn Introduction to Graph-JEPA How it worksTowards Process Foundation Models Technical deep-diveHow Not To Build a Process Foundation Model

Open Areas of Research

We’re approaching process foundation modeling as a stack of small, decomposable bets — learn more below.

Challenge
Our central conjecture is that JEPA-style masked latent prediction on typed temporal graphs produces a more reusable representation than next-event, link-prediction, or feature-reconstruction objectives. Everything downstream of that conjecture depends on it being true — so it needs to be tested before any architectural elaboration is layered on top.
Mitigation
We plan to train families of DG-JEPA objectives that progressively mask events, object lifecycle phases, joins, handoffs, delay regions, exception paths, Petri firings, and future suffixes, comparing them against next-event, feature-reconstruction, link-prediction, graph-transformer-from-scratch, and Petri-only baselines on the same input substrate.
Success will be judged by linear probes over the learned representations, rather than by loss curves: can we recover known latent process state — enabled actions, missing dependencies, branch type, conformance status, bottleneck class, feasible future subgraphs — at materially higher fidelity than the baselines do? Where early masks teach shortcuts (e.g. predicting source-system identifiers, exploiting label leakage), we’ll tighten the mask distribution, strip identifier features, and weight targets toward Petri/temporal supervision rather than raw labels.
Challenge
Real-world processes are bitemporal: facts arrive late, prior events get corrected, and any model that ignores the difference between “what was true at time t” and “what we now know was true at time t” will learn to peek at the future during training. Most prior work skirts this problem by treating event logs as flat sequences over a single time axis. For a foundation model, we cannot afford to.
This data gap is itself one of the central reasons no PFM exists yet — the bitemporal analogue of the broader data-shape problem set out earlier. Public process-mining datasets — BPI Challenge logs, OCEL corpora — all record a single time axis per event; none of them capture when a fact was first known versus when it was later corrected.
Mitigation
Unfortunately, solving this isn’t just a matter of minor data-cleaning. Rather, it’s the kind of “data not existing in the right shape” gap that, until very recently, made even contemplating a model like this impossible. hgres — our typed, multi-temporal, provenance-aware graph substrate — is the infrastructure we’ve spent the past seven years building specifically to close it. Now that the substrate exists, we intend to be the first through the door it has opened. Our bitemporal training and evaluation sources are accordingly hgres-recorded production data, industrial partner audit logs (which many enterprise systems maintain natively), and Petrinaut-generated synthetic logs with controlled correction patterns — complementing the public non-bitemporal corpora used elsewhere in our benchmark.
On top of those data sources, the modeling risk is whether we can actually exploit the bitemporal structure without leaking hindsight back into the model. We’ll construct explicit “as-known-at-time-t” training views and matched later-corrected target views, then compare three input regimes on the same downstream tasks: single-case flattening, object-centric views, and full bitemporal multi-object graphs. Our judgment criterion will be twofold: hindsight-leakage detection (does performance collapse when late corrections are hidden during training?) and head-to-head performance on tasks involving late events, concurrent branches, cross-object joins, and long-range dependencies. If full bitemporal graphs prove too noisy or too large to train on directly, we’ll fall back to typed neighborhood extraction around cases, objects, resources, policies, and Petri markings — preserving bitemporal semantics where they matter the most.
Challenge
A DG-JEPA-anchored hybrid pulls in transformer embeddings for text, SSM embeddings for long timelines, Petri markings for symbolic state, and graph encoders for relational structure. The risk is not that any one of these modalities is wrong; but that, mixed naively, the model leans on whichever modality is easiest to overfit (schema strings, source IDs, timestamps, Petri labels), and DG-JEPA stops learning process state at all.
Mitigation
We’ll train and probe each modality separately before composing them, then run controlled early-, late-, and gated-fusion variants, with leave-one-modality-out ablations. The judgment criterion is whether the model can still recover latent process state when individual modalities are masked at evaluation time, and whether removing the DG-JEPA objective itself causes the representation to collapse into the cheapest available modality.
If fusion is unstable, we’ll fallback to a more modular architecture, with DG-JEPA owning the graph-state representation, and other modalities entering only through adapters or downstream task heads — buying us robustness at the cost of some end-to-end optimization.
Challenge
Enterprise process graphs are large, but most of their structure is repetitive. The interesting paths — anomalies, exceptions, compliance violations, recoveries — are by construction rare. Naïve graph sampling (random walks, uniform ego-graphs) systematically under-represents exactly the paths a PFM most needs to learn from.
Mitigation
We plan on using stratified bitemporal graph sampling: common paths for coverage, rare conformance and anomaly paths for sensitivity, and Petri-guided sampling around token deficits, disabled transitions, joins, and rework loops.
Scaling experiments will vary graph size, temporal horizon, object count, mask size, and Petri-supervision density, mapping out the empirical scaling behavior — including whether rare-path recall improves disproportionately under Petri-guided sampling, and whether scaling is monotonic in graph and supervision density.
If global sampling proves too expensive, we’ll move to hierarchical views — event neighborhoods, object lifecycles, case subgraphs, and process-level summaries — recomposed at inference time rather than fitted into a single training pass.
Challenge
Public process-mining benchmarks largely test next-event prediction on a single log — which is precisely the wrong question for a foundation model, and rewards exactly the behavior we critiqued in our technical deep-dive: overfitting local vocabularies. The benchmark that judges this — one that explicitly tests transfer, robustness, missing-structure inference, and constraint awareness — does not yet exist at scale. We must, therefore, build it.
Mitigation
The benchmark will cover zero- and few-shot transfer across domains, missing-state inference, conformance-aware prediction, future-subgraph prediction, bottleneck diagnosis, fragment retrieval, and robustness to renamed labels, missing events, late corrections, and schema shifts.
Every claim of our PFM’s advantage will be tested against sequence models, GNNs, graph transformers, OCPM-native methods, Petri-only baselines, graph autoencoders, and hybrid variants. Negative results will be treated as informative: a loss to a sequence model on cross-domain transfer would tell us next-token prediction is more useful than we currently think; a loss to a Petri-only baseline on conformance would tell us we’re under-using symbolic structure. Each negative result narrows the design space.
Challenge
The most consequential PFM use cases — should we escalate this case? add capacity here? change this policy? — are causal questions, not predictive ones. But most enterprise event logs are observational, confounded, and incomplete. A model that learns historical correlations and then proposes interventions on the basis of them is, in the worst case, dangerous.
Mitigation
RL is therefore deliberately placed downstream of DG-JEPA and Petri filtering: action proposals come from feasible-action sets defined by the Petri net, consequences are first evaluated in counterfactual simulation, and offline RL uses conservative estimators rather than aggressive exploration.
We separate predictive validity (does the latent state forecast accurately?) from causal validity (would an intervention move outcomes in the predicted direction?), and stress-test the latter with causal probes, policy-change simulations, and backtests against natural interventions where they exist in the data.
Our judgment criterion is whether DG-JEPA-conditioned simulators predict outcomes of held-out natural interventions, and whether causal probes survive distribution shifts that ordinary predictive probes do not. If causal validity does not arrive in step with predictive validity — which is the expected case — high-risk recommendations remain human-reviewed until enough real-world outcome feedback supports deployment.

Research Outputs

brunch.ai →
Specification elicitation tools for faithful intent representation
A research prototype developed to explore how domain experts can better encode intent, minimizing risks of incompleteness and misunderstanding, in ways that lend themselves to formal verification. Learn more at brunch.ai
Learn more →
Formally-verifiable models of safety-critical supply chains
In collaboration with Prof. Nobuko Yoshida, Prof. David Parker, and Adrian Puerto Aubel from the University of Oxford’s Department of Computer Science we developed a series of formally verifiable models of supply chains. See the original announcement for more.
petrinaut.org →
Petri Nets for trustworthy agentic coordination
Applying Petri nets, a mathematical formalism for modeling distributed, concurrent systems, to the coordination and auditing of agentic workflows. This work builds on our browser-based process simulator, Petrinaut. Try it out at demo.petrinaut.org
Learn more →
Read the announcement
Multi-temporal hypergraph query language
HashQL is a query language with first-class support for hypergraph relations and multiple, independent time dimensions — enabling precise reasoning over how data evolves and when it was known. Learn more at hgres.org/hashql
hgres.org →
Read the announcement
Backend-agnostic, strongly-typed graph data engine
A data engine for typed property graphs that decouples query and schema semantics from the underlying database, allowing the same graph workloads to run unchanged across multiple optimized storage backends. Learn more at hgres.org
semtype.org →
Read the announcement
Composable, interoperable, decentralized semantic types
A specification and shared, open registry of semantic types — definitions of the concepts, measures, and entities that structured data refers to. SemType provides a common vocabulary that both people and AI agents can reference to exchange data unambiguously. Read the spec at semtype.org/spec and browse the registry at semtype.org/types

You can learn more about various of our projects on our developer site at hash.dev.

Earlier Research

2019
hCore: Agent-based simulation engine developed in Rust, compiled to WASM, with support for user-created JS, Python and Rust sims.
2019
GPT-2 ABMs: Early LLM-integrated agent-based simulations, and grounding of agents in world models begins.
2020
Real-world applications: Optimization of real-world systems like vaccine distribution and supply chains.
2021
Graph-backed simulation: Multi-temporal, strongly-typed graph underpinnings for error-sensitive, safety-critical environments.

Join Us

We’re recruiting researchers to join our team to work on the projects above and more. Browse our open research roles below, or view all open roles.

AI Research Engineer

London

AI Research Engineer

Berlin

Create a free account

By signing up you agree to our terms and conditions and privacy policy