workflow-miner/docs/research-pattern-mining.md

# Research: Sequential Pattern Mining for Shell History

## The Key Paper: ShRec (Huawei, 2024)

**[ShRec: A SRE Behaviour Knowledge Graph Model for Shell Command Recommendations](https://arxiv.org/abs/2408.05592)** — IEEE, August 2024

The most directly relevant academic work. Proves that sequential pattern mining on real shell history produces useful results.

### Approach

1. Models shell sessions as transactions — each command is an item, the temporal sequence of commands in a session is a transaction
2. Applies sequential pattern mining (frequent sub-sequence extraction) across all sessions
3. Clusters similar sequences using distance metrics and K-means to reduce redundancy
4. Builds a knowledge graph for recommendation

### Results

- Extracted 3,997 sequences ranging from 2 to 14 commands
- Support values from 2 to 101
- Sequences were executed by up to 43 different users
- SREs confirmed the mined sequences matched real operational workflows

## Sequential Pattern Mining Algorithms

All implemented in the [SPMF library](https://www.philippe-fournier-viger.com/spmf/index.php?link=algorithms.php) (Java, open source):

| Algorithm | Approach | Shell History Applicability |
|-----------|----------|-----------------------------|
| **PrefixSpan** | Pattern-growth via prefix-projected databases | **Best fit** — fast, handles variable-length sequences, no candidate generation |
| **GSP** | Apriori-like, level-wise candidate generation | Good for small histories; scales poorly |
| **SPADE** | Vertical ID-list intersection | Viable alternative to PrefixSpan |
| **SPAM** | Bitmap representation | Memory-efficient for dense data |
| **BIDE** | Closed sequential pattern mining | Eliminates redundant patterns — critical for usability |
| **CloSpan/ClaSP** | Closed/maximal patterns | Reduces output size |

**Recommendation:** PrefixSpan as the primary algorithm. ~200 lines of core logic. [Per the original paper](https://hanj.cs.illinois.edu/pdf/span01.pdf), it consistently outperforms GSP, FreeSpan, and SPADE.

### Reference implementation

[PrefixSpan-py](https://github.com/chuanconggao/PrefixSpan-py) — started as 15 lines of code. Includes BIDE (closed patterns) and FEAT (generator patterns). Straightforward to port to Rust.

## Data Model for Shell History

- **Item**: A single command (abstracted — e.g., `git commit -m *` rather than the literal message)
- **Transaction/Sequence**: All commands in a session (defined by session ID, or by time-gap heuristic)
- **Database**: All sessions across time

### Key considerations

- **Command abstraction is critical.** Literal commands have too much variance. Must normalize arguments while preserving command structure.
- **Session segmentation matters.** Atuin provides session IDs. Without them, use time-gap threshold (>30 min gap = new session).
- **Support threshold needs tuning.** ShRec used min_support=2. For personal use, even 2 surfaces useful patterns.
- **Gap constraints** — want "relaxed" sequential patterns allowing intervening commands (e.g., `git add -> [anything] -> git commit -> [anything] -> git push`).

## Simpler Alternatives

### N-gram approaches

- Bigrams/trigrams of commands: fast, easy, limited to fixed-length contiguous sequences
- Variable-length n-grams with recency decay
- Markov chains: good for "next command" prediction, not full workflow extraction

### Process Mining (unexplored opportunity)

[PM4Py](https://github.com/process-intelligence-solutions/pm4py) implements industrial-strength process discovery:

- **Alpha Miner** — discovers causal relations from directly-follows semantics
- **Heuristics Miner** — handles noise well
- **Inductive Miner** — guaranteed-sound block-structured models

Maps directly to shell history: Case ID = session, Activity = command, Timestamp = timestamp. Could produce visual workflow models. **Nobody has applied process mining to shell history.**

Reference: [Workflow Mining: Discovering Process Models from Event Logs](https://www.vdaalst.com/publications/p245.pdf) (van der Aalst et al., 2004)

## Rust Ecosystem for Pattern Mining

### No mature SPM crate exists

The [sequential-pattern-mining GitHub topic](https://github.com/topics/sequential-pattern-mining) lists 15 repos — zero in Rust.

| Crate | What it does | Verdict |
|-------|-------------|---------|
| `rust-rule-miner` | Association rule mining (Apriori) | Not sequential pattern mining |

### Recommended approach: Custom PrefixSpan implementation

Port from PrefixSpan-py. Estimated ~300-500 lines of Rust. The algorithm is recursive and maps cleanly to Rust's type system (`Vec<Vec<T>>`).

## Related Academic Work

| Paper | Venue | Relevance |
|-------|-------|-----------|
| [ShRec](https://arxiv.org/abs/2408.05592) | IEEE 2024 | Direct precedent — SPM on shell history |
| [PrefixSpan](https://hanj.cs.illinois.edu/pdf/span01.pdf) | ICDE 2001 | The algorithm we'll implement |
| [RACONTEUR](https://arxiv.org/abs/2409.02074) | NDSS 2025 | LLM-powered shell command explainer with RAG |
| [SASH](https://sigops.org/s/conferences/hotos/2025/papers/hotos25-364.pdf) | HotOS 2025 | Static analysis framework for shell, LLM-assisted spec generation |
| [AgentSpec](https://arxiv.org/abs/2503.18666) | ICSE 2026 | DSL for runtime safety constraints on LLM agents |
| Cybersecurity training dataset | [ScienceDirect](https://www.sciencedirect.com/science/article/pii/S2352340921006806) | SPM on shell commands from 175 participants |
| [Workflow Mining](https://www.vdaalst.com/publications/p245.pdf) | van der Aalst 2004 | Process mining foundations |

## Elastic Security Labs — Session Summarization

[Using LLMs to summarize user sessions](https://www.elastic.co/security-labs/using-llms-to-summarize-user-sessions) — used GPT-4 to summarize shell sessions for security analysts. Proved the approach works at scale. [Follow-up](https://www.elastic.co/security-labs/using-llms-and-esre-to-find-similar-user-sessions) used semantic retrieval to find similar sessions. Applied to security — nobody has done this for developer productivity.