Files
workflow-miner/docs/research-pattern-mining.md
vikingowl 48923450f8 docs: add architecture plan and research notes
Initial project documentation for workflow-miner — a Rust CLI + zsh
plugin that mines recurring command workflows from Atuin shell history.
2026-02-22 08:41:50 +01:00

104 lines
6.0 KiB
Markdown

# Research: Sequential Pattern Mining for Shell History
## The Key Paper: ShRec (Huawei, 2024)
**[ShRec: A SRE Behaviour Knowledge Graph Model for Shell Command Recommendations](https://arxiv.org/abs/2408.05592)** — IEEE, August 2024
The most directly relevant academic work. Proves that sequential pattern mining on real shell history produces useful results.
### Approach
1. Models shell sessions as transactions — each command is an item, the temporal sequence of commands in a session is a transaction
2. Applies sequential pattern mining (frequent sub-sequence extraction) across all sessions
3. Clusters similar sequences using distance metrics and K-means to reduce redundancy
4. Builds a knowledge graph for recommendation
### Results
- Extracted 3,997 sequences ranging from 2 to 14 commands
- Support values from 2 to 101
- Sequences were executed by up to 43 different users
- SREs confirmed the mined sequences matched real operational workflows
## Sequential Pattern Mining Algorithms
All implemented in the [SPMF library](https://www.philippe-fournier-viger.com/spmf/index.php?link=algorithms.php) (Java, open source):
| Algorithm | Approach | Shell History Applicability |
|-----------|----------|-----------------------------|
| **PrefixSpan** | Pattern-growth via prefix-projected databases | **Best fit** — fast, handles variable-length sequences, no candidate generation |
| **GSP** | Apriori-like, level-wise candidate generation | Good for small histories; scales poorly |
| **SPADE** | Vertical ID-list intersection | Viable alternative to PrefixSpan |
| **SPAM** | Bitmap representation | Memory-efficient for dense data |
| **BIDE** | Closed sequential pattern mining | Eliminates redundant patterns — critical for usability |
| **CloSpan/ClaSP** | Closed/maximal patterns | Reduces output size |
**Recommendation:** PrefixSpan as the primary algorithm. ~200 lines of core logic. [Per the original paper](https://hanj.cs.illinois.edu/pdf/span01.pdf), it consistently outperforms GSP, FreeSpan, and SPADE.
### Reference implementation
[PrefixSpan-py](https://github.com/chuanconggao/PrefixSpan-py) — started as 15 lines of code. Includes BIDE (closed patterns) and FEAT (generator patterns). Straightforward to port to Rust.
## Data Model for Shell History
- **Item**: A single command (abstracted — e.g., `git commit -m *` rather than the literal message)
- **Transaction/Sequence**: All commands in a session (defined by session ID, or by time-gap heuristic)
- **Database**: All sessions across time
### Key considerations
- **Command abstraction is critical.** Literal commands have too much variance. Must normalize arguments while preserving command structure.
- **Session segmentation matters.** Atuin provides session IDs. Without them, use time-gap threshold (>30 min gap = new session).
- **Support threshold needs tuning.** ShRec used min_support=2. For personal use, even 2 surfaces useful patterns.
- **Gap constraints** — want "relaxed" sequential patterns allowing intervening commands (e.g., `git add -> [anything] -> git commit -> [anything] -> git push`).
## Simpler Alternatives
### N-gram approaches
- Bigrams/trigrams of commands: fast, easy, limited to fixed-length contiguous sequences
- Variable-length n-grams with recency decay
- Markov chains: good for "next command" prediction, not full workflow extraction
### Process Mining (unexplored opportunity)
[PM4Py](https://github.com/process-intelligence-solutions/pm4py) implements industrial-strength process discovery:
- **Alpha Miner** — discovers causal relations from directly-follows semantics
- **Heuristics Miner** — handles noise well
- **Inductive Miner** — guaranteed-sound block-structured models
Maps directly to shell history: Case ID = session, Activity = command, Timestamp = timestamp. Could produce visual workflow models. **Nobody has applied process mining to shell history.**
Reference: [Workflow Mining: Discovering Process Models from Event Logs](https://www.vdaalst.com/publications/p245.pdf) (van der Aalst et al., 2004)
## Rust Ecosystem for Pattern Mining
### No mature SPM crate exists
The [sequential-pattern-mining GitHub topic](https://github.com/topics/sequential-pattern-mining) lists 15 repos — zero in Rust.
| Crate | What it does | Verdict |
|-------|-------------|---------|
| `rust-rule-miner` | Association rule mining (Apriori) | Not sequential pattern mining |
### Recommended approach: Custom PrefixSpan implementation
Port from PrefixSpan-py. Estimated ~300-500 lines of Rust. The algorithm is recursive and maps cleanly to Rust's type system (`Vec<Vec<T>>`).
## Related Academic Work
| Paper | Venue | Relevance |
|-------|-------|-----------|
| [ShRec](https://arxiv.org/abs/2408.05592) | IEEE 2024 | Direct precedent — SPM on shell history |
| [PrefixSpan](https://hanj.cs.illinois.edu/pdf/span01.pdf) | ICDE 2001 | The algorithm we'll implement |
| [RACONTEUR](https://arxiv.org/abs/2409.02074) | NDSS 2025 | LLM-powered shell command explainer with RAG |
| [SASH](https://sigops.org/s/conferences/hotos/2025/papers/hotos25-364.pdf) | HotOS 2025 | Static analysis framework for shell, LLM-assisted spec generation |
| [AgentSpec](https://arxiv.org/abs/2503.18666) | ICSE 2026 | DSL for runtime safety constraints on LLM agents |
| Cybersecurity training dataset | [ScienceDirect](https://www.sciencedirect.com/science/article/pii/S2352340921006806) | SPM on shell commands from 175 participants |
| [Workflow Mining](https://www.vdaalst.com/publications/p245.pdf) | van der Aalst 2004 | Process mining foundations |
## Elastic Security Labs — Session Summarization
[Using LLMs to summarize user sessions](https://www.elastic.co/security-labs/using-llms-to-summarize-user-sessions) — used GPT-4 to summarize shell sessions for security analysts. Proved the approach works at scale. [Follow-up](https://www.elastic.co/security-labs/using-llms-and-esre-to-find-similar-user-sessions) used semantic retrieval to find similar sessions. Applied to security — nobody has done this for developer productivity.