Initial project documentation for workflow-miner — a Rust CLI + zsh plugin that mines recurring command workflows from Atuin shell history.
6.0 KiB
Research: Sequential Pattern Mining for Shell History
The Key Paper: ShRec (Huawei, 2024)
ShRec: A SRE Behaviour Knowledge Graph Model for Shell Command Recommendations — IEEE, August 2024
The most directly relevant academic work. Proves that sequential pattern mining on real shell history produces useful results.
Approach
- Models shell sessions as transactions — each command is an item, the temporal sequence of commands in a session is a transaction
- Applies sequential pattern mining (frequent sub-sequence extraction) across all sessions
- Clusters similar sequences using distance metrics and K-means to reduce redundancy
- Builds a knowledge graph for recommendation
Results
- Extracted 3,997 sequences ranging from 2 to 14 commands
- Support values from 2 to 101
- Sequences were executed by up to 43 different users
- SREs confirmed the mined sequences matched real operational workflows
Sequential Pattern Mining Algorithms
All implemented in the SPMF library (Java, open source):
| Algorithm | Approach | Shell History Applicability |
|---|---|---|
| PrefixSpan | Pattern-growth via prefix-projected databases | Best fit — fast, handles variable-length sequences, no candidate generation |
| GSP | Apriori-like, level-wise candidate generation | Good for small histories; scales poorly |
| SPADE | Vertical ID-list intersection | Viable alternative to PrefixSpan |
| SPAM | Bitmap representation | Memory-efficient for dense data |
| BIDE | Closed sequential pattern mining | Eliminates redundant patterns — critical for usability |
| CloSpan/ClaSP | Closed/maximal patterns | Reduces output size |
Recommendation: PrefixSpan as the primary algorithm. ~200 lines of core logic. Per the original paper, it consistently outperforms GSP, FreeSpan, and SPADE.
Reference implementation
PrefixSpan-py — started as 15 lines of code. Includes BIDE (closed patterns) and FEAT (generator patterns). Straightforward to port to Rust.
Data Model for Shell History
- Item: A single command (abstracted — e.g.,
git commit -m *rather than the literal message) - Transaction/Sequence: All commands in a session (defined by session ID, or by time-gap heuristic)
- Database: All sessions across time
Key considerations
- Command abstraction is critical. Literal commands have too much variance. Must normalize arguments while preserving command structure.
- Session segmentation matters. Atuin provides session IDs. Without them, use time-gap threshold (>30 min gap = new session).
- Support threshold needs tuning. ShRec used min_support=2. For personal use, even 2 surfaces useful patterns.
- Gap constraints — want "relaxed" sequential patterns allowing intervening commands (e.g.,
git add -> [anything] -> git commit -> [anything] -> git push).
Simpler Alternatives
N-gram approaches
- Bigrams/trigrams of commands: fast, easy, limited to fixed-length contiguous sequences
- Variable-length n-grams with recency decay
- Markov chains: good for "next command" prediction, not full workflow extraction
Process Mining (unexplored opportunity)
PM4Py implements industrial-strength process discovery:
- Alpha Miner — discovers causal relations from directly-follows semantics
- Heuristics Miner — handles noise well
- Inductive Miner — guaranteed-sound block-structured models
Maps directly to shell history: Case ID = session, Activity = command, Timestamp = timestamp. Could produce visual workflow models. Nobody has applied process mining to shell history.
Reference: Workflow Mining: Discovering Process Models from Event Logs (van der Aalst et al., 2004)
Rust Ecosystem for Pattern Mining
No mature SPM crate exists
The sequential-pattern-mining GitHub topic lists 15 repos — zero in Rust.
| Crate | What it does | Verdict |
|---|---|---|
rust-rule-miner |
Association rule mining (Apriori) | Not sequential pattern mining |
Recommended approach: Custom PrefixSpan implementation
Port from PrefixSpan-py. Estimated ~300-500 lines of Rust. The algorithm is recursive and maps cleanly to Rust's type system (Vec<Vec<T>>).
Related Academic Work
| Paper | Venue | Relevance |
|---|---|---|
| ShRec | IEEE 2024 | Direct precedent — SPM on shell history |
| PrefixSpan | ICDE 2001 | The algorithm we'll implement |
| RACONTEUR | NDSS 2025 | LLM-powered shell command explainer with RAG |
| SASH | HotOS 2025 | Static analysis framework for shell, LLM-assisted spec generation |
| AgentSpec | ICSE 2026 | DSL for runtime safety constraints on LLM agents |
| Cybersecurity training dataset | ScienceDirect | SPM on shell commands from 175 participants |
| Workflow Mining | van der Aalst 2004 | Process mining foundations |
Elastic Security Labs — Session Summarization
Using LLMs to summarize user sessions — used GPT-4 to summarize shell sessions for security analysts. Proved the approach works at scale. Follow-up used semantic retrieval to find similar sessions. Applied to security — nobody has done this for developer productivity.