# Research: Sequential Pattern Mining for Shell History ## The Key Paper: ShRec (Huawei, 2024) **[ShRec: A SRE Behaviour Knowledge Graph Model for Shell Command Recommendations](https://arxiv.org/abs/2408.05592)** — IEEE, August 2024 The most directly relevant academic work. Proves that sequential pattern mining on real shell history produces useful results. ### Approach 1. Models shell sessions as transactions — each command is an item, the temporal sequence of commands in a session is a transaction 2. Applies sequential pattern mining (frequent sub-sequence extraction) across all sessions 3. Clusters similar sequences using distance metrics and K-means to reduce redundancy 4. Builds a knowledge graph for recommendation ### Results - Extracted 3,997 sequences ranging from 2 to 14 commands - Support values from 2 to 101 - Sequences were executed by up to 43 different users - SREs confirmed the mined sequences matched real operational workflows ## Sequential Pattern Mining Algorithms All implemented in the [SPMF library](https://www.philippe-fournier-viger.com/spmf/index.php?link=algorithms.php) (Java, open source): | Algorithm | Approach | Shell History Applicability | |-----------|----------|-----------------------------| | **PrefixSpan** | Pattern-growth via prefix-projected databases | **Best fit** — fast, handles variable-length sequences, no candidate generation | | **GSP** | Apriori-like, level-wise candidate generation | Good for small histories; scales poorly | | **SPADE** | Vertical ID-list intersection | Viable alternative to PrefixSpan | | **SPAM** | Bitmap representation | Memory-efficient for dense data | | **BIDE** | Closed sequential pattern mining | Eliminates redundant patterns — critical for usability | | **CloSpan/ClaSP** | Closed/maximal patterns | Reduces output size | **Recommendation:** PrefixSpan as the primary algorithm. ~200 lines of core logic. [Per the original paper](https://hanj.cs.illinois.edu/pdf/span01.pdf), it consistently outperforms GSP, FreeSpan, and SPADE. ### Reference implementation [PrefixSpan-py](https://github.com/chuanconggao/PrefixSpan-py) — started as 15 lines of code. Includes BIDE (closed patterns) and FEAT (generator patterns). Straightforward to port to Rust. ## Data Model for Shell History - **Item**: A single command (abstracted — e.g., `git commit -m *` rather than the literal message) - **Transaction/Sequence**: All commands in a session (defined by session ID, or by time-gap heuristic) - **Database**: All sessions across time ### Key considerations - **Command abstraction is critical.** Literal commands have too much variance. Must normalize arguments while preserving command structure. - **Session segmentation matters.** Atuin provides session IDs. Without them, use time-gap threshold (>30 min gap = new session). - **Support threshold needs tuning.** ShRec used min_support=2. For personal use, even 2 surfaces useful patterns. - **Gap constraints** — want "relaxed" sequential patterns allowing intervening commands (e.g., `git add -> [anything] -> git commit -> [anything] -> git push`). ## Simpler Alternatives ### N-gram approaches - Bigrams/trigrams of commands: fast, easy, limited to fixed-length contiguous sequences - Variable-length n-grams with recency decay - Markov chains: good for "next command" prediction, not full workflow extraction ### Process Mining (unexplored opportunity) [PM4Py](https://github.com/process-intelligence-solutions/pm4py) implements industrial-strength process discovery: - **Alpha Miner** — discovers causal relations from directly-follows semantics - **Heuristics Miner** — handles noise well - **Inductive Miner** — guaranteed-sound block-structured models Maps directly to shell history: Case ID = session, Activity = command, Timestamp = timestamp. Could produce visual workflow models. **Nobody has applied process mining to shell history.** Reference: [Workflow Mining: Discovering Process Models from Event Logs](https://www.vdaalst.com/publications/p245.pdf) (van der Aalst et al., 2004) ## Rust Ecosystem for Pattern Mining ### No mature SPM crate exists The [sequential-pattern-mining GitHub topic](https://github.com/topics/sequential-pattern-mining) lists 15 repos — zero in Rust. | Crate | What it does | Verdict | |-------|-------------|---------| | `rust-rule-miner` | Association rule mining (Apriori) | Not sequential pattern mining | ### Recommended approach: Custom PrefixSpan implementation Port from PrefixSpan-py. Estimated ~300-500 lines of Rust. The algorithm is recursive and maps cleanly to Rust's type system (`Vec>`). ## Related Academic Work | Paper | Venue | Relevance | |-------|-------|-----------| | [ShRec](https://arxiv.org/abs/2408.05592) | IEEE 2024 | Direct precedent — SPM on shell history | | [PrefixSpan](https://hanj.cs.illinois.edu/pdf/span01.pdf) | ICDE 2001 | The algorithm we'll implement | | [RACONTEUR](https://arxiv.org/abs/2409.02074) | NDSS 2025 | LLM-powered shell command explainer with RAG | | [SASH](https://sigops.org/s/conferences/hotos/2025/papers/hotos25-364.pdf) | HotOS 2025 | Static analysis framework for shell, LLM-assisted spec generation | | [AgentSpec](https://arxiv.org/abs/2503.18666) | ICSE 2026 | DSL for runtime safety constraints on LLM agents | | Cybersecurity training dataset | [ScienceDirect](https://www.sciencedirect.com/science/article/pii/S2352340921006806) | SPM on shell commands from 175 participants | | [Workflow Mining](https://www.vdaalst.com/publications/p245.pdf) | van der Aalst 2004 | Process mining foundations | ## Elastic Security Labs — Session Summarization [Using LLMs to summarize user sessions](https://www.elastic.co/security-labs/using-llms-to-summarize-user-sessions) — used GPT-4 to summarize shell sessions for security analysts. Proved the approach works at scale. [Follow-up](https://www.elastic.co/security-labs/using-llms-and-esre-to-find-similar-user-sessions) used semantic retrieval to find similar sessions. Applied to security — nobody has done this for developer productivity.