Files
workflow-miner/docs/research-pattern-mining.md
vikingowl 48923450f8 docs: add architecture plan and research notes
Initial project documentation for workflow-miner — a Rust CLI + zsh
plugin that mines recurring command workflows from Atuin shell history.
2026-02-22 08:41:50 +01:00

6.0 KiB

Research: Sequential Pattern Mining for Shell History

The Key Paper: ShRec (Huawei, 2024)

ShRec: A SRE Behaviour Knowledge Graph Model for Shell Command Recommendations — IEEE, August 2024

The most directly relevant academic work. Proves that sequential pattern mining on real shell history produces useful results.

Approach

  1. Models shell sessions as transactions — each command is an item, the temporal sequence of commands in a session is a transaction
  2. Applies sequential pattern mining (frequent sub-sequence extraction) across all sessions
  3. Clusters similar sequences using distance metrics and K-means to reduce redundancy
  4. Builds a knowledge graph for recommendation

Results

  • Extracted 3,997 sequences ranging from 2 to 14 commands
  • Support values from 2 to 101
  • Sequences were executed by up to 43 different users
  • SREs confirmed the mined sequences matched real operational workflows

Sequential Pattern Mining Algorithms

All implemented in the SPMF library (Java, open source):

Algorithm Approach Shell History Applicability
PrefixSpan Pattern-growth via prefix-projected databases Best fit — fast, handles variable-length sequences, no candidate generation
GSP Apriori-like, level-wise candidate generation Good for small histories; scales poorly
SPADE Vertical ID-list intersection Viable alternative to PrefixSpan
SPAM Bitmap representation Memory-efficient for dense data
BIDE Closed sequential pattern mining Eliminates redundant patterns — critical for usability
CloSpan/ClaSP Closed/maximal patterns Reduces output size

Recommendation: PrefixSpan as the primary algorithm. ~200 lines of core logic. Per the original paper, it consistently outperforms GSP, FreeSpan, and SPADE.

Reference implementation

PrefixSpan-py — started as 15 lines of code. Includes BIDE (closed patterns) and FEAT (generator patterns). Straightforward to port to Rust.

Data Model for Shell History

  • Item: A single command (abstracted — e.g., git commit -m * rather than the literal message)
  • Transaction/Sequence: All commands in a session (defined by session ID, or by time-gap heuristic)
  • Database: All sessions across time

Key considerations

  • Command abstraction is critical. Literal commands have too much variance. Must normalize arguments while preserving command structure.
  • Session segmentation matters. Atuin provides session IDs. Without them, use time-gap threshold (>30 min gap = new session).
  • Support threshold needs tuning. ShRec used min_support=2. For personal use, even 2 surfaces useful patterns.
  • Gap constraints — want "relaxed" sequential patterns allowing intervening commands (e.g., git add -> [anything] -> git commit -> [anything] -> git push).

Simpler Alternatives

N-gram approaches

  • Bigrams/trigrams of commands: fast, easy, limited to fixed-length contiguous sequences
  • Variable-length n-grams with recency decay
  • Markov chains: good for "next command" prediction, not full workflow extraction

Process Mining (unexplored opportunity)

PM4Py implements industrial-strength process discovery:

  • Alpha Miner — discovers causal relations from directly-follows semantics
  • Heuristics Miner — handles noise well
  • Inductive Miner — guaranteed-sound block-structured models

Maps directly to shell history: Case ID = session, Activity = command, Timestamp = timestamp. Could produce visual workflow models. Nobody has applied process mining to shell history.

Reference: Workflow Mining: Discovering Process Models from Event Logs (van der Aalst et al., 2004)

Rust Ecosystem for Pattern Mining

No mature SPM crate exists

The sequential-pattern-mining GitHub topic lists 15 repos — zero in Rust.

Crate What it does Verdict
rust-rule-miner Association rule mining (Apriori) Not sequential pattern mining

Port from PrefixSpan-py. Estimated ~300-500 lines of Rust. The algorithm is recursive and maps cleanly to Rust's type system (Vec<Vec<T>>).

Paper Venue Relevance
ShRec IEEE 2024 Direct precedent — SPM on shell history
PrefixSpan ICDE 2001 The algorithm we'll implement
RACONTEUR NDSS 2025 LLM-powered shell command explainer with RAG
SASH HotOS 2025 Static analysis framework for shell, LLM-assisted spec generation
AgentSpec ICSE 2026 DSL for runtime safety constraints on LLM agents
Cybersecurity training dataset ScienceDirect SPM on shell commands from 175 participants
Workflow Mining van der Aalst 2004 Process mining foundations

Elastic Security Labs — Session Summarization

Using LLMs to summarize user sessions — used GPT-4 to summarize shell sessions for security analysts. Proved the approach works at scale. Follow-up used semantic retrieval to find similar sessions. Applied to security — nobody has done this for developer productivity.