Files
marktvogt.de/backend/internal/pkg/scrape/scrape_test.go
vikingowl ce32f76731 feat(discovery): per-row LLM enrichment via scrape-then-prompt
Completes the manual two-pass enrichment flow: the crawl-enrich-all
button (MR 3) fills deterministic fields across the queue; this MR
adds a per-row "AI" button that scrapes the row's quellen URLs and
asks Mistral to fill category, opening_hours, description.

Flow per click:
  1. Load row, compute CacheKey(name_normalized, stadt, year).
  2. Cache hit -> skip LLM, merge cached payload onto current
     crawl-enrich base, persist, return.
  3. Miss -> scrape up to 5 quellen URLs via pkg/scrape (goquery
     text extraction, 4000-char truncation), concatenate into labeled
     blocks, call ai.Client.Pass2 with JSON response format.
  4. Parse response into Enrichment{category, opening_hours,
     description}, stamp provenance=llm + model + token counts.
  5. Cache the raw LLM payload (not the merged one) under the tuple
     key with DefaultCacheTTL=30d, so later re-crawls can layer new
     crawl-enrich bases on the same cached answer.
  6. Merge(crawl, llm) -- crawl fields survive. Persist via
     SetEnrichment(status=done). Return merged to the operator.

ErrNoScrapedContent fails fast when zero URLs return usable text;
LLMs without grounding hallucinate, and a 400-style operator error is
better than inventing details. Individual scrape failures don't halt
the flow as long as at least one source succeeds.

pkg/scrape (new, reusable)
- Client.Fetch: HTTP GET, strip script/style/nav/footer/aside via
  goquery, gather body text, collapse whitespace, truncate.
  DefaultTimeout=10s, DefaultMaxChars=4000. User-Agent configurable.
- Tests cover noise stripping, whitespace collapsing, truncation,
  body-less fragments.

enrich.MistralLLMEnricher
- Takes ai.Client + Scraper (both injectable; tests use stubs).
- Prompt: English system instructions asking for JSON-only output
  with category/opening_hours/description in German. User prompt
  includes markt identifiers, already-filled fields (so the LLM
  doesn't waste tokens re-deriving them), and scraped blocks.
- Tests: happy path, all-scrapes-fail (-> ErrNoScrapedContent),
  partial-scrape-success, empty LLM fields yield no provenance,
  URL cap at 5.

Service.RunLLMEnrichOne + handler POST /admin/discovery/queue/:id/
enrich (sync, 30s timeout). NewService gains llm enrich.LLMEnricher
param; routes.go constructs a MistralLLMEnricher when ai.Client is
enabled, falls back to NoopLLMEnricher otherwise.

UI: per-row AI button next to Similar, tracks per-row pending state
via a Set<string>, disables the button while the request is in
flight and shows "AI..." label. Success invalidates the page, the
row's expanded view picks up the new category/opening_hours/
description fields with llm provenance tags. Inline error message on
the row if the enrich action fails.
2026-04-24 10:46:28 +02:00

78 lines
2.1 KiB
Go

package scrape
import (
"strings"
"testing"
)
func TestExtractText_StripsNoise(t *testing.T) {
html := []byte(`<html>
<head><style>.foo{color:red}</style><script>var x = 1;</script></head>
<body>
<nav>HOME | ABOUT | CONTACT</nav>
<main>
<h1>Mittelaltermarkt Dresden</h1>
<p>Samstag und Sonntag von 10:00 bis 18:00 Uhr.</p>
</main>
<footer>Copyright 2026</footer>
</body>
</html>`)
got, err := extractText(html, 1000)
if err != nil {
t.Fatalf("extractText: %v", err)
}
// The content we care about is present.
if !strings.Contains(got, "Mittelaltermarkt Dresden") {
t.Errorf("missing h1: %q", got)
}
if !strings.Contains(got, "10:00 bis 18:00") {
t.Errorf("missing opening hours: %q", got)
}
// Noise is gone.
if strings.Contains(got, "color:red") || strings.Contains(got, "var x = 1") {
t.Errorf("style/script leaked: %q", got)
}
if strings.Contains(got, "Copyright") || strings.Contains(got, "HOME | ABOUT") {
t.Errorf("nav/footer leaked: %q", got)
}
}
func TestExtractText_CollapsesWhitespace(t *testing.T) {
html := []byte(`<html><body><p>foo bar
baz</p></body></html>`)
got, err := extractText(html, 1000)
if err != nil {
t.Fatalf("extractText: %v", err)
}
if got != "foo bar baz" {
t.Errorf("whitespace not collapsed: %q", got)
}
}
func TestExtractText_Truncates(t *testing.T) {
// Build a long body.
body := strings.Repeat("a b c ", 2000) // ~12000 chars after collapse
html := []byte("<html><body><p>" + body + "</p></body></html>")
got, err := extractText(html, 100)
if err != nil {
t.Fatalf("extractText: %v", err)
}
if len(got) != 100 {
t.Errorf("len(got) = %d; want 100", len(got))
}
}
func TestExtractText_FallsBackToDocumentWhenNoBody(t *testing.T) {
// Document fragment without <html>/<body> tags. goquery still parses this
// but .Find("body") returns nothing; we fall back to doc-level text.
html := []byte(`<div><p>Direktes Fragment.</p></div>`)
got, err := extractText(html, 1000)
if err != nil {
t.Fatalf("extractText: %v", err)
}
if !strings.Contains(got, "Direktes Fragment") {
t.Errorf("fallback failed: %q", got)
}
}