Completes the manual two-pass enrichment flow: the crawl-enrich-all
button (MR 3) fills deterministic fields across the queue; this MR
adds a per-row "AI" button that scrapes the row's quellen URLs and
asks Mistral to fill category, opening_hours, description.
Flow per click:
1. Load row, compute CacheKey(name_normalized, stadt, year).
2. Cache hit -> skip LLM, merge cached payload onto current
crawl-enrich base, persist, return.
3. Miss -> scrape up to 5 quellen URLs via pkg/scrape (goquery
text extraction, 4000-char truncation), concatenate into labeled
blocks, call ai.Client.Pass2 with JSON response format.
4. Parse response into Enrichment{category, opening_hours,
description}, stamp provenance=llm + model + token counts.
5. Cache the raw LLM payload (not the merged one) under the tuple
key with DefaultCacheTTL=30d, so later re-crawls can layer new
crawl-enrich bases on the same cached answer.
6. Merge(crawl, llm) -- crawl fields survive. Persist via
SetEnrichment(status=done). Return merged to the operator.
ErrNoScrapedContent fails fast when zero URLs return usable text;
LLMs without grounding hallucinate, and a 400-style operator error is
better than inventing details. Individual scrape failures don't halt
the flow as long as at least one source succeeds.
pkg/scrape (new, reusable)
- Client.Fetch: HTTP GET, strip script/style/nav/footer/aside via
goquery, gather body text, collapse whitespace, truncate.
DefaultTimeout=10s, DefaultMaxChars=4000. User-Agent configurable.
- Tests cover noise stripping, whitespace collapsing, truncation,
body-less fragments.
enrich.MistralLLMEnricher
- Takes ai.Client + Scraper (both injectable; tests use stubs).
- Prompt: English system instructions asking for JSON-only output
with category/opening_hours/description in German. User prompt
includes markt identifiers, already-filled fields (so the LLM
doesn't waste tokens re-deriving them), and scraped blocks.
- Tests: happy path, all-scrapes-fail (-> ErrNoScrapedContent),
partial-scrape-success, empty LLM fields yield no provenance,
URL cap at 5.
Service.RunLLMEnrichOne + handler POST /admin/discovery/queue/:id/
enrich (sync, 30s timeout). NewService gains llm enrich.LLMEnricher
param; routes.go constructs a MistralLLMEnricher when ai.Client is
enabled, falls back to NoopLLMEnricher otherwise.
UI: per-row AI button next to Similar, tracks per-row pending state
via a Set<string>, disables the button while the request is in
flight and shows "AI..." label. Success invalidates the page, the
row's expanded view picks up the new category/opening_hours/
description fields with llm provenance tags. Inline error message on
the row if the enrich action fails.
78 lines
2.1 KiB
Go
78 lines
2.1 KiB
Go
package scrape
|
|
|
|
import (
|
|
"strings"
|
|
"testing"
|
|
)
|
|
|
|
func TestExtractText_StripsNoise(t *testing.T) {
|
|
html := []byte(`<html>
|
|
<head><style>.foo{color:red}</style><script>var x = 1;</script></head>
|
|
<body>
|
|
<nav>HOME | ABOUT | CONTACT</nav>
|
|
<main>
|
|
<h1>Mittelaltermarkt Dresden</h1>
|
|
<p>Samstag und Sonntag von 10:00 bis 18:00 Uhr.</p>
|
|
</main>
|
|
<footer>Copyright 2026</footer>
|
|
</body>
|
|
</html>`)
|
|
got, err := extractText(html, 1000)
|
|
if err != nil {
|
|
t.Fatalf("extractText: %v", err)
|
|
}
|
|
// The content we care about is present.
|
|
if !strings.Contains(got, "Mittelaltermarkt Dresden") {
|
|
t.Errorf("missing h1: %q", got)
|
|
}
|
|
if !strings.Contains(got, "10:00 bis 18:00") {
|
|
t.Errorf("missing opening hours: %q", got)
|
|
}
|
|
// Noise is gone.
|
|
if strings.Contains(got, "color:red") || strings.Contains(got, "var x = 1") {
|
|
t.Errorf("style/script leaked: %q", got)
|
|
}
|
|
if strings.Contains(got, "Copyright") || strings.Contains(got, "HOME | ABOUT") {
|
|
t.Errorf("nav/footer leaked: %q", got)
|
|
}
|
|
}
|
|
|
|
func TestExtractText_CollapsesWhitespace(t *testing.T) {
|
|
html := []byte(`<html><body><p>foo bar
|
|
|
|
baz</p></body></html>`)
|
|
got, err := extractText(html, 1000)
|
|
if err != nil {
|
|
t.Fatalf("extractText: %v", err)
|
|
}
|
|
if got != "foo bar baz" {
|
|
t.Errorf("whitespace not collapsed: %q", got)
|
|
}
|
|
}
|
|
|
|
func TestExtractText_Truncates(t *testing.T) {
|
|
// Build a long body.
|
|
body := strings.Repeat("a b c ", 2000) // ~12000 chars after collapse
|
|
html := []byte("<html><body><p>" + body + "</p></body></html>")
|
|
got, err := extractText(html, 100)
|
|
if err != nil {
|
|
t.Fatalf("extractText: %v", err)
|
|
}
|
|
if len(got) != 100 {
|
|
t.Errorf("len(got) = %d; want 100", len(got))
|
|
}
|
|
}
|
|
|
|
func TestExtractText_FallsBackToDocumentWhenNoBody(t *testing.T) {
|
|
// Document fragment without <html>/<body> tags. goquery still parses this
|
|
// but .Find("body") returns nothing; we fall back to doc-level text.
|
|
html := []byte(`<div><p>Direktes Fragment.</p></div>`)
|
|
got, err := extractText(html, 1000)
|
|
if err != nil {
|
|
t.Fatalf("extractText: %v", err)
|
|
}
|
|
if !strings.Contains(got, "Direktes Fragment") {
|
|
t.Errorf("fallback failed: %q", got)
|
|
}
|
|
}
|