improve lexical warmup and standardize stopword pipeline

This commit is contained in:
2026-02-17 14:49:47 +08:00
parent 246eb7a7e2
commit 94eceaed96
14 changed files with 4840 additions and 330 deletions

View File

@@ -0,0 +1,15 @@
# stopwords sources for story-summary
- Dataset: `stopwords-iso` (npm package, version 1.1.0)
- Repository: https://github.com/stopwords-iso/stopwords-iso
- License: MIT
- Snapshot date: 2026-02-16
- Languages used: `zh`, `ja`, `en`
- Local snapshot files:
- `stopwords-iso.zh.txt`
- `stopwords-iso.ja.txt`
- `stopwords-iso.en.txt`
Generation note:
- `modules/story-summary/vector/utils/stopwords-base.js` is generated from these snapshot files.
- Keep `stopwords-patch.js` for tiny domain overrides only.