Review Sequencing & Sentiment Pipeline
Local LLM-based sentiment, topic, complaints, and praise extraction for HKTVMall contact lens reviews.
Runs entirely on DGX Spark GB10 (Blackwell GPU, 128GB unified memory). No external API calls.
What this project does
Two enrichment tracks:
| Track | Script | Method | Output prefix |
|---|---|---|---|
| LLM classification (primary) | classify_reviews.py |
Ternary-Bonsai-4B via PrismML llama-server |
{run_id}_cl_reviews_model_enriched.parquet |
| Rule-based + BERTopic (optional) | enrich_reviews.py |
Tags, keywords, BERTopic clustering | {run_id}_cl_reviews_enriched.parquet |
Each run gets a 5-character UUID prefix (run_id) so outputs are never overwritten.
Reference run (2026-06-25): 6091e — 16,835 reviews in ~44 min at 6.4/s.
Project layout
review_sequencing_sentiment/
├── README.md ← this manual
├── requirements.txt ← Python deps (torch installed separately)
├── classify_reviews.py ← LLM batch classifier (main pipeline)
├── enrich_reviews.py ← optional rule-based + BERTopic enrichment
├── classify_output.log ← latest classification run log
├── scripts/
│ ├── config.env ← paths + tuning (single source of truth)
│ ├── start_llm_server.sh ← start Bonsai server
│ ├── stop_llm_server.sh ← stop server
│ └── run_classification.sh ← run classifier (checks server first)
├── topic_model/ ← BERTopic model artifacts (from enrich_reviews.py)
└── venv/ ← Python 3.12 virtualenv
Data (outside this repo):
~/ontologer/data/
├── processed/hk_market/hktvmall/
│ ├── cl_reviews.parquet ← input (16,835 reviews)
│ ├── 6091e_cl_reviews_model_enriched.parquet ← example LLM output
│ └── {run_id}_cl_reviews_enriched.parquet ← rule-based output
└── raw/hk_market/hktvmall/ ← raw scrapes (for enrich_reviews.py)
~/models/llm/
├── llama.cpp-prismml-safe/build/bin/llama-server ← PrismML fork binary
└── ternary-bonsai/4b/Ternary-Bonsai-4B-Q2_0.gguf ← 1.1 GB model
Prerequisites
| Requirement | Details |
|---|---|
| Machine | DGX Spark GB10 (spark-5927), aarch64 |
| GPU | NVIDIA Blackwell GB10, 128GB unified memory |
| CUDA | 13.0, driver 580.126.09 |
| PrismML llama.cpp | Standard llama.cpp cannot load ternary Q2_0 (ggml type 42) |
| Model | Ternary-Bonsai-4B-Q2_0.gguf from prism-ml/Ternary-Bonsai-4B-gguf on HuggingFace |
| Input data | cl_reviews.parquet with 16,835 reviews |
One-time setup
1. Python environment
cd /home/sidmishra/ontologer/subprojects/review_sequencing_sentiment
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
2. Build PrismML llama.cpp
Ternary models require the PrismML fork (prism branch):
git clone --branch prism https://github.com/PrismML-Eng/llama.cpp \
/home/sidmishra/models/llm/llama.cpp-prismml-safe
cd /home/sidmishra/models/llm/llama.cpp-prismml-safe
unset CC CXX # required — stale CC/CXX breaks cmake
cmake -B build -DGGML_CUDA=ON
cmake --build build -j$(nproc)
ls build/bin/llama-server # should exist
Backup tarball (if needed): /home/sidmishra/models/llm/llama.cpp-prismml-20260625.tar.gz
3. Download model
Place the GGUF at:
/home/sidmishra/models/llm/ternary-bonsai/4b/Ternary-Bonsai-4B-Q2_0.gguf
4. Verify input data
python3 -c "
import pandas as pd
df = pd.read_parquet('~/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet')
print(len(df), 'reviews')
print(df.columns.tolist())
"
# Expected: 16835 reviews
Runbook — full LLM classification (copy-paste)
This is the verified end-to-end process used for run 6091e.
Step 1 — Start the LLM server
cd /home/sidmishra/ontologer/subprojects/review_sequencing_sentiment
chmod +x scripts/*.sh
./scripts/start_llm_server.sh
Current production config (defined in scripts/config.env):
| Flag | Value | Meaning |
|---|---|---|
-c |
4098 | Total KV context (rounded by llama.cpp — see note below) |
-np / --parallel |
8 | Concurrent server slots |
-ngl |
99 | All layers on GPU |
-t |
4 | CPU threads |
| Per slot | 768 tokens | llama.cpp rounds 4098 → 6144 total (768 × 8) |
Context rounding: If
-cis not divisible by-np, llama.cpp rounds up. With-c 4098 -np 8, the log showsn_ctx is not divisible by n_seq_max - rounding down to 6144. For exact 512 tokens/slot, use-c 4096 -np 8instead.
Manual equivalent:
/home/sidmishra/models/llm/llama.cpp-prismml-safe/build/bin/llama-server \
-m /home/sidmishra/models/llm/ternary-bonsai/4b/Ternary-Bonsai-4B-Q2_0.gguf \
--host 127.0.0.1 --port 8080 \
-ngl 99 -c 4098 -t 4 --parallel 8 \
> /tmp/bonsai-4b.log 2>&1 &
Step 2 — Verify server is healthy
curl -s http://127.0.0.1:8080/health
# {"status":"ok"}
grep -E "n_ctx|n_ctx_seq|n_slots|model loaded" /tmp/bonsai-4b.log | tail -5
Quick smoke test:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"bonsai-4b","messages":[{"role":"user","content":"Say OK"}],"max_tokens":5,"chat_template_kwargs":{"enable_thinking":false}}'
Step 3 — Run classification
./scripts/run_classification.sh
Or manually:
source venv/bin/activate
PYTHONUNBUFFERED=1 python3 classify_reviews.py 2>&1 | tee classify_output.log
The script prints a Run ID at startup:
Run ID: 6091e
Output: .../6091e_cl_reviews_model_enriched.parquet
Loaded 16835 reviews from cl_reviews.parquet
Step 4 — Monitor progress
tail -f classify_output.log
Progress lines look like:
7000/16835 (41.6%) — 6.4/s — ETA 25.8m
Snapshots are written every 1,000 reviews to the same {run_id}_… file (crash-safe).
Expected throughput: ~6.4 reviews/sec with 8 parallel slots (~44 min total).
Step 5 — Verify output
RUN_ID=6091e # replace with your run ID from the log
python3 -c "
import pandas as pd
df = pd.read_parquet(f'~/ontologer/data/processed/hk_market/hktvmall/{RUN_ID}_cl_reviews_model_enriched.parquet')
print('rows:', len(df))
print(df['m_sentiment'].value_counts())
"
Reference distribution (run 6091e):
| m_sentiment | count |
|---|---|
| positive | 12,895 |
| negative | 1,869 |
| mixed | 1,515 |
| neutral | 548 |
| unknown | 8 |
Step 6 — Stop server (when done)
./scripts/stop_llm_server.sh
Architecture
cl_reviews.parquet (16,835 reviews)
│
▼
classify_reviews.py ──► llama-server (Bonsai 4B, :8080)
│ ├── 8 parallel slots (-np 8)
│ ├── 768 tokens per slot (6144 total KV)
│ ├── 1 review per HTTP request
│ └── Q2_0 ternary quantization
▼
{run_id}_cl_reviews_model_enriched.parquet
├── review_id
├── m_sentiment positive | negative | neutral | mixed | unknown
├── m_score 1–5
├── m_topics [comfort, quality, price, ...]
├── m_complaints [original language]
├── m_praise [original language]
└── m_raw raw LLM response text
Configuration reference
All tunables live in scripts/config.env and classify_reviews.py.
| Setting | Location | Current value | Notes |
|---|---|---|---|
Server context -c |
config.env → LLM_CTX_SIZE |
4098 | Must divide evenly by -np for exact sizing |
Server slots -np |
config.env → LLM_PARALLEL |
8 | Match CONCURRENCY in classifier |
| Client concurrency | classify_reviews.py → CONCURRENCY |
8 | Must equal server --parallel |
| Snapshot interval | classify_reviews.py → SNAPSHOT_INTERVAL |
1000 | Parquet checkpoint frequency |
| Comment truncation | classify_reviews.py |
300 chars | Sent to LLM per review |
| Max tokens | classify_reviews.py |
500 | LLM response cap |
| Temperature | classify_reviews.py |
0.1 | Low for consistent classification |
| Output naming | classify_reviews.py |
uuid4().hex[:5] |
Never overwrites prior runs |
Classification details
API call
import httpx
resp = httpx.post("http://127.0.0.1:8080/v1/chat/completions", json={
"model": "bonsai-4b",
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"★{rating} | {comment[:300]}"},
],
"temperature": 0.1,
"max_tokens": 500,
"chat_template_kwargs": {"enable_thinking": False}, # REQUIRED for Qwen/Bonsai
}, timeout=120)
System prompt (current)
Classify HK contact lens reviews. Return a JSON object (not array).
{"sentiment": "positive"|"negative"|"neutral"|"mixed", "score": 1-5, "topics": [from: comfort,quality,price,packaging,delivery,repurchase,prescription,brand,appearance,customer_service,eye_health,expiry,fit_sizing,other], "complaints": [in original language], "praise": [in original language]}
唔=not, 冇=don't have. 唔好用=negative. 回購=positive. ONLY output the JSON object.
Response parsing
The classifier sends one review per request and expects a single JSON object back.
It extracts JSON by finding the first { … last } in the response.
Failed parses after 3 retries → sentiment: unknown.
Cantonese edge cases (verified on hard reviews)
| Review snippet | Expected | Bonsai result |
|---|---|---|
| 唔好用極唔推薦 | negative | ✓ |
| 包裝完整,但上眼睇唔出有粉紅色 | neutral/mixed | ✓ |
| 一直用開…夠自然…不過價錢貴咗 | mixed | ✓ |
| 敏感,唔舒服 | negative | ✓ |
Optional: rule-based enrichment (enrich_reviews.py)
Separate pipeline that reads raw review scrapes (not cl_reviews.parquet) and adds v1_* columns via tags, keyword rules, and BERTopic.
source venv/bin/activate
python enrich_reviews.py # full enrichment + BERTopic
python enrich_reviews.py --skip-topics # rules only, no GPU topic model
python enrich_reviews.py --dry-run # no output written
python enrich_reviews.py --sample 100 # test on 100 rows
Output: {run_id}_cl_reviews_enriched.parquet in the same data directory.
Model: Ternary-Bonsai-4B
| Property | Value |
|---|---|
| Source | prism-ml/Ternary-Bonsai-4B-gguf (HuggingFace) |
| Quantization | Q2_0 ternary (1.58-bit) |
| Size | 1.1 GB GGUF |
| Base architecture | Qwen3-4B, ternary quantized |
| Why ternary | Fast inference, good classification quality on Cantonese/English mix |
Alternative model (tested, not primary)
Qwen3.5-4B with MTP speculative decoding — 9/10 on hard reviews vs Bonsai 10/10.
See scripts/config.env to swap MODEL_PATH if experimenting.
Performance history
| Config | Reviews/sec | Notes |
|---|---|---|
| Bonsai 4B, 1 slot | ~1.7 | Baseline |
Bonsai 4B, 4 slots, -c 4096 |
~6.8 | Earlier README config |
Bonsai 4B, 8 slots, -c 4098 |
~6.4 | Production run 6091e |
| Qwen3.5 4B, 1 slot | ~1.8 | Alternative |
Do not use
-c 46800 --parallel 36— this was a mistaken config that hogs memory. Always keepCONCURRENCYin the classifier equal to--parallelon the server.
Troubleshooting
invalid ggml type 42
Using standard llama.cpp with ternary models. Must use the PrismML fork.
Empty response / content in reasoning_content
Add "chat_template_kwargs": {"enable_thinking": False} to the API call.
Server won't start — "Could not find compiler"
unset CC CXX
cmake -B build -DGGML_CUDA=ON
Server already running on port 8080
./scripts/stop_llm_server.sh
./scripts/start_llm_server.sh
Classification hangs / no log output
Run with unbuffered Python:
PYTHONUNBUFFERED=1 python3 classify_reviews.py
Or use ./scripts/run_classification.sh which sets this automatically.
Output overwritten by a bad run
Fixed: every run writes to {run_id}_cl_reviews_model_enriched.parquet.
Check classify_output.log for the Run ID line at the top.
unknown sentiment rows
8 rows in run 6091e — LLM returned unparseable JSON after 3 retries.
Inspect via m_raw column (empty string = total failure).
Quick reference — three commands
./scripts/start_llm_server.sh # 1. start Bonsai on :8080
./scripts/run_classification.sh # 2. classify all reviews (~44 min)
./scripts/stop_llm_server.sh # 3. stop server
Log files: /tmp/bonsai-4b.log (server), classify_output.log (classifier).