Review Sequencing & Sentiment Pipeline

Local LLM-based sentiment, topic, complaints, and praise extraction for HKTVMall contact lens reviews.

Runs entirely on DGX Spark GB10 (Blackwell GPU, 128GB unified memory). No external API calls.

What this project does

Two enrichment tracks:

Track	Script	Method	Output prefix
LLM classification (primary)	`classify_reviews.py`	Ternary-Bonsai-4B via PrismML `llama-server`	`{run_id}_cl_reviews_model_enriched.parquet`
Rule-based + BERTopic (optional)	`enrich_reviews.py`	Tags, keywords, BERTopic clustering	`{run_id}_cl_reviews_enriched.parquet`

Each run gets a 5-character UUID prefix (run_id) so outputs are never overwritten.

Reference run (2026-06-25): 6091e — 16,835 reviews in ~44 min at 6.4/s.

Project layout

review_sequencing_sentiment/
├── README.md                 ← this manual
├── requirements.txt          ← Python deps (torch installed separately)
├── classify_reviews.py       ← LLM batch classifier (main pipeline)
├── enrich_reviews.py         ← optional rule-based + BERTopic enrichment
├── classify_output.log       ← latest classification run log
├── scripts/
│   ├── config.env            ← paths + tuning (single source of truth)
│   ├── start_llm_server.sh   ← start Bonsai server
│   ├── stop_llm_server.sh    ← stop server
│   └── run_classification.sh ← run classifier (checks server first)
├── topic_model/              ← BERTopic model artifacts (from enrich_reviews.py)
└── venv/                     ← Python 3.12 virtualenv

Data (outside this repo):

~/ontologer/data/
├── processed/hk_market/hktvmall/
│   ├── cl_reviews.parquet                          ← input (16,835 reviews)
│   ├── 6091e_cl_reviews_model_enriched.parquet     ← example LLM output
│   └── {run_id}_cl_reviews_enriched.parquet        ← rule-based output
└── raw/hk_market/hktvmall/                         ← raw scrapes (for enrich_reviews.py)

~/models/llm/
├── llama.cpp-prismml-safe/build/bin/llama-server   ← PrismML fork binary
└── ternary-bonsai/4b/Ternary-Bonsai-4B-Q2_0.gguf   ← 1.1 GB model

Prerequisites

Requirement	Details
Machine	DGX Spark GB10 (`spark-5927`), aarch64
GPU	NVIDIA Blackwell GB10, 128GB unified memory
CUDA	13.0, driver 580.126.09
PrismML llama.cpp	Standard llama.cpp cannot load ternary Q2_0 (ggml type 42)
Model	`Ternary-Bonsai-4B-Q2_0.gguf` from `prism-ml/Ternary-Bonsai-4B-gguf` on HuggingFace
Input data	`cl_reviews.parquet` with 16,835 reviews

One-time setup

1. Python environment

cd /home/sidmishra/ontologer/subprojects/review_sequencing_sentiment

python3 -m venv venv
source venv/bin/activate

pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

2. Build PrismML llama.cpp

Ternary models require the PrismML fork (prism branch):

git clone --branch prism https://github.com/PrismML-Eng/llama.cpp \
  /home/sidmishra/models/llm/llama.cpp-prismml-safe

cd /home/sidmishra/models/llm/llama.cpp-prismml-safe
unset CC CXX                    # required — stale CC/CXX breaks cmake
cmake -B build -DGGML_CUDA=ON
cmake --build build -j$(nproc)

ls build/bin/llama-server       # should exist

Backup tarball (if needed): /home/sidmishra/models/llm/llama.cpp-prismml-20260625.tar.gz

3. Download model

Place the GGUF at:

/home/sidmishra/models/llm/ternary-bonsai/4b/Ternary-Bonsai-4B-Q2_0.gguf

4. Verify input data

python3 -c "
import pandas as pd
df = pd.read_parquet('~/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet')
print(len(df), 'reviews')
print(df.columns.tolist())
"
# Expected: 16835 reviews

Runbook — full LLM classification (copy-paste)

This is the verified end-to-end process used for run 6091e.

Step 1 — Start the LLM server

cd /home/sidmishra/ontologer/subprojects/review_sequencing_sentiment
chmod +x scripts/*.sh
./scripts/start_llm_server.sh

Current production config (defined in scripts/config.env):

Flag	Value	Meaning
`-c`	4098	Total KV context (rounded by llama.cpp — see note below)
`-np` / `--parallel`	8	Concurrent server slots
`-ngl`	99	All layers on GPU
`-t`	4	CPU threads
Per slot	768 tokens	llama.cpp rounds 4098 → 6144 total (768 × 8)

Context rounding: If -c is not divisible by -np, llama.cpp rounds up. With -c 4098 -np 8, the log shows n_ctx is not divisible by n_seq_max - rounding down to 6144. For exact 512 tokens/slot, use -c 4096 -np 8 instead.

Manual equivalent:

/home/sidmishra/models/llm/llama.cpp-prismml-safe/build/bin/llama-server \
  -m /home/sidmishra/models/llm/ternary-bonsai/4b/Ternary-Bonsai-4B-Q2_0.gguf \
  --host 127.0.0.1 --port 8080 \
  -ngl 99 -c 4098 -t 4 --parallel 8 \
  > /tmp/bonsai-4b.log 2>&1 &

Step 2 — Verify server is healthy

curl -s http://127.0.0.1:8080/health
# {"status":"ok"}

grep -E "n_ctx|n_ctx_seq|n_slots|model loaded" /tmp/bonsai-4b.log | tail -5

Quick smoke test:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"bonsai-4b","messages":[{"role":"user","content":"Say OK"}],"max_tokens":5,"chat_template_kwargs":{"enable_thinking":false}}'

Step 3 — Run classification

./scripts/run_classification.sh

Or manually:

source venv/bin/activate
PYTHONUNBUFFERED=1 python3 classify_reviews.py 2>&1 | tee classify_output.log

The script prints a Run ID at startup:

Run ID: 6091e
Output: .../6091e_cl_reviews_model_enriched.parquet
Loaded 16835 reviews from cl_reviews.parquet

Step 4 — Monitor progress

tail -f classify_output.log

Progress lines look like:

  7000/16835 (41.6%) — 6.4/s — ETA 25.8m

Snapshots are written every 1,000 reviews to the same {run_id}_… file (crash-safe).

Expected throughput: ~6.4 reviews/sec with 8 parallel slots (~44 min total).

Step 5 — Verify output

RUN_ID=6091e   # replace with your run ID from the log

python3 -c "
import pandas as pd
df = pd.read_parquet(f'~/ontologer/data/processed/hk_market/hktvmall/{RUN_ID}_cl_reviews_model_enriched.parquet')
print('rows:', len(df))
print(df['m_sentiment'].value_counts())
"

Reference distribution (run 6091e):

m_sentiment	count
positive	12,895
negative	1,869
mixed	1,515
neutral	548
unknown	8

Step 6 — Stop server (when done)

./scripts/stop_llm_server.sh

Architecture

cl_reviews.parquet (16,835 reviews)
        │
        ▼
classify_reviews.py ──► llama-server (Bonsai 4B, :8080)
        │                  ├── 8 parallel slots (-np 8)
        │                  ├── 768 tokens per slot (6144 total KV)
        │                  ├── 1 review per HTTP request
        │                  └── Q2_0 ternary quantization
        ▼
{run_id}_cl_reviews_model_enriched.parquet
  ├── review_id
  ├── m_sentiment   positive | negative | neutral | mixed | unknown
  ├── m_score       1–5
  ├── m_topics      [comfort, quality, price, ...]
  ├── m_complaints  [original language]
  ├── m_praise      [original language]
  └── m_raw         raw LLM response text

Configuration reference

All tunables live in scripts/config.env and classify_reviews.py.

Setting	Location	Current value	Notes
Server context `-c`	`config.env` → `LLM_CTX_SIZE`	4098	Must divide evenly by `-np` for exact sizing
Server slots `-np`	`config.env` → `LLM_PARALLEL`	8	Match `CONCURRENCY` in classifier
Client concurrency	`classify_reviews.py` → `CONCURRENCY`	8	Must equal server `--parallel`
Snapshot interval	`classify_reviews.py` → `SNAPSHOT_INTERVAL`	1000	Parquet checkpoint frequency
Comment truncation	`classify_reviews.py`	300 chars	Sent to LLM per review
Max tokens	`classify_reviews.py`	500	LLM response cap
Temperature	`classify_reviews.py`	0.1	Low for consistent classification
Output naming	`classify_reviews.py`	`uuid4().hex[:5]`	Never overwrites prior runs

Classification details

API call

import httpx

resp = httpx.post("http://127.0.0.1:8080/v1/chat/completions", json={
    "model": "bonsai-4b",
    "messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"★{rating} | {comment[:300]}"},
    ],
    "temperature": 0.1,
    "max_tokens": 500,
    "chat_template_kwargs": {"enable_thinking": False},  # REQUIRED for Qwen/Bonsai
}, timeout=120)

System prompt (current)

Classify HK contact lens reviews. Return a JSON object (not array).
{"sentiment": "positive"|"negative"|"neutral"|"mixed", "score": 1-5, "topics": [from: comfort,quality,price,packaging,delivery,repurchase,prescription,brand,appearance,customer_service,eye_health,expiry,fit_sizing,other], "complaints": [in original language], "praise": [in original language]}
唔=not, 冇=don't have. 唔好用=negative. 回購=positive. ONLY output the JSON object.

Response parsing

The classifier sends one review per request and expects a single JSON object back. It extracts JSON by finding the first { … last } in the response. Failed parses after 3 retries → sentiment: unknown.

Cantonese edge cases (verified on hard reviews)

Review snippet	Expected	Bonsai result
唔好用極唔推薦	negative	✓
包裝完整，但上眼睇唔出有粉紅色	neutral/mixed	✓
一直用開…夠自然…不過價錢貴咗	mixed	✓
敏感，唔舒服	negative	✓

Optional: rule-based enrichment (`enrich_reviews.py`)

Separate pipeline that reads raw review scrapes (not cl_reviews.parquet) and adds v1_* columns via tags, keyword rules, and BERTopic.

source venv/bin/activate
python enrich_reviews.py              # full enrichment + BERTopic
python enrich_reviews.py --skip-topics   # rules only, no GPU topic model
python enrich_reviews.py --dry-run       # no output written
python enrich_reviews.py --sample 100    # test on 100 rows

Output: {run_id}_cl_reviews_enriched.parquet in the same data directory.

Model: Ternary-Bonsai-4B

Property	Value
Source	`prism-ml/Ternary-Bonsai-4B-gguf` (HuggingFace)
Quantization	Q2_0 ternary (1.58-bit)
Size	1.1 GB GGUF
Base architecture	Qwen3-4B, ternary quantized
Why ternary	Fast inference, good classification quality on Cantonese/English mix

Alternative model (tested, not primary)

Qwen3.5-4B with MTP speculative decoding — 9/10 on hard reviews vs Bonsai 10/10. See scripts/config.env to swap MODEL_PATH if experimenting.

Performance history

Config	Reviews/sec	Notes
Bonsai 4B, 1 slot	~1.7	Baseline
Bonsai 4B, 4 slots, `-c 4096`	~6.8	Earlier README config
Bonsai 4B, 8 slots, `-c 4098`	~6.4	Production run `6091e`
Qwen3.5 4B, 1 slot	~1.8	Alternative

Do not use -c 46800 --parallel 36 — this was a mistaken config that hogs memory. Always keep CONCURRENCY in the classifier equal to --parallel on the server.

Troubleshooting

`invalid ggml type 42`

Using standard llama.cpp with ternary models. Must use the PrismML fork.

Empty response / content in `reasoning_content`

Add "chat_template_kwargs": {"enable_thinking": False} to the API call.

Server won't start — "Could not find compiler"

unset CC CXX
cmake -B build -DGGML_CUDA=ON

Server already running on port 8080

./scripts/stop_llm_server.sh
./scripts/start_llm_server.sh

Classification hangs / no log output

Run with unbuffered Python:

PYTHONUNBUFFERED=1 python3 classify_reviews.py

Or use ./scripts/run_classification.sh which sets this automatically.

Output overwritten by a bad run

Fixed: every run writes to {run_id}_cl_reviews_model_enriched.parquet. Check classify_output.log for the Run ID line at the top.

`unknown` sentiment rows

8 rows in run 6091e — LLM returned unparseable JSON after 3 retries. Inspect via m_raw column (empty string = total failure).

Quick reference — three commands

./scripts/start_llm_server.sh      # 1. start Bonsai on :8080
./scripts/run_classification.sh    # 2. classify all reviews (~44 min)
./scripts/stop_llm_server.sh       # 3. stop server

Log files: /tmp/bonsai-4b.log (server), classify_output.log (classifier).