Review Sequencing & Sentiment Pipeline

Local LLM-based sentiment, topic, complaints, and praise extraction for HKTVMall contact lens reviews.

Runs entirely on DGX Spark GB10 (Blackwell GPU, 128GB unified memory). No external API calls.


What this project does

Two enrichment tracks:

Track Script Method Output prefix
LLM classification (primary) classify_reviews.py Ternary-Bonsai-4B via PrismML llama-server {run_id}_cl_reviews_model_enriched.parquet
Rule-based + BERTopic (optional) enrich_reviews.py Tags, keywords, BERTopic clustering {run_id}_cl_reviews_enriched.parquet

Each run gets a 5-character UUID prefix (run_id) so outputs are never overwritten.

Reference run (2026-06-25): 6091e — 16,835 reviews in ~44 min at 6.4/s.


Project layout

review_sequencing_sentiment/
├── README.md                 ← this manual
├── requirements.txt          ← Python deps (torch installed separately)
├── classify_reviews.py       ← LLM batch classifier (main pipeline)
├── enrich_reviews.py         ← optional rule-based + BERTopic enrichment
├── classify_output.log       ← latest classification run log
├── scripts/
│   ├── config.env            ← paths + tuning (single source of truth)
│   ├── start_llm_server.sh   ← start Bonsai server
│   ├── stop_llm_server.sh    ← stop server
│   └── run_classification.sh ← run classifier (checks server first)
├── topic_model/              ← BERTopic model artifacts (from enrich_reviews.py)
└── venv/                     ← Python 3.12 virtualenv

Data (outside this repo):

~/ontologer/data/
├── processed/hk_market/hktvmall/
│   ├── cl_reviews.parquet                          ← input (16,835 reviews)
│   ├── 6091e_cl_reviews_model_enriched.parquet     ← example LLM output
│   └── {run_id}_cl_reviews_enriched.parquet        ← rule-based output
└── raw/hk_market/hktvmall/                         ← raw scrapes (for enrich_reviews.py)

~/models/llm/
├── llama.cpp-prismml-safe/build/bin/llama-server   ← PrismML fork binary
└── ternary-bonsai/4b/Ternary-Bonsai-4B-Q2_0.gguf   ← 1.1 GB model

Prerequisites

Requirement Details
Machine DGX Spark GB10 (spark-5927), aarch64
GPU NVIDIA Blackwell GB10, 128GB unified memory
CUDA 13.0, driver 580.126.09
PrismML llama.cpp Standard llama.cpp cannot load ternary Q2_0 (ggml type 42)
Model Ternary-Bonsai-4B-Q2_0.gguf from prism-ml/Ternary-Bonsai-4B-gguf on HuggingFace
Input data cl_reviews.parquet with 16,835 reviews

One-time setup

1. Python environment

cd /home/sidmishra/ontologer/subprojects/review_sequencing_sentiment

python3 -m venv venv
source venv/bin/activate

pip install -r requirements.txt
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

2. Build PrismML llama.cpp

Ternary models require the PrismML fork (prism branch):

git clone --branch prism https://github.com/PrismML-Eng/llama.cpp \
  /home/sidmishra/models/llm/llama.cpp-prismml-safe

cd /home/sidmishra/models/llm/llama.cpp-prismml-safe
unset CC CXX                    # required — stale CC/CXX breaks cmake
cmake -B build -DGGML_CUDA=ON
cmake --build build -j$(nproc)

ls build/bin/llama-server       # should exist

Backup tarball (if needed): /home/sidmishra/models/llm/llama.cpp-prismml-20260625.tar.gz

3. Download model

Place the GGUF at:

/home/sidmishra/models/llm/ternary-bonsai/4b/Ternary-Bonsai-4B-Q2_0.gguf

4. Verify input data

python3 -c "
import pandas as pd
df = pd.read_parquet('~/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet')
print(len(df), 'reviews')
print(df.columns.tolist())
"
# Expected: 16835 reviews

Runbook — full LLM classification (copy-paste)

This is the verified end-to-end process used for run 6091e.

Step 1 — Start the LLM server

cd /home/sidmishra/ontologer/subprojects/review_sequencing_sentiment
chmod +x scripts/*.sh
./scripts/start_llm_server.sh

Current production config (defined in scripts/config.env):

Flag Value Meaning
-c 4098 Total KV context (rounded by llama.cpp — see note below)
-np / --parallel 8 Concurrent server slots
-ngl 99 All layers on GPU
-t 4 CPU threads
Per slot 768 tokens llama.cpp rounds 4098 → 6144 total (768 × 8)

Context rounding: If -c is not divisible by -np, llama.cpp rounds up. With -c 4098 -np 8, the log shows n_ctx is not divisible by n_seq_max - rounding down to 6144. For exact 512 tokens/slot, use -c 4096 -np 8 instead.

Manual equivalent:

/home/sidmishra/models/llm/llama.cpp-prismml-safe/build/bin/llama-server \
  -m /home/sidmishra/models/llm/ternary-bonsai/4b/Ternary-Bonsai-4B-Q2_0.gguf \
  --host 127.0.0.1 --port 8080 \
  -ngl 99 -c 4098 -t 4 --parallel 8 \
  > /tmp/bonsai-4b.log 2>&1 &

Step 2 — Verify server is healthy

curl -s http://127.0.0.1:8080/health
# {"status":"ok"}

grep -E "n_ctx|n_ctx_seq|n_slots|model loaded" /tmp/bonsai-4b.log | tail -5

Quick smoke test:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"bonsai-4b","messages":[{"role":"user","content":"Say OK"}],"max_tokens":5,"chat_template_kwargs":{"enable_thinking":false}}'

Step 3 — Run classification

./scripts/run_classification.sh

Or manually:

source venv/bin/activate
PYTHONUNBUFFERED=1 python3 classify_reviews.py 2>&1 | tee classify_output.log

The script prints a Run ID at startup:

Run ID: 6091e
Output: .../6091e_cl_reviews_model_enriched.parquet
Loaded 16835 reviews from cl_reviews.parquet

Step 4 — Monitor progress

tail -f classify_output.log

Progress lines look like:

  7000/16835 (41.6%) — 6.4/s — ETA 25.8m

Snapshots are written every 1,000 reviews to the same {run_id}_… file (crash-safe).

Expected throughput: ~6.4 reviews/sec with 8 parallel slots (~44 min total).

Step 5 — Verify output

RUN_ID=6091e   # replace with your run ID from the log

python3 -c "
import pandas as pd
df = pd.read_parquet(f'~/ontologer/data/processed/hk_market/hktvmall/{RUN_ID}_cl_reviews_model_enriched.parquet')
print('rows:', len(df))
print(df['m_sentiment'].value_counts())
"

Reference distribution (run 6091e):

m_sentiment count
positive 12,895
negative 1,869
mixed 1,515
neutral 548
unknown 8

Step 6 — Stop server (when done)

./scripts/stop_llm_server.sh

Architecture

cl_reviews.parquet (16,835 reviews)
        │
        ▼
classify_reviews.py ──► llama-server (Bonsai 4B, :8080)
        │                  ├── 8 parallel slots (-np 8)
        │                  ├── 768 tokens per slot (6144 total KV)
        │                  ├── 1 review per HTTP request
        │                  └── Q2_0 ternary quantization
        ▼
{run_id}_cl_reviews_model_enriched.parquet
  ├── review_id
  ├── m_sentiment   positive | negative | neutral | mixed | unknown
  ├── m_score       1–5
  ├── m_topics      [comfort, quality, price, ...]
  ├── m_complaints  [original language]
  ├── m_praise      [original language]
  └── m_raw         raw LLM response text

Configuration reference

All tunables live in scripts/config.env and classify_reviews.py.

Setting Location Current value Notes
Server context -c config.envLLM_CTX_SIZE 4098 Must divide evenly by -np for exact sizing
Server slots -np config.envLLM_PARALLEL 8 Match CONCURRENCY in classifier
Client concurrency classify_reviews.pyCONCURRENCY 8 Must equal server --parallel
Snapshot interval classify_reviews.pySNAPSHOT_INTERVAL 1000 Parquet checkpoint frequency
Comment truncation classify_reviews.py 300 chars Sent to LLM per review
Max tokens classify_reviews.py 500 LLM response cap
Temperature classify_reviews.py 0.1 Low for consistent classification
Output naming classify_reviews.py uuid4().hex[:5] Never overwrites prior runs

Classification details

API call

import httpx

resp = httpx.post("http://127.0.0.1:8080/v1/chat/completions", json={
    "model": "bonsai-4b",
    "messages": [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"★{rating} | {comment[:300]}"},
    ],
    "temperature": 0.1,
    "max_tokens": 500,
    "chat_template_kwargs": {"enable_thinking": False},  # REQUIRED for Qwen/Bonsai
}, timeout=120)

System prompt (current)

Classify HK contact lens reviews. Return a JSON object (not array).
{"sentiment": "positive"|"negative"|"neutral"|"mixed", "score": 1-5, "topics": [from: comfort,quality,price,packaging,delivery,repurchase,prescription,brand,appearance,customer_service,eye_health,expiry,fit_sizing,other], "complaints": [in original language], "praise": [in original language]}
唔=not, 冇=don't have. 唔好用=negative. 回購=positive. ONLY output the JSON object.

Response parsing

The classifier sends one review per request and expects a single JSON object back. It extracts JSON by finding the first { … last } in the response. Failed parses after 3 retries → sentiment: unknown.

Cantonese edge cases (verified on hard reviews)

Review snippet Expected Bonsai result
唔好用極唔推薦 negative
包裝完整,但上眼睇唔出有粉紅色 neutral/mixed
一直用開…夠自然…不過價錢貴咗 mixed
敏感,唔舒服 negative

Optional: rule-based enrichment (enrich_reviews.py)

Separate pipeline that reads raw review scrapes (not cl_reviews.parquet) and adds v1_* columns via tags, keyword rules, and BERTopic.

source venv/bin/activate
python enrich_reviews.py              # full enrichment + BERTopic
python enrich_reviews.py --skip-topics   # rules only, no GPU topic model
python enrich_reviews.py --dry-run       # no output written
python enrich_reviews.py --sample 100    # test on 100 rows

Output: {run_id}_cl_reviews_enriched.parquet in the same data directory.


Model: Ternary-Bonsai-4B

Property Value
Source prism-ml/Ternary-Bonsai-4B-gguf (HuggingFace)
Quantization Q2_0 ternary (1.58-bit)
Size 1.1 GB GGUF
Base architecture Qwen3-4B, ternary quantized
Why ternary Fast inference, good classification quality on Cantonese/English mix

Alternative model (tested, not primary)

Qwen3.5-4B with MTP speculative decoding — 9/10 on hard reviews vs Bonsai 10/10. See scripts/config.env to swap MODEL_PATH if experimenting.


Performance history

Config Reviews/sec Notes
Bonsai 4B, 1 slot ~1.7 Baseline
Bonsai 4B, 4 slots, -c 4096 ~6.8 Earlier README config
Bonsai 4B, 8 slots, -c 4098 ~6.4 Production run 6091e
Qwen3.5 4B, 1 slot ~1.8 Alternative

Do not use -c 46800 --parallel 36 — this was a mistaken config that hogs memory. Always keep CONCURRENCY in the classifier equal to --parallel on the server.


Troubleshooting

invalid ggml type 42

Using standard llama.cpp with ternary models. Must use the PrismML fork.

Empty response / content in reasoning_content

Add "chat_template_kwargs": {"enable_thinking": False} to the API call.

Server won't start — "Could not find compiler"

unset CC CXX
cmake -B build -DGGML_CUDA=ON

Server already running on port 8080

./scripts/stop_llm_server.sh
./scripts/start_llm_server.sh

Classification hangs / no log output

Run with unbuffered Python:

PYTHONUNBUFFERED=1 python3 classify_reviews.py

Or use ./scripts/run_classification.sh which sets this automatically.

Output overwritten by a bad run

Fixed: every run writes to {run_id}_cl_reviews_model_enriched.parquet. Check classify_output.log for the Run ID line at the top.

unknown sentiment rows

8 rows in run 6091e — LLM returned unparseable JSON after 3 retries. Inspect via m_raw column (empty string = total failure).


Quick reference — three commands

./scripts/start_llm_server.sh      # 1. start Bonsai on :8080
./scripts/run_classification.sh    # 2. classify all reviews (~44 min)
./scripts/stop_llm_server.sh       # 3. stop server

Log files: /tmp/bonsai-4b.log (server), classify_output.log (classifier).