Clustering & k-NN Analysis Results — HKTVMall Contact Lens Reviews
Author: Composer (Cursor AI Agent)
Date: 2026-06-26
Script: cluster_reviews.py
Outputs: cluster_output/review_clusters.parquet, cluster_output/cluster_report.json
Sources
| # | Source | Role |
|---|---|---|
| 1 | cl_reviews.parquet |
Base reviews (deduped on review_id, keep last → 16,835 rows) |
| 2 | 6091e_cl_reviews_model_enriched.parquet |
LLM dimensions: sentiment, score, topics, complaints, praise |
| 3 | initial_assessment_analytics.md |
Prior exploratory analytics informing feature design |
Methodology
Feature engineering
Each review is encoded as a fixed-length vector, StandardScaler-normalized:
| Block | Pass A (26 features) | Pass B (20 features) |
|---|---|---|
| Topics | 14 canonical multi-hot flags | 14 canonical multi-hot flags |
| Sentiment | 5 one-hot + ordinal score | Excluded |
| Behavior | ★ rating, LLM m_score, log(comment len), log(#complaints), log(#praise), #topics |
Same |
Canonical topics: comfort, quality, price, packaging, delivery, repurchase, prescription, brand, appearance, customer_service, eye_health, expiry, fit_sizing, other
Algorithms
| Method | Purpose |
|---|---|
| KMeans | Primary segmentation; k selected by silhouette score (k = 4…12) |
| k-NN coherence (k=15) | Validation — fraction of 15 nearest neighbors sharing same cluster label |
| HDBSCAN | Sanity check on Pass A vectors (density-based; reported separately) |
Two-pass design
| Pass | Name | Features | Answers |
|---|---|---|---|
| A | Full VoC | Topics + sentiment + behavior | How does the customer feel and express it? |
| B | Topic archetypes | Topics + behavior only | What experience path are they on, regardless of sentiment label? |
Pass A — Full VoC Segmentation
Best k: 5 (silhouette 0.469)
k-NN coherence (k=15): mean 0.997, median 1.000, 99.5% of reviews have ≥80% same-cluster neighbors
k-selection table
| k | Silhouette | Calinski-Harabasz | Davies-Bouldin |
|---|---|---|---|
| 4 | 0.4658 | 8514 | 1.191 |
| 5 | 0.4691 | 7532 | 0.960 |
| 6 | 0.4126 | 6990 | 1.327 |
| 7 | 0.4163 | 6994 | 1.142 |
| 8 | 0.4255 | 6678 | 1.159 |
| 9 | 0.4036 | 6112 | 1.403 |
| 10 | 0.3646 | 5839 | 1.283 |
| 11 | 0.3560 | 5356 | 1.527 |
| 12 | 0.3967 | 5093 | 1.534 |
Cluster profiles
| Cluster | n | Share | Avg ★ | Avg LLM score | Avg comment len | Archetype |
|---|---|---|---|---|---|---|
| 0 | 6,759 | 40.1% | 4.84 | 4.82 | 5.3 | Silent promoters |
| 1 | 7,054 | 41.9% | 4.82 | 4.82 | 12.9 | Articulate advocates |
| 2 | 1,751 | 10.4% | 1.78 | 1.76 | 45.9 | Vocal detractors |
| 3 | 1,263 | 7.5% | 3.62 | 3.63 | 31.1 | Ambivalent critics |
| 4 | 8 | 0.05% | 3.75 | 0.0 | 60.2 | LLM parse failures (ignore) |
Sentiment mix by cluster
| Cluster | Sentiment breakdown |
|---|---|
| 0 | positive 87%, neutral 7%, mixed 4%, negative 2% |
| 1 | positive 100% |
| 2 | negative 99%, neutral 1% |
| 3 | mixed 98%, neutral 2% |
| 4 | unknown 100% |
Top topics by cluster
| Cluster | Top 5 topics |
|---|---|
| 0 | eye_health, fit_sizing, comfort, expiry, brand |
| 1 | comfort, fit_sizing, appearance, quality, packaging |
| 2 | fit_sizing, eye_health, quality, packaging, comfort |
| 3 | fit_sizing, comfort, quality, appearance, packaging |
Top brands by cluster
| Cluster | Top brands (review count) |
|---|---|
| 0 | ACUVUE (2,158), 博士倫 (1,357), Bausch + Lomb (828) |
| 1 | ACUVUE (2,126), 博士倫 (1,231), OLENS (949) |
| 2 | ACUVUE (541), 博士倫 (360), OLENS (236) |
| 3 | ACUVUE (349), 博士倫 (230), OLENS (220) |
Centroid signatures (top z-scored features)
| Cluster | Defining features (z-score) |
|---|---|
| 0 | log_n_praise (+1.19), topic_other (+1.17), topic_prescription (+1.12), topic_expiry (+1.11), topic_brand (+1.06) |
| 1 | topic_expiry (−0.82), topic_other (−0.81), log_n_praise (−0.81), topic_eye_health (−0.78), topic_prescription (−0.77) |
| 2 | sent_negative (+2.80), n_topics (−2.29), sentiment_ord (−2.28), rating (−2.27), sent_positive (−1.81) |
| 3 | sent_mixed (+3.12), sent_positive (−1.81), log_comment_len (+1.40), n_topics (−1.36), m_score (+0.92) |
| 4 | sent_unknown (+45.86) — artifact |
Prototype reviews (nearest to centroid)
Cluster 0 — Silent promoters
| Brand | ★ | Comment |
|---|---|---|
| 博士倫 | 5 | GOOD |
| 博士倫 | 5 | 回購。扺 |
Cluster 1 — Articulate advocates
| Brand | ★ | Comment |
|---|---|---|
| OLENS | 5 | 唔乾,超舒服 |
| Candymagic | 5 | 透氣度高,非常舒服 |
Cluster 2 — Vocal detractors
| Brand | ★ | Comment |
|---|---|---|
| ACUVUE | 2 | 一盒Con裏面有一半都係吉眼嘅 |
| Bausch + Lomb | 1 | 買左兩盒 有一盒戴左上去矇茶茶 已經換左幾隻戴都是一樣 |
Cluster 3 — Ambivalent critics
| Brand | ★ | Comment |
|---|---|---|
| 博士倫 | 4 | 送既隨機顏色隱形眼鏡係無色,唔係大眼仔 |
| OLENS | 4 | 幾靚,但未算好舒服戴得 |
Pass A interpretation
The market splits into ~82% promoters (clusters 0 + 1) with two behavioral modes:
- Silent (40%) — ultra-short 5★ reviews, corrective-lens topic bias
- Articulate (42%) — comfort/appearance-led praise, OLENS-heavy
~10% hard detractors (cluster 2) write long fit + eye_health complaints at ★1.8.
~8% ambivalent (cluster 3) are structurally mixed — trade-off language at ★3.6.
Clusters 0 and 1 share high stars but differ in verbosity and topic emphasis, not sentiment.
Pass B — Topic Archetypes (no sentiment features)
Best k: 4 (silhouette 0.414)
k-NN coherence (k=15): mean 0.973, median 1.000, 95.6% with ≥80% same-cluster neighbors
k-selection table
| k | Silhouette | Calinski-Harabasz | Davies-Bouldin |
|---|---|---|---|
| 4 | 0.4139 | 11104 | 1.556 |
| 5 | 0.4059 | 9055 | 1.798 |
| 6 | 0.3190 | 7858 | 1.666 |
| 7 | 0.3056 | 6847 | 1.934 |
| 8 | 0.3014 | 6230 | 1.849 |
| 9 | 0.2944 | 5623 | 2.012 |
| 10 | 0.2880 | 5232 | 1.996 |
| 11 | 0.3179 | 5009 | 1.862 |
| 12 | 0.3206 | 4721 | 1.825 |
Cluster profiles
| Cluster | n | Share | Avg ★ | Avg LLM score | Avg comment len | Archetype |
|---|---|---|---|---|---|---|
| 1 | 6,785 | 40.3% | 4.82 | 4.81 | 5.5 | Corrective lens buyers |
| 0 | 4,502 | 26.7% | 4.81 | 4.81 | 14.4 | Value loyalists |
| 3 | 3,486 | 20.7% | 4.73 | 4.74 | 13.4 | Cosmetic enthusiasts |
| 2 | 2,062 | 12.2% | 1.70 | 1.69 | 47.7 | Failure bundle |
Sentiment mix by cluster (within topic archetypes)
| Cluster | Sentiment breakdown |
|---|---|
| 1 (corrective) | positive 86%, neutral 7%, mixed 4%, negative 2% |
| 0 (value) | positive 87%, mixed 9%, negative 3%, neutral 1% |
| 3 (cosmetic) | positive 89%, mixed 10%, negative 1% |
| 2 (failure) | negative 74%, mixed 23%, positive 2% |
Top topics by cluster
| Cluster | Top 5 topics |
|---|---|
| 1 (corrective) | fit_sizing, eye_health, expiry, delivery, packaging |
| 0 (value) | packaging, price, fit_sizing, delivery, repurchase |
| 3 (cosmetic) | comfort, appearance, quality, fit_sizing, customer_service |
| 2 (failure) | fit_sizing, packaging, quality, eye_health, comfort |
Top brands by cluster
| Cluster | Top brands (review count) |
|---|---|
| 1 (corrective) | ACUVUE (2,169), 博士倫 (1,361), Bausch + Lomb (828) |
| 0 (value) | ACUVUE (1,507), 博士倫 (895), Bausch + Lomb (523) |
| 3 (cosmetic) | ACUVUE (863), OLENS (625), 博士倫 (511) |
| 2 (failure) | ACUVUE (640), 博士倫 (412), OLENS (296) |
Centroid signatures
| Cluster | Defining features (z-score) |
|---|---|
| 0 (value) | topic_comfort (−1.14), topic_quality (−1.02), topic_appearance (−1.00), n_topics (−0.95) — defined by absence of cosmetic topics |
| 1 (corrective) | n_topics (+1.19), topic_other (+1.17), topic_prescription (+1.12), topic_expiry (+1.11), topic_brand (+1.06) |
| 2 (failure) | rating (−2.34), m_score (−2.33), log_n_complaints (+1.64), log_n_praise (−1.43), log_comment_len (+1.32) |
| 3 (cosmetic) | topic_packaging (−1.02), topic_price (−0.92), topic_delivery (−0.92), topic_repurchase (−0.84) — defined by absence of ops/value topics |
Prototype reviews
Cluster 0 — Value loyalists
| Brand | ★ | Comment |
|---|---|---|
| Bausch + Lomb | 5 | 訂購方便 |
| Johnson & Johnson | 5 | 買一盒都有贈品,抵! |
Cluster 1 — Corrective lens buyers
| Brand | ★ | Comment |
|---|---|---|
| 博士倫 | 5 | 抵,回購 |
| 博士倫 | 5 | 完好無缺 |
Cluster 2 — Failure bundle
| Brand | ★ | Comment |
|---|---|---|
| TICON | 1 | 買了兩盒,7.00度的是否也一起退回? |
| Delight | 2 | 為什麼二盒打開不一樣的?跟新包裝也不一樣 |
Cluster 3 — Cosmetic enthusiasts
| Brand | ★ | Comment |
|---|---|---|
| CooperVision | 5 | 無得彈,非常滿意 |
| Candymagic | 5 | 平靚正 日常使用 |
Pass B interpretation
Three distinct happy paths converge on ~4.8★ for different reasons:
- Corrective (40%) — fit, prescription, expiry, brand trust
- Value (27%) — price, repurchase, packaging, delivery
- Cosmetic (21%) — comfort, appearance, quality; OLENS over-indexed
Failure bundle (12%) is topic-structural: fit + packaging + delivery co-occur with low ratings and high complaint volume. It captures 74% negative AND 23% mixed — the mixed reviews here are ops/fit problems expressed ambivalently, not price trade-offs.
Cross-tab: Pass A (tone) × Pass B (experience)
Rows = kmeans_cluster (Pass A), columns = topic_cluster (Pass B):
| Pass A ↓ / Pass B → | B1 Corrective | B2 Failure | B3 Cosmetic | B0 Value |
|---|---|---|---|---|
| A0 Silent promo | 6,750 | 1 | 8 | 0 |
| A1 Articulate adv | 0 | 34 | 3,079 | 3,941 |
| A2 Detractors | 26 | 1,531 | 43 | 151 |
| A3 Ambivalent | 9 | 488 | 356 | 410 |
| A4 Parse fail | 0 | 8 | 0 | 0 |
Cross-tab reads
| Pattern | Finding |
|---|---|
| Silent promoters (A0) | 99.9% corrective archetype — terse repurchase/fit reviews |
| Articulate advocates (A1) | Split 56% value / 44% cosmetic — two praise vocabularies |
| Detractors (A2) | 87% failure bundle (1,531 / 1,751) |
| Ambivalent (A3) | Spread: 39% failure, 33% value, 28% cosmetic — true trade-off reviews |
k-NN validation summary
| Metric | Pass A (full VoC) | Pass B (topic archetypes) |
|---|---|---|
| Neighbors (k) | 15 | 15 |
| Mean same-cluster share | 0.997 | 0.973 |
| Median | 1.000 | 1.000 |
| Reviews with ≥80% coherent neighbors | 99.5% | 95.6% |
Pass A coherence is near-perfect because sentiment one-hot features dominate separation. Pass B's slightly lower coherence (97.3%) reflects softer boundaries between value and cosmetic happy paths — expected and healthy.
HDBSCAN sanity check (Pass A vectors)
| Metric | Value |
|---|---|
| Clusters found | 22 |
| Noise points | 3,088 (18.3%) |
| Silhouette (excl. noise) | 0.454 |
HDBSCAN over-segments for executive use. KMeans with k=4–5 yields interpretable archetypes; HDBSCAN useful for micro-SKU anomaly detection later.
Strategic actions by segment
| Segment | Pass A | Pass B | Recommended action |
|---|---|---|---|
| Silent promoters | A0 | B1 | Low-touch retention; repurchase nudges |
| Articulate advocates | A1 | B0 or B3 | Loyalty programs (value) or cosmetic upsell (OLENS) |
| Vocal detractors | A2 | B2 | Fit guides, QC on packaging/expiry, eye health response |
| Ambivalent critics | A3 | B2/B0/B3 | Price-match policy; expectation setting on color lenses |
| Failure bundle | any | B2 | Cross-functional ops — delivery, packaging, fit, returns |
Reproducing
cd /home/sidmishra/ontologer/subprojects/review_sequencing_sentiment
source venv/bin/activate
python cluster_reviews.py
Outputs:
cluster_output/review_clusters.parquet— all reviews withkmeans_cluster,topic_cluster,hdbscan_clustercluster_output/cluster_report.json— full numeric results and prototypes
Join key: review_id → base cl_reviews.parquet or enrichment file.
End of clustering results.
Updated: 2026-06-26 13:55 via generate_clustering_results.py
Report block convention
Each analytical section uses up to three executable layers:
| Layer | Tag | Role |
|---|---|---|
| Results | [result:…]…[/result] |
Markdown table — findings |
| Data SQL | [babysql:polars]…[/babysql] |
Polars SQL on parquet — reproduce inputs/aggregates |
| ML / stats | [ml:…]…[/ml] |
Python estimator spec (sklearn, statsmodels, scipy) |
Prose How we found this (math) gives the formulas. Cached artifacts live in cluster_output/.
Pass A — Full VoC Segmentation (auto-generated blocks)
[result:review_cluster_k_selection] | k | Silhouette | | --- | --- | | 4 | 0.4658 | | 5 | 0.4691 | | 6 | 0.4126 | | 7 | 0.4163 | | 8 | 0.4255 | | 9 | 0.4036 | | 10 | 0.3646 | | 11 | 0.3560 | | 12 | 0.3967 |
| Cluster | N | % | Avg ★ | Top topics |
|---|---|---|---|---|
| 1 | 7054 | 41.9% | 4.82 | comfort, fit_sizing, appearance |
| 0 | 6759 | 40.1% | 4.84 | fit_sizing, eye_health, comfort |
| 2 | 1751 | 10.4% | 1.78 | fit_sizing, eye_health, quality |
| 3 | 1263 | 7.5% | 3.62 | fit_sizing, comfort, quality |
| 4 | 8 | 0.0% | 3.75 |
k-NN coherence (k=15): 0.997 [/result]
[babysql:polars] SELECT k, silhouette, calinski_harabasz, davies_bouldin FROM read_csv('/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/review_k_selection.csv') ORDER BY k [/babysql]
[babysql:polars] SELECT kmeans_cluster AS cluster, COUNT() AS n, ROUND(AVG(rating), 2) AS avg_rating, ROUND(100.0 * SUM(CASE WHEN m_sentiment = 'positive' THEN 1 ELSE 0 END) / COUNT(), 1) AS pos_pct FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/review_clusters.parquet' GROUP BY kmeans_cluster ORDER BY cluster [/babysql]
[ml:sklearn-kmeans] library: scikit-learn script: cluster_reviews.py :: run_clustering unit: review (one row per review_id) features: 15 topic flags + sentiment one-hot + rating/m_score/verbosity preprocess: StandardScaler model: KMeans(n_init=10, random_state=42) selection: k in 4..12, maximize silhouette validation: knn_coherence(k=15) artifact: cluster_output/review_clusters.parquet [/ml]
How we found this (math): KMeans on z-scored features; silhouette s=(b−a)/max(a,b).
Pass B — Topic archetypes (auto-generated blocks)
[result:topic_cluster_k_selection] | k | Silhouette | | --- | --- | | 4 | 0.4139 | | 5 | 0.4059 | | 6 | 0.3190 | | 7 | 0.3056 | | 8 | 0.3014 | | 9 | 0.2944 | | 10 | 0.2880 | | 11 | 0.3179 | | 12 | 0.3206 |
Best k=4; k-NN coherence 0.973 [/result]
[ml:sklearn-kmeans] library: scikit-learn script: cluster_reviews.py :: run_clustering(include_sentiment=False) features: topics + behavior only (no sentiment one-hot) [/ml]