Clustering & k-NN Analysis Results — HKTVMall Contact Lens Reviews

Author: Composer (Cursor AI Agent)
Date: 2026-06-26
Script: cluster_reviews.py
Outputs: cluster_output/review_clusters.parquet, cluster_output/cluster_report.json

Sources

#	Source	Role
1	`cl_reviews.parquet`	Base reviews (deduped on `review_id`, keep last → 16,835 rows)
2	`6091e_cl_reviews_model_enriched.parquet`	LLM dimensions: sentiment, score, topics, complaints, praise
3	`initial_assessment_analytics.md`	Prior exploratory analytics informing feature design

Methodology

Feature engineering

Each review is encoded as a fixed-length vector, StandardScaler-normalized:

Block	Pass A (26 features)	Pass B (20 features)
Topics	14 canonical multi-hot flags	14 canonical multi-hot flags
Sentiment	5 one-hot + ordinal score	Excluded
Behavior	★ rating, LLM `m_score`, log(comment len), log(#complaints), log(#praise), #topics	Same

Canonical topics: comfort, quality, price, packaging, delivery, repurchase, prescription, brand, appearance, customer_service, eye_health, expiry, fit_sizing, other

Algorithms

Method	Purpose
KMeans	Primary segmentation; k selected by silhouette score (k = 4…12)
k-NN coherence (k=15)	Validation — fraction of 15 nearest neighbors sharing same cluster label
HDBSCAN	Sanity check on Pass A vectors (density-based; reported separately)

Two-pass design

Pass	Name	Features	Answers
A	Full VoC	Topics + sentiment + behavior	How does the customer feel and express it?
B	Topic archetypes	Topics + behavior only	What experience path are they on, regardless of sentiment label?

Pass A — Full VoC Segmentation

Best k: 5 (silhouette 0.469)
k-NN coherence (k=15): mean 0.997, median 1.000, 99.5% of reviews have ≥80% same-cluster neighbors

k-selection table

k	Silhouette	Calinski-Harabasz	Davies-Bouldin
4	0.4658	8514	1.191
5	0.4691	7532	0.960
6	0.4126	6990	1.327
7	0.4163	6994	1.142
8	0.4255	6678	1.159
9	0.4036	6112	1.403
10	0.3646	5839	1.283
11	0.3560	5356	1.527
12	0.3967	5093	1.534

Cluster profiles

Cluster	n	Share	Avg ★	Avg LLM score	Avg comment len	Archetype
0	6,759	40.1%	4.84	4.82	5.3	Silent promoters
1	7,054	41.9%	4.82	4.82	12.9	Articulate advocates
2	1,751	10.4%	1.78	1.76	45.9	Vocal detractors
3	1,263	7.5%	3.62	3.63	31.1	Ambivalent critics
4	8	0.05%	3.75	0.0	60.2	LLM parse failures (ignore)

Sentiment mix by cluster

Cluster	Sentiment breakdown
0	positive 87%, neutral 7%, mixed 4%, negative 2%
1	positive 100%
2	negative 99%, neutral 1%
3	mixed 98%, neutral 2%
4	unknown 100%

Cluster	Top 5 topics
0	eye_health, fit_sizing, comfort, expiry, brand
1	comfort, fit_sizing, appearance, quality, packaging
2	fit_sizing, eye_health, quality, packaging, comfort
3	fit_sizing, comfort, quality, appearance, packaging

Top brands by cluster

Cluster	Top brands (review count)
0	ACUVUE (2,158), 博士倫 (1,357), Bausch + Lomb (828)
1	ACUVUE (2,126), 博士倫 (1,231), OLENS (949)
2	ACUVUE (541), 博士倫 (360), OLENS (236)
3	ACUVUE (349), 博士倫 (230), OLENS (220)

Centroid signatures (top z-scored features)

Cluster	Defining features (z-score)
0	log_n_praise (+1.19), topic_other (+1.17), topic_prescription (+1.12), topic_expiry (+1.11), topic_brand (+1.06)
1	topic_expiry (−0.82), topic_other (−0.81), log_n_praise (−0.81), topic_eye_health (−0.78), topic_prescription (−0.77)
2	sent_negative (+2.80), n_topics (−2.29), sentiment_ord (−2.28), rating (−2.27), sent_positive (−1.81)
3	sent_mixed (+3.12), sent_positive (−1.81), log_comment_len (+1.40), n_topics (−1.36), m_score (+0.92)
4	sent_unknown (+45.86) — artifact

Prototype reviews (nearest to centroid)

Cluster 0 — Silent promoters

Brand	★	Comment
博士倫	5	GOOD
博士倫	5	回購。扺

Cluster 1 — Articulate advocates

Brand	★	Comment
OLENS	5	唔乾，超舒服
Candymagic	5	透氣度高，非常舒服

Cluster 2 — Vocal detractors

Brand	★	Comment
ACUVUE	2	一盒Con裏面有一半都係吉眼嘅
Bausch + Lomb	1	買左兩盒有一盒戴左上去矇茶茶已經換左幾隻戴都是一樣

Cluster 3 — Ambivalent critics

Brand	★	Comment
博士倫	4	送既隨機顏色隱形眼鏡係無色，唔係大眼仔
OLENS	4	幾靚，但未算好舒服戴得

Pass A interpretation

The market splits into ~82% promoters (clusters 0 + 1) with two behavioral modes:

Silent (40%) — ultra-short 5★ reviews, corrective-lens topic bias
Articulate (42%) — comfort/appearance-led praise, OLENS-heavy

~10% hard detractors (cluster 2) write long fit + eye_health complaints at ★1.8.
~8% ambivalent (cluster 3) are structurally mixed — trade-off language at ★3.6.

Clusters 0 and 1 share high stars but differ in verbosity and topic emphasis, not sentiment.

Pass B — Topic Archetypes (no sentiment features)

Best k: 4 (silhouette 0.414)
k-NN coherence (k=15): mean 0.973, median 1.000, 95.6% with ≥80% same-cluster neighbors

k-selection table

k	Silhouette	Calinski-Harabasz	Davies-Bouldin
4	0.4139	11104	1.556
5	0.4059	9055	1.798
6	0.3190	7858	1.666
7	0.3056	6847	1.934
8	0.3014	6230	1.849
9	0.2944	5623	2.012
10	0.2880	5232	1.996
11	0.3179	5009	1.862
12	0.3206	4721	1.825

Cluster profiles

Cluster	n	Share	Avg ★	Avg LLM score	Avg comment len	Archetype
1	6,785	40.3%	4.82	4.81	5.5	Corrective lens buyers
0	4,502	26.7%	4.81	4.81	14.4	Value loyalists
3	3,486	20.7%	4.73	4.74	13.4	Cosmetic enthusiasts
2	2,062	12.2%	1.70	1.69	47.7	Failure bundle

Sentiment mix by cluster (within topic archetypes)

Cluster	Sentiment breakdown
1 (corrective)	positive 86%, neutral 7%, mixed 4%, negative 2%
0 (value)	positive 87%, mixed 9%, negative 3%, neutral 1%
3 (cosmetic)	positive 89%, mixed 10%, negative 1%
2 (failure)	negative 74%, mixed 23%, positive 2%

Cluster	Top 5 topics
1 (corrective)	fit_sizing, eye_health, expiry, delivery, packaging
0 (value)	packaging, price, fit_sizing, delivery, repurchase
3 (cosmetic)	comfort, appearance, quality, fit_sizing, customer_service
2 (failure)	fit_sizing, packaging, quality, eye_health, comfort

Top brands by cluster

Cluster	Top brands (review count)
1 (corrective)	ACUVUE (2,169), 博士倫 (1,361), Bausch + Lomb (828)
0 (value)	ACUVUE (1,507), 博士倫 (895), Bausch + Lomb (523)
3 (cosmetic)	ACUVUE (863), OLENS (625), 博士倫 (511)
2 (failure)	ACUVUE (640), 博士倫 (412), OLENS (296)

Centroid signatures

Cluster	Defining features (z-score)
0 (value)	topic_comfort (−1.14), topic_quality (−1.02), topic_appearance (−1.00), n_topics (−0.95) — defined by absence of cosmetic topics
1 (corrective)	n_topics (+1.19), topic_other (+1.17), topic_prescription (+1.12), topic_expiry (+1.11), topic_brand (+1.06)
2 (failure)	rating (−2.34), m_score (−2.33), log_n_complaints (+1.64), log_n_praise (−1.43), log_comment_len (+1.32)
3 (cosmetic)	topic_packaging (−1.02), topic_price (−0.92), topic_delivery (−0.92), topic_repurchase (−0.84) — defined by absence of ops/value topics

Prototype reviews

Cluster 0 — Value loyalists

Brand	★	Comment
Bausch + Lomb	5	訂購方便
Johnson & Johnson	5	買一盒都有贈品,抵!

Cluster 1 — Corrective lens buyers

Brand	★	Comment
博士倫	5	抵，回購
博士倫	5	完好無缺

Cluster 2 — Failure bundle

Brand	★	Comment
TICON	1	買了兩盒，7.00度的是否也一起退回？
Delight	2	為什麼二盒打開不一樣的？跟新包裝也不一樣

Cluster 3 — Cosmetic enthusiasts

Brand	★	Comment
CooperVision	5	無得彈，非常滿意
Candymagic	5	平靚正日常使用

Pass B interpretation

Three distinct happy paths converge on ~4.8★ for different reasons:

Corrective (40%) — fit, prescription, expiry, brand trust
Value (27%) — price, repurchase, packaging, delivery
Cosmetic (21%) — comfort, appearance, quality; OLENS over-indexed

Failure bundle (12%) is topic-structural: fit + packaging + delivery co-occur with low ratings and high complaint volume. It captures 74% negative AND 23% mixed — the mixed reviews here are ops/fit problems expressed ambivalently, not price trade-offs.

Cross-tab: Pass A (tone) × Pass B (experience)

Rows = kmeans_cluster (Pass A), columns = topic_cluster (Pass B):

Pass A ↓ / Pass B →	B1 Corrective	B2 Failure	B3 Cosmetic	B0 Value
A0 Silent promo	6,750	1	8	0
A1 Articulate adv	0	34	3,079	3,941
A2 Detractors	26	1,531	43	151
A3 Ambivalent	9	488	356	410
A4 Parse fail	0	8	0	0

Cross-tab reads

Pattern	Finding
Silent promoters (A0)	99.9% corrective archetype — terse repurchase/fit reviews
Articulate advocates (A1)	Split 56% value / 44% cosmetic — two praise vocabularies
Detractors (A2)	87% failure bundle (1,531 / 1,751)
Ambivalent (A3)	Spread: 39% failure, 33% value, 28% cosmetic — true trade-off reviews

k-NN validation summary

Metric	Pass A (full VoC)	Pass B (topic archetypes)
Neighbors (k)	15	15
Mean same-cluster share	0.997	0.973
Median	1.000	1.000
Reviews with ≥80% coherent neighbors	99.5%	95.6%

Pass A coherence is near-perfect because sentiment one-hot features dominate separation. Pass B's slightly lower coherence (97.3%) reflects softer boundaries between value and cosmetic happy paths — expected and healthy.

HDBSCAN sanity check (Pass A vectors)

Metric	Value
Clusters found	22
Noise points	3,088 (18.3%)
Silhouette (excl. noise)	0.454

HDBSCAN over-segments for executive use. KMeans with k=4–5 yields interpretable archetypes; HDBSCAN useful for micro-SKU anomaly detection later.

Strategic actions by segment

Segment	Pass A	Pass B	Recommended action
Silent promoters	A0	B1	Low-touch retention; repurchase nudges
Articulate advocates	A1	B0 or B3	Loyalty programs (value) or cosmetic upsell (OLENS)
Vocal detractors	A2	B2	Fit guides, QC on packaging/expiry, eye health response
Ambivalent critics	A3	B2/B0/B3	Price-match policy; expectation setting on color lenses
Failure bundle	any	B2	Cross-functional ops — delivery, packaging, fit, returns

Reproducing

cd /home/sidmishra/ontologer/subprojects/review_sequencing_sentiment
source venv/bin/activate
python cluster_reviews.py

Outputs:

cluster_output/review_clusters.parquet — all reviews with kmeans_cluster, topic_cluster, hdbscan_cluster
cluster_output/cluster_report.json — full numeric results and prototypes

Join key: review_id → base cl_reviews.parquet or enrichment file.

End of clustering results.

Updated: 2026-06-26 13:55 via generate_clustering_results.py

Report block convention

Each analytical section uses up to three executable layers:

Layer	Tag	Role
Results	`[result:…]…[/result]`	Markdown table — findings
Data SQL	`[babysql:polars]…[/babysql]`	Polars SQL on parquet — reproduce inputs/aggregates
ML / stats	`[ml:…]…[/ml]`	Python estimator spec (sklearn, statsmodels, scipy)

Prose How we found this (math) gives the formulas. Cached artifacts live in cluster_output/.

Pass A — Full VoC Segmentation (auto-generated blocks)

[result:review_cluster_k_selection] | k | Silhouette | | --- | --- | | 4 | 0.4658 | | 5 | 0.4691 | | 6 | 0.4126 | | 7 | 0.4163 | | 8 | 0.4255 | | 9 | 0.4036 | | 10 | 0.3646 | | 11 | 0.3560 | | 12 | 0.3967 |

Cluster	N	%	Avg ★	Top topics
1	7054	41.9%	4.82	comfort, fit_sizing, appearance
0	6759	40.1%	4.84	fit_sizing, eye_health, comfort
2	1751	10.4%	1.78	fit_sizing, eye_health, quality
3	1263	7.5%	3.62	fit_sizing, comfort, quality
4	8	0.0%	3.75

k-NN coherence (k=15): 0.997 [/result]

[babysql:polars] SELECT k, silhouette, calinski_harabasz, davies_bouldin FROM read_csv('/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/review_k_selection.csv') ORDER BY k [/babysql]

[babysql:polars] SELECT kmeans_cluster AS cluster, COUNT() AS n, ROUND(AVG(rating), 2) AS avg_rating, ROUND(100.0 * SUM(CASE WHEN m_sentiment = 'positive' THEN 1 ELSE 0 END) / COUNT(), 1) AS pos_pct FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/review_clusters.parquet' GROUP BY kmeans_cluster ORDER BY cluster [/babysql]

[ml:sklearn-kmeans] library: scikit-learn script: cluster_reviews.py :: run_clustering unit: review (one row per review_id) features: 15 topic flags + sentiment one-hot + rating/m_score/verbosity preprocess: StandardScaler model: KMeans(n_init=10, random_state=42) selection: k in 4..12, maximize silhouette validation: knn_coherence(k=15) artifact: cluster_output/review_clusters.parquet [/ml]

How we found this (math): KMeans on z-scored features; silhouette s=(b−a)/max(a,b).

Pass B — Topic archetypes (auto-generated blocks)

[result:topic_cluster_k_selection] | k | Silhouette | | --- | --- | | 4 | 0.4139 | | 5 | 0.4059 | | 6 | 0.3190 | | 7 | 0.3056 | | 8 | 0.3014 | | 9 | 0.2944 | | 10 | 0.2880 | | 11 | 0.3179 | | 12 | 0.3206 |

Best k=4; k-NN coherence 0.973 [/result]

[ml:sklearn-kmeans] library: scikit-learn script: cluster_reviews.py :: run_clustering(include_sentiment=False) features: topics + behavior only (no sentiment one-hot) [/ml]

Clustering & k-NN Analysis Results — HKTVMall Contact Lens Reviews

Sources

Methodology

Feature engineering

Algorithms

Two-pass design

Pass A — Full VoC Segmentation

k-selection table

Cluster profiles

Sentiment mix by cluster

Top topics by cluster

Top brands by cluster

Centroid signatures (top z-scored features)

Prototype reviews (nearest to centroid)

Pass A interpretation

Pass B — Topic Archetypes (no sentiment features)

k-selection table

Cluster profiles

Sentiment mix by cluster (within topic archetypes)

Top topics by cluster

Top brands by cluster

Centroid signatures

Prototype reviews

Pass B interpretation

Cross-tab: Pass A (tone) × Pass B (experience)

Cross-tab reads

k-NN validation summary

HDBSCAN sanity check (Pass A vectors)

Strategic actions by segment

Reproducing

Report block convention

Pass A — Full VoC Segmentation (auto-generated blocks)

Pass B — Topic archetypes (auto-generated blocks)