Consumer clustering — repeat reviewer archetypes
Generated: 2026-06-26 13:55
Script: analyze_consumer_clusters.py
Report block convention
Each analytical section uses up to three executable layers:
| Layer | Tag | Role |
|---|---|---|
| Results | [result:…]…[/result] |
Markdown table — findings |
| Data SQL | [babysql:polars]…[/babysql] |
Polars SQL on parquet — reproduce inputs/aggregates |
| ML / stats | [ml:…]…[/ml] |
Python estimator spec (sklearn, statsmodels, scipy) |
Prose How we found this (math) gives the formulas. Cached artifacts live in cluster_output/.
What this is based on
Each consumer = one user_pk with ≥3 contact-lens reviews on HKTVmall.
Per-user feature vector x ∈ ℝᵈ:
| Block | Features | Source |
|---|---|---|
| Product mix | Review-count-weighted shares of v2_modality, v2_lens_type, v2_pack_bucket, v2_brand_top |
product_id_to_attributes_v2.parquet |
| Sentiment behavior | pos_rate, neg_rate, avg_rating, avg_m_score, avg_comment_len, n_reviews, n_stores |
cl_reviews + LLM enrichment |
| Topic behavior | Mean flag rate for each of 15 LLM topics (comfort, price, repurchase, …) |
6091e_cl_reviews_model_enriched.parquet |
Method: z-score standardize → KMeans (k ∈ [3,8] by best silhouette) → k-NN coherence (k=15) validation.
- 1,425 users clustered · k = 5 · silhouette 0.1824
- k-NN label coherence: 0.848
- Neighbors share dominant modality 96.1%; brand 79.4%
Not the same as: review-level clusters (cluster_reviews.py) or wearer cohorts (wearer_cohort.py).
Cluster profiles
[result:consumer_cluster_profiles] | Cluster | Label | Users | Reviews | Pos | Neg | Modality | Brand | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | biweekly / niche modality | 73 | 350 | 79.0% | 7.0% | 2-week | ACUVUE | | 1 | detractor cohort | 114 | 464 | 24.6% | 50.7% | daily | BAUSCHLOMB | | 2 | mainstream satisfied | 556 | 3564 | 87.2% | 2.8% | daily | BAUSCHLOMB | | 3 | OLENS monthly loyalists | 62 | 354 | 82.6% | 9.3% | monthly | OLENS | | 4 | mainstream satisfied | 620 | 3547 | 85.9% | 3.1% | daily | BAUSCHLOMB | [/result]
[babysql:polars] WITH joined AS ( SELECT r.user_pk, CASE WHEN m.m_sentiment = 'positive' THEN 1.0 ELSE 0.0 END AS is_positive, CASE WHEN m.m_sentiment = 'negative' THEN 1.0 ELSE 0.0 END AS is_negative, cl.rating, m.m_score, v.v2_modality, v.v2_lens_type, v.v2_brand FROM '/home/sidmishra/ontologer/data/raw/hk_market/hktvmall/2026-06-19/reviews.parquet' AS r JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl ON r.review_id = cl.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/product_id_to_attributes_v2.parquet' AS v ON cl.product_id = v.product_id WHERE r.user_pk IS NOT NULL ) SELECT user_pk, COUNT() AS n_reviews, ROUND(AVG(is_positive), 3) AS pos_rate, ROUND(AVG(is_negative), 3) AS neg_rate, ROUND(AVG(rating), 2) AS avg_rating, ROUND(AVG(m_score), 2) AS avg_m_score FROM joined GROUP BY user_pk HAVING COUNT() >= 3 ORDER BY n_reviews DESC LIMIT 15 [/babysql]
[babysql:polars] SELECT reviewer_cluster AS cluster, COUNT(*) AS n_users, SUM(n_reviews) AS n_reviews, ROUND(AVG(pos_rate), 3) AS avg_pos_rate, ROUND(AVG(neg_rate), 3) AS avg_neg_rate, dom_modality, dom_brand FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_clusters.parquet' GROUP BY reviewer_cluster, dom_modality, dom_brand ORDER BY reviewer_cluster, n_users DESC [/babysql]
[ml:sklearn-kmeans] library: scikit-learn script: analyze_consumer_clusters.py (via analyze_product_people_clusters.cluster_reviewers) unit: user_pk, n_reviews >= 3 features: product mix shares + sentiment rates + topic prevalence artifact: cluster_output/reviewer_clusters.parquet [/ml]
How we found this (math): Same KMeans pipeline as product_people section (a); z-scored features, silhouette k selection, k-NN validation.
Wearer cohort mix by cluster
[result:wearer_mix_by_cluster] | Cluster | Wearer cohort | Reviews | Share | | :right | :left | :right | :right | | 4 | platform_repeat_reviewer | 1890 | 53.3% | | 4 | declared_existing_wearer | 1158 | 32.6% | | 4 | platform_first_review | 491 | 13.8% | | 4 | declared_new_wearer | 8 | 0.2% | | 3 | platform_repeat_reviewer | 218 | 61.6% | | 3 | declared_existing_wearer | 82 | 23.2% | | 3 | platform_first_review | 54 | 15.3% | | 2 | platform_repeat_reviewer | 2236 | 62.7% | | 2 | declared_existing_wearer | 861 | 24.2% | | 2 | platform_first_review | 464 | 13.0% | | 2 | declared_new_wearer | 3 | 0.1% | | 1 | platform_repeat_reviewer | 293 | 63.1% | | 1 | platform_first_review | 92 | 19.8% | | 1 | declared_existing_wearer | 77 | 16.6% | | 1 | declared_new_wearer | 2 | 0.4% | | 0 | platform_repeat_reviewer | 189 | 54.0% | | 0 | declared_existing_wearer | 97 | 27.7% | | 0 | platform_first_review | 63 | 18.0% | | 0 | declared_new_wearer | 1 | 0.3% | [/result]
[babysql:polars] SELECT rc.reviewer_cluster, w.wearer_cohort, COUNT() AS n_reviews, ROUND(100.0 * COUNT() / SUM(COUNT(*)) OVER (PARTITION BY rc.reviewer_cluster), 1) AS pct FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/wearer_tags_v2.parquet' AS w JOIN '/home/sidmishra/ontologer/data/raw/hk_market/hktvmall/2026-06-19/reviews.parquet' AS r ON w.review_id = r.review_id JOIN '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_clusters.parquet' AS rc ON r.user_pk = rc.user_pk GROUP BY rc.reviewer_cluster, w.wearer_cohort ORDER BY rc.reviewer_cluster, n_reviews DESC [/babysql]
[ml:wearer_cohort-rules] library: wearer_cohort.classify_review script: regex scoring + ordinal gates (not ML) note: Cross-tab of rule-based wearer cohort × KMeans reviewer cluster [/ml]
Journey behavior by cluster
[result:cluster_transition_stats] | Cluster | Pairs | Brand switch | Modality switch | Lens switch | Median days | | :right | :right | :right | :right | :right | :right | | 0 | 172 | 20.9% | 31.5% | 14.5% | 167 | | 1 | 220 | 45.5% | 6.5% | 25.0% | 190 | | 2 | 1843 | 23.1% | 2.8% | 12.5% | 98 | | 3 | 150 | 32.0% | 33.3% | 16.0% | 118 | | 4 | 1799 | 26.1% | 3.0% | 15.0% | 118 | [/result]
[babysql:polars] SELECT rc.reviewer_cluster, COUNT(*) AS n_pairs, ROUND(AVG(CASE WHEN p.switched_brand THEN 1.0 ELSE 0.0 END), 3) AS brand_switch_rate, ROUND(AVG(CASE WHEN p.switched_modality THEN 1.0 ELSE 0.0 END), 3) AS modality_switch_rate, ROUND(AVG(CASE WHEN p.switched_lens THEN 1.0 ELSE 0.0 END), 3) AS lens_switch_rate, MEDIAN(p.days_gap) AS median_days_gap FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_transition_pairs.parquet' AS p JOIN '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_clusters.parquet' AS rc ON p.user_pk = rc.user_pk GROUP BY rc.reviewer_cluster ORDER BY rc.reviewer_cluster [/babysql]
How we found this (math): Switch flags on consecutive pairs; cluster-level means.
Key takeaways
- Cluster 1 (detractor cohort) — only 114 users but 51% negative rate; similar daily/spherical mix to satisfied clusters — segmentation is sentiment behavior, not product mix alone.
- Cluster 3 concentrates OLENS + monthly users (OLENS / monthly) with 83% positive.
- Brand-switch rate ranges 21% (cluster 0) to 45% (cluster 1); detractors revisit on median 190 days between reviews.
- Use platform_first_review proxy for scale when declared-new wearers (n≈37) is too small for theme work.
Output files
| File | Description |
|---|---|
cluster_output/reviewer_clusters.parquet |
user_pk × cluster + label |
cluster_output/consumer_cluster_topics.csv |
Topic rates per cluster |
cluster_output/consumer_cluster_wearer_mix.csv |
Wearer cohort shares |
cluster_output/consumer_cluster_transitions.csv |
Switch rates per cluster |
cluster_output/consumer_clustering_report.json |
Full payload |