Consumer clustering — repeat reviewer archetypes

Generated: 2026-06-26 13:55
Script: analyze_consumer_clusters.py


Report block convention

Each analytical section uses up to three executable layers:

Layer Tag Role
Results [result:…]…[/result] Markdown table — findings
Data SQL [babysql:polars]…[/babysql] Polars SQL on parquet — reproduce inputs/aggregates
ML / stats [ml:…]…[/ml] Python estimator spec (sklearn, statsmodels, scipy)

Prose How we found this (math) gives the formulas. Cached artifacts live in cluster_output/.


What this is based on

Each consumer = one user_pk with ≥3 contact-lens reviews on HKTVmall.

Per-user feature vector x ∈ ℝᵈ:

Block Features Source
Product mix Review-count-weighted shares of v2_modality, v2_lens_type, v2_pack_bucket, v2_brand_top product_id_to_attributes_v2.parquet
Sentiment behavior pos_rate, neg_rate, avg_rating, avg_m_score, avg_comment_len, n_reviews, n_stores cl_reviews + LLM enrichment
Topic behavior Mean flag rate for each of 15 LLM topics (comfort, price, repurchase, …) 6091e_cl_reviews_model_enriched.parquet

Method: z-score standardize → KMeans (k ∈ [3,8] by best silhouette) → k-NN coherence (k=15) validation.

  • 1,425 users clustered · k = 5 · silhouette 0.1824
  • k-NN label coherence: 0.848
  • Neighbors share dominant modality 96.1%; brand 79.4%

Not the same as: review-level clusters (cluster_reviews.py) or wearer cohorts (wearer_cohort.py).


Cluster profiles

[result:consumer_cluster_profiles] | Cluster | Label | Users | Reviews | Pos | Neg | Modality | Brand | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | biweekly / niche modality | 73 | 350 | 79.0% | 7.0% | 2-week | ACUVUE | | 1 | detractor cohort | 114 | 464 | 24.6% | 50.7% | daily | BAUSCHLOMB | | 2 | mainstream satisfied | 556 | 3564 | 87.2% | 2.8% | daily | BAUSCHLOMB | | 3 | OLENS monthly loyalists | 62 | 354 | 82.6% | 9.3% | monthly | OLENS | | 4 | mainstream satisfied | 620 | 3547 | 85.9% | 3.1% | daily | BAUSCHLOMB | [/result]

[babysql:polars] WITH joined AS ( SELECT r.user_pk, CASE WHEN m.m_sentiment = 'positive' THEN 1.0 ELSE 0.0 END AS is_positive, CASE WHEN m.m_sentiment = 'negative' THEN 1.0 ELSE 0.0 END AS is_negative, cl.rating, m.m_score, v.v2_modality, v.v2_lens_type, v.v2_brand FROM '/home/sidmishra/ontologer/data/raw/hk_market/hktvmall/2026-06-19/reviews.parquet' AS r JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl ON r.review_id = cl.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/product_id_to_attributes_v2.parquet' AS v ON cl.product_id = v.product_id WHERE r.user_pk IS NOT NULL ) SELECT user_pk, COUNT() AS n_reviews, ROUND(AVG(is_positive), 3) AS pos_rate, ROUND(AVG(is_negative), 3) AS neg_rate, ROUND(AVG(rating), 2) AS avg_rating, ROUND(AVG(m_score), 2) AS avg_m_score FROM joined GROUP BY user_pk HAVING COUNT() >= 3 ORDER BY n_reviews DESC LIMIT 15 [/babysql]

[babysql:polars] SELECT reviewer_cluster AS cluster, COUNT(*) AS n_users, SUM(n_reviews) AS n_reviews, ROUND(AVG(pos_rate), 3) AS avg_pos_rate, ROUND(AVG(neg_rate), 3) AS avg_neg_rate, dom_modality, dom_brand FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_clusters.parquet' GROUP BY reviewer_cluster, dom_modality, dom_brand ORDER BY reviewer_cluster, n_users DESC [/babysql]

[ml:sklearn-kmeans] library: scikit-learn script: analyze_consumer_clusters.py (via analyze_product_people_clusters.cluster_reviewers) unit: user_pk, n_reviews >= 3 features: product mix shares + sentiment rates + topic prevalence artifact: cluster_output/reviewer_clusters.parquet [/ml]

How we found this (math): Same KMeans pipeline as product_people section (a); z-scored features, silhouette k selection, k-NN validation.


Wearer cohort mix by cluster

[result:wearer_mix_by_cluster] | Cluster | Wearer cohort | Reviews | Share | | :right | :left | :right | :right | | 4 | platform_repeat_reviewer | 1890 | 53.3% | | 4 | declared_existing_wearer | 1158 | 32.6% | | 4 | platform_first_review | 491 | 13.8% | | 4 | declared_new_wearer | 8 | 0.2% | | 3 | platform_repeat_reviewer | 218 | 61.6% | | 3 | declared_existing_wearer | 82 | 23.2% | | 3 | platform_first_review | 54 | 15.3% | | 2 | platform_repeat_reviewer | 2236 | 62.7% | | 2 | declared_existing_wearer | 861 | 24.2% | | 2 | platform_first_review | 464 | 13.0% | | 2 | declared_new_wearer | 3 | 0.1% | | 1 | platform_repeat_reviewer | 293 | 63.1% | | 1 | platform_first_review | 92 | 19.8% | | 1 | declared_existing_wearer | 77 | 16.6% | | 1 | declared_new_wearer | 2 | 0.4% | | 0 | platform_repeat_reviewer | 189 | 54.0% | | 0 | declared_existing_wearer | 97 | 27.7% | | 0 | platform_first_review | 63 | 18.0% | | 0 | declared_new_wearer | 1 | 0.3% | [/result]

[babysql:polars] SELECT rc.reviewer_cluster, w.wearer_cohort, COUNT() AS n_reviews, ROUND(100.0 * COUNT() / SUM(COUNT(*)) OVER (PARTITION BY rc.reviewer_cluster), 1) AS pct FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/wearer_tags_v2.parquet' AS w JOIN '/home/sidmishra/ontologer/data/raw/hk_market/hktvmall/2026-06-19/reviews.parquet' AS r ON w.review_id = r.review_id JOIN '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_clusters.parquet' AS rc ON r.user_pk = rc.user_pk GROUP BY rc.reviewer_cluster, w.wearer_cohort ORDER BY rc.reviewer_cluster, n_reviews DESC [/babysql]

[ml:wearer_cohort-rules] library: wearer_cohort.classify_review script: regex scoring + ordinal gates (not ML) note: Cross-tab of rule-based wearer cohort × KMeans reviewer cluster [/ml]


Journey behavior by cluster

[result:cluster_transition_stats] | Cluster | Pairs | Brand switch | Modality switch | Lens switch | Median days | | :right | :right | :right | :right | :right | :right | | 0 | 172 | 20.9% | 31.5% | 14.5% | 167 | | 1 | 220 | 45.5% | 6.5% | 25.0% | 190 | | 2 | 1843 | 23.1% | 2.8% | 12.5% | 98 | | 3 | 150 | 32.0% | 33.3% | 16.0% | 118 | | 4 | 1799 | 26.1% | 3.0% | 15.0% | 118 | [/result]

[babysql:polars] SELECT rc.reviewer_cluster, COUNT(*) AS n_pairs, ROUND(AVG(CASE WHEN p.switched_brand THEN 1.0 ELSE 0.0 END), 3) AS brand_switch_rate, ROUND(AVG(CASE WHEN p.switched_modality THEN 1.0 ELSE 0.0 END), 3) AS modality_switch_rate, ROUND(AVG(CASE WHEN p.switched_lens THEN 1.0 ELSE 0.0 END), 3) AS lens_switch_rate, MEDIAN(p.days_gap) AS median_days_gap FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_transition_pairs.parquet' AS p JOIN '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_clusters.parquet' AS rc ON p.user_pk = rc.user_pk GROUP BY rc.reviewer_cluster ORDER BY rc.reviewer_cluster [/babysql]

How we found this (math): Switch flags on consecutive pairs; cluster-level means.


Key takeaways

  • Cluster 1 (detractor cohort) — only 114 users but 51% negative rate; similar daily/spherical mix to satisfied clusters — segmentation is sentiment behavior, not product mix alone.
  • Cluster 3 concentrates OLENS + monthly users (OLENS / monthly) with 83% positive.
  • Brand-switch rate ranges 21% (cluster 0) to 45% (cluster 1); detractors revisit on median 190 days between reviews.
  • Use platform_first_review proxy for scale when declared-new wearers (n≈37) is too small for theme work.

Output files

File Description
cluster_output/reviewer_clusters.parquet user_pk × cluster + label
cluster_output/consumer_cluster_topics.csv Topic rates per cluster
cluster_output/consumer_cluster_wearer_mix.csv Wearer cohort shares
cluster_output/consumer_cluster_transitions.csv Switch rates per cluster
cluster_output/consumer_clustering_report.json Full payload