Consumer clustering — repeat reviewer archetypes

Generated: 2026-06-26 13:55
Script: analyze_consumer_clusters.py

Report block convention

Each analytical section uses up to three executable layers:

Layer	Tag	Role
Results	`[result:…]…[/result]`	Markdown table — findings
Data SQL	`[babysql:polars]…[/babysql]`	Polars SQL on parquet — reproduce inputs/aggregates
ML / stats	`[ml:…]…[/ml]`	Python estimator spec (sklearn, statsmodels, scipy)

Prose How we found this (math) gives the formulas. Cached artifacts live in cluster_output/.

What this is based on

Each consumer = one user_pk with ≥3 contact-lens reviews on HKTVmall.

Per-user feature vector x ∈ ℝᵈ:

Block	Features	Source
Product mix	Review-count-weighted shares of `v2_modality`, `v2_lens_type`, `v2_pack_bucket`, `v2_brand_top`	`product_id_to_attributes_v2.parquet`
Sentiment behavior	`pos_rate`, `neg_rate`, `avg_rating`, `avg_m_score`, `avg_comment_len`, `n_reviews`, `n_stores`	`cl_reviews` + LLM enrichment
Topic behavior	Mean flag rate for each of 15 LLM topics (`comfort`, `price`, `repurchase`, …)	`6091e_cl_reviews_model_enriched.parquet`

Method: z-score standardize → KMeans (k ∈ [3,8] by best silhouette) → k-NN coherence (k=15) validation.

1,425 users clustered · k = 5 · silhouette 0.1824
k-NN label coherence: 0.848
Neighbors share dominant modality 96.1%; brand 79.4%

Not the same as: review-level clusters (cluster_reviews.py) or wearer cohorts (wearer_cohort.py).

Cluster profiles

[result:consumer_cluster_profiles] | Cluster | Label | Users | Reviews | Pos | Neg | Modality | Brand | | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | biweekly / niche modality | 73 | 350 | 79.0% | 7.0% | 2-week | ACUVUE | | 1 | detractor cohort | 114 | 464 | 24.6% | 50.7% | daily | BAUSCHLOMB | | 2 | mainstream satisfied | 556 | 3564 | 87.2% | 2.8% | daily | BAUSCHLOMB | | 3 | OLENS monthly loyalists | 62 | 354 | 82.6% | 9.3% | monthly | OLENS | | 4 | mainstream satisfied | 620 | 3547 | 85.9% | 3.1% | daily | BAUSCHLOMB | [/result]

[babysql:polars] WITH joined AS ( SELECT r.user_pk, CASE WHEN m.m_sentiment = 'positive' THEN 1.0 ELSE 0.0 END AS is_positive, CASE WHEN m.m_sentiment = 'negative' THEN 1.0 ELSE 0.0 END AS is_negative, cl.rating, m.m_score, v.v2_modality, v.v2_lens_type, v.v2_brand FROM '/home/sidmishra/ontologer/data/raw/hk_market/hktvmall/2026-06-19/reviews.parquet' AS r JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl ON r.review_id = cl.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/product_id_to_attributes_v2.parquet' AS v ON cl.product_id = v.product_id WHERE r.user_pk IS NOT NULL ) SELECT user_pk, COUNT() AS n_reviews, ROUND(AVG(is_positive), 3) AS pos_rate, ROUND(AVG(is_negative), 3) AS neg_rate, ROUND(AVG(rating), 2) AS avg_rating, ROUND(AVG(m_score), 2) AS avg_m_score FROM joined GROUP BY user_pk HAVING COUNT() >= 3 ORDER BY n_reviews DESC LIMIT 15 [/babysql]

[babysql:polars] SELECT reviewer_cluster AS cluster, COUNT(*) AS n_users, SUM(n_reviews) AS n_reviews, ROUND(AVG(pos_rate), 3) AS avg_pos_rate, ROUND(AVG(neg_rate), 3) AS avg_neg_rate, dom_modality, dom_brand FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_clusters.parquet' GROUP BY reviewer_cluster, dom_modality, dom_brand ORDER BY reviewer_cluster, n_users DESC [/babysql]

[ml:sklearn-kmeans] library: scikit-learn script: analyze_consumer_clusters.py (via analyze_product_people_clusters.cluster_reviewers) unit: user_pk, n_reviews >= 3 features: product mix shares + sentiment rates + topic prevalence artifact: cluster_output/reviewer_clusters.parquet [/ml]

How we found this (math): Same KMeans pipeline as product_people section (a); z-scored features, silhouette k selection, k-NN validation.

Wearer cohort mix by cluster

[result:wearer_mix_by_cluster] | Cluster | Wearer cohort | Reviews | Share | | :right | :left | :right | :right | | 4 | platform_repeat_reviewer | 1890 | 53.3% | | 4 | declared_existing_wearer | 1158 | 32.6% | | 4 | platform_first_review | 491 | 13.8% | | 4 | declared_new_wearer | 8 | 0.2% | | 3 | platform_repeat_reviewer | 218 | 61.6% | | 3 | declared_existing_wearer | 82 | 23.2% | | 3 | platform_first_review | 54 | 15.3% | | 2 | platform_repeat_reviewer | 2236 | 62.7% | | 2 | declared_existing_wearer | 861 | 24.2% | | 2 | platform_first_review | 464 | 13.0% | | 2 | declared_new_wearer | 3 | 0.1% | | 1 | platform_repeat_reviewer | 293 | 63.1% | | 1 | platform_first_review | 92 | 19.8% | | 1 | declared_existing_wearer | 77 | 16.6% | | 1 | declared_new_wearer | 2 | 0.4% | | 0 | platform_repeat_reviewer | 189 | 54.0% | | 0 | declared_existing_wearer | 97 | 27.7% | | 0 | platform_first_review | 63 | 18.0% | | 0 | declared_new_wearer | 1 | 0.3% | [/result]

[babysql:polars] SELECT rc.reviewer_cluster, w.wearer_cohort, COUNT() AS n_reviews, ROUND(100.0 * COUNT() / SUM(COUNT(*)) OVER (PARTITION BY rc.reviewer_cluster), 1) AS pct FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/wearer_tags_v2.parquet' AS w JOIN '/home/sidmishra/ontologer/data/raw/hk_market/hktvmall/2026-06-19/reviews.parquet' AS r ON w.review_id = r.review_id JOIN '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_clusters.parquet' AS rc ON r.user_pk = rc.user_pk GROUP BY rc.reviewer_cluster, w.wearer_cohort ORDER BY rc.reviewer_cluster, n_reviews DESC [/babysql]

[ml:wearer_cohort-rules] library: wearer_cohort.classify_review script: regex scoring + ordinal gates (not ML) note: Cross-tab of rule-based wearer cohort × KMeans reviewer cluster [/ml]

Journey behavior by cluster

[result:cluster_transition_stats] | Cluster | Pairs | Brand switch | Modality switch | Lens switch | Median days | | :right | :right | :right | :right | :right | :right | | 0 | 172 | 20.9% | 31.5% | 14.5% | 167 | | 1 | 220 | 45.5% | 6.5% | 25.0% | 190 | | 2 | 1843 | 23.1% | 2.8% | 12.5% | 98 | | 3 | 150 | 32.0% | 33.3% | 16.0% | 118 | | 4 | 1799 | 26.1% | 3.0% | 15.0% | 118 | [/result]

[babysql:polars] SELECT rc.reviewer_cluster, COUNT(*) AS n_pairs, ROUND(AVG(CASE WHEN p.switched_brand THEN 1.0 ELSE 0.0 END), 3) AS brand_switch_rate, ROUND(AVG(CASE WHEN p.switched_modality THEN 1.0 ELSE 0.0 END), 3) AS modality_switch_rate, ROUND(AVG(CASE WHEN p.switched_lens THEN 1.0 ELSE 0.0 END), 3) AS lens_switch_rate, MEDIAN(p.days_gap) AS median_days_gap FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_transition_pairs.parquet' AS p JOIN '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_clusters.parquet' AS rc ON p.user_pk = rc.user_pk GROUP BY rc.reviewer_cluster ORDER BY rc.reviewer_cluster [/babysql]

How we found this (math): Switch flags on consecutive pairs; cluster-level means.

Key takeaways

Cluster 1 (detractor cohort) — only 114 users but 51% negative rate; similar daily/spherical mix to satisfied clusters — segmentation is sentiment behavior, not product mix alone.
Cluster 3 concentrates OLENS + monthly users (OLENS / monthly) with 83% positive.
Brand-switch rate ranges 21% (cluster 0) to 45% (cluster 1); detractors revisit on median 190 days between reviews.
Use platform_first_review proxy for scale when declared-new wearers (n≈37) is too small for theme work.

Output files

File	Description
`cluster_output/reviewer_clusters.parquet`	`user_pk` × cluster + label
`cluster_output/consumer_cluster_topics.csv`	Topic rates per cluster
`cluster_output/consumer_cluster_wearer_mix.csv`	Wearer cohort shares
`cluster_output/consumer_cluster_transitions.csv`	Switch rates per cluster
`cluster_output/consumer_clustering_report.json`	Full payload