Product-anchored people & segment analytics

Generated: 2026-06-26 13:56
Script: analyze_product_people_clusters.py
Recent window: 2026-03, 2026-04, 2026-05 vs baseline 2025-03–2026-02

Report block convention

Each analytical section uses up to three executable layers:

Layer	Tag	Role
Results	`[result:…]…[/result]`	Markdown table — findings
Data SQL	`[babysql:polars]…[/babysql]`	Polars SQL on parquet — reproduce inputs/aggregates
ML / stats	`[ml:…]…[/ml]`	Python estimator spec (sklearn, statsmodels, scipy)

Prose How we found this (math) gives the formulas. Cached artifacts live in cluster_output/.

Executive summary

Reviewer archetypes split by sentiment — Cluster 0 is the detractor cohort (21% positive, 54% negative) vs satisfied clusters (~85%+ positive) with similar daily/spherical mixes.
Recent volume growth concentrates in ACUVUE, multifocal, 30-packs — but positive rates fell on daily, spherical, 30-pack, and color lenses (see trend tables).
Store H7249002 is a volume outlier — platform review flow is store-skewed.
Positives and negatives have different drivers — after controlling for modality×lens×pack×brand, brand terms shift odds materially (logit tables).
Topic rates differ by segment (χ² p<0.05) — heterogeneity table below.

(a) Product-based reviewer clusters (product mix × sentiment behavior)

1,425 repeat reviewers (≥3 reviews) → 5 clusters; silhouette 0.1841; k-NN coherence 0.851.

[result:reviewer_cluster_profiles] | Cluster | Label | Users | Reviews | Pos | Neg | Modality | Brand | | :right | :left | :right | :right | :right | :right | :left | :left | | 0 | detractor cohort | 103 | 417 | 21.3% | 54.0% | daily(98), monthly(5) | BAUSCHLOMB(28), ACUVUE(26), OLENS(23) | | 1 | mainstream satisfied | 630 | 3591 | 85.2% | 3.6% | daily(623), ?(5), monthly(2) | BAUSCHLOMB(221), ACUVUE(193), OLENS(67) | | 2 | OLENS-heavy advocates | 64 | 361 | 83.6% | 7.7% | monthly(45), daily(19) | OLENS(55), DELIGHT(5), FRESHKON(3) | | 3 | mainstream satisfied | 555 | 3560 | 87.3% | 2.7% | daily(544), monthly(4), ?(4) | BAUSCHLOMB(198), ACUVUE(185), COOPERVISION(46) | | 4 | mixed / niche | 73 | 350 | 79.0% | 7.0% | 2-week(43), monthly(14), daily(9) | ACUVUE(55), COOPERVISION(10), BAUSCHLOMB(4) | [/result]

[babysql:polars] WITH joined AS ( SELECT r.user_pk, CASE WHEN m.m_sentiment = 'positive' THEN 1.0 ELSE 0.0 END AS is_positive, CASE WHEN m.m_sentiment = 'negative' THEN 1.0 ELSE 0.0 END AS is_negative, cl.rating, m.m_score, v.v2_modality, v.v2_lens_type, v.v2_brand FROM '/home/sidmishra/ontologer/data/raw/hk_market/hktvmall/2026-06-19/reviews.parquet' AS r JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl ON r.review_id = cl.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/product_id_to_attributes_v2.parquet' AS v ON cl.product_id = v.product_id WHERE r.user_pk IS NOT NULL ) SELECT user_pk, COUNT() AS n_reviews, ROUND(AVG(is_positive), 3) AS pos_rate, ROUND(AVG(is_negative), 3) AS neg_rate, ROUND(AVG(rating), 2) AS avg_rating, ROUND(AVG(m_score), 2) AS avg_m_score FROM joined GROUP BY user_pk HAVING COUNT() >= 3 ORDER BY n_reviews DESC LIMIT 15 [/babysql]

[babysql:polars] SELECT reviewer_cluster AS cluster, COUNT(*) AS n_users, SUM(n_reviews) AS n_reviews, ROUND(AVG(pos_rate), 3) AS avg_pos_rate, ROUND(AVG(neg_rate), 3) AS avg_neg_rate, dom_modality, dom_brand FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/reviewer_clusters.parquet' GROUP BY reviewer_cluster, dom_modality, dom_brand ORDER BY reviewer_cluster, n_users DESC [/babysql]

[ml:sklearn-kmeans] library: scikit-learn script: analyze_product_people_clusters.py :: cluster_reviewers unit: user_pk with n_reviews >= 3 features: - review-weighted shares: v2_modality, v2_lens_type, v2_pack_bucket, v2_brand_top - pos_rate, neg_rate, avg_rating, avg_m_score, avg_comment_len, n_stores - mean topic flags (15 LLM topics) preprocess: StandardScaler model: KMeans(n_init=10, random_state=42) selection: k in 3..8, maximize silhouette_score validation: knn_coherence(k=15, metric=euclidean) artifact: cluster_output/reviewer_clusters.parquet [/ml]

How we found this (math): Build x ∈ ℝᵈ per user; z = (x−μ)/σ. KMeans minimizes Σᵢ‖zᵢ−μ_c‖²; pick k maximizing silhouette s = (b−a)/max(a,b). k-NN coherence = (1/N)Σᵢ |{j∈Nₖ(i): label_j=label_i}|/k.

(b) Greatest volume & sentiment shifts by product dimension

[result:product_dimension_trends] | Dimension | Value | Δ vol/mo | Growth | Δ pos (pp) | p (pos shift) | | :left | :left | :right | :right | :right | :right | | v2_brand_top | ACUVUE | 17.00 | 26.6% | -6.4 | 0.0402 | | v2_brand_top | ALCON | 0.75 | 12.0% | +1.5 | 0.887 | | v2_brand_top | CANDYMAGIC | -0.33 | -6.7% | -23.3 | 0.0894 | | v2_brand_top | DELIGHT | -0.67 | -33.3% | +20.8 | nan | | v2_brand_top | FRESHKON | -0.83 | -45.5% | -30.3 | nan | | v2_modality | monthly | 0.33 | 4.8% | -8.5 | 0.464 | | v2_modality | 2-week | -0.33 | -33.3% | +58.3 | nan | | v2_modality | daily | -3.00 | -2.6% | -8.4 | 0.00251 | | v2_lens_type | multifocal | 7.08 | 35.0% | +0.7 | 0.898 | | v2_lens_type | spherical | 1.17 | 1.2% | -9.3 | 0.00136 | | v2_lens_type | toric | 0.83 | 33.3% | -23.3 | 0.0533 | | v2_lens_type | None | -0.08 | -100.0% | +0.0 | nan | | v2_lens_type | color | -5.00 | -16.0% | -14.2 | 0.0207 | | v2_pack_bucket | 30 | 6.58 | 6.7% | -9.0 | 0.00198 | | v2_pack_bucket | 6 | 3.58 | 24.9% | +3.2 | 0.567 | | v2_pack_bucket | 2 | 0.42 | 7.5% | -20.8 | 0.115 | | v2_pack_bucket | 10 | 0.42 | 5.5% | -16.2 | 0.134 | | v2_pack_bucket | other | -7.00 | -30.4% | -3.4 | 0.617 | [/result]

[babysql:polars] WITH base AS ( SELECT v.v2_brand, LEFT(cl.date, 7) AS ym, CASE WHEN m.m_sentiment = 'positive' THEN 1 ELSE 0 END AS is_positive FROM '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/product_id_to_attributes_v2.parquet' AS v ON cl.product_id = v.product_id WHERE v.v2_brand IS NOT NULL ), recent AS ( SELECT v2_brand, COUNT() AS n_recent, AVG(is_positive) AS pos_recent FROM base WHERE ym IN ('2026-03', '2026-04', '2026-05') GROUP BY v2_brand ), baseline AS ( SELECT v2_brand, COUNT() AS n_base, AVG(is_positive) AS pos_base FROM base WHERE ym >= '2025-03' AND ym <= '2026-02' GROUP BY v2_brand ) SELECT r.v2_brand, r.n_recent, ROUND(r.n_recent / 3.0, 2) AS rate_recent_mo, b.n_base, ROUND(b.n_base / 12.0, 2) AS rate_base_mo, ROUND(r.n_recent / 3.0 - b.n_base / 12.0, 2) AS delta_vol_mo, ROUND(100.0 * (r.pos_recent - b.pos_base), 1) AS delta_pos_pp FROM recent r JOIN baseline b ON r.v2_brand = b.v2_brand ORDER BY delta_vol_mo DESC LIMIT 10 [/babysql]

[babysql:polars] WITH base AS ( SELECT v.v2_modality, LEFT(cl.date, 7) AS ym, CASE WHEN m.m_sentiment = 'positive' THEN 1 ELSE 0 END AS is_positive FROM '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/product_id_to_attributes_v2.parquet' AS v ON cl.product_id = v.product_id WHERE v.v2_modality IS NOT NULL ), recent AS ( SELECT v2_modality, COUNT() AS n_recent, AVG(is_positive) AS pos_recent FROM base WHERE ym IN ('2026-03', '2026-04', '2026-05') GROUP BY v2_modality ), baseline AS ( SELECT v2_modality, COUNT() AS n_base, AVG(is_positive) AS pos_base FROM base WHERE ym >= '2025-03' AND ym <= '2026-02' GROUP BY v2_modality ) SELECT r.v2_modality, ROUND(r.n_recent / 3.0 - b.n_base / 12.0, 2) AS delta_vol_mo, ROUND(100.0 * (r.pos_recent - b.pos_base), 1) AS delta_pos_pp FROM recent r JOIN baseline b ON r.v2_modality = b.v2_modality ORDER BY delta_vol_mo DESC [/babysql]

[ml:scipy-two-proportion] library: scipy.stats + polars script: analyze_product_people_clusters.py :: dim_trends, two_prop_z windows: recent=2026-03,2026-04,2026-05 vs baseline 2025-03..2026-02 volume: delta_vol_mo = n_recent/3 - n_base/12 sentiment_test: two-proportion z on positive rate; p = 2(1-Φ(|z|)) artifact: cluster_output/trends_*.csv [/ml]

How we found this (math): λ̂_r = n_recent/3; λ̂_b = n_base/12; Δvol = λ̂_r−λ̂_b. Positive-rate shift: z = (p̂_r−p̂_b)/√(p̂(1−p̂)(1/n_r+1/n_b)).

(c) Store effects & recent anomalies

Stores with ≥15 baseline reviews: 18.

[result:store_volume_anomalies] | Store | Δ reviews/mo | z | Rel growth | Δ pos (pp) | | :left | :right | :right | :right | :right | | H7249002 | 17.17 | 3.16 | 49% | +3.7 | | H1053001 | 4.75 | 0.87 | 112% | -20.7 | | H8617001 | 3.25 | 0.59 | 57% | -19.2 | | B1048002 | 3.08 | 0.56 | 73% | -23.2 | | B1099001 | 1.33 | 0.24 | 100% | -12.0 | | H9723001 | 1.17 | 0.20 | 14% | -10.6 | | B1412001 | 0.83 | 0.14 | 33% | -23.8 | | H6878002 | 0.75 | 0.13 | 33% | -21.7 |

Largest drops:

| Store | Δ reviews/mo | z | | :left | :right | :right | | H9835001 | -7.50 | -1.40 | | H5711001 | -6.58 | -1.23 | | H6878001 | -4.08 | -0.77 | | S2830001 | -3.92 | -0.73 | | H7385002 | -3.75 | -0.70 | [/result]

[babysql:polars] SELECT r.ref_store_code, LEFT(cl.date, 7) AS ym, COUNT(*) AS n, ROUND(AVG(CASE WHEN m.m_sentiment = 'positive' THEN 1.0 ELSE 0.0 END), 3) AS pos_rate FROM '/home/sidmishra/ontologer/data/raw/hk_market/hktvmall/2026-06-19/reviews.parquet' AS r JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl ON r.review_id = cl.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id WHERE r.ref_store_code IS NOT NULL GROUP BY r.ref_store_code, ym ORDER BY ref_store_code, ym [/babysql]

[babysql:polars] SELECT ref_store_code, delta_mo_vol, vol_z, ROUND(100.0 * rel_vol_growth, 0) AS rel_vol_growth_pct, ROUND(delta_pos_pp, 1) AS delta_pos_pp FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/store_anomalies.parquet' ORDER BY delta_mo_vol DESC LIMIT 8 [/babysql]

[ml:python-zscore] library: polars + numpy script: analyze_product_people_clusters.py :: store_anomalies steps: monthly n by store → recent/3 vs base/12 → z_vol = (Δ−μ_Δ)/σ_Δ across stores artifact: cluster_output/store_anomalies.parquet [/ml]

How we found this (math): z_vol = (λ̂_recent−λ̂_base−μ_Δ)/σ_Δ; sentiment shift = 100·(p̂_recent−p̂_base) pp.

(d) Reviewer KNN vs review-level clusters

[result:knn_coherence] | Layer | k-NN coherence (k=15) | | --- | --- | | Review-level (cluster_reviews.py) | 0.997 | | Reviewer-level (product mix) | 0.851 | [/result]

[babysql:polars] SELECT k, silhouette, calinski_harabasz, davies_bouldin FROM read_csv('/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/review_k_selection.csv') ORDER BY k [/babysql]

[ml:sklearn-knn] library: sklearn.neighbors.NearestNeighbors script: cluster_reviews.knn_coherence + analyze_product_people_clusters.cluster_reviewers metric: euclidean coherence: fraction of k nearest neighbors sharing same cluster label [/ml]

How we found this (math): C = (1/N)Σᵢ (1/k)Σ_{j∈Nₖ(i)} 𝟙[label_j=label_i].

(e) Positive / negative drivers — global vs segment heterogeneity

Global top topics

positive: comfort=72.1%, fit_sizing=67.9%, appearance=66.8%, quality=66.4%, packaging=57.5%, price=56.0%
negative: fit_sizing=60.2%, eye_health=45.8%, quality=41.0%, packaging=39.5%, comfort=32.6%, delivery=29.1%

Logistic regression — positive outcome

[result:logit_positive] | Term | OR | Coef | p | | :left | :right | :right | :right | | C(v2_modality)[T.daily] | 1.933 | 0.6589 | 0.02509 | | C(v2_modality)[T.monthly] | 1.004 | 0.0037 | 0.9862 | | C(v2_lens_type)[T.color_toric] | 0.705 | -0.3489 | 0.6367 | | C(v2_lens_type)[T.multifocal] | 1.119 | 0.1124 | 0.2509 | | C(v2_lens_type)[T.spherical] | 1.531 | 0.4261 | 8.651e-14 | | C(v2_lens_type)[T.toric] | 3.126 | 1.1396 | 0.001894 | | C(v2_pack_bucket)[T.2] | 1.403 | 0.3387 | 0.1109 | | C(v2_pack_bucket)[T.30] | 0.826 | -0.1909 | 0.06923 | | C(v2_pack_bucket)[T.6] | 1.333 | 0.2878 | 0.3273 | | C(v2_pack_bucket)[T.other] | 1.014 | 0.0144 | 0.8916 | | C(v2_brand_top)[T.ALCON] | 1.072 | 0.0696 | 0.4235 | | C(v2_brand_top)[T.BAUSCHLOMB] | 0.918 | -0.0857 | 0.09525 | | C(v2_brand_top)[T.CANDYMAGIC] | 0.773 | -0.2571 | 0.01037 | | C(v2_brand_top)[T.COOPERVISION] | 1.098 | 0.0933 | 0.2853 |

N=16107, pseudo-R²=0.0079 [/result]

[babysql:polars] SELECT CASE WHEN m.m_sentiment = 'positive' THEN 1 ELSE 0 END AS is_positive, CASE WHEN m.m_sentiment = 'negative' THEN 1 ELSE 0 END AS is_negative, v.v2_modality, v.v2_lens_type, CASE WHEN v.v2_pack_size IN (2, 6, 10, 30) THEN CAST(v.v2_pack_size AS VARCHAR) ELSE 'other' END AS v2_pack_bucket, v.v2_brand FROM '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/product_id_to_attributes_v2.parquet' AS v ON cl.product_id = v.product_id WHERE v.v2_modality IS NOT NULL AND v.v2_lens_type IS NOT NULL AND v.v2_pack_size IS NOT NULL AND v.v2_brand IS NOT NULL LIMIT 20 [/babysql]

[babysql:polars] SELECT term, coef, odds_ratio, p FROM read_csv('/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/logit_positive_terms.csv') ORDER BY p ASC LIMIT 15 [/babysql]

[ml:statsmodels-logit] library: statsmodels.formula.api.logit script: analyze_product_people_clusters.py :: fit_logit formula: is_positive ~ C(v2_modality) + C(v2_lens_type) + C(v2_pack_bucket) + C(v2_brand_top) outcome: is_positive estimator: MLE Logit report: OR = exp(β), Wald p-value artifact: cluster_output/logit_positive_terms.csv [/ml]

Logistic regression — negative outcome

[result:logit_negative] | Term | OR | Coef | p | | :left | :right | :right | :right | | C(v2_modality)[T.daily] | 0.477 | -0.741 | 0.06871 | | C(v2_modality)[T.monthly] | 1.142 | 0.1332 | 0.6473 | | C(v2_lens_type)[T.color_toric] | 1.774 | 0.5735 | 0.4892 | | C(v2_lens_type)[T.multifocal] | 0.863 | -0.1474 | 0.2274 | | C(v2_lens_type)[T.spherical] | 0.49 | -0.7143 | 1.527e-23 | | C(v2_lens_type)[T.toric] | 0.41 | -0.8911 | 0.04458 | | C(v2_pack_bucket)[T.2] | 0.713 | -0.3381 | 0.264 | | C(v2_pack_bucket)[T.30] | 1.408 | 0.3419 | 0.01807 | | C(v2_pack_bucket)[T.6] | 0.808 | -0.2129 | 0.6008 | | C(v2_pack_bucket)[T.other] | 1.152 | 0.1417 | 0.333 | | C(v2_brand_top)[T.ALCON] | 1.032 | 0.0319 | 0.7815 | | C(v2_brand_top)[T.BAUSCHLOMB] | 1.055 | 0.0534 | 0.4454 | | C(v2_brand_top)[T.CANDYMAGIC] | 1.314 | 0.273 | 0.04486 | | C(v2_brand_top)[T.COOPERVISION] | 0.753 | -0.2834 | 0.02668 |

N=16107, pseudo-R²=0.0152 [/result]

[babysql:polars] SELECT term, coef, odds_ratio, p FROM read_csv('/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/logit_negative_terms.csv') ORDER BY p ASC LIMIT 15 [/babysql]

[ml:statsmodels-logit] library: statsmodels.formula.api.logit script: analyze_product_people_clusters.py :: fit_logit formula: is_negative ~ C(v2_modality) + C(v2_lens_type) + C(v2_pack_bucket) + C(v2_brand_top) outcome: is_negative artifact: cluster_output/logit_negative_terms.csv [/ml]

Topics with statistically different rates across segments (χ², p<0.05)

[babysql:polars] SELECT sentiment, topic, dim, chi2, p FROM read_csv('/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/topic_heterogeneity.csv') ORDER BY p ASC LIMIT 15 [/babysql]

[ml:scipy-chi2] library: scipy.stats.chi2_contingency script: analyze_product_people_clusters.py :: segment_topic_enrichment table: segment × {mention, no mention} artifact: cluster_output/topic_heterogeneity.csv [/ml]

How we found this (math): χ² = Σ(O−E)²/E on contingency tables; logit OR_j = exp(β_j).

Segment sentiment residuals (complete v2 segments)

[result:segment_sentiment_residuals] | Segment | N | Pos rate | Residual (pp) | | :left | :right | :right | :right | | daily|spherical|1 | 8 | 25.0% | -51.6 | | daily|multifocal|90 | 23 | 56.5% | -20.1 | | monthly|color|2 | 137 | 59.1% | -17.5 | | daily|color_toric|30 | 8 | 62.5% | -14.1 | | daily|color|10 | 135 | 65.2% | -11.4 | | monthly|spherical|3 | 2 | 100.0% | +23.4 | | daily|color|1 | 2 | 100.0% | +23.4 | | daily|spherical|5 | 2 | 100.0% | +23.4 | | monthly|toric|3 | 16 | 93.8% | +17.2 | | daily|spherical|32 | 17 | 88.2% | +11.6 | [/result]

[babysql:polars] WITH joined AS ( SELECT CASE WHEN m.m_sentiment = 'positive' THEN 1 ELSE 0 END AS is_positive, v.v2_compare_segment FROM '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/product_id_to_attributes_v2.parquet' AS v ON cl.product_id = v.product_id WHERE v.v2_compare_segment_complete = true ), global AS (SELECT AVG(is_positive) AS pi FROM joined) SELECT j.v2_compare_segment, COUNT(*) AS n, ROUND(100.0 * AVG(j.is_positive), 1) AS pos_rate, ROUND(100.0 * (AVG(j.is_positive) - g.pi), 1) AS pos_residual_pp FROM joined j CROSS JOIN global g GROUP BY j.v2_compare_segment, g.pi ORDER BY pos_residual_pp ASC LIMIT 8 [/babysql]

How we found this (math): residual = 100·(π̂_pos,s−π̂_pos).

Output files

File	Description
`cluster_output/product_people_report.json`	Full numeric payload
`cluster_output/reviewer_clusters.parquet`	user_pk × cluster assignment
`cluster_output/trends_*.csv`	Per-dimension trend tables
`cluster_output/store_anomalies.parquet`	Store volume/sentiment shifts
`cluster_output/logit_*_terms.csv`	Logistic regression coefficients
`cluster_output/topic_heterogeneity.csv`	χ² heterogeneity tests