Wearer landscape — new vs existing themes

Generated: 2026-06-26 13:55
Script: analyze_wearer_landscape.py


Report block convention

Each analytical section uses up to three executable layers:

Layer Tag Role
Results [result:…]…[/result] Markdown table — findings
Data SQL [babysql:polars]…[/babysql] Polars SQL on parquet — reproduce inputs/aggregates
ML / stats [ml:…]…[/ml] Python estimator spec (sklearn, statsmodels, scipy)

Prose How we found this (math) gives the formulas. Cached artifacts live in cluster_output/.


What else we can do (total landscape roadmap)

This corpus supports a full VoC × product × people × time stack. Completed or in progress:

Layer Status Examples
Category health & stars vs LLM Done initial_assessment_analytics.md
v2 segment trends & GLM drivers Done product_people_analytics.md
Reviewer journeys (brand/modality switches) Done reviewer_transitions.md
New vs existing wearer themes This report Text-declared + platform-first proxies

Next extensions (not yet built):

  1. Causal mix decomposition — is category sentiment fall driven by OLENS/color mix shift or within-SKU deterioration? (Shapley or fixed-effects on segment×month.)
  2. Competitive switching attribution — link detractor Cluster 3 to prior brand/SKU (from transition pairs).
  3. Store × wearer cohort — are new wearers concentrated in outlier stores (H7249002)?
  4. Prescription / toric / multifocal onboarding — first-review topic paths for medical vs cosmetic entry.
  5. LLM re-prompt for wearer intent — add wearer_type: new|existing|unknown to classification schema (cleaner than regex).
  6. Survival / repeat hazard — P(second review within 90d | first-review themes).
  7. Price-regret vs product-failure split — separate mixed-sentiment price timing from fit/eye_health negatives.

Cohort definitions

[result:wearer_cohort_counts] | Cohort | Rule | N | Interpretation | | --- | --- | --- | --- | | declared_new_wearer | Scored novice / first-lens trial | 37 | Self-identified novice | | declared_existing_wearer | Scored repurchase / veteran | 3,885 | Self-identified repeat | | platform_first_review | User's 1st review (no declared signal) | 6,678 | New to reviewing on HKTV | | platform_repeat_reviewer | User's 2+ review (proxy) | 6,224 | Engaged repeat reviewer | [/result]

[ml:wearer_cohort-rules] library: wearer_cohort.py (regex scoring, not sklearn) script: classify_review(comment, m_topics, m_praise, review_ordinal) scores: S_n novice patterns, S_e veteran patterns, veto penalties gates: wearing context, ordinal caps for weak trial language audit: cluster_output/wearer_audit_sample.md [/ml]

How we found this (math): If S_e ≥ 3 and S_e ≥ S_n → existing; elif S_n clears thresholds → new; else proxy by ordinal o_u within user_pk.


(a) Top themes — new wearers (declared_new_wearer)

Sentiment: 54.1% positive, 27.0% negative, 16.2% mixed · avg ★3.78 · avg comment 37 chars

LLM topics (mention rate per review)

[result:new_wearer_topics] | Topic | Rate | vs existing (pp) | | --- | --- | --- | | repurchase | 0.0% | -59.5 | | customer_service | 0.0% | -48.8 | | delivery | 2.7% | -44.7 | | price | 5.4% | -42.0 | | expiry | 2.7% | -40.3 | | other | 0.0% | -39.6 | | packaging | 16.2% | -37.2 | | prescription | 10.8% | -33.8 | | brand | 16.2% | -33.6 | | eye_health | 21.6% | -29.5 | [/result]

[babysql:polars] SELECT cohort, topic, rate FROM read_csv('/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/wearer_topic_rates.csv') WHERE cohort = 'declared_new_wearer' ORDER BY rate DESC LIMIT 12 [/babysql]

[ml:scipy-chi2] library: scipy.stats.chi2_contingency script: analyze_wearer_landscape.py :: topic enrichment test: topic mention independent of cohort [/ml]

Top complaint phrases (LLM-extracted)

  • 唔好expect 上眼會有驚喜效果 (2)
  • 唔舒服 (2)
  • 盒裝濕 (2)
  • con packing破損 (2)
  • 遇到開左就崩左 (1)
  • 啲碎留係眼入邊係咩後果 (1)
  • 對Con變細咗第一次買到咁嘅Con 好難想像長期帶會點 (1)
  • 眼乾眼澀 (1)
  • un著un著感覺 (1)
  • 第一次試戴呢個牌子 (1)

Top praise phrases

  • 舒服 (3)
  • 舒適 (2)
  • 顏色自然嘅全深啡色 (2)
  • 唔會直徑太大嘅大眼仔 (2)
  • 第一次買,度數算準,配戴也比較舒服 (1)
  • 度數準確 (1)
  • 配戴舒服 (1)
  • 顏色唔错 (1)

New-wearer narrative: Themes cluster on fit_sizing, comfort, and eye_health — onboarding anxiety, not repurchase. Negative rate is 27% vs 8% for declared existing — failure modes are adaptation (异物感, dry eyes) not loyalty erosion.


(b) Top themes — existing wearers (declared_existing_wearer)

Sentiment: 86.2% positive, 7.8% negative · avg ★4.54

LLM topics (mention rate per review)

Topic Rate
fit_sizing 81.4%
comfort 68.9%
appearance 60.7%
repurchase 59.5%
quality 57.1%
packaging 53.5%
eye_health 51.1%
brand 49.8%
customer_service 48.8%
delivery 47.4%
price 47.4%
prescription 44.7%

Top complaint phrases

  • 冇有 (111)
  • 唔乾 (10)
  • 唔舒服 (7)
  • not available in all sizes (6)
  • 眼已覺攰及乾 (4)
  • 不能久帶 (4)
  • 冇有其他問題 (3)
  • 唔好用 (3)

Top praise phrases

  • 回購 (398)
  • 多次回購 (263)
  • 用開 (248)
  • 已多次回購 (160)
  • good quality (119)
  • comfortable fit (94)
  • good (82)
  • 一直用開 (66)
  • good fit (64)
  • 已回購多次 (61)

Existing-wearer narrative: repurchase and brand dominate — advocacy and habit. Comfort/quality are table stakes; negatives are rare and often operational (delivery, expiry).

[babysql:polars] SELECT cohort, topic, rate FROM read_csv('/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/wearer_topic_rates.csv') WHERE cohort = 'declared_existing_wearer' ORDER BY rate DESC LIMIT 12 [/babysql]


Platform proxy comparison (no declared signal in text)

Cohort N Pos Neg Top-3 topics
platform_first_review 6,678 69.5% 15.5% comfort(60%), quality(60%), fit_sizing(59%)
platform_repeat_reviewer 6,224 78.3% 8.4% comfort(71%), quality(69%), appearance(68%)

[babysql:polars] SELECT wearer_cohort, COUNT() AS n, ROUND(100.0 * SUM(CASE WHEN m.m_sentiment = 'positive' THEN 1 ELSE 0 END) / COUNT(), 1) AS pos_pct, ROUND(100.0 * SUM(CASE WHEN m.m_sentiment = 'negative' THEN 1 ELSE 0 END) / COUNT(*), 1) AS neg_pct FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/wearer_tags.parquet' AS w JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl ON w.review_id = cl.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id WHERE wearer_cohort IN ('platform_first_review', 'platform_repeat_reviewer') GROUP BY wearer_cohort [/babysql]

First-time reviewers on the platform show higher negative rate than repeat reviewers — even without saying "first time wearing lenses." Likely mix of trialists, disappointed switchers, and motivated complainers.


Product mix — what new vs existing wearers buy

Declared new — top modalities / lens types

modality: daily(100%) lens: spherical(68%), color(24%), multifocal(8%)

Declared existing — top modalities / lens types

modality: daily(93%), monthly(4%), 2-week(3%) lens: spherical(77%), color(14%), multifocal(8%), toric(1%), color_toric(0%)

[babysql:polars] SELECT w.wearer_cohort, v.v2_modality, v.v2_lens_type, COUNT() AS n, ROUND(100.0 * COUNT() / SUM(COUNT(*)) OVER (PARTITION BY w.wearer_cohort), 1) AS pct FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/wearer_tags.parquet' AS w JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl ON w.review_id = cl.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/product_id_to_attributes_v2.parquet' AS v ON cl.product_id = v.product_id WHERE w.wearer_cohort IN ('declared_new_wearer', 'declared_existing_wearer') GROUP BY w.wearer_cohort, v.v2_modality, v.v2_lens_type ORDER BY w.wearer_cohort, n DESC [/babysql]


Topic enrichment — new vs existing (χ²)

Topic New rate Existing rate Δ pp p
repurchase 0.0% 59.5% -59.5 8.31e-13*
customer_service 0.0% 48.8% -48.8 9.41e-09*
delivery 2.7% 47.4% -44.7 1.46e-07*
price 5.4% 47.4% -42.0 8.35e-07*
expiry 2.7% 43.0% -40.3 1.84e-06*
other 0.0% 39.6% -39.6 2.11e-06*
packaging 16.2% 53.5% -37.2 1.34e-05*
prescription 10.8% 44.7% -33.8 7.54e-05*
brand 16.2% 49.8% -33.6 9.66e-05*
eye_health 21.6% 51.1% -29.5 0.000657*
appearance 35.1% 60.7% -25.6 0.00274*
fit_sizing 62.2% 81.4% -19.3 0.00545*

How we found this (math): Per topic (t): (\hat\pi_t = #{\text{reviews mentioning }t}/N). Test (H_0): topic mention independent of cohort using χ² on 2×2 table (mention × {new, existing}).


Output files

File Role
cluster_output/wearer_tags.parquet Per-review cohort tags
cluster_output/wearer_topic_rates.csv Topic rates by cohort
cluster_output/wearer_landscape_report.json Full payload