Wearer landscape — new vs existing themes

Generated: 2026-06-26 13:55
Script: analyze_wearer_landscape.py

Report block convention

Each analytical section uses up to three executable layers:

Layer	Tag	Role
Results	`[result:…]…[/result]`	Markdown table — findings
Data SQL	`[babysql:polars]…[/babysql]`	Polars SQL on parquet — reproduce inputs/aggregates
ML / stats	`[ml:…]…[/ml]`	Python estimator spec (sklearn, statsmodels, scipy)

Prose How we found this (math) gives the formulas. Cached artifacts live in cluster_output/.

What else we can do (total landscape roadmap)

This corpus supports a full VoC × product × people × time stack. Completed or in progress:

Layer	Status	Examples
Category health & stars vs LLM	Done	`initial_assessment_analytics.md`
v2 segment trends & GLM drivers	Done	`product_people_analytics.md`
Reviewer journeys (brand/modality switches)	Done	`reviewer_transitions.md`
New vs existing wearer themes	This report	Text-declared + platform-first proxies

Next extensions (not yet built):

Causal mix decomposition — is category sentiment fall driven by OLENS/color mix shift or within-SKU deterioration? (Shapley or fixed-effects on segment×month.)
Competitive switching attribution — link detractor Cluster 3 to prior brand/SKU (from transition pairs).
Store × wearer cohort — are new wearers concentrated in outlier stores (H7249002)?
Prescription / toric / multifocal onboarding — first-review topic paths for medical vs cosmetic entry.
LLM re-prompt for wearer intent — add wearer_type: new|existing|unknown to classification schema (cleaner than regex).
Survival / repeat hazard — P(second review within 90d | first-review themes).
Price-regret vs product-failure split — separate mixed-sentiment price timing from fit/eye_health negatives.

Cohort definitions

[result:wearer_cohort_counts] | Cohort | Rule | N | Interpretation | | --- | --- | --- | --- | | declared_new_wearer | Scored novice / first-lens trial | 37 | Self-identified novice | | declared_existing_wearer | Scored repurchase / veteran | 3,885 | Self-identified repeat | | platform_first_review | User's 1st review (no declared signal) | 6,678 | New to reviewing on HKTV | | platform_repeat_reviewer | User's 2+ review (proxy) | 6,224 | Engaged repeat reviewer | [/result]

[ml:wearer_cohort-rules] library: wearer_cohort.py (regex scoring, not sklearn) script: classify_review(comment, m_topics, m_praise, review_ordinal) scores: S_n novice patterns, S_e veteran patterns, veto penalties gates: wearing context, ordinal caps for weak trial language audit: cluster_output/wearer_audit_sample.md [/ml]

How we found this (math): If S_e ≥ 3 and S_e ≥ S_n → existing; elif S_n clears thresholds → new; else proxy by ordinal o_u within user_pk.

(a) Top themes — new wearers (`declared_new_wearer`)

Sentiment: 54.1% positive, 27.0% negative, 16.2% mixed · avg ★3.78 · avg comment 37 chars

LLM topics (mention rate per review)

[result:new_wearer_topics] | Topic | Rate | vs existing (pp) | | --- | --- | --- | | repurchase | 0.0% | -59.5 | | customer_service | 0.0% | -48.8 | | delivery | 2.7% | -44.7 | | price | 5.4% | -42.0 | | expiry | 2.7% | -40.3 | | other | 0.0% | -39.6 | | packaging | 16.2% | -37.2 | | prescription | 10.8% | -33.8 | | brand | 16.2% | -33.6 | | eye_health | 21.6% | -29.5 | [/result]

[babysql:polars] SELECT cohort, topic, rate FROM read_csv('/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/wearer_topic_rates.csv') WHERE cohort = 'declared_new_wearer' ORDER BY rate DESC LIMIT 12 [/babysql]

[ml:scipy-chi2] library: scipy.stats.chi2_contingency script: analyze_wearer_landscape.py :: topic enrichment test: topic mention independent of cohort [/ml]

Top complaint phrases (LLM-extracted)

唔好expect 上眼會有驚喜效果 (2)
唔舒服 (2)
盒裝濕 (2)
con packing破損 (2)
遇到開左就崩左 (1)
啲碎留係眼入邊係咩後果 (1)
對Con變細咗第一次買到咁嘅Con 好難想像長期帶會點 (1)
眼乾眼澀 (1)
un著un著感覺 (1)
第一次試戴呢個牌子 (1)

Top praise phrases

舒服 (3)
舒適 (2)
顏色自然嘅全深啡色 (2)
唔會直徑太大嘅大眼仔 (2)
第一次買，度數算準，配戴也比較舒服 (1)
度數準確 (1)
配戴舒服 (1)
顏色唔错 (1)

New-wearer narrative: Themes cluster on fit_sizing, comfort, and eye_health — onboarding anxiety, not repurchase. Negative rate is 27% vs 8% for declared existing — failure modes are adaptation (异物感, dry eyes) not loyalty erosion.

(b) Top themes — existing wearers (`declared_existing_wearer`)

Sentiment: 86.2% positive, 7.8% negative · avg ★4.54

LLM topics (mention rate per review)

Topic	Rate
fit_sizing	81.4%
comfort	68.9%
appearance	60.7%
repurchase	59.5%
quality	57.1%
packaging	53.5%
eye_health	51.1%
brand	49.8%
customer_service	48.8%
delivery	47.4%
price	47.4%
prescription	44.7%

Top complaint phrases

冇有 (111)
唔乾 (10)
唔舒服 (7)
not available in all sizes (6)
眼已覺攰及乾 (4)
不能久帶 (4)
冇有其他問題 (3)
唔好用 (3)

Top praise phrases

回購 (398)
多次回購 (263)
用開 (248)
已多次回購 (160)
good quality (119)
comfortable fit (94)
good (82)
一直用開 (66)
good fit (64)
已回購多次 (61)

Existing-wearer narrative: repurchase and brand dominate — advocacy and habit. Comfort/quality are table stakes; negatives are rare and often operational (delivery, expiry).

Platform proxy comparison (no declared signal in text)

Cohort	N	Pos	Neg	Top-3 topics
platform_first_review	6,678	69.5%	15.5%	comfort(60%), quality(60%), fit_sizing(59%)
platform_repeat_reviewer	6,224	78.3%	8.4%	comfort(71%), quality(69%), appearance(68%)

[babysql:polars] SELECT wearer_cohort, COUNT() AS n, ROUND(100.0 * SUM(CASE WHEN m.m_sentiment = 'positive' THEN 1 ELSE 0 END) / COUNT(), 1) AS pos_pct, ROUND(100.0 * SUM(CASE WHEN m.m_sentiment = 'negative' THEN 1 ELSE 0 END) / COUNT(*), 1) AS neg_pct FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/wearer_tags.parquet' AS w JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl ON w.review_id = cl.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/6091e_cl_reviews_model_enriched.parquet' AS m ON cl.review_id = m.review_id WHERE wearer_cohort IN ('platform_first_review', 'platform_repeat_reviewer') GROUP BY wearer_cohort [/babysql]

First-time reviewers on the platform show higher negative rate than repeat reviewers — even without saying "first time wearing lenses." Likely mix of trialists, disappointed switchers, and motivated complainers.

Product mix — what new vs existing wearers buy

Declared new — top modalities / lens types

modality: daily(100%) lens: spherical(68%), color(24%), multifocal(8%)

Declared existing — top modalities / lens types

modality: daily(93%), monthly(4%), 2-week(3%) lens: spherical(77%), color(14%), multifocal(8%), toric(1%), color_toric(0%)

[babysql:polars] SELECT w.wearer_cohort, v.v2_modality, v.v2_lens_type, COUNT() AS n, ROUND(100.0 * COUNT() / SUM(COUNT(*)) OVER (PARTITION BY w.wearer_cohort), 1) AS pct FROM '/home/sidmishra/ontologer/subprojects/review_sequencing_sentiment/cluster_output/wearer_tags.parquet' AS w JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/cl_reviews.parquet' AS cl ON w.review_id = cl.review_id JOIN '/home/sidmishra/ontologer/data/processed/hk_market/hktvmall/product_id_to_attributes_v2.parquet' AS v ON cl.product_id = v.product_id WHERE w.wearer_cohort IN ('declared_new_wearer', 'declared_existing_wearer') GROUP BY w.wearer_cohort, v.v2_modality, v.v2_lens_type ORDER BY w.wearer_cohort, n DESC [/babysql]

Topic enrichment — new vs existing (χ²)

Topic	New rate	Existing rate	Δ pp	p
repurchase	0.0%	59.5%	-59.5	8.31e-13*
customer_service	0.0%	48.8%	-48.8	9.41e-09*
delivery	2.7%	47.4%	-44.7	1.46e-07*
price	5.4%	47.4%	-42.0	8.35e-07*
expiry	2.7%	43.0%	-40.3	1.84e-06*
other	0.0%	39.6%	-39.6	2.11e-06*
packaging	16.2%	53.5%	-37.2	1.34e-05*
prescription	10.8%	44.7%	-33.8	7.54e-05*
brand	16.2%	49.8%	-33.6	9.66e-05*
eye_health	21.6%	51.1%	-29.5	0.000657*
appearance	35.1%	60.7%	-25.6	0.00274*
fit_sizing	62.2%	81.4%	-19.3	0.00545*

How we found this (math): Per topic (t): (\hat\pi_t = #{\text{reviews mentioning }t}/N). Test (H_0): topic mention independent of cohort using χ² on 2×2 table (mention × {new, existing}).

Output files

File	Role
`cluster_output/wearer_tags.parquet`	Per-review cohort tags
`cluster_output/wearer_topic_rates.csv`	Topic rates by cohort
`cluster_output/wearer_landscape_report.json`	Full payload

Wearer landscape — new vs existing themes

Report block convention

What else we can do (total landscape roadmap)

Cohort definitions

(a) Top themes — new wearers (declared_new_wearer)

LLM topics (mention rate per review)

Top complaint phrases (LLM-extracted)

Top praise phrases

(b) Top themes — existing wearers (declared_existing_wearer)

LLM topics (mention rate per review)

Top complaint phrases

Top praise phrases

Platform proxy comparison (no declared signal in text)

Product mix — what new vs existing wearers buy

Declared new — top modalities / lens types

Declared existing — top modalities / lens types

Topic enrichment — new vs existing (χ²)

Output files

(a) Top themes — new wearers (`declared_new_wearer`)

(b) Top themes — existing wearers (`declared_existing_wearer`)