Review sequencing & sentiment — file index

Data root: ~/projects-new/data-analyses/hk_market/
Scripts: ~/ontologer/subprojects/review_sequencing_sentiment/

Last updated: 2026-06-26 (data relocated to projects-new)

Primary outputs (use these)

File	Rows	Role
`source_data/cl_reviews.parquet`	16,835	Review corpus (479 `product_id`s)
`source_data/product_id_to_attributes_v2.parquet`	2,914	Apples-to-apples product mapping (`v2_modality` × `v2_lens_type` × `v2_pack_size`)
`source_data/6091e_cl_reviews_model_enriched.parquet`	16,835	LLM-enriched reviews (run `6091e`)
`source_data/reviews.parquet`	—	Raw scrape (`user_pk`, `ref_store_code`)
`output_data/v2_segment_frequencies.csv`	71	Segment × SKU × review counts

Join key: cl_reviews.product_id = product_id_to_attributes_v2.product_id (bare id, no hktv- prefix).

v2 mapping (`product_id_to_attributes_v2.parquet`)

Rebuild:

cd ~/ontologer/subprojects/review_sequencing_sentiment
./venv/bin/python3 build_product_id_to_attributes_v2.py

Auto-loads newest PDP pack file when present.

Pack size resolution order

orig_packingSpecEn / orig_packingSpecZh (index scrape)
description_pdp — from PDP HTML Packing : N (see below)
Title / name parse
orig_llm_pack_size

Coverage (current)

Metric	Value
Review `product_id`s in v2	479 / 479
Contact-lens SKUs in v2	2,821
Reviews with complete segment (`modality\\|lens_type\\|pack`)	15,351 / 16,835 (91.2%)
Reviews with `v2source_pack_size = description_pdp`	8,939
SKUs with `description_pdp`	88

Key v2 columns

Prefix	Examples	Meaning
`orig_llm_*`	`orig_llm_modality`	Processed catalog LLM dims
`orig_*`	`orig_packingSpecEn`, `orig_subcat2_en`	Raw scrape JSON
`v2_*`	`v2_modality`, `v2_lens_type`, `v2_pack_size`, `v2_brand`	Corrected attributes
`v2source_*`	`v2source_pack_size`	Where each v2 field came from
`v2_compare_segment`	`daily\\|spherical\\|30`	Apples-to-apples key
`v2_compare_segment_complete`	bool	All three dims known

PDP pack enrichment

Item	Path
Scraper	`~/ontologer/data/pipelines/hk_market.hktvmall/scripts/scrape_pdp_pack_sizes.py`
Output	`~/ontologer/data/raw/hk_market/hktvmall/pdp_pack/{date}/pdp_pack_missing_descr.parquet`
Latest run	`.../pdp_pack/2026-06-26/pdp_pack_missing_descr.parquet`
Checkpoint	`.../pdp_pack/2026-06-26/checkpoint.json`
Run log	`.../pdp_pack/2026-06-26/scrape.log`

2026-06-26 run: 88 / 110 review-corpus null-pack SKUs parsed.

N (pack size)	SKUs	Reviews
30	56	8,035
2	21	112
6	9	775
3	1	16
10	1	1

# Re-scrape (resumes; skips already-parsed SKUs)
python3 ~/ontologer/data/pipelines/hk_market.hktvmall/scripts/scrape_pdp_pack_sizes.py --date 2026-06-26

Segment frequency table

Regenerate:

./venv/bin/python3 analyze_v2_segments.py

Writes output_data/v2_segment_frequencies.csv and .parquet.

Contact lenses only. Each row = unique (v2_modality, v2_lens_type, v2_pack_size) with SKU count and review count.

Summary (2026-06-26, after PDP)

	Segment combos	Reviews
Complete	47	15,351 (91.2%)
Incomplete	24	1,484 (8.8%)
Total rows	71	16,835

Top complete segments (by reviews)

Modality	Lens type	Pack	SKUs	Reviews	% corpus
daily	spherical	30	215	9,471	56.3%
daily	color	30	369	1,750	10.4%
daily	multifocal	30	51	860	5.1%
daily	spherical	10	187	779	4.6%
daily	spherical	20	54	632	3.8%
monthly	color	2	223	137	0.8%
daily	color	10	686	118	0.7%

Top incomplete segments (by reviews)

Modality	Lens type	Pack	SKUs	Reviews	% corpus
daily	spherical	null	107	716	4.3%
daily	color	null	148	26	0.2%
monthly	spherical	null	15	13	0.1%

Remaining gap is mostly 22 PDP failures (19 pack_not_found, 3 missing_url_en) plus catalog SKUs with null pack but no reviews.

Raw & upstream inputs

File	Role
`~/ontologer/data/raw/hk_market/hktvmall/2026-06-12/products.parquet`	Weekly index scrape
`~/ontologer/data/raw/hk_market/hktvmall/2026-06-19/products.parquet`	Weekly index scrape (newest wins)
`~/ontologer/data/processed/hk_market/hktvmall/products.parquet`	1.1M SKUs + `llm_*` dims
`~/ontologer/subprojects/scrapes/hk_market/ssot/hktvmall_products_RAW.parquet`	Wide canonical (155 cols)
`~/plans/hktv_curl.txt`	API discovery curls (Algolia, KSS, reviews)

Pipeline scripts (`~/ontologer/data/pipelines/hk_market.hktvmall/scripts/`)

Script	Purpose
`hktv_api.py`	Algolia, Keyword Search Server, review API clients
`algolia_search_to_parquet.py`	Index product dump
`pull_algolia_cl_products.py`	KSS bulk CL pull
`scrape_pdp_pack_sizes.py`	PDP pack enrichment
`scrape_cl_reviews_async.py`	Async review scraper
`HKTVMALL_API.md`	API surface docs

This project — build & analysis scripts

Script	Purpose
`product_attributes_v2.py`	v2 transform logic (shared)
`build_product_id_to_attributes_v2.py`	Build v2 parquet
`analyze_v2_segments.py`	Segment frequency table
`classify_reviews.py`	LLM review classification
`analyze_reviews.py`	Review analytics
`cluster_reviews.py`	Clustering + KNN
`analyze_product_people_clusters.py`	Product-anchored reviewer clusters, trends, store anomalies, GLM
`analyze_reviewer_transitions.py`	Repeat-reviewer brand/modality/lens journeys & timing
`analyze_wearer_landscape.py`	New vs existing wearer themes & landscape roadmap
`analyze_consumer_clusters.py`	Consumer archetypes (product mix × sentiment × topics)

Consumer clustering (repeat reviewers)

Report: consumer_clustering.md

./venv/bin/python3 analyze_consumer_clusters.py

Cohort: 1,425 users with ≥3 reviews. Features: v2 product-mix shares + sentiment rates + LLM topic prevalence → KMeans (k=5).

Output	Role
`output_data/reviewer_clusters.parquet`	`user_pk` × cluster + label
`output_data/consumer_cluster_topics.csv`	Topic rates per cluster
`output_data/consumer_cluster_wearer_mix.csv`	Wearer cohort × cluster
`output_data/consumer_cluster_transitions.csv`	Brand/modality switch rates

Wearer landscape (new vs existing themes)

Report: wearer_landscape.md

./venv/bin/python3 analyze_wearer_landscape.py

Output	Role
`output_data/wearer_tags.parquet`	Per-review `wearer_cohort` tag
`output_data/wearer_topic_rates.csv`	Topic mention rates by cohort
`output_files/wearer_audit_sample.md`	Random-sample audit of cohort labels

Reviewer journey transitions (>2 reviews)

Reports use three layers per section: [result] markdown tables, [babysql:polars] SQL on parquet, [ml:…] Python estimator specs. See report_blocks.py.

Report	BabySQL	ML blocks	Result tables
`product_people_analytics.md`	Yes	KMeans, logit, χ², z-tests	Yes
`consumer_clustering.md`	Yes	KMeans, wearer rules	Yes
`reviewer_transitions.md`	Yes	Polars pair construction	Yes
`wearer_landscape.md`	Yes	Classifier rules, χ²	Yes
`clustering_results.md`	Yes (auto section)	KMeans review-level	Yes

Report: reviewer_transitions.md

./venv/bin/python3 analyze_reviewer_transitions.py

Cohort: users with ≥3 contact-lens reviews. Chronological pairs with days_gap > 0 (same-day SKU dupes collapsed).

Output	Role
`output_data/reviewer_transition_pairs.parquet`	Consecutive step pairs per user
`output_data/reviewer_brand_flows.csv`	Brand from → to counts + median days
`output_data/reviewer_modality_flows.csv`	Modality transitions
`output_data/reviewer_lens_flows.csv`	Lens-type transitions
`output_data/reviewer_pack_flows.csv`	Pack-size transitions
`output_data/reviewer_user_journey_summary.parquet`	Per-user span, # brands/modalities

Report: product_people_analytics.md