Review sequencing & sentiment — file index

Data root: ~/projects-new/data-analyses/hk_market/
Scripts: ~/ontologer/subprojects/review_sequencing_sentiment/

Last updated: 2026-06-26 (data relocated to projects-new)


Primary outputs (use these)

File Rows Role
source_data/cl_reviews.parquet 16,835 Review corpus (479 product_ids)
source_data/product_id_to_attributes_v2.parquet 2,914 Apples-to-apples product mapping (v2_modality × v2_lens_type × v2_pack_size)
source_data/6091e_cl_reviews_model_enriched.parquet 16,835 LLM-enriched reviews (run 6091e)
source_data/reviews.parquet Raw scrape (user_pk, ref_store_code)
output_data/v2_segment_frequencies.csv 71 Segment × SKU × review counts

Join key: cl_reviews.product_id = product_id_to_attributes_v2.product_id (bare id, no hktv- prefix).


v2 mapping (product_id_to_attributes_v2.parquet)

Rebuild:

cd ~/ontologer/subprojects/review_sequencing_sentiment
./venv/bin/python3 build_product_id_to_attributes_v2.py

Auto-loads newest PDP pack file when present.

Pack size resolution order

  1. orig_packingSpecEn / orig_packingSpecZh (index scrape)
  2. description_pdp — from PDP HTML Packing : N (see below)
  3. Title / name parse
  4. orig_llm_pack_size

Coverage (current)

Metric Value
Review product_ids in v2 479 / 479
Contact-lens SKUs in v2 2,821
Reviews with complete segment (modality\|lens_type\|pack) 15,351 / 16,835 (91.2%)
Reviews with v2source_pack_size = description_pdp 8,939
SKUs with description_pdp 88

Key v2 columns

Prefix Examples Meaning
orig_llm_* orig_llm_modality Processed catalog LLM dims
orig_* orig_packingSpecEn, orig_subcat2_en Raw scrape JSON
v2_* v2_modality, v2_lens_type, v2_pack_size, v2_brand Corrected attributes
v2source_* v2source_pack_size Where each v2 field came from
v2_compare_segment daily\|spherical\|30 Apples-to-apples key
v2_compare_segment_complete bool All three dims known

PDP pack enrichment

Item Path
Scraper ~/ontologer/data/pipelines/hk_market.hktvmall/scripts/scrape_pdp_pack_sizes.py
Output ~/ontologer/data/raw/hk_market/hktvmall/pdp_pack/{date}/pdp_pack_missing_descr.parquet
Latest run .../pdp_pack/2026-06-26/pdp_pack_missing_descr.parquet
Checkpoint .../pdp_pack/2026-06-26/checkpoint.json
Run log .../pdp_pack/2026-06-26/scrape.log

2026-06-26 run: 88 / 110 review-corpus null-pack SKUs parsed.

N (pack size) SKUs Reviews
30 56 8,035
2 21 112
6 9 775
3 1 16
10 1 1
# Re-scrape (resumes; skips already-parsed SKUs)
python3 ~/ontologer/data/pipelines/hk_market.hktvmall/scripts/scrape_pdp_pack_sizes.py --date 2026-06-26

Segment frequency table

Regenerate:

./venv/bin/python3 analyze_v2_segments.py

Writes output_data/v2_segment_frequencies.csv and .parquet.

Contact lenses only. Each row = unique (v2_modality, v2_lens_type, v2_pack_size) with SKU count and review count.

Summary (2026-06-26, after PDP)

Segment combos Reviews
Complete 47 15,351 (91.2%)
Incomplete 24 1,484 (8.8%)
Total rows 71 16,835

Top complete segments (by reviews)

Modality Lens type Pack SKUs Reviews % corpus
daily spherical 30 215 9,471 56.3%
daily color 30 369 1,750 10.4%
daily multifocal 30 51 860 5.1%
daily spherical 10 187 779 4.6%
daily spherical 20 54 632 3.8%
monthly color 2 223 137 0.8%
daily color 10 686 118 0.7%

Top incomplete segments (by reviews)

Modality Lens type Pack SKUs Reviews % corpus
daily spherical null 107 716 4.3%
daily color null 148 26 0.2%
monthly spherical null 15 13 0.1%

Remaining gap is mostly 22 PDP failures (19 pack_not_found, 3 missing_url_en) plus catalog SKUs with null pack but no reviews.


Raw & upstream inputs

File Role
~/ontologer/data/raw/hk_market/hktvmall/2026-06-12/products.parquet Weekly index scrape
~/ontologer/data/raw/hk_market/hktvmall/2026-06-19/products.parquet Weekly index scrape (newest wins)
~/ontologer/data/processed/hk_market/hktvmall/products.parquet 1.1M SKUs + llm_* dims
~/ontologer/subprojects/scrapes/hk_market/ssot/hktvmall_products_RAW.parquet Wide canonical (155 cols)
~/plans/hktv_curl.txt API discovery curls (Algolia, KSS, reviews)

Pipeline scripts (~/ontologer/data/pipelines/hk_market.hktvmall/scripts/)

Script Purpose
hktv_api.py Algolia, Keyword Search Server, review API clients
algolia_search_to_parquet.py Index product dump
pull_algolia_cl_products.py KSS bulk CL pull
scrape_pdp_pack_sizes.py PDP pack enrichment
scrape_cl_reviews_async.py Async review scraper
HKTVMALL_API.md API surface docs

This project — build & analysis scripts

Script Purpose
product_attributes_v2.py v2 transform logic (shared)
build_product_id_to_attributes_v2.py Build v2 parquet
analyze_v2_segments.py Segment frequency table
classify_reviews.py LLM review classification
analyze_reviews.py Review analytics
cluster_reviews.py Clustering + KNN
analyze_product_people_clusters.py Product-anchored reviewer clusters, trends, store anomalies, GLM
analyze_reviewer_transitions.py Repeat-reviewer brand/modality/lens journeys & timing
analyze_wearer_landscape.py New vs existing wearer themes & landscape roadmap
analyze_consumer_clusters.py Consumer archetypes (product mix × sentiment × topics)

Consumer clustering (repeat reviewers)

Report: consumer_clustering.md

./venv/bin/python3 analyze_consumer_clusters.py

Cohort: 1,425 users with ≥3 reviews. Features: v2 product-mix shares + sentiment rates + LLM topic prevalence → KMeans (k=5).

Output Role
output_data/reviewer_clusters.parquet user_pk × cluster + label
output_data/consumer_cluster_topics.csv Topic rates per cluster
output_data/consumer_cluster_wearer_mix.csv Wearer cohort × cluster
output_data/consumer_cluster_transitions.csv Brand/modality switch rates

Wearer landscape (new vs existing themes)

Report: wearer_landscape.md

./venv/bin/python3 analyze_wearer_landscape.py
Output Role
output_data/wearer_tags.parquet Per-review wearer_cohort tag
output_data/wearer_topic_rates.csv Topic mention rates by cohort
output_files/wearer_audit_sample.md Random-sample audit of cohort labels

Reviewer journey transitions (>2 reviews)

Reports use three layers per section: [result] markdown tables, [babysql:polars] SQL on parquet, [ml:…] Python estimator specs. See report_blocks.py.

Report BabySQL ML blocks Result tables
product_people_analytics.md Yes KMeans, logit, χ², z-tests Yes
consumer_clustering.md Yes KMeans, wearer rules Yes
reviewer_transitions.md Yes Polars pair construction Yes
wearer_landscape.md Yes Classifier rules, χ² Yes
clustering_results.md Yes (auto section) KMeans review-level Yes

Report: reviewer_transitions.md

./venv/bin/python3 analyze_reviewer_transitions.py

Cohort: users with ≥3 contact-lens reviews. Chronological pairs with days_gap > 0 (same-day SKU dupes collapsed).

Output Role
output_data/reviewer_transition_pairs.parquet Consecutive step pairs per user
output_data/reviewer_brand_flows.csv Brand from → to counts + median days
output_data/reviewer_modality_flows.csv Modality transitions
output_data/reviewer_lens_flows.csv Lens-type transitions
output_data/reviewer_pack_flows.csv Pack-size transitions
output_data/reviewer_user_journey_summary.parquet Per-user span, # brands/modalities

Report: product_people_analytics.md

Regenerate:

./venv/bin/python3 analyze_product_people_clusters.py

Joins cl_reviews + LLM enrichment + product_id_to_attributes_v2 + raw user_pk / ref_store_code.

Output Role
output_data/product_people_report.json Full numeric payload
output_data/reviewer_clusters.parquet user_pk × reviewer cluster
output_data/trends_*.csv Brand / modality / lens / pack volume & sentiment shifts
output_data/store_anomalies.csv Store-level Mar–May 2026 vs prior 12 mo
output_data/segment_sentiment_residuals.csv Segment positive-rate vs category mean

Windows: recent = 2026-03…2026-05; baseline = 2025-03…2026-02. Repeat reviewers = ≥3 reviews (user_pk).

Related: initial_assessment_analytics.md (VoC landscape); output_data/cluster_report.json (review-level KMeans).


Legacy (not primary; reference only)

Path Note
~/ecosystems/data/_unsorted/triniti/.../hkdata/getters/ Original getter connectors
~/data_backup/hktvmall/from-hkdata-raw/2026-06-19/ products_discovered.jsonl, product_index_searched.parquet