Review sequencing & sentiment — file index
Data root: ~/projects-new/data-analyses/hk_market/
Scripts: ~/ontologer/subprojects/review_sequencing_sentiment/
Last updated: 2026-06-26 (data relocated to projects-new)
Primary outputs (use these)
| File | Rows | Role |
|---|---|---|
source_data/cl_reviews.parquet |
16,835 | Review corpus (479 product_ids) |
source_data/product_id_to_attributes_v2.parquet |
2,914 | Apples-to-apples product mapping (v2_modality × v2_lens_type × v2_pack_size) |
source_data/6091e_cl_reviews_model_enriched.parquet |
16,835 | LLM-enriched reviews (run 6091e) |
source_data/reviews.parquet |
— | Raw scrape (user_pk, ref_store_code) |
output_data/v2_segment_frequencies.csv |
71 | Segment × SKU × review counts |
Join key: cl_reviews.product_id = product_id_to_attributes_v2.product_id (bare id, no hktv- prefix).
v2 mapping (product_id_to_attributes_v2.parquet)
Rebuild:
cd ~/ontologer/subprojects/review_sequencing_sentiment
./venv/bin/python3 build_product_id_to_attributes_v2.py
Auto-loads newest PDP pack file when present.
Pack size resolution order
orig_packingSpecEn/orig_packingSpecZh(index scrape)description_pdp— from PDP HTMLPacking : N(see below)- Title / name parse
orig_llm_pack_size
Coverage (current)
| Metric | Value |
|---|---|
Review product_ids in v2 |
479 / 479 |
| Contact-lens SKUs in v2 | 2,821 |
Reviews with complete segment (modality\|lens_type\|pack) |
15,351 / 16,835 (91.2%) |
Reviews with v2source_pack_size = description_pdp |
8,939 |
SKUs with description_pdp |
88 |
Key v2 columns
| Prefix | Examples | Meaning |
|---|---|---|
orig_llm_* |
orig_llm_modality |
Processed catalog LLM dims |
orig_* |
orig_packingSpecEn, orig_subcat2_en |
Raw scrape JSON |
v2_* |
v2_modality, v2_lens_type, v2_pack_size, v2_brand |
Corrected attributes |
v2source_* |
v2source_pack_size |
Where each v2 field came from |
v2_compare_segment |
daily\|spherical\|30 |
Apples-to-apples key |
v2_compare_segment_complete |
bool | All three dims known |
PDP pack enrichment
| Item | Path |
|---|---|
| Scraper | ~/ontologer/data/pipelines/hk_market.hktvmall/scripts/scrape_pdp_pack_sizes.py |
| Output | ~/ontologer/data/raw/hk_market/hktvmall/pdp_pack/{date}/pdp_pack_missing_descr.parquet |
| Latest run | .../pdp_pack/2026-06-26/pdp_pack_missing_descr.parquet |
| Checkpoint | .../pdp_pack/2026-06-26/checkpoint.json |
| Run log | .../pdp_pack/2026-06-26/scrape.log |
2026-06-26 run: 88 / 110 review-corpus null-pack SKUs parsed.
| N (pack size) | SKUs | Reviews |
|---|---|---|
| 30 | 56 | 8,035 |
| 2 | 21 | 112 |
| 6 | 9 | 775 |
| 3 | 1 | 16 |
| 10 | 1 | 1 |
# Re-scrape (resumes; skips already-parsed SKUs)
python3 ~/ontologer/data/pipelines/hk_market.hktvmall/scripts/scrape_pdp_pack_sizes.py --date 2026-06-26
Segment frequency table
Regenerate:
./venv/bin/python3 analyze_v2_segments.py
Writes output_data/v2_segment_frequencies.csv and .parquet.
Contact lenses only. Each row = unique (v2_modality, v2_lens_type, v2_pack_size) with SKU count and review count.
Summary (2026-06-26, after PDP)
| Segment combos | Reviews | |
|---|---|---|
| Complete | 47 | 15,351 (91.2%) |
| Incomplete | 24 | 1,484 (8.8%) |
| Total rows | 71 | 16,835 |
Top complete segments (by reviews)
| Modality | Lens type | Pack | SKUs | Reviews | % corpus |
|---|---|---|---|---|---|
| daily | spherical | 30 | 215 | 9,471 | 56.3% |
| daily | color | 30 | 369 | 1,750 | 10.4% |
| daily | multifocal | 30 | 51 | 860 | 5.1% |
| daily | spherical | 10 | 187 | 779 | 4.6% |
| daily | spherical | 20 | 54 | 632 | 3.8% |
| monthly | color | 2 | 223 | 137 | 0.8% |
| daily | color | 10 | 686 | 118 | 0.7% |
Top incomplete segments (by reviews)
| Modality | Lens type | Pack | SKUs | Reviews | % corpus |
|---|---|---|---|---|---|
| daily | spherical | null | 107 | 716 | 4.3% |
| daily | color | null | 148 | 26 | 0.2% |
| monthly | spherical | null | 15 | 13 | 0.1% |
Remaining gap is mostly 22 PDP failures (19 pack_not_found, 3 missing_url_en) plus catalog SKUs with null pack but no reviews.
Raw & upstream inputs
| File | Role |
|---|---|
~/ontologer/data/raw/hk_market/hktvmall/2026-06-12/products.parquet |
Weekly index scrape |
~/ontologer/data/raw/hk_market/hktvmall/2026-06-19/products.parquet |
Weekly index scrape (newest wins) |
~/ontologer/data/processed/hk_market/hktvmall/products.parquet |
1.1M SKUs + llm_* dims |
~/ontologer/subprojects/scrapes/hk_market/ssot/hktvmall_products_RAW.parquet |
Wide canonical (155 cols) |
~/plans/hktv_curl.txt |
API discovery curls (Algolia, KSS, reviews) |
Pipeline scripts (~/ontologer/data/pipelines/hk_market.hktvmall/scripts/)
| Script | Purpose |
|---|---|
hktv_api.py |
Algolia, Keyword Search Server, review API clients |
algolia_search_to_parquet.py |
Index product dump |
pull_algolia_cl_products.py |
KSS bulk CL pull |
scrape_pdp_pack_sizes.py |
PDP pack enrichment |
scrape_cl_reviews_async.py |
Async review scraper |
HKTVMALL_API.md |
API surface docs |
This project — build & analysis scripts
| Script | Purpose |
|---|---|
product_attributes_v2.py |
v2 transform logic (shared) |
build_product_id_to_attributes_v2.py |
Build v2 parquet |
analyze_v2_segments.py |
Segment frequency table |
classify_reviews.py |
LLM review classification |
analyze_reviews.py |
Review analytics |
cluster_reviews.py |
Clustering + KNN |
analyze_product_people_clusters.py |
Product-anchored reviewer clusters, trends, store anomalies, GLM |
analyze_reviewer_transitions.py |
Repeat-reviewer brand/modality/lens journeys & timing |
analyze_wearer_landscape.py |
New vs existing wearer themes & landscape roadmap |
analyze_consumer_clusters.py |
Consumer archetypes (product mix × sentiment × topics) |
Consumer clustering (repeat reviewers)
Report: consumer_clustering.md
./venv/bin/python3 analyze_consumer_clusters.py
Cohort: 1,425 users with ≥3 reviews. Features: v2 product-mix shares + sentiment rates + LLM topic prevalence → KMeans (k=5).
| Output | Role |
|---|---|
output_data/reviewer_clusters.parquet |
user_pk × cluster + label |
output_data/consumer_cluster_topics.csv |
Topic rates per cluster |
output_data/consumer_cluster_wearer_mix.csv |
Wearer cohort × cluster |
output_data/consumer_cluster_transitions.csv |
Brand/modality switch rates |
Wearer landscape (new vs existing themes)
Report: wearer_landscape.md
./venv/bin/python3 analyze_wearer_landscape.py
| Output | Role |
|---|---|
output_data/wearer_tags.parquet |
Per-review wearer_cohort tag |
output_data/wearer_topic_rates.csv |
Topic mention rates by cohort |
output_files/wearer_audit_sample.md |
Random-sample audit of cohort labels |
Reviewer journey transitions (>2 reviews)
Reports use three layers per section: [result] markdown tables, [babysql:polars] SQL on parquet, [ml:…] Python estimator specs. See report_blocks.py.
| Report | BabySQL | ML blocks | Result tables |
|---|---|---|---|
product_people_analytics.md |
Yes | KMeans, logit, χ², z-tests | Yes |
consumer_clustering.md |
Yes | KMeans, wearer rules | Yes |
reviewer_transitions.md |
Yes | Polars pair construction | Yes |
wearer_landscape.md |
Yes | Classifier rules, χ² | Yes |
clustering_results.md |
Yes (auto section) | KMeans review-level | Yes |
Report: reviewer_transitions.md
./venv/bin/python3 analyze_reviewer_transitions.py
Cohort: users with ≥3 contact-lens reviews. Chronological pairs with days_gap > 0 (same-day SKU dupes collapsed).
| Output | Role |
|---|---|
output_data/reviewer_transition_pairs.parquet |
Consecutive step pairs per user |
output_data/reviewer_brand_flows.csv |
Brand from → to counts + median days |
output_data/reviewer_modality_flows.csv |
Modality transitions |
output_data/reviewer_lens_flows.csv |
Lens-type transitions |
output_data/reviewer_pack_flows.csv |
Pack-size transitions |
output_data/reviewer_user_journey_summary.parquet |
Per-user span, # brands/modalities |
Report: product_people_analytics.md
Regenerate:
./venv/bin/python3 analyze_product_people_clusters.py
Joins cl_reviews + LLM enrichment + product_id_to_attributes_v2 + raw user_pk / ref_store_code.
| Output | Role |
|---|---|
output_data/product_people_report.json |
Full numeric payload |
output_data/reviewer_clusters.parquet |
user_pk × reviewer cluster |
output_data/trends_*.csv |
Brand / modality / lens / pack volume & sentiment shifts |
output_data/store_anomalies.csv |
Store-level Mar–May 2026 vs prior 12 mo |
output_data/segment_sentiment_residuals.csv |
Segment positive-rate vs category mean |
Windows: recent = 2026-03…2026-05; baseline = 2025-03…2026-02. Repeat reviewers = ≥3 reviews (user_pk).
Related: initial_assessment_analytics.md (VoC landscape); output_data/cluster_report.json (review-level KMeans).
Legacy (not primary; reference only)
| Path | Note |
|---|---|
~/ecosystems/data/_unsorted/triniti/.../hkdata/getters/ |
Original getter connectors |
~/data_backup/hktvmall/from-hkdata-raw/2026-06-19/ |
products_discovered.jsonl, product_index_searched.parquet |