marketplace-recsys-feature-engineering

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Marketplace Engineering Recsys Feature Engineering Best Practices

双边市场推荐系统特征工程最佳实践

Comprehensive first-principles guide for deriving usable recommender features from the raw assets of a two-sided trust marketplace — listing photos, owner-supplied listing metadata, and sitter wizard responses — for item-to-item, user-to-item, and user-to-user solutions. Contains 44 rules across 8 categories ordered by cascade impact on the feature-engineering lifecycle, plus one playbook that composes the rules into an end-to-end feature discovery workflow.

This skill is the upstream precursor to

marketplace-personalisation

(AWS Personalize) and

marketplace-search-recsys-planning

(OpenSearch retrieval). Those skills treat features as inputs they already have; this skill is about deciding what features to build from the raw assets, which decisions they serve, and how to prove each one is worth its maintenance cost.

本指南基于第一性原理，全面介绍了如何从双边信任型市场的原始资产——房源照片、房东提供的房源元数据、看护者向导回复——中提取可用的推荐系统特征，用于物品到物品、用户到物品、用户到用户的推荐解决方案。包含按特征工程生命周期影响优先级排序的8个类别共44条规则，以及一个将这些规则整合为端到端特征发现流程的操作手册。

本技能是

marketplace-personalisation

（AWS Personalize）和

marketplace-search-recsys-planning

（OpenSearch检索）的上游前置技能。后两者将特征视为已有的输入，而本技能则聚焦于如何从原始资产中构建特征、这些特征服务于哪些决策，以及如何证明每个特征的维护成本具有价值。

When to Apply

适用场景

Reference this skill when:

Planning what to extract from listing photos, descriptions, or amenity lists to power i2i similarity or u2i ranking
Designing or revising the sitter onboarding wizard with recsys features as the primary output
Deciding whether to build a vision embedding pipeline, a text encoder, or neither — and in what order
Composing existing base features into item-to-item, user-to-item, or user-to-user scoring
Auditing an existing feature store for coverage, drift, PII, duplication, or orphan features
Choosing a ship/kill criterion for a new recsys feature and designing the ablation A/B test
Answering the question: "we want to improve the similar-homes shelf — what feature should we build?"

在以下场景中参考本技能：

规划从房源照片、描述或设施列表中提取哪些特征，以支持i2i相似性推荐或u2u排序
设计或修订以推荐系统特征为核心输出的看护者入职向导
决定是否构建视觉嵌入流水线、文本编码器，以及二者的优先级
将现有基础特征组合为物品到物品、用户到物品或用户到用户的评分特征
审计现有特征库的覆盖率、漂移、PII、重复或无人维护的特征
为新的推荐系统特征选择上线/淘汰标准，并设计消融A/B测试
回答诸如「我们想提升相似房源推荐模块的质量，应该构建什么特征？」这类问题

Setup

配置说明

This skill has no user-specific configuration — it is self-contained. References are live URLs to engineering blogs from Airbnb, Pinterest, DoorDash, Uber, Netflix, and Google, to open-source libraries (Feast, Sentence-Transformers, Hugging Face CLIP, H3), to foundational academic papers (Airbnb KDD 2018, Pinterest ItemSage, YouTube Semantic IDs, PinSage), and to Google's Rules of Machine Learning.

本技能无需用户特定配置，内容为自包含式。参考资料包括Airbnb、Pinterest、DoorDash、Uber、Netflix、Google的技术博客，开源库（Feast、Sentence-Transformers、Hugging Face CLIP、H3），基础学术论文（Airbnb KDD 2018、Pinterest ItemSage、YouTube Semantic IDs、PinSage），以及Google的机器学习规则。

Rule Categories

规则类别

Categories are ordered by cascade impact on the feature-engineering lifecycle: auditing mistakes build features on data that does not exist, first-principles mistakes produce features that do not map to real decisions, extraction mistakes poison everything downstream, and so on. Fix earlier-stage problems before later-stage problems.


audit-
firstp-
vision-
listing-
wizard-
derive-
quality-
prove-

#	Category	Prefix	Impact
1	Asset Audit and Inventory	`audit-`	CRITICAL
2	First-Principles Feature Decomposition	`firstp-`	CRITICAL
3	Image Feature Extraction	`vision-`	HIGH
4	Listing Text and Metadata Extraction	`listing-`	HIGH
5	Sitter Wizard and Profile Extraction	`wizard-`	HIGH
6	Derived Similarity and Affinity	`derive-`	MEDIUM-HIGH
7	Feature Quality and Governance	`quality-`	MEDIUM-HIGH
8	Incremental Rollout and Value Proof	`prove-`	MEDIUM

类别按对特征工程生命周期的影响优先级排序：审计错误会导致基于不存在的数据构建特征，第一性原理错误会产生与实际决策不匹配的特征，提取错误会污染所有下游环节，以此类推。应先解决早期阶段的问题，再处理后期问题。


audit-
firstp-
vision-
listing-
wizard-
derive-
quality-
prove-

序号	类别	前缀	影响级别
1	资产审计与盘点	`audit-`	关键
2	第一性原理特征分解	`firstp-`	关键
3	图像特征提取	`vision-`	高
4	房源文本与元数据提取	`listing-`	高
5	看护者向导与档案提取	`wizard-`	高
6	衍生相似性与亲和度	`derive-`	中高
7	特征质量与治理	`quality-`	中高
8	增量上线与价值验证	`prove-`	中

Quick Reference

快速参考

1. Asset Audit and Inventory (CRITICAL)

1. 资产审计与盘点（关键）

```
audit-measure-coverage-before-modelling
```
— reject fields below 80% coverage from the feature plan
```
audit-sample-every-asset-type-end-to-end
```
— pull 100 real instances through the real fetch path before planning

audit-verify-rights-and-privacy-before-extraction

— ToS, GDPR, consent, face blur before encoding

```
audit-quantify-freshness-per-asset
```
— age distribution + expiry + refresh bucket
```
audit-separate-raw-assets-from-derived-features
```
— raw immutable in object store, derived versioned in feature store

```
audit-measure-coverage-before-modelling
```
— 排除覆盖率低于80%的字段，不纳入特征规划
```
audit-sample-every-asset-type-end-to-end
```
— 在规划前，通过真实获取路径拉取100个真实实例进行测试
```
audit-verify-rights-and-privacy-before-extraction
```
— 在编码前确认服务条款、GDPR合规性、用户同意，以及人脸模糊处理
```
audit-quantify-freshness-per-asset
```
— 统计资产的年龄分布、过期时间与刷新周期
```
audit-separate-raw-assets-from-derived-features
```
— 原始不可变资产存储在对象存储中，衍生特征以版本化方式存储在特征库中

2. First-Principles Feature Decomposition (CRITICAL)

2. 第一性原理特征分解（关键）

```
firstp-start-from-the-decision-not-the-algorithm
```
— decision first, sub-judgments second, tools last
```
firstp-ask-what-signal-a-human-uses
```
— interview 8-12 owners and sitters; features trace back to quotes
```
firstp-tie-every-feature-to-a-specific-solution
```
— no feature without a named i2i/u2i/u2u consumer
```
firstp-prefer-directly-observed-over-learned
```
— observed columns first, learned embeddings second

firstp-reject-features-you-cannot-serve-at-inference

— training-serving parity starts at design time

firstp-kill-features-a-popularity-baseline-already-captures

— correlation screen before registration

```
firstp-start-from-the-decision-not-the-algorithm
```
— 先明确决策，再拆解子判断，最后选择工具
```
firstp-ask-what-signal-a-human-uses
```
— 访谈8-12位房东和看护者；特征需对应访谈中的真实反馈
```
firstp-tie-every-feature-to-a-specific-solution
```
— 每个特征都必须对应一个明确的i2i/u2i/u2u推荐场景
```
firstp-prefer-directly-observed-over-learned
```
— 优先使用直接观测的字段，其次再考虑学习得到的嵌入
```
firstp-reject-features-you-cannot-serve-at-inference
```
— 训练-服务一致性需从设计阶段开始保障
```
firstp-kill-features-a-popularity-baseline-already-captures
```
— 在注册前先进行相关性筛查，淘汰已被流行度基线覆盖的特征

3. Image Feature Extraction (HIGH)

3. 图像特征提取（高）

vision-use-clip-for-zero-shot-listing-embeddings

— zero-shot CLIP ships in a week

vision-detect-room-types-before-detecting-amenities

— room prior conditions the amenity threshold

vision-quantify-image-quality-separately-from-content

— blur, lighting, aesthetic as their own features

vision-extract-per-object-counts-not-just-presence

—

n_bed = 4

beats

has_bed = true

```
vision-pool-embeddings-across-a-listings-photo-set
```
— pooled listing vector; per-photo stored alongside

vision-fine-tune-on-your-domain-when-clip-underperforms

— contrastive fine-tune only after zero-shot plateaus

vision-use-clip-for-zero-shot-listing-embeddings

— 零样本CLIP可在一周内完成上线

```
vision-detect-room-types-before-detecting-amenities
```
— 先检测房间类型，再以此为前提设置设施检测阈值
```
vision-quantify-image-quality-separately-from-content
```
— 将模糊度、光线、美观度作为独立特征

vision-extract-per-object-counts-not-just-presence

—

n_bed = 4

优于

has_bed = true

```
vision-pool-embeddings-across-a-listings-photo-set
```
— 生成房源级的池化嵌入向量，同时保留单张照片的嵌入
```
vision-fine-tune-on-your-domain-when-clip-underperforms
```
— 仅当零样本CLIP效果达到瓶颈后，再进行领域对比微调

4. Listing Text and Metadata Extraction (HIGH)

4. 房源文本与元数据提取（高）

listing-declare-categorical-fields-for-bounded-vocabularies

— bounded vocab → categorical, validated on write

```
listing-multi-hot-encode-amenity-lists
```
— fixed amenity vocabulary → multi-hot vector

listing-hash-geo-to-hierarchies-not-raw-lat-lon

— H3 at multiple resolutions

listing-embed-description-with-pretrained-sentence-encoder

— all-MiniLM-L6-v2 for cheap semantic text features

```
listing-extract-stay-duration-shape-not-just-length
```
— bin + holiday overlap + flexibility, not raw day count

listing-encode-pet-requirements-as-structured-triples

—

(species, count, special_needs)

triples plus free text alongside

listing-declare-categorical-fields-for-bounded-vocabularies

— 有限词汇集→分类变量，写入时进行验证

```
listing-multi-hot-encode-amenity-lists
```
— 固定设施词汇集→多热编码向量

listing-hash-geo-to-hierarchies-not-raw-lat-lon

— 使用多分辨率的H3地理哈希

```
listing-embed-description-with-pretrained-sentence-encoder
```
— 使用all-MiniLM-L6-v2获取低成本的语义文本特征
```
listing-extract-stay-duration-shape-not-just-length
```
— 提取时长区间、节假日重叠度、灵活性，而非原始天数

listing-encode-pet-requirements-as-structured-triples

— 以

(物种, 数量, 特殊需求)

三元组存储，同时保留自由文本

5. Sitter Wizard and Profile Extraction (HIGH)

5. 看护者向导与档案提取（高）

```
wizard-order-questions-by-information-gain
```
— discriminative questions first, narrative last

wizard-prefer-multiple-choice-over-free-text

— categorical features by construction

```
wizard-make-skips-genuine-and-log-them
```
— skip is signal; defaults destroy it
```
wizard-capture-experience-as-counts-and-dates
```
— numbers, not adjectives; platform history overrides self-declaration

wizard-separate-hard-constraints-from-soft-preferences

— filters vs ranking features

```
wizard-order-questions-by-information-gain
```
— 先展示区分度高的问题，最后展示叙事性问题

wizard-prefer-multiple-choice-over-free-text

— 天然生成分类特征

```
wizard-make-skips-genuine-and-log-them
```
— 跳过操作本身就是一种信号；默认值会破坏信号
```
wizard-capture-experience-as-counts-and-dates
```
— 使用数字而非形容词；平台历史记录优先于自我声明

wizard-separate-hard-constraints-from-soft-preferences

— 区分过滤特征与排序特征

6. Derived Similarity and Affinity (MEDIUM-HIGH)

6. 衍生相似性与亲和度（中高）

```
derive-precompute-i2i-nearest-neighbours-offline
```
— ANN shelf built nightly, served from KV in <5ms
```
derive-fuse-modalities-before-item-similarity
```
— vision + text + structured, weighted and normalised
```
derive-use-two-tower-for-user-item-affinity
```
— dual encoder trained on interactions; ANN-retrieval-ready

derive-score-u2u-as-symmetric-mutual-fit

—

min(P(owner), P(sitter))

; one-sided scoring produces wasted requests

derive-decompose-affinity-into-interpretable-subscores

— fit/safety/logistics/price subscores + blend

derive-cache-user-embedding-with-short-ttl

— session-level cache, 60-300s TTL

```
derive-precompute-i2i-nearest-neighbours-offline
```
— 夜间构建ANN候选集，从KV存储中以<5ms的延迟提供服务
```
derive-fuse-modalities-before-item-similarity
```
— 融合视觉、文本、结构化特征，进行加权与归一化
```
derive-use-two-tower-for-user-item-affinity
```
— 基于交互数据训练双编码器；生成适合ANN检索的向量

derive-score-u2u-as-symmetric-mutual-fit

— 使用

min(P(owner), P(sitter))

；单向评分会导致无效请求

```
derive-decompose-affinity-into-interpretable-subscores
```
— 拆解为匹配度/安全性/后勤/价格子分数，再进行融合
```
derive-cache-user-embedding-with-short-ttl
```
— 会话级缓存，TTL设置为60-300秒

7. Feature Quality and Governance (MEDIUM-HIGH)

7. 特征质量与治理（中高）

quality-version-feature-definitions-in-one-registry

— one name, one implementation, one owner

quality-serve-training-and-inference-from-one-store

— feature store as the single source of truth

quality-gate-features-on-coverage-and-drift

— coverage floor + PSI alarm

quality-scrub-pii-before-features-leave-secure-zone

— face blur and regex scrubbing before encoding

quality-freeze-feature-schemas-per-model-version

— schema hash pinned to model artifact

```
quality-version-feature-definitions-in-one-registry
```
— 一个特征对应一个名称、一个实现、一个负责人

quality-serve-training-and-inference-from-one-store

— 以特征库作为唯一可信数据源

quality-gate-features-on-coverage-and-drift

— 设置覆盖率下限与PSI漂移告警

```
quality-scrub-pii-before-features-leave-secure-zone
```
— 在编码前进行人脸模糊与正则表达式清理，去除PII信息

quality-freeze-feature-schemas-per-model-version

— 将Schema哈希与模型工件绑定

8. Incremental Rollout and Value Proof (MEDIUM)

8. 增量上线与价值验证（中）

```
prove-ship-one-feature-at-a-time
```
— one feature, one experiment, one decision
```
prove-measure-lift-against-feature-ablated-variant
```
— ablation isolates the feature from incidental changes

prove-kill-features-that-dont-earn-maintenance

— quarterly kill review on attributed lift

prove-dedicate-random-exploration-slice-to-new-features

— 3-5% slice catches offline-close-to-tied winners

prove-retain-feature-free-baseline-permanently

— popularity baseline as drift anchor

```
prove-ship-one-feature-at-a-time
```
— 一次上线一个特征、一个实验、一个决策
```
prove-measure-lift-against-feature-ablated-variant
```
— 通过消融实验隔离单个特征的影响，排除其他偶然变化
```
prove-kill-features-that-dont-earn-maintenance
```
— 每季度根据归因效果进行特征淘汰审查
```
prove-dedicate-random-exploration-slice-to-new-features
```
— 预留3-5%的流量分片，用于发现离线效果接近的优质特征

prove-retain-feature-free-baseline-permanently

— 保留流行度基线作为漂移锚点

Discovering New Features

新特征发现

One playbook composes the rules into an end-to-end workflow:

```
references/playbooks/discovering.md
```
— Discover new features from raw marketplace assets: a seven-step workflow that starts with an asset audit and a decision decomposition and ends with a shipped ablation A/B against a feature-ablated baseline. Use when the task is "what should we build next?" rather than "fix this specific feature."

Read the playbook first when the task is an open-ended "how do we extract more signal from X?" Read individual rules when a specific implementation question arises.

操作手册将规则整合为端到端流程：

```
references/playbooks/discovering.md
```
— 从原始市场资产中发现新特征：七步流程，从资产审计与决策分解开始，到上线后与消融基线对比的A/B测试结束。适用于「下一步应该构建什么？」而非「修复特定特征」的场景。

当任务是开放式的「如何从X中获取更多信号？」时，先阅读本操作手册；当遇到具体实现问题时，再阅读对应规则。

How to Use

使用方法

Read
```
references/_sections.md
```
for category structure and cascade rationale
Read
```
gotchas.md
```
for accumulated diagnostic lessons before suggesting interventions
Read
```
references/playbooks/discovering.md
```
to plan a new feature discovery cycle
Read individual rule files under
```
references/
```
when a specific task matches the rule title
Use
```
assets/templates/_template.md
```
to author new rules as the skill grows

阅读
```
references/_sections.md
```
了解类别结构与优先级逻辑
在提出解决方案前，阅读
```
gotchas.md
```
积累的诊断经验
阅读
```
references/playbooks/discovering.md
```
规划新的特征发现周期
当遇到具体任务时，阅读
```
references/
```
下的对应规则文件
使用
```
assets/templates/_template.md
```
编写新规则

Related Skills

Reference Files

参考文件

File	Description
references/_sections.md	Category definitions, impact ordering, cascade rationale
references/playbooks/discovering.md	End-to-end feature discovery playbook
gotchas.md	Accumulated feature-engineering diagnostic lessons (living)
assets/templates/_template.md	Template for authoring new rules
metadata.json	Version, discipline, authoritative references

文件	描述
references/_sections.md	类别定义、影响优先级、逻辑说明
references/playbooks/discovering.md	端到端特征发现操作手册
gotchas.md	特征工程诊断经验汇总（持续更新）
assets/templates/_template.md	新规则编写模板
metadata.json	版本、领域、权威参考资料