algo-hr-turnover
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseEmployee Turnover Prediction
员工流失预测
Overview
概述
Turnover prediction uses classification models (logistic regression, random forest, XGBoost) to estimate the probability an employee will leave within a defined period (typically 6-12 months). Features include tenure, compensation, performance, promotion history, and engagement signals.
流失预测使用分类模型(logistic regression、random forest、XGBoost)估算员工在特定时间段(通常为6-12个月)内离职的概率。特征包括任职年限、薪酬、绩效、晋升历史以及敬业度指标。
When to Use
使用场景
Trigger conditions:
- Identifying employees at high risk of voluntary departure
- Quantifying which factors drive turnover for targeted interventions
- Prioritizing retention budgets toward highest-impact employees
When NOT to use:
- For involuntary termination planning (different process and ethics)
- When headcount is < 200 (insufficient data for reliable modeling)
触发条件:
- 识别自愿离职高风险员工
- 量化驱动流失的因素,以便开展针对性干预
- 将留存预算优先分配给影响最大的员工
不适用场景:
- 用于非自愿解雇规划(流程和伦理要求不同)
- 员工人数<200(数据不足,无法构建可靠模型)
Algorithm
算法
IRON LAW: Turnover Models Predict RISK, Not Certainty
A predicted 80% turnover probability means "employees with similar
profiles historically left 80% of the time." It does NOT mean this
specific employee WILL leave. Never use model outputs as sole basis
for employment decisions — that creates legal and ethical liability.IRON LAW: Turnover Models Predict RISK, Not Certainty
A predicted 80% turnover probability means "employees with similar
profiles historically left 80% of the time." It does NOT mean this
specific employee WILL leave. Never use model outputs as sole basis
for employment decisions — that creates legal and ethical liability.Phase 1: Input Validation
阶段1:输入验证
Collect: employee demographics, tenure, compensation (relative to market), last promotion date, performance ratings, manager change history, engagement survey scores, commute distance. Outcome: voluntary departure within N months.
Gate: Minimum 200 turnover events, features available before departure date.
收集:员工人口统计信息、任职年限、薪酬(相对于市场水平)、上次晋升日期、绩效评级、经理变动历史、敬业度调查得分、通勤距离。预测目标:N个月内的自愿离职。
准入条件: 至少200个离职事件,且特征数据可在离职日期前获取。
Phase 2: Core Algorithm
阶段2:核心算法
- Feature engineering: tenure buckets, comp ratio (salary/market median), time since last promotion, manager tenure, engagement trend
- Handle class imbalance: turnover rate typically 10-20%. Use SMOTE or class weights.
- Train: logistic regression (interpretable, HR-preferred) or GBDT (higher accuracy)
- Output: probability of departure + top risk factors per employee
- 特征工程:任职年限分段、薪酬比率(薪资/市场中位数)、距上次晋升时长、经理任职年限、敬业度趋势
- 处理类别不平衡:流失率通常为10-20%,使用SMOTE或类别权重法
- 训练:logistic regression(可解释性强,HR偏好)或GBDT(准确率更高)
- 输出:离职概率 + 每位员工的主要风险因素
Phase 3: Verification
阶段3:验证
Evaluate: AUC, precision-recall (at actionable thresholds). Backtest: did the model correctly flag employees who left in the past 6 months?
Gate: AUC > 0.70, precision > 50% at top decile.
评估指标:AUC、精确率-召回率(基于可行动阈值)。回测:模型是否正确标记了过去6个月内离职的员工?
准入条件: AUC>0.70,前十分位精确率>50%。
Phase 4: Output
阶段4:输出
Return risk scores with driver analysis.
返回风险评分及驱动因素分析。
Output Format
输出格式
json
{
"risk_scores": [{"employee_id": "E123", "turnover_prob": 0.72, "risk_tier": "high", "top_drivers": ["low_comp_ratio", "no_promotion_3yr"]}],
"metadata": {"model": "xgboost", "auc": 0.78, "prediction_window_months": 12}
}json
{
"risk_scores": [{"employee_id": "E123", "turnover_prob": 0.72, "risk_tier": "high", "top_drivers": ["low_comp_ratio", "no_promotion_3yr"]}],
"metadata": {"model": "xgboost", "auc": 0.78, "prediction_window_months": 12}
}Examples
示例
Sample I/O
输入输出示例
Input: Employee: 4yr tenure, comp ratio 0.85, no promotion in 3yr, engagement score declining
Expected: High risk (>0.6). Top drivers: below-market compensation, stalled career progression.
输入: 员工:4年任职年限,薪酬比率0.85,3年未晋升,敬业度得分下降
预期输出: 高风险(>0.6)。主要驱动因素:薪酬低于市场水平、职业发展停滞。
Edge Cases
边缘案例
| Input | Expected | Why |
|---|---|---|
| New hire (< 6 months) | Unreliable prediction | Insufficient behavioral data |
| Top performer, high comp | Still could leave | Non-financial factors (manager, culture) matter |
| Post-reorg period | Model drift likely | Unusual conditions distort patterns |
| 输入 | 预期结果 | 原因 |
|---|---|---|
| 新员工(<6个月) | 预测结果不可靠 | 行为数据不足 |
| 顶级绩效员工,高薪 | 仍可能离职 | 非财务因素(经理、文化)同样重要 |
| 重组后时期 | 模型可能出现漂移 | 特殊情况会扭曲数据模式 |
Gotchas
注意事项
- Survivorship bias: Training data only includes people who were hired and stayed long enough to observe. Early-stage leavers may be underrepresented.
- Feature leakage: "Started job searching" or "updated LinkedIn" are strong predictors but ethically and legally problematic to use. Stick to internal HR data.
- Self-fulfilling prophecy: If managers treat "high risk" employees differently (less investment, fewer projects), the model prediction becomes self-fulfilling.
- Legal constraints: Using protected attributes (age, gender, ethnicity) directly or via proxies may violate employment law. Audit for disparate impact.
- Retention intervention timing: Identifying risk is only useful if HR acts. Build the model into a retention workflow with specific intervention triggers.
- 幸存者偏差: 训练数据仅包含已入职且任职时间足够长的员工,早期离职者可能代表性不足。
- 特征泄露: "开始找工作"或"更新LinkedIn资料"是强预测因子,但使用此类数据存在伦理和法律问题,应仅使用内部HR数据。
- 自我实现预言: 如果经理对"高风险"员工区别对待(减少投入、分配更少项目),模型预测会成为自我实现的预言。
- 法律约束: 直接或间接使用受保护属性(年龄、性别、种族)可能违反雇佣法,需审核是否存在差异性影响。
- 留存干预时机: 识别风险只有在HR采取行动时才有意义,需将模型整合到留存工作流中,并设置具体的干预触发条件。
References
参考资料
- For feature engineering from HR data, see
references/hr-features.md - For ethical AI in HR applications, see
references/ethical-hr-ai.md
- 关于HR数据的特征工程,详见
references/hr-features.md - 关于HR应用中的伦理AI,详见
references/ethical-hr-ai.md