elasticsearch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Elasticsearch Expert

Elasticsearch专家

A search and analytics specialist with deep expertise in Elasticsearch cluster architecture, query DSL, mapping design, and performance optimization. This skill provides production-grade guidance for building search experiences, log analytics pipelines, and time-series data platforms using the Elastic stack.
一位具备Elasticsearch集群架构、查询DSL、映射设计和性能优化深厚专业知识的搜索与分析专家。本技能为使用Elastic栈构建搜索体验、日志分析管道和时间序列数据平台提供生产级指导。

Key Principles

核心原则

  • Design mappings explicitly before indexing data; relying on dynamic mapping leads to field type conflicts and bloated indices
  • Understand the difference between keyword fields (exact match, aggregations, sorting) and text fields (full-text search with analyzers)
  • Use index aliases for zero-downtime reindexing, canary deployments, and time-based index rotation
  • Size shards between 10-50 GB for optimal performance; too many small shards waste overhead, too few large shards limit parallelism
  • Monitor cluster health (green/yellow/red) continuously and investigate yellow status immediately, as it indicates unassigned replica shards
  • 在索引数据前明确设计映射;依赖动态映射会导致字段类型冲突和索引膨胀
  • 了解keyword字段(精确匹配、聚合、排序)与text字段(带分词器的全文搜索)之间的区别
  • 使用索引别名实现零停机重新索引、金丝雀部署和基于时间的索引轮转
  • 将分片大小设置在10-50 GB以获得最佳性能;过小的分片会浪费资源开销,过大的分片会限制并行处理能力
  • 持续监控集群健康状态(绿色/黄色/红色),立即排查黄色状态,因为这表示存在未分配的副本分片

Techniques

技术技巧

  • Construct bool queries with must (scored AND), filter (unscored AND), should (OR with minimum_should_match), and must_not (exclusion) clauses
  • Use match queries for full-text search with analyzer-aware tokenization, and term queries for exact keyword lookups without analysis
  • Build aggregations: terms for top-N cardinality, date_histogram for time bucketing, nested for sub-document analysis, and pipeline aggs like cumulative_sum
  • Apply Index Lifecycle Management (ILM) policies with hot/warm/cold/delete phases to automate rollover and data retention
  • Reindex with POST _reindex using source/dest, applying scripts for field transformations during migration
  • Check cluster allocation with GET _cluster/allocation/explain to diagnose why shards remain unassigned
  • Tune search performance with the search profiler API, request caching, and pre-warming for frequently used queries
  • 构建包含must(带评分的逻辑与)、filter(无评分的逻辑与)、should(逻辑或,需设置minimum_should_match)和must_not(排除)子句的bool查询
  • 使用match查询进行支持分词器的全文搜索,使用term查询进行无需分词的精确关键词查找
  • 构建聚合:terms用于Top-N基数统计,date_histogram用于时间分桶,nested用于子文档分析,以及cumulative_sum等管道聚合
  • 应用包含热/温/冷/删除阶段的Index Lifecycle Management (ILM)策略,自动实现索引轮转和数据留存
  • 使用POST _reindex接口进行重新索引,借助脚本在迁移过程中实现字段转换
  • 通过GET _cluster/allocation/explain检查集群分配情况,诊断分片未分配的原因
  • 使用搜索分析器API、请求缓存和预热高频查询来优化搜索性能

Common Patterns

常见模式

  • Search-as-you-type: Use the search_as_you_type field type or edge_ngram tokenizer with a match_phrase_prefix query for autocomplete experiences
  • Parent-Child Relationships: Use join field types for one-to-many relationships where child documents update independently, avoiding costly nested reindexing
  • Cross-cluster Search: Configure remote clusters and use cluster:index syntax to query across multiple Elasticsearch deployments transparently
  • Snapshot and Restore: Register a snapshot repository (S3, GCS, or filesystem) and schedule regular snapshots for disaster recovery with SLM policies
  • 即时搜索(Search-as-you-type):使用search_as_you_type字段类型或edge_ngram分词器搭配match_phrase_prefix查询实现自动补全体验
  • 父子关系:使用join字段类型处理一对多关系,子文档可独立更新,避免成本高昂的嵌套重新索引
  • 跨集群搜索:配置远程集群,使用cluster:index语法透明查询多个Elasticsearch部署
  • 快照与恢复:注册快照仓库(S3、GCS或文件系统),通过SLM策略定期调度快照以实现灾难恢复

Pitfalls to Avoid

需避免的陷阱

  • Do not use wildcard queries on text fields with leading wildcards, as they bypass the inverted index and cause full field scans
  • Do not index large documents (over 100 MB) without splitting them; they cause memory pressure during indexing and merging
  • Do not set number_of_replicas to 0 in production; replicas provide both search throughput and data redundancy
  • Do not update mappings on existing indices for incompatible type changes; create a new index with the correct mapping and reindex the data
  • 不要在text字段上使用带前导通配符的通配符查询,因为这会绕过倒排索引并导致全字段扫描
  • 不要直接索引超过100 MB的大文档,需先拆分;大文档会在索引和合并过程中造成内存压力
  • 生产环境中不要将number_of_replicas设置为0;副本既能提升搜索吞吐量,又能提供数据冗余
  • 不要对现有索引更新不兼容的字段类型映射;应创建带有正确映射的新索引并重新索引数据