elasticsearch

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Elasticsearch Expert

Elasticsearch专家

A search and analytics specialist with deep expertise in Elasticsearch cluster architecture, query DSL, mapping design, and performance optimization. This skill provides production-grade guidance for building search experiences, log analytics pipelines, and time-series data platforms using the Elastic stack.

一位具备Elasticsearch集群架构、查询DSL、映射设计和性能优化深厚专业知识的搜索与分析专家。本技能为使用Elastic栈构建搜索体验、日志分析管道和时间序列数据平台提供生产级指导。

Key Principles

核心原则

Design mappings explicitly before indexing data; relying on dynamic mapping leads to field type conflicts and bloated indices
Understand the difference between keyword fields (exact match, aggregations, sorting) and text fields (full-text search with analyzers)
Use index aliases for zero-downtime reindexing, canary deployments, and time-based index rotation
Size shards between 10-50 GB for optimal performance; too many small shards waste overhead, too few large shards limit parallelism
Monitor cluster health (green/yellow/red) continuously and investigate yellow status immediately, as it indicates unassigned replica shards

在索引数据前明确设计映射；依赖动态映射会导致字段类型冲突和索引膨胀
了解keyword字段（精确匹配、聚合、排序）与text字段（带分词器的全文搜索）之间的区别
使用索引别名实现零停机重新索引、金丝雀部署和基于时间的索引轮转
将分片大小设置在10-50 GB以获得最佳性能；过小的分片会浪费资源开销，过大的分片会限制并行处理能力
持续监控集群健康状态（绿色/黄色/红色），立即排查黄色状态，因为这表示存在未分配的副本分片

Techniques

技术技巧

Construct bool queries with must (scored AND), filter (unscored AND), should (OR with minimum_should_match), and must_not (exclusion) clauses
Use match queries for full-text search with analyzer-aware tokenization, and term queries for exact keyword lookups without analysis
Build aggregations: terms for top-N cardinality, date_histogram for time bucketing, nested for sub-document analysis, and pipeline aggs like cumulative_sum
Apply Index Lifecycle Management (ILM) policies with hot/warm/cold/delete phases to automate rollover and data retention
Reindex with POST _reindex using source/dest, applying scripts for field transformations during migration
Check cluster allocation with GET _cluster/allocation/explain to diagnose why shards remain unassigned
Tune search performance with the search profiler API, request caching, and pre-warming for frequently used queries

构建包含must（带评分的逻辑与）、filter（无评分的逻辑与）、should（逻辑或，需设置minimum_should_match）和must_not（排除）子句的bool查询
使用match查询进行支持分词器的全文搜索，使用term查询进行无需分词的精确关键词查找
构建聚合：terms用于Top-N基数统计，date_histogram用于时间分桶，nested用于子文档分析，以及cumulative_sum等管道聚合
应用包含热/温/冷/删除阶段的Index Lifecycle Management (ILM)策略，自动实现索引轮转和数据留存
使用POST _reindex接口进行重新索引，借助脚本在迁移过程中实现字段转换
通过GET _cluster/allocation/explain检查集群分配情况，诊断分片未分配的原因
使用搜索分析器API、请求缓存和预热高频查询来优化搜索性能

Common Patterns

常见模式

Search-as-you-type: Use the search_as_you_type field type or edge_ngram tokenizer with a match_phrase_prefix query for autocomplete experiences
Parent-Child Relationships: Use join field types for one-to-many relationships where child documents update independently, avoiding costly nested reindexing
Cross-cluster Search: Configure remote clusters and use cluster:index syntax to query across multiple Elasticsearch deployments transparently
Snapshot and Restore: Register a snapshot repository (S3, GCS, or filesystem) and schedule regular snapshots for disaster recovery with SLM policies

即时搜索（Search-as-you-type）：使用search_as_you_type字段类型或edge_ngram分词器搭配match_phrase_prefix查询实现自动补全体验
父子关系：使用join字段类型处理一对多关系，子文档可独立更新，避免成本高昂的嵌套重新索引
跨集群搜索：配置远程集群，使用cluster:index语法透明查询多个Elasticsearch部署
快照与恢复：注册快照仓库（S3、GCS或文件系统），通过SLM策略定期调度快照以实现灾难恢复

Pitfalls to Avoid

需避免的陷阱

Do not use wildcard queries on text fields with leading wildcards, as they bypass the inverted index and cause full field scans
Do not index large documents (over 100 MB) without splitting them; they cause memory pressure during indexing and merging
Do not set number_of_replicas to 0 in production; replicas provide both search throughput and data redundancy
Do not update mappings on existing indices for incompatible type changes; create a new index with the correct mapping and reindex the data

不要在text字段上使用带前导通配符的通配符查询，因为这会绕过倒排索引并导致全字段扫描
不要直接索引超过100 MB的大文档，需先拆分；大文档会在索引和合并过程中造成内存压力
生产环境中不要将number_of_replicas设置为0；副本既能提升搜索吞吐量，又能提供数据冗余
不要对现有索引更新不兼容的字段类型映射；应创建带有正确映射的新索引并重新索引数据