greptimedb-pipeline
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGreptimeDB Pipeline Guide
GreptimeDB Pipeline 指南
Create GreptimeDB pipeline definition to transform data into specific structured
table, including data extraction, processing, type parsing, datetime handling
and more.
创建GreptimeDB Pipeline定义,将数据转换为特定结构的表,涵盖数据提取、处理、类型解析、日期时间处理等功能。
The workflow
工作流程
To create GreptimeDB pipeline, we should follow these phases:
创建GreptimeDB Pipeline需遵循以下阶段:
Phase 1. Understanding GreptimeDB Pipeline
阶段1:了解GreptimeDB Pipeline
First, we should read greptimedb pipeline definitions and how it works from
GreptimeDB's documentation.
There are pages available, use WebFetch to load and understand them:
- High level information of how to use custom pipeline https://docs.greptime.com/user-guide/logs/use-custom-pipelines/
- Details about pipeline elements and docs for each processor, transform and dispatcher https://docs.greptime.com/reference/pipeline/pipeline-config/
We will always create version 2 pipeline.
首先,我们需要从GreptimeDB的官方文档中阅读Pipeline的定义及其工作原理。
可通过WebFetch加载以下相关文档页面进行学习:
- 自定义Pipeline使用方法的高阶介绍 https://docs.greptime.com/user-guide/logs/use-custom-pipelines/
- Pipeline元素详情及各处理器、转换器、调度器的文档 https://docs.greptime.com/reference/pipeline/pipeline-config/
我们将始终创建版本2的Pipeline。
Phase 2. Create an initial pipeline that works
阶段2:创建可运行的初始Pipeline
Ask user to provide a sample input data. It can be one of:
- text data line
- ndjson data line
- an array of json data
And try to understand what type of information that user want to extract from
the sample data.
For text data line, we should try to split it by any potential field separator
like space or tab. Find out the datetime part and use processor to parse
it. Try to name each field by its meaning. If it's impossible to understand the
text line, we try to use a field called for all the line.
datemessageFor ndjson and json, we will find out a datetime field and use processor
on it to generate the time index. And we will use json key for all other fields.
dateProvide user a sample of how the initial pipeline definition will look like, as
well as how the parsed data to be like. We can use a markdown table to show each
field name, data type in greptimedb and values:
| Field name 1 (Data type) | Field name 2 (Data type) | ... |
|---|---|---|
| Value 1 | Value 2 | ... |
| Value 1 | Value 2 | ... |
请用户提供一份示例输入数据,格式可以是:
- 文本数据行
- NDJSON数据行
- JSON数据数组
同时尝试理解用户希望从示例数据中提取哪些信息。
对于文本数据行,我们应尝试通过空格、制表符等潜在字段分隔符拆分数据,找到日期时间部分并使用处理器进行解析。根据字段含义为每个字段命名;如果无法理解文本行的结构,则将整行内容存入名为的字段中。
datemessage对于NDJSON和JSON格式数据,我们需要找到日期时间字段,使用处理器将其转换为时间索引,其他字段则直接使用JSON的键作为字段名。
date向用户提供初始Pipeline定义的示例,以及解析后的数据样例。我们可以用Markdown表格展示每个字段的名称、在GreptimeDB中的数据类型和对应值:
| 字段名1(数据类型) | 字段名2(数据类型) | ... |
|---|---|---|
| 值1 | 值2 | ... |
| 值1 | 值2 | ... |
Phase 3. Work on special requirements and verify
阶段3:处理特殊需求并验证
The user may have more requirements on particular field, use processor to
address them.
If the user want to dispatch data into multiple tables, or using different
pipeline to process, there is available to handle this. User can
provide table suffix for dispatched data.
dispatchIf the user requirements are complex enough for declarative processors, there is
also an advanced VRL processor for remapping data. Check reference for more
information.
If the greptimedb-mcp-server is available, there is a tool by
which we can provide pipeline definition and sample data to test against
GreptimeDB's implementation. The output is a table encoded as json.
dryrun-pipeline用户可能对特定字段有更多需求,可通过处理器来满足。
如果用户需要将数据分发到多个表,或使用不同Pipeline处理数据,可以使用功能来实现。用户可指定分发数据的表后缀。
dispatch如果用户的需求复杂到声明式处理器无法满足,还可以使用高级VRL处理器来重新映射数据,详情可参考相关文档。
如果greptimedb-mcp-server可用,我们可以使用工具,传入Pipeline定义和示例数据,在GreptimeDB的实现环境中进行测试,测试结果会以JSON格式的表格输出。
dryrun-pipelinePhase 4. Check index and table options
阶段4:检查索引与表选项
The Pipeline system also allow user to specify various index on the result
table. We will understand how user will query the table and provide suggestion
on index.
Advanced table options can be customized by variables. Use them if
user want to customize TTL, append_mode and etc.
.greptime_Pipeline系统允许用户为结果表指定各类索引。我们会了解用户查询表的方式,并提供索引设置建议。
用户可通过变量自定义高级表选项,比如TTL、追加模式等,如有需求可使用这些变量。
.greptime_Reference
参考资料
- GreptimeDB Index Options: https://docs.greptime.com/user-guide/manage-data/data-index/
- VRL, the advanced processing language from Vector: https://vector.dev/docs/reference/vrl/
- Using Table Options from Pipeline/VRL: https://docs.greptime.com/reference/pipeline/write-log-api/#set-table-options
- GreptimeDB索引选项: https://docs.greptime.com/user-guide/manage-data/data-index/
- VRL(Vector的高级处理语言): https://vector.dev/docs/reference/vrl/
- 通过Pipeline/VRL设置表选项: https://docs.greptime.com/reference/pipeline/write-log-api/#set-table-options