novel-reader

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Novel Reader - 智能长文本小说阅读器

Novel Reader - Intelligent Long Text Novel Reader

使用 Python 安全、可靠地读取长文本小说,解决 LLM 上下文窗口有限的问题。完美支持 UTF-8 编码的中文小说。自动过滤无关内容,实时抽取详细的角色、道具、场景信息。
Use Python to read long text novels safely and reliably, solving the problem of limited LLM context window. It perfectly supports Chinese novels encoded in UTF-8. It automatically filters irrelevant content and extracts detailed character, item, and scene information in real time.

核心功能

Core Features

  • 多格式支持:如果输入文件是 PDF、DOC、DOCX 格式,先使用 doc-to-txt skill 转换为 TXT 纯文本格式,然后读取转换后的 TXT 文件
  • 按字符位置读取:使用 Python 按字符位置精确读取,不会出现乱码
  • 分段读取:每次读取最多 3000 字符,默认值为 3000 字符,避免超出工具输出限制。严禁跳读,必须连续逐段读取
  • 灵活定位:可从任意位置开始读取
  • 编码安全:原生支持 UTF-8 编码,正确处理中文字符
  • 智能内容过滤:通过大模型自动识别并跳过小说中与正文无关的内容(上架感言、作者感谢、访谈、广告等)
  • 实时资产抽取:每读取一段内容,立即识别并记录小说中的角色、道具、场景的详细信息
  • 大纲记录:记录每段内容的摘要,形成完整大纲
  • 进度追踪:记录当前阅读位置、已读字数等进度信息
  • 上下文管理:充分利用 Agent 的上下文 compact 机制,在上下文快用完时自动压缩
  • Multi-format support: If the input file is in PDF, DOC, DOCX format, first use the doc-to-txt skill to convert it to TXT plain text format, then read the converted TXT file
  • Read by character position: Use Python to read accurately by character position, no garbled characters will occur
  • Segmented reading: Read up to 3000 characters each time, the default value is 3000 characters, to avoid exceeding the tool output limit. Skip reading is strictly prohibited, must read continuously paragraph by paragraph
  • Flexible positioning: Can start reading from any position
  • Encoding security: Natively supports UTF-8 encoding, correctly handles Chinese characters
  • Smart content filtering: Automatically identifies and skips content irrelevant to the main text of the novel (launch testimonials, author thanks, interviews, advertisements, etc.) through the large model
  • Real-time asset extraction: After reading a piece of content, immediately identify and record detailed information about characters, items, and scenes in the novel
  • Outline recording: Record the summary of each paragraph of content to form a complete outline
  • Progress tracking: Record progress information such as current reading position, number of words read, etc.
  • Context management: Make full use of the Agent's context compact mechanism, and automatically compress when the context is almost used up

重要要求

Important Requirements

严禁跳读

Skip reading is strictly prohibited

  • 必须连续逐段读取:每次读取 3000 字符后,下一次必须从当前结束位置继续读取(即 start = 上次的 start + 3000)
  • 禁止跳读:绝对不允许跳过中间内容直接跳到后面的位置
  • 绝对禁止 for 循环读取:严禁使用 for 循环批量读取文件(如
    for i in $(seq ...)
    ),因为 Agent 工具会对输出进行压缩,for 循环会导致内容被截断,当你看到 (some characters truncated) 表示内容已经被截断,你只能手动重新一段一段读取。
  • 禁止一个命令多次调用 python 脚本:每个工具只能输出 3000 字,超出的内容会被截断,一次 tool call 最多只能调用一次 python 命令!
  • 提升效率:因为要读取超长文本,你可以连续调用读取工具脚本(但不能用 for 循环),中间不思考也不说话。
  • 更新循环:每读取 10 个小说片段(每次 3000 字符,共 30000 字符),就更新一次大纲、进度、资产文件,继续下一个循环。
  • 禁止预设任务:禁止使用 TODO 列表,Task 列表相关工具!
  • Must read continuously paragraph by paragraph: After reading 3000 characters each time, the next time must continue reading from the current end position (i.e. start = last start + 3000)
  • Skip reading is prohibited: It is absolutely not allowed to skip the middle content and jump directly to the later position
  • For loop reading is absolutely prohibited: It is strictly prohibited to use a for loop to read files in batches (e.g.
    for i in $(seq ...)
    ), because the Agent tool will compress the output, and the for loop will cause the content to be truncated. When you see (some characters truncated), it means the content has been truncated, you can only manually re-read paragraph by paragraph.
  • Prohibit calling Python scripts multiple times in one command: Each tool can only output 3000 words, the excess content will be truncated, and one tool call can only call the Python command at most once!
  • Improve efficiency: Because you need to read super long text, you can call the reading tool script continuously (but you can't use a for loop), without thinking or speaking in the middle.
  • Update loop: Every time you read 10 novel fragments (3000 characters each, a total of 30000 characters), update the outline, progress, and asset files once, and continue to the next loop.
  • Prohibit preset tasks: Prohibit the use of TODO lists, Task list related tools!

使用方法

Usage

核心命令

Core Command

bash
python3 read_novel.py <小说文件路径> [--start <起始位置>]
bash
python3 read_novel.py <novel file path> [--start <start position>]

参数说明

Parameter Description

参数说明示例
<小说文件路径>
小说文件的路径(必填),支持 TXT、PDF、DOC、DOCX 格式。如果是 PDF/DOC/DOCX 格式,先使用 doc-to-txt skill 转换为 TXT
test-files/novel.txt
test-files/novel.pdf
--info
获取小说信息(总字符数、总行数、非空行数)
--info
--start
起始位置(字符索引,从 0 开始,默认:0,可选)
--start 10000
ParameterDescriptionExample
<novel file path>
Path of the novel file (required), supports TXT, PDF, DOC, DOCX formats. If it is PDF/DOC/DOCX format, first use the doc-to-txt skill to convert to TXT
test-files/novel.txt
or
test-files/novel.pdf
--info
Get novel information (total number of characters, total number of lines, non-empty lines)
--info
--start
Start position (character index, starting from 0, default: 0, optional)
--start 10000

示例

Examples

获取小说信息:
bash
python3 read_novel.py ./novel.txt --info
读取小说从头开始的 3000 个字符(使用默认参数):
bash
python3 read_novel.py ./novel.txt --start 0
读取小说从第 3000 个字符开始的 3000 个字符(仅指定 start):
bash
python3 read_novel.py ./novel.txt --start 3000
查看帮助:
bash
python3 read_novel.py --help
Get novel information:
bash
python3 read_novel.py ./novel.txt --info
Read the first 3000 characters of the novel from the beginning (use default parameters):
bash
python3 read_novel.py ./novel.txt --start 0
Read 3000 characters of the novel starting from the 3000th character (only specify start):
bash
python3 read_novel.py ./novel.txt --start 3000
View help:
bash
python3 read_novel.py --help

工作流

Workflow

完整工作流分为三个阶段:文件格式处理 → 初始化 → 循环读取 → 完成总结
The complete workflow is divided into three stages: file format processing → initialization → loop reading → completion summary

前置步骤:文件格式处理

Pre-step: File format processing

  • 检查输入文件格式
  • 如果是 PDF、DOC、DOCX 格式,先使用 doc-to-txt skill 转换为 TXT 纯文本格式
  • 转换完成后,使用转换后的 TXT 文件进行后续操作
  • Check input file format
  • If it is PDF, DOC, DOCX format, first use the doc-to-txt skill to convert to TXT plain text format
  • After the conversion is completed, use the converted TXT file for subsequent operations

阶段一:初始化(仅需一次)

Stage 1: Initialization (only once)

步骤 1:获取小说信息
bash
python3 read_novel.py novel.txt --info
输出示例:
总字符数: 5512508
总行数: 158108
非空行数: 79053
步骤 2:检查读取进度
  • 检查是否存在
    读取进度.txt
    文件
  • 如果存在:读取当前位置,从该位置继续(断点续读)
  • 如果不存在:从位置 0 开始全新读取
Step 1: Get novel information
bash
python3 read_novel.py novel.txt --info
Output example:
Total characters: 5512508
Total lines: 158108
Non-empty lines: 79053
Step 2: Check reading progress
  • Check if the
    reading_progress.txt
    file exists
  • If exists: Read the current position and continue from that position (resume reading from breakpoint)
  • If not exists: Start a new read from position 0

阶段二:循环读取(核心阶段)

Stage 2: Loop reading (core stage)

每个循环读取 10 个片段(共 30000 字符),分为两个子阶段:
子阶段 A:批量读取(连续执行,中间不分析)
连续调用 10 次读取命令,每次读取 3000 字符:
  • 第 1 次:
    --start <当前位置>
  • 第 2 次:
    --start <当前位置+3000>
  • ...
  • 第 10 次:
    --start <当前位置+27000>
示例(从 0 开始的第一个循环):
读取片段 1:--start 0
读取片段 2:--start 3000
...
读取片段 10:--start 27000
示例(从 60000 开始的第三个循环):
读取片段 21:--start 60000
读取片段 22:--start 63000
...
读取片段 30:--start 87000
⚠️ 关键规则:10 次读取必须连续执行,中间不分析、不总结、不更新文件
子阶段 B:分析更新(读完 10 个片段后执行)
  1. 分析这 10 个片段的内容
  2. 更新
    大纲.txt
    :添加这 10 个片段的章节摘要
  3. 更新
    读取进度.txt
    :记录当前位置(如 30000)
  4. 创建或更新资产文件:提取新出现的角色、道具、场景
子阶段 C:判断是否继续
  • 计算已读进度(已读字数/总字数)
  • 如果未达到用户要求(如"读 5%"):回到子阶段 A,继续下一批 10 个片段
  • 如果已达到用户要求:进入阶段三
Each loop reads 10 fragments (total 30000 characters), divided into two sub-stages:
Sub-stage A: Batch reading (continuous execution, no analysis in between)
Call the read command 10 times continuously, read 3000 characters each time:
  • 1st time:
    --start <current position>
  • 2nd time:
    --start <current position+3000>
  • ...
  • 10th time:
    --start <current position+27000>
Example (first loop starting from 0):
Read fragment 1: --start 0
Read fragment 2: --start 3000
...
Read fragment 10: --start 27000
Example (third loop starting from 60000):
Read fragment 21: --start 60000
Read fragment 22: --start 63000
...
Read fragment 30: --start 87000
⚠️ Key rule: 10 reads must be executed continuously, no analysis, no summary, no file update in between
Sub-stage B: Analysis and update (executed after reading 10 fragments)
  1. Analyze the content of these 10 fragments
  2. Update
    outline.txt
    : Add chapter summaries of these 10 fragments
  3. Update
    reading_progress.txt
    : Record the current position (e.g. 30000)
  4. Create or update asset files: Extract newly appeared characters, items, scenes
Sub-stage C: Judge whether to continue
  • Calculate the read progress (number of words read / total number of words)
  • If the user's requirement is not met (e.g. "read 5%"): Return to sub-stage A, continue the next batch of 10 fragments
  • If the user's requirement is met: Enter stage three

阶段三:完成总结

Stage 3: Completion summary

  • 输出最终读取进度
  • 总结已抽取的资产统计(角色数、道具数、场景数)
  • Output final reading progress
  • Summarize the extracted asset statistics (number of characters, number of items, number of scenes)

执行顺序示例

Execution sequence example

假设用户要求读取小说前 5%,小说文件是
novel.pdf
前置步骤:文件格式处理
  • 检查文件格式:.pdf
  • 使用 doc-to-txt skill 将 novel.pdf 转换为 novel.txt
  • 后续步骤使用 novel.txt 文件
步骤 1:初始化
  1. 获取小说信息 → 总字数 5512508
  2. 检查
    读取进度.txt
    → 不存在,从位置 0 开始
步骤 2:循环读取
循环 1(片段 1-10,0-30000 字符):
  • 批量读取:连续执行
    --start 0
    --start 27000
  • 分析更新:更新大纲、进度(30000)、资产
  • 检查进度:30000/5512508 = 0.54%,未达到 5%,继续
循环 2(片段 11-20,30000-60000 字符):
  • 批量读取:连续执行
    --start 30000
    --start 57000
  • 分析更新:更新大纲、进度(60000)、资产
  • 检查进度:60000/5512508 = 1.09%,未达到 5%,继续
循环 3(片段 21-30,60000-90000 字符):
  • 批量读取:连续执行
    --start 60000
    --start 87000
  • 分析更新:更新大纲、进度(90000)、资产
  • 检查进度:90000/5512508 = 1.63%,未达到 5%,继续
... 继续循环 ...
循环 10(片段 91-100,270000-300000 字符):
  • 批量读取:连续执行
    --start 270000
    --start 297000
  • 分析更新:更新大纲、进度(300000)、资产
  • 检查进度:300000/5512508 = 5.44%,已达到 5%,停止
步骤 3:完成总结
  • 输出最终进度:5.44%
  • 统计资产:角色 X 个,道具 Y 个,场景 Z 个
Suppose the user requires to read the first 5% of the novel, and the novel file is
novel.pdf
:
Pre-step: File format processing
  • Check file format: .pdf
  • Use doc-to-txt skill to convert novel.pdf to novel.txt
  • Subsequent steps use novel.txt file
Step 1: Initialization
  1. Get novel information → total number of words 5512508
  2. Check
    reading_progress.txt
    → does not exist, start from position 0
Step 2: Loop reading
Loop 1 (fragments 1-10, 0-30000 characters):
  • Batch reading: Continuously execute
    --start 0
    to
    --start 27000
  • Analysis and update: Update outline, progress (30000), assets
  • Check progress: 30000/5512508 = 0.54%, not reaching 5%, continue
Loop 2 (fragments 11-20, 30000-60000 characters):
  • Batch reading: Continuously execute
    --start 30000
    to
    --start 57000
  • Analysis and update: Update outline, progress (60000), assets
  • Check progress: 60000/5512508 = 1.09%, not reaching 5%, continue
Loop 3 (fragments 21-30, 60000-90000 characters):
  • Batch reading: Continuously execute
    --start 60000
    to
    --start 87000
  • Analysis and update: Update outline, progress (90000), assets
  • Check progress: 90000/5512508 = 1.63%, not reaching 5%, continue
... Continue looping ...
Loop 10 (fragments 91-100, 270000-300000 characters):
  • Batch reading: Continuously execute
    --start 270000
    to
    --start 297000
  • Analysis and update: Update outline, progress (300000), assets
  • Check progress: 300000/5512508 = 5.44%, reached 5%, stop
Step 3: Completion summary
  • Output final progress: 5.44%
  • Count assets: X characters, Y items, Z scenes

关键规则

Key Rules

规则说明
严禁边读边分析每读完一个片段就分析更新是错误的做法
必须批量读取10 个片段全部读完后再统一分析更新
支持断点续读通过读取进度.txt 实现中断后继续
进度计算每次循环读取 30000 字符(10 × 3000)
RuleDescription
It is strictly prohibited to analyze while readingIt is a wrong practice to analyze and update after reading a fragment
Must read in batchesUnified analysis and update after all 10 fragments are read
Support breakpoint resume readingRealize resuming after interruption through reading_progress.txt
Progress calculationRead 30000 characters per loop (10 × 3000)

资产抽取

Asset Extraction

智能内容过滤

Smart Content Filtering

通过执行 Python 脚本读取小说内容后,让大模型读懂内容,自动识别并忽略与小说正文无关的内容,只保留小说的正文章节内容进行分析和资产抽取。不使用任何规则或正则表达式进行过滤,完全依赖大模型的理解能力。
After reading the novel content by executing the Python script, let the large model understand the content, automatically identify and ignore the content irrelevant to the main text of the novel, and only retain the main chapter content of the novel for analysis and asset extraction. No rules or regular expressions are used for filtering, completely relying on the understanding ability of the large model.

实时资产抽取

Real-time Asset Extraction

每读取一段内容(不超过 3000 字符),立即从中识别并抽取新出现的资产(角色、道具、场景),并实时更新对应的文件。不要等到全部读完再更新。大模型会充分利用上下文 compact 机制,在快用完上下文时自动压缩。
Every time a piece of content (no more than 3000 characters) is read, immediately identify and extract newly appeared assets (characters, items, scenes) from it, and update the corresponding files in real time. Don't wait until all reading is completed to update. The large model will make full use of the context compact mechanism and automatically compress when the context is almost used up.

抽取元素

Extracted Elements

从每段读取的内容中,识别并抽取以下三类资产,尽可能收集详细信息:
  1. 角色 - 小说中出现的人物,包括:
    • 基本信息:姓名、年龄、性别、身份、角色定位(主角/配角/反派/男二/女二等)
    • 外貌特征:容貌、身高、体型、着装
    • 性格特点:口头禅、习惯性动作、行为模式
    • 背景关系:家庭、朋友、敌人、师承
    • 能力修为:实力等级、特殊能力、武器装备
    • 出场情节:首次出场、重要事件
  2. 道具 - 小说中出现的物品、法宝、武器等,包括:
    • 基本信息:名称、类型、来源
    • 外观特征:形状、颜色、材质、大小
    • 功能特性:特殊能力、使用方法、效果
    • 历史背景:来历、前任主人、重要事件
    • 相关角色:拥有者、使用者、争夺者
  3. 场景 - 小说中出现的地点、环境、场所等,包括:
    • 基本信息:名称、类型、地理位置
    • 环境特征:地形、气候、建筑风格、氛围
    • 功能用途:居住、修炼、交易、战斗
    • 相关势力:所属势力、管理者、常驻人物
    • 重要事件:发生过的关键情节
From each piece of read content, identify and extract the following three types of assets, and collect detailed information as much as possible:
  1. Characters - Characters appearing in the novel, including:
    • Basic information: name, age, gender, identity, role positioning (protagonist/supporting role/villain/second male/second female, etc.)
    • Appearance features: face, height, body shape, dress
    • Personality characteristics: mantra, habitual actions, behavior patterns
    • Background relationships: family, friends, enemies, teachers
    • Ability and cultivation: strength level, special abilities, weapons and equipment
    • Appearance plot: first appearance, important events
  2. Items - Items, magic weapons, weapons, etc. appearing in the novel, including:
    • Basic information: name, type, source
    • Appearance features: shape, color, material, size
    • Functional characteristics: special abilities, usage methods, effects
    • Historical background: origin, former owner, important events
    • Related characters: owner, user, contender
  3. Scenes - Locations, environments, places, etc. appearing in the novel, including:
    • Basic information: name, type, geographical location
    • Environmental characteristics: terrain, climate, architectural style, atmosphere
    • Functional purposes: residence, cultivation, trading, combat
    • Related forces: affiliated forces, managers, permanent personnel
    • Important events: key plots that have occurred

目录结构

Directory Structure

在小说文件所在目录下创建以下结构:
小说名或项目名/
├── 大纲.txt
├── 读取进度.txt
├── 角色/
│   ├── <角色名1>.txt
│   ├── <角色名2>.txt
│   └── ...
├── 道具/
│   ├── <道具名1>.txt
│   ├── <道具名2>.txt
│   └── ...
└── 场景/
    ├── <场景名1>.txt
    ├── <场景名2>.txt
    └── ...
Create the following structure in the directory where the novel file is located:
Novel name or project name/
├── outline.txt
├── reading_progress.txt
├── characters/
│   ├── <character name 1>.txt
│   ├── <character name 2>.txt
│   └── ...
├── items/
│   ├── <item name 1>.txt
│   ├── <item name 2>.txt
│   └── ...
└── scenes/
    ├── <scene name 1>.txt
    ├── <scene name 2>.txt
    └── ...

文件内容格式

File Content Format

大纲.txt

outline.txt

1. <第一段内容的摘要>

2. <第二段内容的摘要>

...
说明
  • 按小说内容的逻辑分段添加摘要,不按固定字数分段
  • 大模型根据内容自然决定何时添加新的摘要条目
  • 每条摘要用序号标识,按阅读顺序排列
  • 只记录小说正文内容的摘要,跳过无关内容
  • 跳过小说中与正文无关的内容,如上架感言、作者感谢、访谈、广告等
1. <Summary of the first paragraph>

2. <Summary of the second paragraph>

...
Description:
  • Add summaries according to the logical segmentation of the novel content, not according to the fixed number of words
  • The large model naturally decides when to add a new summary entry according to the content
  • Each summary is identified by a serial number, arranged in reading order
  • Only record the summary of the main content of the novel, skip irrelevant content
  • Skip content irrelevant to the main text of the novel, such as launch testimonials, author thanks, interviews, advertisements, etc.

读取进度.txt

reading_progress.txt

当前位置: <字符索引>
已读字数: <数字>
总字数: <数字>
进度: <百分比>%
Current position: <character index>
Words read: <number>
Total words: <number>
Progress: <percentage>%

角色文件(角色/<角色名>.txt)

Character file (characters/<character name>.txt)

每个角色文件包含该角色的详细描述,参考格式如下:
<角色名称>

【基本信息】
姓名:
年龄:
性别:
身份:
角色定位:

【外貌特征】
容貌:
身高:
体型:
着装:

【性格特点】
性格:
口头禅:
习惯性动作:
行为模式:

【背景关系】
家庭:
朋友:
敌人:
师承:

【能力修为】
实力等级:
特殊能力:
武器装备:

【出场情节】
首次出场:
Each character file contains a detailed description of the character, refer to the following format:
<Character name>

【Basic Information】
Name:
Age:
Gender:
Identity:
Role positioning:

【Appearance Features】
Face:
Height:
Body shape:
Dress:

【Personality Characteristics】
Personality:
Mantra:
Habitual action:
Behavior pattern:

【Background Relationships】
Family:
Friends:
Enemies:
Teachers:

【Ability and Cultivation】
Strength level:
Special ability:
Weapons and equipment:

【Appearance Plot】
First appearance:

道具文件(道具/<道具名>.txt)

Item file (items/<item name>.txt)

每个道具文件包含该道具的详细描述,参考格式如下:
<道具名称>

【基本信息】
名称:
类型:
来源:

【外观特征】
形状:
颜色:
材质:
大小:

【功能特性】
特殊能力:
使用方法:
效果:

【历史背景】
来历:
重要事件:

【相关角色】
拥有者:
使用者:
争夺者:
Each item file contains a detailed description of the item, refer to the following format:
<Item name>

【Basic Information】
Name:
Type:
Source:

【Appearance Features】
Shape:
Color:
Material:
Size:

【Functional Characteristics】
Special ability:
Usage method:
Effect:

【Historical Background】
Origin:
Important events:

【Related Characters】
Owner:
User:
Contender:

场景文件(场景/<场景名>.txt)

Scene file (scenes/<scene name>.txt)

每个场景文件包含该场景的详细描述,参考格式如下:
<场景名称>

【基本信息】
名称:
类型:
地理位置:

【环境特征】
地形:
气候:
建筑风格:
氛围:

【功能用途】
居住:
修炼:
交易:
战斗:

【相关势力】
所属势力:
管理者:
常驻人物:

【重要事件】
发生过的关键情节:
Each scene file contains a detailed description of the scene, refer to the following format:
<Scene name>

【Basic Information】
Name:
Type:
Geographical location:

【Environmental Characteristics】
Terrain:
Climate:
Architectural style:
Atmosphere:

【Functional Purposes】
Residence:
Cultivation:
Trading:
Combat:

【Related Forces】
Affiliated force:
Manager:
Permanent personnel:

【Important Events】
Key plots that occurred:

抽取规则

Extraction Rules

  1. 去重:如果资产文件已存在,只需追加新的信息,不要重复创建
  2. 准确性:确保抽取的资产名称准确,避免错别字
  3. 完整性:尽可能记录资产的所有相关信息,使用上述详细格式
  4. 自然语言:使用自然语言描述,让大模型理解抽取需求
  5. 内容过滤:在抽取资产前,先过滤掉无关内容,只从正文中抽取
  6. 实时更新:每读取一段内容(不超过 3000 字符),立即识别并抽取新出现的资产,不要等全部读完再更新
  7. 不使用规则/正则:完全通过执行 Python 脚本获取内容,让大模型自己理解和分析,不使用任何规则或正则表达式进行匹配
  8. 重要性筛选:只记录重要的资产,不重要的角色、场景、道具可以忽略跳过。判断标准包括:
    • 角色:对剧情有推动作用的主要角色、关键配角、重要反派等;路人、龙套、一次性出现的次要角色可忽略
    • 道具:对剧情有关键作用的法宝、武器、重要物品;普通物品、一次性道具可忽略
    • 场景:故事发生的主要地点、重要场所;临时场景、一笔带过的地点可忽略
  1. Deduplication: If the asset file already exists, only append new information, do not create it repeatedly
  2. Accuracy: Ensure that the extracted asset name is accurate and avoid typos
  3. Completeness: Record all relevant information of the asset as much as possible, use the above detailed format
  4. Natural language: Use natural language description to let the large model understand the extraction requirements
  5. Content filtering: Before extracting assets, first filter out irrelevant content, only extract from the main text
  6. Real-time update: Every time a piece of content (no more than 3000 characters) is read, immediately identify and extract newly appeared assets, don't wait until all reading is completed to update
  7. No rules/regex: Completely obtain content by executing Python scripts, let the large model understand and analyze by itself, do not use any rules or regular expressions for matching
  8. Importance screening: Only record important assets, unimportant characters, scenes, items can be ignored. The judgment criteria include:
    • Characters: Main characters, key supporting roles, important villains, etc. that promote the plot; passers-by, extras, secondary characters that appear once can be ignored
    • Items: Magic weapons, weapons, important items that play a key role in the plot; ordinary items, disposable props can be ignored
    • Scenes: Main locations and important places where the story takes place; temporary scenes, locations mentioned in passing can be ignored

技巧提示

Tips

  1. 获取文件总字符数
bash
python3 -c "
with open('test-files/novel.txt', 'r', encoding='utf-8', errors='replace') as f:
    print(len(f.read()))
"
  1. 读取分段:每次读取最多 3000 字符,默认值为 3000 字符,避免超出工具输出限制
  2. 上下文管理:大模型通常有 128k 上下文窗口,Agent 会在快用完上下文时自动压缩,请合理利用
  1. Get the total number of characters in the file:
bash
python3 -c "
with open('test-files/novel.txt', 'r', encoding='utf-8', errors='replace') as f:
    print(len(f.read()))
"
  1. Segmented reading: Read up to 3000 characters each time, the default value is 3000 characters, to avoid exceeding the tool output limit
  2. Context management: Large models usually have a 128k context window, the Agent will automatically compress when the context is almost used up, please use it reasonably

为什么用 Python?

Why use Python?

  • 编码安全:原生支持 UTF-8,不会出现乱码
  • 按字符读取:不是按字节,正确处理中文字符(每个中文字符算 1 个)
  • 跨平台:Windows/Mac/Linux 都一样
  • 系统预装:几乎所有现代系统都预装了 Python 3
  • 语法简单:一个模式搞定所有场景
  • Encoding security: Natively supports UTF-8, no garbled characters
  • Read by character: Not by byte, correctly handle Chinese characters (each Chinese character counts as 1)
  • Cross-platform: Same on Windows/Mac/Linux
  • Pre-installed in system: Almost all modern systems have Python 3 pre-installed
  • Simple syntax: One pattern handles all scenarios