sqlite-db-truncate

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

SQLite Truncated Database Recovery

SQLite截断数据库恢复

This skill provides systematic approaches for recovering data from SQLite database files that have been corrupted through binary truncation. It emphasizes understanding the SQLite file format before attempting recovery and avoiding common pitfalls that lead to multiple failed iterations.
本方法提供了一套系统化的方案,用于从因二进制截断而损坏的SQLite数据库文件中恢复数据。它强调在尝试恢复前要先理解SQLite文件格式,并避免那些会导致多次失败尝试的常见误区。

When to Use This Skill

适用场景

This skill applies when:
  • A SQLite database file cannot be opened with standard
    sqlite3
    commands
  • The database error indicates corruption or malformed data
  • File size is smaller than expected (suggesting truncation)
  • Standard recovery tools like
    .recover
    command fail
  • Manual binary parsing of SQLite page structure is required
本方法适用于以下情况:
  • 标准
    sqlite3
    命令无法打开SQLite数据库文件
  • 数据库报错提示损坏或数据格式错误
  • 文件大小小于预期(表明存在截断情况)
  • .recover
    等标准恢复工具失效
  • 需要手动解析SQLite页面结构

Initial Assessment Strategy

初始评估策略

Before writing any recovery code, perform a thorough analysis of the corrupted file:
在编写任何恢复代码前,需对损坏文件进行全面分析:

Step 1: Examine File Characteristics

步骤1:检查文件特征

bash
undefined
bash
undefined

Check file size and basic properties

检查文件大小和基本属性

ls -lh database.db file database.db
ls -lh database.db file database.db

Create hex dump for analysis

生成十六进制转储用于分析

hexdump -C database.db | head -100

Key observations to make:
- **File size**: SQLite pages are typically 4096 bytes. Check if size aligns with page boundaries
- **Magic bytes**: Valid SQLite files start with "SQLite format 3\000" (16 bytes)
- **First byte after header**: Identifies page type (0x0d = table leaf page with actual data)
hexdump -C database.db | head -100

需要重点观察的内容:
- **文件大小**:SQLite页面通常为4096字节,检查文件大小是否与页面边界对齐
- **魔术字节**:有效的SQLite文件以"SQLite format 3\000"开头(共16字节)
- **头部后的第一个字节**:用于标识页面类型(0x0d = 包含实际数据的表叶子页面)

Step 2: Identify Corruption Pattern

步骤2:识别损坏模式

Common truncation scenarios:
  • Header-only file: Only the 100-byte header remains
  • Missing header: File starts with a data page (first byte is 0x0d, 0x05, 0x0a, or 0x02)
  • Partial page: File ends mid-page, truncating some cells
If the file lacks the standard "SQLite format 3" magic header but starts with 0x0d, this indicates the file contains only a table leaf page without the database header.
常见的截断场景:
  • 仅存头部的文件:仅保留了100字节的头部
  • 缺失头部:文件以数据页面开头(第一个字节为0x0d、0x05、0x0a或0x02)
  • 页面不完整:文件在页面中途结束,截断了部分单元格
如果文件缺少标准的"SQLite format 3"魔术头部,但以0x0d开头,说明该文件仅包含表叶子页面,无数据库头部。

Step 3: Try Standard Tools First

步骤3:优先尝试标准工具

Always attempt standard recovery before manual parsing:
bash
undefined
在进行手动解析前,务必先尝试标准恢复工具:
bash
undefined

Check if sqlite3 can read the file

检查sqlite3是否能读取文件

sqlite3 database.db ".schema" 2>&1 sqlite3 database.db "SELECT * FROM sqlite_master" 2>&1
sqlite3 database.db ".schema" 2>&1 sqlite3 database.db "SELECT * FROM sqlite_master" 2>&1

Try built-in recovery

尝试内置恢复功能

sqlite3 database.db ".recover" > recovered.sql 2>&1
sqlite3 database.db ".recover" > recovered.sql 2>&1

Try integrity check

尝试完整性检查

sqlite3 database.db "PRAGMA integrity_check;"

If these fail with "database disk image is malformed" or similar errors, proceed to manual binary parsing.
sqlite3 database.db "PRAGMA integrity_check;"

如果这些操作返回"database disk image is malformed"或类似错误,则进行手动二进制解析。

SQLite Page Structure Overview

SQLite页面结构概述

Understanding the page structure is essential before writing recovery code.
在编写恢复代码前,理解页面结构至关重要。

Table Leaf Page Layout (Page Type 0x0d)

表叶子页面布局(页面类型0x0d)

Offset  Size   Description
------  ----   -----------
0       1      Page type (0x0d for table leaf)
1       2      First freeblock offset (big-endian)
3       2      Number of cells on page (big-endian)
5       2      Cell content area start offset (big-endian)
7       1      Fragmented free bytes count
8+      varies Cell pointer array (2 bytes per cell, big-endian)
...            [Gap/free space]
End            Cell data (grows backward from page end)
偏移量  大小   描述
------  ----   -----------
0       1      页面类型(表叶子页面为0x0d)
1       2      第一个空闲块偏移量(大端序)
3       2      页面上的单元格数量(大端序)
5       2      单元格内容区域起始偏移量(大端序)
7       1      碎片化空闲字节数
8+      可变   单元格指针数组(每个单元格占2字节,大端序)
...            [间隙/空闲空间]
末尾            单元格数据(从页面末尾向前增长)

Cell Structure

单元格结构

Each cell contains a database row:
[Payload size: varint]
[Row ID: varint]
[Header size: varint]
[Serial type 1: varint]
[Serial type 2: varint]
...
[Column 1 value]
[Column 2 value]
...
每个单元格包含一条数据库行:
[负载大小:varint]
[行ID:varint]
[头部大小:varint]
[序列化类型1:varint]
[序列化类型2:varint]
...
[列1值]
[列2值]
...

Varint Encoding

Varint编码

SQLite uses variable-length integers (varints):
  • Bytes 1-8: Use 7 bits for data, high bit (0x80) indicates continuation
  • Byte 9: Uses all 8 bits (no continuation)
SQLite使用可变长度整数(varint):
  • 第1-8字节:使用7位存储数据,高位(0x80)表示后续还有字节
  • 第9字节:使用全部8位(无后续字节)

Serial Types

序列化类型

Serial types indicate how to interpret column data:
TypeSizeMeaning
00NULL
118-bit signed integer
2216-bit big-endian signed integer
3324-bit big-endian signed integer
4432-bit big-endian signed integer
78IEEE 754 64-bit float (big-endian)
80Integer constant 0
90Integer constant 1
N >= 12, even(N-12)/2BLOB
N >= 13, odd(N-13)/2Text string (UTF-8)
Example: Serial type 0x21 (33) = text string of length (33-13)/2 = 10 bytes.
序列化类型用于指示列数据的解析方式:
类型大小含义
00NULL
118位有符号整数
2216位大端序有符号整数
3324位大端序有符号整数
4432位大端序有符号整数
78IEEE 754 64位浮点数(大端序)
80整数常量0
90整数常量1
N >= 12,偶数(N-12)/2BLOB
N >= 13,奇数(N-13)/2文本字符串(UTF-8)
示例:序列化类型0x21(33)表示长度为(33-13)/2 = 10字节的文本字符串。

Recovery Approach

恢复方法

Build a Single, Modular Script

构建单一模块化脚本

Avoid creating multiple separate recovery scripts. Instead, build one script iteratively with clear debug output:
python
import struct
import json

DEBUG = True

def read_varint(data, offset):
    """Read SQLite variable-length integer."""
    value = 0
    for i in range(9):
        if offset + i >= len(data):
            return None, offset
        byte = data[offset + i]
        if i == 8:
            value = (value << 8) | byte
            return value, offset + i + 1
        value = (value << 7) | (byte & 0x7f)
        if (byte & 0x80) == 0:
            return value, offset + i + 1
    return value, offset

def decode_value(data, offset, serial_type):
    """Decode value based on serial type."""
    if serial_type == 0:
        return None, offset
    elif serial_type == 1:
        return struct.unpack('>b', data[offset:offset+1])[0], offset + 1
    elif serial_type == 2:
        return struct.unpack('>h', data[offset:offset+2])[0], offset + 2
    elif serial_type == 4:
        return struct.unpack('>i', data[offset:offset+4])[0], offset + 4
    elif serial_type == 7:
        return struct.unpack('>d', data[offset:offset+8])[0], offset + 8
    elif serial_type == 8:
        return 0, offset
    elif serial_type == 9:
        return 1, offset
    elif serial_type >= 12:
        if serial_type % 2 == 0:
            length = (serial_type - 12) // 2
            return data[offset:offset+length], offset + length
        else:
            length = (serial_type - 13) // 2
            return data[offset:offset+length].decode('utf-8', errors='replace'), offset + length
    return None, offset
避免创建多个独立的恢复脚本,应迭代构建一个带有清晰调试输出的单一脚本:
python
import struct
import json

DEBUG = True

def read_varint(data, offset):
    """读取SQLite可变长度整数。"""
    value = 0
    for i in range(9):
        if offset + i >= len(data):
            return None, offset
        byte = data[offset + i]
        if i == 8:
            value = (value << 8) | byte
            return value, offset + i + 1
        value = (value << 7) | (byte & 0x7f)
        if (byte & 0x80) == 0:
            return value, offset + i + 1
    return value, offset

def decode_value(data, offset, serial_type):
    """根据序列化类型解码值。"""
    if serial_type == 0:
        return None, offset
    elif serial_type == 1:
        return struct.unpack('>b', data[offset:offset+1])[0], offset + 1
    elif serial_type == 2:
        return struct.unpack('>h', data[offset:offset+2])[0], offset + 2
    elif serial_type == 4:
        return struct.unpack('>i', data[offset:offset+4])[0], offset + 4
    elif serial_type == 7:
        return struct.unpack('>d', data[offset:offset+8])[0], offset + 8
    elif serial_type == 8:
        return 0, offset
    elif serial_type == 9:
        return 1, offset
    elif serial_type >= 12:
        if serial_type % 2 == 0:
            length = (serial_type - 12) // 2
            return data[offset:offset+length], offset + length
        else:
            length = (serial_type - 13) // 2
            return data[offset:offset+length].decode('utf-8', errors='replace'), offset + length
    return None, offset

Parse Incrementally with Debug Output

逐步解析并输出调试信息

Parse one cell completely and verify before processing all cells:
python
def parse_cell(data, cell_offset, debug=DEBUG):
    """Parse a single cell with detailed debug output."""
    if debug:
        print(f"\nParsing cell at offset {cell_offset} (0x{cell_offset:04x})")

    # Read payload size
    payload_size, offset = read_varint(data, cell_offset)
    if debug:
        print(f"  Payload size: {payload_size}")

    # Read row ID
    row_id, offset = read_varint(data, offset)
    if debug:
        print(f"  Row ID: {row_id}")

    # Read header size
    header_size, header_start = read_varint(data, offset)
    if debug:
        print(f"  Header size: {header_size}")

    # Parse serial types
    serial_types = []
    current = header_start
    header_end = offset + header_size
    while current < header_end:
        st, current = read_varint(data, current)
        serial_types.append(st)

    if debug:
        print(f"  Serial types: {serial_types}")

    # Parse values
    values = []
    for st in serial_types:
        val, current = decode_value(data, current, st)
        values.append(val)

    if debug:
        print(f"  Values: {values}")

    return {'row_id': row_id, 'values': values}
先完整解析一个单元格并验证,再处理所有单元格:
python
def parse_cell(data, cell_offset, debug=DEBUG):
    """解析单个单元格并输出详细调试信息。"""
    if debug:
        print(f"\n正在解析偏移量为 {cell_offset} (0x{cell_offset:04x}) 的单元格")

    # 读取负载大小
    payload_size, offset = read_varint(data, cell_offset)
    if debug:
        print(f"  负载大小: {payload_size}")

    # 读取行ID
    row_id, offset = read_varint(data, offset)
    if debug:
        print(f"  行ID: {row_id}")

    # 读取头部大小
    header_size, header_start = read_varint(data, offset)
    if debug:
        print(f"  头部大小: {header_size}")

    # 解析序列化类型
    serial_types = []
    current = header_start
    header_end = offset + header_size
    while current < header_end:
        st, current = read_varint(data, current)
        serial_types.append(st)

    if debug:
        print(f"  序列化类型: {serial_types}")

    # 解析值
    values = []
    for st in serial_types:
        val, current = decode_value(data, current, st)
        values.append(val)

    if debug:
        print(f"  值: {values}")

    return {'row_id': row_id, 'values': values}

Common Pitfalls and Prevention

常见误区与规避方法

Pitfall 1: Not Understanding the Corruption Pattern

误区1:未理解损坏模式

Mistake: Assuming the file has a standard SQLite header when it may only contain a data page.
Prevention: Always examine the first few bytes with hexdump. If the file starts with 0x0d instead of "SQLite format 3", the header is missing. Adjust parsing offsets accordingly (no 100-byte header offset needed).
错误做法:假设文件包含标准SQLite头部,而实际上可能仅包含数据页面。
规避方法:始终使用hexdump检查前几个字节。如果文件以0x0d而非"SQLite format 3"开头,说明头部缺失,需相应调整解析偏移量(无需跳过100字节的头部)。

Pitfall 2: Multiple Script Iterations

误区2:创建多个脚本迭代

Mistake: Creating many separate recovery scripts (recover1.py, recover2.py, etc.) based on trial and error.
Prevention:
  • Read the hex dump thoroughly first and annotate the structure manually
  • Build one script with debug flags
  • Reference the SQLite file format specification before coding
错误做法:基于试错创建多个独立的恢复脚本(如recover1.py、recover2.py等)。
规避方法
  • 先仔细阅读十六进制转储并手动标注结构
  • 构建一个带有调试标志的单一脚本
  • 编码前参考SQLite文件格式规范

Pitfall 3: Reading Strings Beyond Their Boundaries

误区3:读取超出边界的字符串

Mistake: Reading string data without checking the serial type length, resulting in incorrect strings (e.g., "testword052" instead of "testword05").
Prevention: Always calculate string length from serial type:
length = (serial_type - 13) // 2
. Read exactly that many bytes.
错误做法:未根据序列化类型长度读取字符串数据,导致字符串解析错误(例如将"testword05"解析为"testword052")。
规避方法:始终根据序列化类型计算字符串长度:
length = (serial_type - 13) // 2
,仅读取对应长度的字节。

Pitfall 4: Syntax Errors in Generated Code

误区4:生成代码存在语法错误

Mistake: Missing spaces in operators like
if48
instead of
if 48
, or
12and
instead of
12 and
.
Prevention: Validate syntax before running:
bash
python3 -m py_compile recovery_script.py
错误做法:操作符缺少空格,如
if48
而非
if 48
,或
12and
而非
12 and
规避方法:运行前验证语法:
bash
python3 -m py_compile recovery_script.py

Pitfall 5: Wrong Byte Order

误区5:字节序错误

Mistake: Reading multi-byte integers with little-endian instead of big-endian.
Prevention: SQLite uses big-endian for all multi-byte integers. Always use
struct.unpack('>...', data)
with the
>
prefix.
错误做法:以小端序而非大端序读取多字节整数。
规避方法:SQLite所有多字节整数均使用大端序,始终使用
struct.unpack('>...', data)
并带有
>
前缀。

Pitfall 6: Not Handling Truncation Gracefully

误区6:未优雅处理截断

Mistake: Script crashes when encountering truncated data at end of file.
Prevention: Check bounds before every read operation:
python
def safe_read(data, offset, length):
    if offset + length > len(data):
        return None
    return data[offset:offset+length]
错误做法:遇到文件末尾的截断数据时脚本崩溃。
规避方法:每次读取操作前检查边界:
python
def safe_read(data, offset, length):
    if offset + length > len(data):
        return None
    return data[offset:offset+length]

Verification Strategy

验证策略

Step 1: Validate Cell Count

步骤1:验证单元格数量

Compare the number of cells reported in the page header (offset 3-4) with actual cells found.
将页面头部(偏移量3-4)报告的单元格数量与实际找到的单元格数量进行比较。

Step 2: Validate Data Patterns

步骤2:验证数据模式

If expected patterns are known (e.g., words matching "testwordXY"), verify extracted strings match the pattern.
如果已知预期模式(例如符合"testwordXY"格式的字符串),验证提取的字符串是否匹配该模式。

Step 3: Check Value Ranges

步骤3:检查值范围

Verify extracted numeric values are within expected ranges. Watch for:
  • Unexpected negative numbers (sign bit interpretation)
  • Very large numbers (byte order issues)
  • NaN or infinity for floats
验证提取的数值是否在预期范围内,注意:
  • 意外的负数(符号位解析问题)
  • 过大的数值(字节序问题)
  • 浮点数的NaN或无穷大值

Step 4: Compare with Expected Output Format

步骤4:与预期输出格式对比

Before finalizing output, ensure JSON structure matches requirements:
python
undefined
在最终确定输出前,确保JSON结构符合要求:
python
undefined

Validate output structure

验证输出结构

for record in recovered_data: assert 'word' in record and 'value' in record assert isinstance(record['word'], str) assert isinstance(record['value'], (int, float))
undefined
for record in recovered_data: assert 'word' in record and 'value' in record assert isinstance(record['word'], str) assert isinstance(record['value'], (int, float))
undefined

Output Generation

输出生成

Format recovered data according to the required output specification:
python
def generate_output(recovered_rows, output_path):
    """Format and save recovered data."""
    results = []
    for row in recovered_rows:
        if len(row['values']) >= 2:
            results.append({
                'word': row['values'][0],
                'value': row['values'][1]
            })

    with open(output_path, 'w') as f:
        json.dump(results, f, indent=2)

    print(f"Recovered {len(results)} records to {output_path}")
    return results
根据要求的输出规范格式化恢复的数据:
python
def generate_output(recovered_rows, output_path):
    """格式化并保存恢复的数据。"""
    results = []
    for row in recovered_rows:
        if len(row['values']) >= 2:
            results.append({
                'word': row['values'][0],
                'value': row['values'][1]
            })

    with open(output_path, 'w') as f:
        json.dump(results, f, indent=2)

    print(f"已将 {len(results)} 条记录恢复到 {output_path}")
    return results

Summary Checklist

总结检查清单

Before writing recovery code:
  • Examined file with hexdump to understand corruption extent
  • Identified whether header is present or missing
  • Tried standard SQLite tools first
  • Reviewed SQLite file format specification
During implementation:
  • Using a single script with debug output (not multiple scripts)
  • Validated Python syntax before running
  • Using big-endian byte order for all multi-byte integers
  • Calculating string lengths from serial types
  • Handling truncation with bounds checking
After recovery:
  • Verified cell count matches expectation
  • Validated string patterns if known
  • Checked numeric value ranges
  • Confirmed output format matches requirements
编写恢复代码前:
  • 使用hexdump检查文件,了解损坏程度
  • 确定头部是否存在
  • 已尝试标准SQLite工具
  • 已查阅SQLite文件格式规范
实现过程中:
  • 使用带有调试输出的单一脚本(而非多个脚本)
  • 运行前验证Python语法
  • 所有多字节整数使用大端序
  • 根据序列化类型计算字符串长度
  • 通过边界检查处理截断情况
恢复完成后:
  • 验证单元格数量与预期一致
  • 若已知则验证字符串模式
  • 检查数值范围
  • 确认输出格式符合要求