document-inventory

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Document Inventory Skill

文档编目Skill

Overview

概述

This skill scans document collections (PDFs, Word docs, text files) and creates a structured inventory with metadata, automatic categorization, and collection statistics. Essential first step before building knowledge bases.
该Skill可扫描文档集(PDF、Word文档、文本文件),并创建包含元数据、自动分类和文档集统计信息的结构化编目。这是构建知识库前必不可少的第一步。

Quick Start

快速开始

python
from pathlib import Path
import sqlite3
python
from pathlib import Path
import sqlite3

Scan directory

Scan directory

documents = [] for filepath in Path("/path/to/docs").rglob("*.pdf"): documents.append({ 'filename': filepath.name, 'size': filepath.stat().st_size, 'path': str(filepath) })
documents = [] for filepath in Path("/path/to/docs").rglob("*.pdf"): documents.append({ 'filename': filepath.name, 'size': filepath.stat().st_size, 'path': str(filepath) })

Store in database

Store in database

conn = sqlite3.connect("inventory.db") cursor = conn.cursor() cursor.execute("CREATE TABLE IF NOT EXISTS docs (name TEXT, size INTEGER, path TEXT)") for doc in documents: cursor.execute("INSERT INTO docs VALUES (?, ?, ?)", (doc['filename'], doc['size'], doc['path'])) conn.commit() print(f"Inventoried {len(documents)} documents")
undefined
conn = sqlite3.connect("inventory.db") cursor = conn.cursor() cursor.execute("CREATE TABLE IF NOT EXISTS docs (name TEXT, size INTEGER, path TEXT)") for doc in documents: cursor.execute("INSERT INTO docs VALUES (?, ?, ?)", (doc['filename'], doc['size'], doc['path'])) conn.commit() print(f"Inventoried {len(documents)} documents")
undefined

When to Use

适用场景

  • Auditing large document libraries before processing
  • Understanding the scope of a document collection
  • Categorizing documents by type, source, or content
  • Preparing inventories for knowledge base creation
  • Generating reports on document collections
  • Identifying duplicates or organizing files
  • 处理前审计大型文档库
  • 了解文档集的范围
  • 按类型、来源或内容对文档进行分类
  • 为创建知识库准备编目
  • 生成文档集报告
  • 识别重复文件或整理文件

Features

功能特性

  • Recursive scanning - Process nested directories
  • Metadata extraction - Size, dates, page counts
  • Auto-categorization - Pattern-based classification
  • Statistics generation - Collection summaries
  • SQLite storage - Queryable inventory database
  • Multiple formats - PDF, DOCX, TXT, and more
  • 递归扫描 - 处理嵌套目录
  • 元数据提取 - 大小、日期、页数
  • 自动分类 - 基于规则的分类
  • 统计信息生成 - 文档集汇总
  • SQLite存储 - 可查询的编目数据库
  • 多格式支持 - PDF、DOCX、TXT等更多格式

Implementation

实现代码

Core Inventory Builder

核心编目构建器

python
#!/usr/bin/env python3
"""Document inventory builder."""

import sqlite3
import os
from pathlib import Path
from datetime import datetime
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class DocumentInventory:
    """Build and manage document inventories."""

    SUPPORTED_EXTENSIONS = {
        '.pdf': 'PDF',
        '.docx': 'Word',
        '.doc': 'Word',
        '.txt': 'Text',
        '.md': 'Markdown',
        '.xlsx': 'Excel',
        '.xls': 'Excel',
        '.pptx': 'PowerPoint',
        '.ppt': 'PowerPoint',
    }

    def __init__(self, db_path):
        self.db_path = db_path
        self.conn = sqlite3.connect(db_path, timeout=30)
        self._setup_tables()

    def _setup_tables(self):
        cursor = self.conn.cursor()

        cursor.execute('''
            CREATE TABLE IF NOT EXISTS documents (
                id INTEGER PRIMARY KEY,
                filename TEXT NOT NULL,
                filepath TEXT UNIQUE NOT NULL,
                extension TEXT,
                file_type TEXT,
                category TEXT,
                file_size INTEGER,
                created_date TEXT,
                modified_date TEXT,
                parent_dir TEXT,
                depth INTEGER,
                scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        ''')

        cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_category ON documents(category)
        ''')

        cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_extension ON documents(extension)
        ''')

        self.conn.commit()

    def scan_directory(self, root_path):
        """Scan directory and build inventory."""
        root = Path(root_path).resolve()
        logger.info(f"Scanning: {root}")

        count = 0
        for filepath in root.rglob('*'):
            if filepath.is_file():
                ext = filepath.suffix.lower()
                if ext in self.SUPPORTED_EXTENSIONS:
                    self._add_document(filepath, root)
                    count += 1

                    if count % 500 == 0:
                        logger.info(f"Scanned {count} documents...")
                        self.conn.commit()

        self.conn.commit()
        logger.info(f"Scan complete: {count} documents found")
        return count

    def _add_document(self, filepath, root):
        """Add document to inventory."""
        cursor = self.conn.cursor()

        try:
            stat = filepath.stat()
            ext = filepath.suffix.lower()

            cursor.execute('''
                INSERT OR REPLACE INTO documents
                (filename, filepath, extension, file_type, category,
                 file_size, created_date, modified_date, parent_dir, depth)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ''', (
                filepath.name,
                str(filepath),
                ext,
                self.SUPPORTED_EXTENSIONS.get(ext, 'Unknown'),
                self._categorize(filepath),
                stat.st_size,
                datetime.fromtimestamp(stat.st_ctime).isoformat(),
                datetime.fromtimestamp(stat.st_mtime).isoformat(),
                str(filepath.parent),
                len(filepath.relative_to(root).parts) - 1
            ))

        except Exception as e:
            logger.warning(f"Error adding {filepath}: {e}")

    def _categorize(self, filepath):
        """Auto-categorize document based on patterns."""
        name = filepath.name.upper()
        path_str = str(filepath).upper()

        # Industry standard patterns
        patterns = {
            'API': 'API',
            'ISO': 'ISO',
            'ASME': 'ASME',
            'DNV': 'DNV',
            'NORSOK': 'NORSOK',
            'BSI': 'BSI',
            'ASTM': 'ASTM',
            'AWS': 'AWS',
            'ABS': 'ABS',
            'AISC': 'AISC',
            'IEEE': 'IEEE',
        }

        for pattern, category in patterns.items():
            if pattern in name or pattern in path_str:
                return category

        # Path-based categorization
        path_categories = {
            'STANDARD': 'Standards',
            'SPEC': 'Specifications',
            'MANUAL': 'Manuals',
            'GUIDE': 'Guides',
            'REPORT': 'Reports',
            'DRAWING': 'Drawings',
            'PROCEDURE': 'Procedures',
        }

        for pattern, category in path_categories.items():
            if pattern in path_str:
                return category

        return 'Unknown'

    def get_statistics(self):
        """Get inventory statistics."""
        cursor = self.conn.cursor()

        stats = {}

        # Total count
        cursor.execute('SELECT COUNT(*) FROM documents')
        stats['total_documents'] = cursor.fetchone()[0]

        # Total size
        cursor.execute('SELECT SUM(file_size) FROM documents')
        total_bytes = cursor.fetchone()[0] or 0
        stats['total_size_mb'] = round(total_bytes / (1024 * 1024), 2)

        # By file type
        cursor.execute('''
            SELECT file_type, COUNT(*), SUM(file_size)
            FROM documents
            GROUP BY file_type
            ORDER BY COUNT(*) DESC
        ''')
        stats['by_type'] = {
            row[0]: {'count': row[1], 'size_mb': round((row[2] or 0) / 1024 / 1024, 2)}
            for row in cursor.fetchall()
        }

        # By category
        cursor.execute('''
            SELECT category, COUNT(*)
            FROM documents
            GROUP BY category
            ORDER BY COUNT(*) DESC
        ''')
        stats['by_category'] = dict(cursor.fetchall())

        # By extension
        cursor.execute('''
            SELECT extension, COUNT(*)
            FROM documents
            GROUP BY extension
            ORDER BY COUNT(*) DESC
        ''')
        stats['by_extension'] = dict(cursor.fetchall())

        return stats

    def search(self, query, category=None, file_type=None, limit=50):
        """Search inventory."""
        cursor = self.conn.cursor()

        sql = 'SELECT filename, filepath, category, file_size FROM documents WHERE 1=1'
        params = []

        if query:
            sql += ' AND filename LIKE ?'
            params.append(f'%{query}%')

        if category:
            sql += ' AND category = ?'
            params.append(category)

        if file_type:
            sql += ' AND file_type = ?'
            params.append(file_type)

        sql += ' ORDER BY filename LIMIT ?'
        params.append(limit)

        cursor.execute(sql, params)
        return cursor.fetchall()

    def export_csv(self, output_path):
        """Export inventory to CSV."""
        import csv

        cursor = self.conn.cursor()
        cursor.execute('SELECT * FROM documents')

        columns = [desc[0] for desc in cursor.description]

        with open(output_path, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(columns)
            writer.writerows(cursor.fetchall())

        logger.info(f"Exported to {output_path}")
python
#!/usr/bin/env python3
"""Document inventory builder."""

import sqlite3
import os
from pathlib import Path
from datetime import datetime
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class DocumentInventory:
    """Build and manage document inventories."""

    SUPPORTED_EXTENSIONS = {
        '.pdf': 'PDF',
        '.docx': 'Word',
        '.doc': 'Word',
        '.txt': 'Text',
        '.md': 'Markdown',
        '.xlsx': 'Excel',
        '.xls': 'Excel',
        '.pptx': 'PowerPoint',
        '.ppt': 'PowerPoint',
    }

    def __init__(self, db_path):
        self.db_path = db_path
        self.conn = sqlite3.connect(db_path, timeout=30)
        self._setup_tables()

    def _setup_tables(self):
        cursor = self.conn.cursor()

        cursor.execute('''
            CREATE TABLE IF NOT EXISTS documents (
                id INTEGER PRIMARY KEY,
                filename TEXT NOT NULL,
                filepath TEXT UNIQUE NOT NULL,
                extension TEXT,
                file_type TEXT,
                category TEXT,
                file_size INTEGER,
                created_date TEXT,
                modified_date TEXT,
                parent_dir TEXT,
                depth INTEGER,
                scanned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        ''')

        cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_category ON documents(category)
        ''')

        cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_extension ON documents(extension)
        ''')

        self.conn.commit()

    def scan_directory(self, root_path):
        """Scan directory and build inventory."""
        root = Path(root_path).resolve()
        logger.info(f"Scanning: {root}")

        count = 0
        for filepath in root.rglob('*'):
            if filepath.is_file():
                ext = filepath.suffix.lower()
                if ext in self.SUPPORTED_EXTENSIONS:
                    self._add_document(filepath, root)
                    count += 1

                    if count % 500 == 0:
                        logger.info(f"Scanned {count} documents...")
                        self.conn.commit()

        self.conn.commit()
        logger.info(f"Scan complete: {count} documents found")
        return count

    def _add_document(self, filepath, root):
        """Add document to inventory."""
        cursor = self.conn.cursor()

        try:
            stat = filepath.stat()
            ext = filepath.suffix.lower()

            cursor.execute('''
                INSERT OR REPLACE INTO documents
                (filename, filepath, extension, file_type, category,
                 file_size, created_date, modified_date, parent_dir, depth)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ''', (
                filepath.name,
                str(filepath),
                ext,
                self.SUPPORTED_EXTENSIONS.get(ext, 'Unknown'),
                self._categorize(filepath),
                stat.st_size,
                datetime.fromtimestamp(stat.st_ctime).isoformat(),
                datetime.fromtimestamp(stat.st_mtime).isoformat(),
                str(filepath.parent),
                len(filepath.relative_to(root).parts) - 1
            ))

        except Exception as e:
            logger.warning(f"Error adding {filepath}: {e}")

    def _categorize(self, filepath):
        """Auto-categorize document based on patterns."""
        name = filepath.name.upper()
        path_str = str(filepath).upper()

        # Industry standard patterns
        patterns = {
            'API': 'API',
            'ISO': 'ISO',
            'ASME': 'ASME',
            'DNV': 'DNV',
            'NORSOK': 'NORSOK',
            'BSI': 'BSI',
            'ASTM': 'ASTM',
            'AWS': 'AWS',
            'ABS': 'ABS',
            'AISC': 'AISC',
            'IEEE': 'IEEE',
        }

        for pattern, category in patterns.items():
            if pattern in name or pattern in path_str:
                return category

        # Path-based categorization
        path_categories = {
            'STANDARD': 'Standards',
            'SPEC': 'Specifications',
            'MANUAL': 'Manuals',
            'GUIDE': 'Guides',
            'REPORT': 'Reports',
            'DRAWING': 'Drawings',
            'PROCEDURE': 'Procedures',
        }

        for pattern, category in path_categories.items():
            if pattern in path_str:
                return category

        return 'Unknown'

    def get_statistics(self):
        """Get inventory statistics."""
        cursor = self.conn.cursor()

        stats = {}

        # Total count
        cursor.execute('SELECT COUNT(*) FROM documents')
        stats['total_documents'] = cursor.fetchone()[0]

        # Total size
        cursor.execute('SELECT SUM(file_size) FROM documents')
        total_bytes = cursor.fetchone()[0] or 0
        stats['total_size_mb'] = round(total_bytes / (1024 * 1024), 2)

        # By file type
        cursor.execute('''
            SELECT file_type, COUNT(*), SUM(file_size)
            FROM documents
            GROUP BY file_type
            ORDER BY COUNT(*) DESC
        ''')
        stats['by_type'] = {
            row[0]: {'count': row[1], 'size_mb': round((row[2] or 0) / 1024 / 1024, 2)}
            for row in cursor.fetchall()
        }

        # By category
        cursor.execute('''
            SELECT category, COUNT(*)
            FROM documents
            GROUP BY category
            ORDER BY COUNT(*) DESC
        ''')
        stats['by_category'] = dict(cursor.fetchall())

        # By extension
        cursor.execute('''
            SELECT extension, COUNT(*)
            FROM documents
            GROUP BY extension
            ORDER BY COUNT(*) DESC
        ''')
        stats['by_extension'] = dict(cursor.fetchall())

        return stats

    def search(self, query, category=None, file_type=None, limit=50):
        """Search inventory."""
        cursor = self.conn.cursor()

        sql = 'SELECT filename, filepath, category, file_size FROM documents WHERE 1=1'
        params = []

        if query:
            sql += ' AND filename LIKE ?'
            params.append(f'%{query}%')

        if category:
            sql += ' AND category = ?'
            params.append(category)

        if file_type:
            sql += ' AND file_type = ?'
            params.append(file_type)

        sql += ' ORDER BY filename LIMIT ?'
        params.append(limit)

        cursor.execute(sql, params)
        return cursor.fetchall()

    def export_csv(self, output_path):
        """Export inventory to CSV."""
        import csv

        cursor = self.conn.cursor()
        cursor.execute('SELECT * FROM documents')

        columns = [desc[0] for desc in cursor.description]

        with open(output_path, 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(columns)
            writer.writerows(cursor.fetchall())

        logger.info(f"Exported to {output_path}")

CLI Interface

CLI 交互界面

python
#!/usr/bin/env python3
"""Document Inventory CLI."""

import argparse
import json

def main():
    parser = argparse.ArgumentParser(description='Document Inventory Tool')
    subparsers = parser.add_subparsers(dest='command', help='Commands')

    # Scan command
    scan_parser = subparsers.add_parser('scan', help='Scan directory')
    scan_parser.add_argument('path', help='Directory to scan')
    scan_parser.add_argument('--db', default='inventory.db', help='Database path')

    # Stats command
    stats_parser = subparsers.add_parser('stats', help='Show statistics')
    stats_parser.add_argument('--db', default='inventory.db', help='Database path')
    stats_parser.add_argument('--json', action='store_true', help='Output as JSON')

    # Search command
    search_parser = subparsers.add_parser('search', help='Search inventory')
    search_parser.add_argument('query', help='Search query')
    search_parser.add_argument('--db', default='inventory.db', help='Database path')
    search_parser.add_argument('--category', help='Filter by category')
    search_parser.add_argument('--type', help='Filter by file type')

    # Export command
    export_parser = subparsers.add_parser('export', help='Export to CSV')
    export_parser.add_argument('output', help='Output CSV path')
    export_parser.add_argument('--db', default='inventory.db', help='Database path')

    args = parser.parse_args()

    if args.command == 'scan':
        inventory = DocumentInventory(args.db)
        count = inventory.scan_directory(args.path)
        print(f"\nScanned {count} documents")

        stats = inventory.get_statistics()
        print(f"Total size: {stats['total_size_mb']} MB")
        print(f"\nBy category:")
        for cat, count in list(stats['by_category'].items())[:10]:
            print(f"  {cat}: {count}")

    elif args.command == 'stats':
        inventory = DocumentInventory(args.db)
        stats = inventory.get_statistics()

        if args.json:
            print(json.dumps(stats, indent=2))
        else:
            print(f"Total Documents: {stats['total_documents']}")
            print(f"Total Size: {stats['total_size_mb']} MB")
            print(f"\nBy Type:")
            for t, data in stats['by_type'].items():
                print(f"  {t}: {data['count']} ({data['size_mb']} MB)")
            print(f"\nBy Category:")
            for cat, count in list(stats['by_category'].items())[:15]:
                print(f"  {cat}: {count}")

    elif args.command == 'search':
        inventory = DocumentInventory(args.db)
        results = inventory.search(
            args.query,
            category=args.category,
            file_type=args.type
        )

        print(f"Found {len(results)} results:\n")
        for filename, filepath, category, size in results:
            size_kb = size / 1024
            print(f"  [{category:10}] {filename} ({size_kb:.1f} KB)")

    elif args.command == 'export':
        inventory = DocumentInventory(args.db)
        inventory.export_csv(args.output)

    else:
        parser.print_help()


if __name__ == '__main__':
    main()
python
#!/usr/bin/env python3
"""Document Inventory CLI."""

import argparse
import json

def main():
    parser = argparse.ArgumentParser(description='Document Inventory Tool')
    subparsers = parser.add_subparsers(dest='command', help='Commands')

    # Scan command
    scan_parser = subparsers.add_parser('scan', help='Scan directory')
    scan_parser.add_argument('path', help='Directory to scan')
    scan_parser.add_argument('--db', default='inventory.db', help='Database path')

    # Stats command
    stats_parser = subparsers.add_parser('stats', help='Show statistics')
    stats_parser.add_argument('--db', default='inventory.db', help='Database path')
    stats_parser.add_argument('--json', action='store_true', help='Output as JSON')

    # Search command
    search_parser = subparsers.add_parser('search', help='Search inventory')
    search_parser.add_argument('query', help='Search query')
    search_parser.add_argument('--db', default='inventory.db', help='Database path')
    search_parser.add_argument('--category', help='Filter by category')
    search_parser.add_argument('--type', help='Filter by file type')

    # Export command
    export_parser = subparsers.add_parser('export', help='Export to CSV')
    export_parser.add_argument('output', help='Output CSV path')
    export_parser.add_argument('--db', default='inventory.db', help='Database path')

    args = parser.parse_args()

    if args.command == 'scan':
        inventory = DocumentInventory(args.db)
        count = inventory.scan_directory(args.path)
        print(f"\nScanned {count} documents")

        stats = inventory.get_statistics()
        print(f"Total size: {stats['total_size_mb']} MB")
        print(f"\nBy category:")
        for cat, count in list(stats['by_category'].items())[:10]:
            print(f"  {cat}: {count}")

    elif args.command == 'stats':
        inventory = DocumentInventory(args.db)
        stats = inventory.get_statistics()

        if args.json:
            print(json.dumps(stats, indent=2))
        else:
            print(f"Total Documents: {stats['total_documents']}")
            print(f"Total Size: {stats['total_size_mb']} MB")
            print(f"\nBy Type:")
            for t, data in stats['by_type'].items():
                print(f"  {t}: {data['count']} ({data['size_mb']} MB)")
            print(f"\nBy Category:")
            for cat, count in list(stats['by_category'].items())[:15]:
                print(f"  {cat}: {count}")

    elif args.command == 'search':
        inventory = DocumentInventory(args.db)
        results = inventory.search(
            args.query,
            category=args.category,
            file_type=args.type
        )

        print(f"Found {len(results)} results:\n")
        for filename, filepath, category, size in results:
            size_kb = size / 1024
            print(f"  [{category:10}] {filename} ({size_kb:.1f} KB)")

    elif args.command == 'export':
        inventory = DocumentInventory(args.db)
        inventory.export_csv(args.output)

    else:
        parser.print_help()


if __name__ == '__main__':
    main()

Report Generator

报告生成器

python
def generate_report(db_path, output_path):
    """Generate HTML inventory report."""
    inventory = DocumentInventory(db_path)
    stats = inventory.get_statistics()

    html = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <title>Document Inventory Report</title>
        <style>
            body {{ font-family: Arial, sans-serif; margin: 40px; }}
            h1 {{ color: #333; }}
            .stat-box {{ background: #f5f5f5; padding: 20px; margin: 10px 0; border-radius: 8px; }}
            .stat-value {{ font-size: 2em; font-weight: bold; color: #2196F3; }}
            table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }}
            th, td {{ border: 1px solid #ddd; padding: 12px; text-align: left; }}
            th {{ background: #2196F3; color: white; }}
            tr:nth-child(even) {{ background: #f9f9f9; }}
        </style>
    </head>
    <body>
        <h1>Document Inventory Report</h1>
        <p>Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>

        <div class="stat-box">
            <div class="stat-value">{stats['total_documents']:,}</div>
            <div>Total Documents</div>
        </div>

        <div class="stat-box">
            <div class="stat-value">{stats['total_size_mb']:,.1f} MB</div>
            <div>Total Size</div>
        </div>

        <h2>By File Type</h2>
        <table>
            <tr><th>Type</th><th>Count</th><th>Size (MB)</th></tr>
            {''.join(f"<tr><td>{t}</td><td>{d['count']}</td><td>{d['size_mb']}</td></tr>"
                     for t, d in stats['by_type'].items())}
        </table>

        <h2>By Category</h2>
        <table>
            <tr><th>Category</th><th>Count</th></tr>
            {''.join(f"<tr><td>{c}</td><td>{n}</td></tr>"
                     for c, n in stats['by_category'].items())}
        </table>
    </body>
    </html>
    """

    with open(output_path, 'w') as f:
        f.write(html)

    print(f"Report generated: {output_path}")
python
def generate_report(db_path, output_path):
    """Generate HTML inventory report."""
    inventory = DocumentInventory(db_path)
    stats = inventory.get_statistics()

    html = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <title>Document Inventory Report</title>
        <style>
            body {{ font-family: Arial, sans-serif; margin: 40px; }}
            h1 {{ color: #333; }}
            .stat-box {{ background: #f5f5f5; padding: 20px; margin: 10px 0; border-radius: 8px; }}
            .stat-value {{ font-size: 2em; font-weight: bold; color: #2196F3; }}
            table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }}
            th, td {{ border: 1px solid #ddd; padding: 12px; text-align: left; }}
            th {{ background: #2196F3; color: white; }}
            tr:nth-child(even) {{ background: #f9f9f9; }}
        </style>
    </head>
    <body>
        <h1>Document Inventory Report</h1>
        <p>Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>

        <div class="stat-box">
            <div class="stat-value">{stats['total_documents']:,}</div>
            <div>Total Documents</div>
        </div>

        <div class="stat-box">
            <div class="stat-value">{stats['total_size_mb']:,.1f} MB</div>
            <div>Total Size</div>
        </div>

        <h2>By File Type</h2>
        <table>
            <tr><th>Type</th><th>Count</th><th>Size (MB)</th></tr>
            {''.join(f"<tr><td>{t}</td><td>{d['count']}</td><td>{d['size_mb']}</td></tr>"
                     for t, d in stats['by_type'].items())}
        </table>

        <h2>By Category</h2>
        <table>
            <tr><th>Category</th><th>Count</th></tr>
            {''.join(f"<tr><td>{c}</td><td>{n}</td></tr>"
                     for c, n in stats['by_category'].items())}
        </table>
    </body>
    </html>
    """

    with open(output_path, 'w') as f:
        f.write(html)

    print(f"Report generated: {output_path}")

Custom Categorization

自定义分类

Extend with Your Patterns

扩展自定义规则

python
undefined
python
undefined

Add custom patterns for your domain

Add custom patterns for your domain

CUSTOM_PATTERNS = { 'SPEC': 'Specifications', 'DWG': 'Drawings', 'REV': 'Revisions', 'APPROVED': 'Approved', 'DRAFT': 'Draft', 'SUPERSEDED': 'Superseded', }
def categorize_custom(filepath): name = filepath.name.upper() for pattern, category in CUSTOM_PATTERNS.items(): if pattern in name: return category return 'Uncategorized'
undefined
CUSTOM_PATTERNS = { 'SPEC': 'Specifications', 'DWG': 'Drawings', 'REV': 'Revisions', 'APPROVED': 'Approved', 'DRAFT': 'Draft', 'SUPERSEDED': 'Superseded', }
def categorize_custom(filepath): name = filepath.name.upper() for pattern, category in CUSTOM_PATTERNS.items(): if pattern in name: return category return 'Uncategorized'
undefined

Multi-Level Categories

多级分类

python
def categorize_hierarchical(filepath):
    """Create hierarchical categories."""
    name = filepath.name.upper()

    # Primary category
    primary = 'General'
    if 'API' in name:
        primary = 'API Standards'
    elif 'ISO' in name:
        primary = 'ISO Standards'

    # Secondary category
    secondary = 'Other'
    if 'DESIGN' in name:
        secondary = 'Design'
    elif 'SAFETY' in name:
        secondary = 'Safety'
    elif 'QUALITY' in name:
        secondary = 'Quality'

    return f"{primary}/{secondary}"
python
def categorize_hierarchical(filepath):
    """Create hierarchical categories."""
    name = filepath.name.upper()

    # Primary category
    primary = 'General'
    if 'API' in name:
        primary = 'API Standards'
    elif 'ISO' in name:
        primary = 'ISO Standards'

    # Secondary category
    secondary = 'Other'
    if 'DESIGN' in name:
        secondary = 'Design'
    elif 'SAFETY' in name:
        secondary = 'Safety'
    elif 'QUALITY' in name:
        secondary = 'Quality'

    return f"{primary}/{secondary}"

Execution Checklist

执行检查清单

  • Identify target directory for scanning
  • Create SQLite database for inventory
  • Run initial scan and review results
  • Customize categorization patterns if needed
  • Generate statistics report
  • Export to CSV for review
  • Generate HTML report for stakeholders
  • Plan next steps (knowledge base creation)
  • 确定要扫描的目标目录
  • 创建用于编目的SQLite数据库
  • 运行初始扫描并查看结果
  • 如有需要,自定义分类规则
  • 生成统计报告
  • 导出为CSV用于审核
  • 为利益相关者生成HTML报告
  • 规划后续步骤(如创建知识库)

Error Handling

错误处理

Common Errors

常见错误

Error: PermissionError
  • Cause: Insufficient permissions to read files
  • Solution: Run with appropriate permissions or skip protected files
Error: sqlite3.OperationalError (database is locked)
  • Cause: Concurrent access without timeout
  • Solution: Use
    timeout=30
    when connecting
Error: UnicodeDecodeError in filenames
  • Cause: Non-UTF8 characters in file paths
  • Solution: Use
    errors='replace'
    when processing paths
Error: OSError (too many open files)
  • Cause: Not closing file handles properly
  • Solution: Use context managers and batch commits
Error: Slow scanning on network drives
  • Cause: Network latency for each file access
  • Solution: Copy to local drive or use async scanning
错误:PermissionError
  • 原因:读取文件的权限不足
  • 解决方案:使用适当权限运行,或跳过受保护文件
错误:sqlite3.OperationalError (database is locked)
  • 原因:无超时设置的并发访问
  • 解决方案:连接时使用
    timeout=30
错误:文件名中的UnicodeDecodeError
  • 原因:文件路径中包含非UTF8字符
  • 解决方案:处理路径时使用
    errors='replace'
错误:OSError (too many open files)
  • 原因:未正确关闭文件句柄
  • 解决方案:使用上下文管理器和批量提交
错误:网络驱动器扫描缓慢
  • 原因:每个文件访问都有网络延迟
  • 解决方案:复制到本地驱动器或使用异步扫描

Metrics

指标

MetricTypical Value
Scan speed (local)~1000 files/second
Scan speed (network)~100 files/second
Database size~1KB per 10 documents
Memory usage~50MB for 100K documents
Report generation<1 second
指标典型值
本地扫描速度~1000文件/秒
网络扫描速度~100文件/秒
数据库大小每10个文档约1KB
内存占用10万个文档约50MB
报告生成速度<1秒

Best Practices

最佳实践

  1. Scan before processing - Always inventory first
  2. Use SQLite timeout -
    timeout=30
    for concurrent access
  3. Batch commits - Commit every 500 files
  4. Handle errors gracefully - Log and continue on failures
  5. Export for review - Generate CSV/HTML for stakeholders
  6. Update incrementally - Use
    INSERT OR REPLACE
  1. 处理前先扫描 - 始终先进行编目
  2. 使用SQLite超时设置 - 并发访问时使用
    timeout=30
  3. 批量提交 - 每500个文件提交一次
  4. 优雅处理错误 - 记录错误并继续执行
  5. 导出用于审核 - 为利益相关者生成CSV/HTML文件
  6. 增量更新 - 使用
    INSERT OR REPLACE
    语句

Example Usage

使用示例

bash
undefined
bash
undefined

Scan directory

扫描目录

python inventory.py scan /path/to/documents --db inventory.db
python inventory.py scan /path/to/documents --db inventory.db

View statistics

查看统计信息

python inventory.py stats --db inventory.db
python inventory.py stats --db inventory.db

Search

搜索

python inventory.py search "API" --category "Standards"
python inventory.py search "API" --category "Standards"

Export to CSV

导出为CSV

python inventory.py export inventory.csv --db inventory.db
undefined
python inventory.py export inventory.csv --db inventory.db
undefined

Related Skills

相关Skill

  • knowledge-base-builder
    - Build searchable database after inventory
  • pdf-text-extractor
    - Extract text from inventoried PDFs
  • semantic-search-setup
    - Add AI search capabilities
  • knowledge-base-builder
    - 编目完成后构建可搜索数据库
  • pdf-text-extractor
    - 从已编目的PDF中提取文本
  • semantic-search-setup
    - 添加AI搜索功能

Dependencies

依赖项

bash
undefined
bash
undefined

No external dependencies - uses Python standard library

无外部依赖 - 使用Python标准库

Optional: pandas for advanced data manipulation

可选:pandas用于高级数据处理

pip install pandas

---
pip install pandas

---

Version History

版本历史

  • 1.1.0 (2026-01-02): Added Quick Start, Execution Checklist, Error Handling, Metrics sections; updated frontmatter with version, category, related_skills
  • 1.0.0 (2024-10-15): Initial release with SQLite storage, auto-categorization, CLI interface
  • 1.1.0 (2026-01-02):新增快速开始、执行检查清单、错误处理、指标章节;更新前言,添加版本、分类、相关Skill信息
  • 1.0.0 (2024-10-15):初始版本,支持SQLite存储、自动分类、CLI交互界面