office-to-md

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Office to Markdown Skill

Office转Markdown Skill

Overview

概述

This skill enables conversion from various Office formats to Markdown using markitdown - Microsoft's open-source tool for converting documents to Markdown. Perfect for making Office content searchable, version-controllable, and AI-friendly.
本Skill借助markitdown——微软推出的文档转Markdown开源工具,支持将多种Office格式转换为Markdown。非常适合让Office内容具备可搜索、可版本控制和AI友好的特性。

How to Use

使用方法

  1. Provide the Office file (Word, Excel, PowerPoint, PDF, etc.)
  2. Optionally specify conversion options
  3. I'll convert it to clean Markdown
Example prompts:
  • "Convert this Word document to Markdown"
  • "Turn this PowerPoint into Markdown notes"
  • "Extract content from this PDF as Markdown"
  • "Convert this Excel file to Markdown tables"
  1. 提供Office文件(Word、Excel、PowerPoint、PDF等)
  2. 可选择性指定转换选项
  3. 我会将其转换为整洁的Markdown格式
示例提示词:
  • "将这份Word文档转换为Markdown"
  • "把这份PowerPoint转成Markdown笔记"
  • "从这份PDF中提取内容并保存为Markdown"
  • "将这份Excel文件转换为Markdown表格"

Domain Knowledge

领域知识

markitdown Fundamentals

markitdown 基础

python
from markitdown import MarkItDown
python
from markitdown import MarkItDown

Initialize converter

Initialize converter

md = MarkItDown()
md = MarkItDown()

Convert file

Convert file

result = md.convert("document.docx") print(result.text_content)
result = md.convert("document.docx") print(result.text_content)

Save to file

Save to file

with open("output.md", "w") as f: f.write(result.text_content)
undefined
with open("output.md", "w") as f: f.write(result.text_content)
undefined

Supported Formats

支持的格式

FormatExtensionNotes
Word.docxFull text, tables, basic formatting
Excel.xlsxConverts to Markdown tables
PowerPoint.pptxSlides as sections
PDF.pdfText extraction
HTML.htmlClean markdown
Images.jpg, .pngOCR with vision model
Audio.mp3, .wavTranscription
ZIP.zipProcesses contained files
格式扩展名说明
Word.docx完整文本、表格、基础格式
Excel.xlsx转换为Markdown表格
PowerPoint.pptx幻灯片转为章节
PDF.pdf文本提取
HTML.html整洁的Markdown格式
图片.jpg, .png结合视觉模型进行OCR识别
音频.mp3, .wav转录为文本
ZIP.zip处理压缩包内的文件

Basic Usage

基础用法

Python API

Python API

python
from markitdown import MarkItDown
python
from markitdown import MarkItDown

Simple conversion

Simple conversion

md = MarkItDown() result = md.convert("document.docx")
md = MarkItDown() result = md.convert("document.docx")

Access content

Access content

markdown_text = result.text_content
markdown_text = result.text_content

With options

With options

md = MarkItDown( llm_client=None, # Optional LLM for enhanced processing llm_model=None # Model name if using LLM )
undefined
md = MarkItDown( llm_client=None, # Optional LLM for enhanced processing llm_model=None # Model name if using LLM )
undefined

Command Line

命令行

bash
undefined
bash
undefined

Install

Install

pip install markitdown
pip install markitdown

Convert file

Convert file

markitdown document.docx > output.md
markitdown document.docx > output.md

Or with output file

Or with output file

markitdown document.docx -o output.md
undefined
markitdown document.docx -o output.md
undefined

Word Document Conversion

Word文档转换

python
from markitdown import MarkItDown

md = MarkItDown()
python
from markitdown import MarkItDown

md = MarkItDown()

Convert Word document

Convert Word document

result = md.convert("report.docx")
result = md.convert("report.docx")

Output preserves:

Output preserves:

- Headings (as # headers)

- Headings (as # headers)

- Bold/italic formatting

- Bold/italic formatting

- Lists (bulleted and numbered)

- Lists (bulleted and numbered)

- Tables (as markdown tables)

- Tables (as markdown tables)

- Hyperlinks

- Hyperlinks

print(result.text_content)

**Example Output:**
```markdown
print(result.text_content)

**示例输出:**
```markdown

Annual Report 2024

Annual Report 2024

Executive Summary

Executive Summary

This report summarizes the key achievements and challenges...
This report summarizes the key achievements and challenges...

Key Metrics

Key Metrics

Metric20232024Change
Revenue$10M$12M+20%
Users50K75K+50%
Metric20232024Change
Revenue$10M$12M+20%
Users50K75K+50%

Detailed Analysis

Detailed Analysis

The following sections provide...
undefined
The following sections provide...
undefined

Excel Conversion

Excel转换

python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.xlsx")
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.xlsx")

Each sheet becomes a section

Each sheet becomes a section

Data becomes markdown tables

Data becomes markdown tables

print(result.text_content)

**Example Output:**
```markdown
print(result.text_content)

**示例输出:**
```markdown

Sheet1

Sheet1

NameDepartmentSalary
JohnEngineering$80,000
JaneMarketing$75,000
NameDepartmentSalary
JohnEngineering$80,000
JaneMarketing$75,000

Sheet2

Sheet2

ProductQ1Q2Q3Q4
Widget A100120150180
undefined
ProductQ1Q2Q3Q4
Widget A100120150180
undefined

PowerPoint Conversion

PowerPoint转换

python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("presentation.pptx")
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("presentation.pptx")

Each slide becomes a section

Each slide becomes a section

Speaker notes included if present

Speaker notes included if present

print(result.text_content)

**Example Output:**
```markdown
print(result.text_content)

**示例输出:**
```markdown

Slide 1: Company Overview

Slide 1: Company Overview

Our mission is to...
Our mission is to...

Key Points

Key Points

  • Innovation first
  • Customer focused
  • Global reach

  • Innovation first
  • Customer focused
  • Global reach

Slide 2: Market Analysis

Slide 2: Market Analysis

The market opportunity is significant...
Notes: Mention the competitor analysis here
undefined
The market opportunity is significant...
Notes: Mention the competitor analysis here
undefined

PDF Conversion

PDF转换

python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")

Extracts text content

Extracts text content

Tables converted where detected

Tables converted where detected

print(result.text_content)
undefined
print(result.text_content)
undefined

Image Conversion (with Vision Model)

图片转换(结合视觉模型)

python
from markitdown import MarkItDown
import anthropic
python
from markitdown import MarkItDown
import anthropic

Use Claude for image description

Use Claude for image description

client = anthropic.Anthropic()
md = MarkItDown( llm_client=client, llm_model="claude-sonnet-4-20250514" )
result = md.convert("diagram.png") print(result.text_content)
client = anthropic.Anthropic()
md = MarkItDown( llm_client=client, llm_model="claude-sonnet-4-20250514" )
result = md.convert("diagram.png") print(result.text_content)

Output: Description of the image content

Output: Description of the image content

undefined
undefined

Batch Conversion

批量转换

python
from markitdown import MarkItDown
from pathlib import Path

def batch_convert(input_dir, output_dir):
    """Convert all Office files to Markdown."""
    md = MarkItDown()
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    extensions = ['.docx', '.xlsx', '.pptx', '.pdf']
    
    for ext in extensions:
        for file in input_path.glob(f'*{ext}'):
            try:
                result = md.convert(str(file))
                output_file = output_path / f"{file.stem}.md"
                
                with open(output_file, 'w') as f:
                    f.write(result.text_content)
                
                print(f"Converted: {file.name}")
            except Exception as e:
                print(f"Error converting {file.name}: {e}")

batch_convert('./documents', './markdown')
python
from markitdown import MarkItDown
from pathlib import Path

def batch_convert(input_dir, output_dir):
    """Convert all Office files to Markdown."""
    md = MarkItDown()
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    extensions = ['.docx', '.xlsx', '.pptx', '.pdf']
    
    for ext in extensions:
        for file in input_path.glob(f'*{ext}'):
            try:
                result = md.convert(str(file))
                output_file = output_path / f"{file.stem}.md"
                
                with open(output_file, 'w') as f:
                    f.write(result.text_content)
                
                print(f"Converted: {file.name}")
            except Exception as e:
                print(f"Error converting {file.name}: {e}")

batch_convert('./documents', './markdown')

Best Practices

最佳实践

  1. Check Output Quality: Review converted Markdown for accuracy
  2. Handle Tables: Complex tables may need manual adjustment
  3. Preserve Structure: Use consistent heading levels in source docs
  4. Image Handling: Consider using vision models for important images
  5. Version Control: Store converted Markdown in Git for tracking
  1. 检查输出质量:查看转换后的Markdown内容是否准确
  2. 处理表格:复杂表格可能需要手动调整
  3. 保留结构:源文档中使用统一的标题层级
  4. 图片处理:对于重要图片,考虑使用视觉模型
  5. 版本控制:将转换后的Markdown存储在Git中以便追踪

Common Patterns

常见应用场景

Document Archive

文档归档

python
import os
from datetime import datetime
from markitdown import MarkItDown

def archive_document(doc_path, archive_dir):
    """Convert and archive Office document to Markdown."""
    md = MarkItDown()
    result = md.convert(doc_path)
    
    # Create archive structure
    date_str = datetime.now().strftime('%Y-%m-%d')
    filename = os.path.basename(doc_path)
    base_name = os.path.splitext(filename)[0]
    
    # Save with metadata
    output_content = f"""---
source: {filename}
converted: {date_str}
---

{result.text_content}
"""
    
    output_path = os.path.join(archive_dir, f"{base_name}.md")
    with open(output_path, 'w') as f:
        f.write(output_content)
    
    return output_path
python
import os
from datetime import datetime
from markitdown import MarkItDown

def archive_document(doc_path, archive_dir):
    """Convert and archive Office document to Markdown."""
    md = MarkItDown()
    result = md.convert(doc_path)
    
    # Create archive structure
    date_str = datetime.now().strftime('%Y-%m-%d')
    filename = os.path.basename(doc_path)
    base_name = os.path.splitext(filename)[0]
    
    # Save with metadata
    output_content = f"""---
source: {filename}
converted: {date_str}
---

{result.text_content}
"""
    
    output_path = os.path.join(archive_dir, f"{base_name}.md")
    with open(output_path, 'w') as f:
        f.write(output_content)
    
    return output_path

AI-Ready Corpus

AI就绪语料库

python
from markitdown import MarkItDown
from pathlib import Path
import json

def create_ai_corpus(doc_folder, output_file):
    """Convert documents to JSON corpus for AI training/RAG."""
    md = MarkItDown()
    corpus = []
    
    for doc in Path(doc_folder).glob('**/*'):
        if doc.suffix in ['.docx', '.pdf', '.pptx', '.xlsx']:
            try:
                result = md.convert(str(doc))
                corpus.append({
                    'source': str(doc),
                    'filename': doc.name,
                    'content': result.text_content,
                    'type': doc.suffix[1:]
                })
            except Exception as e:
                print(f"Skipped {doc.name}: {e}")
    
    with open(output_file, 'w') as f:
        json.dump(corpus, f, indent=2)
    
    print(f"Created corpus with {len(corpus)} documents")
    return corpus
python
from markitdown import MarkItDown
from pathlib import Path
import json

def create_ai_corpus(doc_folder, output_file):
    """Convert documents to JSON corpus for AI training/RAG."""
    md = MarkItDown()
    corpus = []
    
    for doc in Path(doc_folder).glob('**/*'):
        if doc.suffix in ['.docx', '.pdf', '.pptx', '.xlsx']:
            try:
                result = md.convert(str(doc))
                corpus.append({
                    'source': str(doc),
                    'filename': doc.name,
                    'content': result.text_content,
                    'type': doc.suffix[1:]
                })
            except Exception as e:
                print(f"Skipped {doc.name}: {e}")
    
    with open(output_file, 'w') as f:
        json.dump(corpus, f, indent=2)
    
    print(f"Created corpus with {len(corpus)} documents")
    return corpus

Examples

示例

Example 1: Convert Documentation Suite

示例1:转换文档套件

python
from markitdown import MarkItDown
from pathlib import Path

def convert_docs_to_wiki(docs_folder, wiki_folder):
    """Convert all Office docs to markdown wiki structure."""
    md = MarkItDown()
    docs_path = Path(docs_folder)
    wiki_path = Path(wiki_folder)
    
    # Create wiki structure
    wiki_path.mkdir(exist_ok=True)
    
    # Create index
    index_content = "# Documentation Index\n\n"
    
    for doc in sorted(docs_path.glob('**/*.docx')):
        try:
            result = md.convert(str(doc))
            
            # Create relative path in wiki
            rel_path = doc.relative_to(docs_path)
            output_file = wiki_path / rel_path.with_suffix('.md')
            output_file.parent.mkdir(parents=True, exist_ok=True)
            
            # Write markdown
            with open(output_file, 'w') as f:
                f.write(result.text_content)
            
            # Add to index
            link = str(rel_path.with_suffix('.md')).replace('\\', '/')
            index_content += f"- [{doc.stem}]({link})\n"
            
            print(f"Converted: {doc.name}")
            
        except Exception as e:
            print(f"Error: {doc.name} - {e}")
    
    # Write index
    with open(wiki_path / 'index.md', 'w') as f:
        f.write(index_content)

convert_docs_to_wiki('./company_docs', './wiki')
python
from markitdown import MarkItDown
from pathlib import Path

def convert_docs_to_wiki(docs_folder, wiki_folder):
    """Convert all Office docs to markdown wiki structure."""
    md = MarkItDown()
    docs_path = Path(docs_folder)
    wiki_path = Path(wiki_folder)
    
    # Create wiki structure
    wiki_path.mkdir(exist_ok=True)
    
    # Create index
    index_content = "# Documentation Index\n\n"
    
    for doc in sorted(docs_path.glob('**/*.docx')):
        try:
            result = md.convert(str(doc))
            
            # Create relative path in wiki
            rel_path = doc.relative_to(docs_path)
            output_file = wiki_path / rel_path.with_suffix('.md')
            output_file.parent.mkdir(parents=True, exist_ok=True)
            
            # Write markdown
            with open(output_file, 'w') as f:
                f.write(result.text_content)
            
            # Add to index
            link = str(rel_path.with_suffix('.md')).replace('\\', '/')
            index_content += f"- [{doc.stem}]({link})\n"
            
            print(f"Converted: {doc.name}")
            
        except Exception as e:
            print(f"Error: {doc.name} - {e}")
    
    # Write index
    with open(wiki_path / 'index.md', 'w') as f:
        f.write(index_content)

convert_docs_to_wiki('./company_docs', './wiki')

Example 2: Meeting Notes Processor

示例2:会议记录处理

python
from markitdown import MarkItDown
import re
from datetime import datetime

def process_meeting_notes(pptx_path):
    """Extract and structure meeting notes from PowerPoint."""
    md = MarkItDown()
    result = md.convert(pptx_path)
    
    # Parse the markdown
    content = result.text_content
    
    # Extract sections
    sections = {
        'attendees': [],
        'agenda': [],
        'decisions': [],
        'action_items': []
    }
    
    current_section = None
    
    for line in content.split('\n'):
        line_lower = line.lower()
        
        if 'attendee' in line_lower or 'participant' in line_lower:
            current_section = 'attendees'
        elif 'agenda' in line_lower:
            current_section = 'agenda'
        elif 'decision' in line_lower:
            current_section = 'decisions'
        elif 'action' in line_lower:
            current_section = 'action_items'
        elif line.strip().startswith(('-', '*', '•')) and current_section:
            sections[current_section].append(line.strip()[1:].strip())
    
    # Generate structured output
    output = f"""# Meeting Notes

**Date:** {datetime.now().strftime('%Y-%m-%d')}
**Source:** {pptx_path}
python
from markitdown import MarkItDown
import re
from datetime import datetime

def process_meeting_notes(pptx_path):
    """Extract and structure meeting notes from PowerPoint."""
    md = MarkItDown()
    result = md.convert(pptx_path)
    
    # Parse the markdown
    content = result.text_content
    
    # Extract sections
    sections = {
        'attendees': [],
        'agenda': [],
        'decisions': [],
        'action_items': []
    }
    
    current_section = None
    
    for line in content.split('\n'):
        line_lower = line.lower()
        
        if 'attendee' in line_lower or 'participant' in line_lower:
            current_section = 'attendees'
        elif 'agenda' in line_lower:
            current_section = 'agenda'
        elif 'decision' in line_lower:
            current_section = 'decisions'
        elif 'action' in line_lower:
            current_section = 'action_items'
        elif line.strip().startswith(('-', '*', '•')) and current_section:
            sections[current_section].append(line.strip()[1:].strip())
    
    # Generate structured output
    output = f"""# Meeting Notes

**Date:** {datetime.now().strftime('%Y-%m-%d')}
**Source:** {pptx_path}

Attendees

Attendees

{chr(10).join('- ' + a for a in sections['attendees'])}
{chr(10).join('- ' + a for a in sections['attendees'])}

Agenda

Agenda

{chr(10).join('- ' + a for a in sections['agenda'])}
{chr(10).join('- ' + a for a in sections['agenda'])}

Decisions Made

Decisions Made

{chr(10).join('- ' + d for d in sections['decisions'])}
{chr(10).join('- ' + d for d in sections['decisions'])}

Action Items

Action Items

{chr(10).join('- [ ] ' + a for a in sections['action_items'])} """
return output
notes = process_meeting_notes('team_meeting.pptx') print(notes)
undefined
{chr(10).join('- [ ] ' + a for a in sections['action_items'])} """
return output
notes = process_meeting_notes('team_meeting.pptx') print(notes)
undefined

Example 3: Excel to Documentation

示例3:Excel转文档

python
from markitdown import MarkItDown

def excel_to_data_dictionary(xlsx_path):
    """Convert Excel data model to data dictionary documentation."""
    md = MarkItDown()
    result = md.convert(xlsx_path)
    
    # Add documentation structure
    doc = f"""# Data Dictionary

Generated from: `{xlsx_path}`

{result.text_content}
python
from markitdown import MarkItDown

def excel_to_data_dictionary(xlsx_path):
    """Convert Excel data model to data dictionary documentation."""
    md = MarkItDown()
    result = md.convert(xlsx_path)
    
    # Add documentation structure
    doc = f"""# Data Dictionary

Generated from: `{xlsx_path}`

{result.text_content}

Usage Notes

Usage Notes

  • All tables are derived from the source Excel file
  • Review data types and constraints before use
  • Contact data team for clarifications
  • All tables are derived from the source Excel file
  • Review data types and constraints before use
  • Contact data team for clarifications

Change Log

Change Log

DateChangeAuthor
{datetime.now().strftime('%Y-%m-%d')}Initial generationAuto
"""
return doc
documentation = excel_to_data_dictionary('data_model.xlsx') with open('data_dictionary.md', 'w') as f: f.write(documentation)
undefined
DateChangeAuthor
{datetime.now().strftime('%Y-%m-%d')}Initial generationAuto
"""
return doc
documentation = excel_to_data_dictionary('data_model.xlsx') with open('data_dictionary.md', 'w') as f: f.write(documentation)
undefined

Limitations

局限性

  • Complex formatting may be simplified
  • Images are not embedded (use vision model for descriptions)
  • Some table structures may not convert perfectly
  • Track changes in Word are not preserved
  • Comments may not be extracted
  • 复杂格式可能会被简化
  • 图片不会被嵌入(使用视觉模型生成描述)
  • 部分表格结构可能无法完美转换
  • Word中的修订记录不会被保留
  • 批注可能无法被提取

Installation

安装

bash
pip install markitdown
bash
pip install markitdown

For image/audio processing

For image/audio processing

pip install markitdown[all]
pip install markitdown[all]

For specific features

For specific features

pip install markitdown[images] # Image OCR pip install markitdown[audio] # Audio transcription
undefined
pip install markitdown[images] # Image OCR pip install markitdown[audio] # Audio transcription
undefined

Resources

相关资源