office-to-md

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Office to Markdown Skill

Office转Markdown Skill

Overview

概述

This skill enables conversion from various Office formats to Markdown using markitdown - Microsoft's open-source tool for converting documents to Markdown. Perfect for making Office content searchable, version-controllable, and AI-friendly.

本Skill借助markitdown——微软推出的文档转Markdown开源工具，支持将多种Office格式转换为Markdown。非常适合让Office内容具备可搜索、可版本控制和AI友好的特性。

How to Use

使用方法

Provide the Office file (Word, Excel, PowerPoint, PDF, etc.)
Optionally specify conversion options
I'll convert it to clean Markdown

Example prompts:

"Convert this Word document to Markdown"
"Turn this PowerPoint into Markdown notes"
"Extract content from this PDF as Markdown"
"Convert this Excel file to Markdown tables"

提供Office文件（Word、Excel、PowerPoint、PDF等）
可选择性指定转换选项
我会将其转换为整洁的Markdown格式

示例提示词：

"将这份Word文档转换为Markdown"
"把这份PowerPoint转成Markdown笔记"
"从这份PDF中提取内容并保存为Markdown"
"将这份Excel文件转换为Markdown表格"

Domain Knowledge

领域知识

markitdown Fundamentals

markitdown 基础

python

from markitdown import MarkItDown

python

from markitdown import MarkItDown

Initialize converter

md = MarkItDown()

Convert file

result = md.convert("document.docx") print(result.text_content)

Save to file

with open("output.md", "w") as f: f.write(result.text_content)

undefined

with open("output.md", "w") as f: f.write(result.text_content)

undefined

Supported Formats

支持的格式

Format	Extension	Notes
Word	.docx	Full text, tables, basic formatting
Excel	.xlsx	Converts to Markdown tables
PowerPoint	.pptx	Slides as sections
PDF	.pdf	Text extraction
HTML	.html	Clean markdown
Images	.jpg, .png	OCR with vision model
Audio	.mp3, .wav	Transcription
ZIP	.zip	Processes contained files

格式	扩展名	说明
Word	.docx	完整文本、表格、基础格式
Excel	.xlsx	转换为Markdown表格
PowerPoint	.pptx	幻灯片转为章节
PDF	.pdf	文本提取
HTML	.html	整洁的Markdown格式
图片	.jpg, .png	结合视觉模型进行OCR识别
音频	.mp3, .wav	转录为文本
ZIP	.zip	处理压缩包内的文件

Basic Usage

基础用法

Python API

python

from markitdown import MarkItDown

python

from markitdown import MarkItDown

Simple conversion

md = MarkItDown() result = md.convert("document.docx")

Access content

markdown_text = result.text_content

With options

md = MarkItDown( llm_client=None, # Optional LLM for enhanced processing llm_model=None # Model name if using LLM )

undefined

md = MarkItDown( llm_client=None, # Optional LLM for enhanced processing llm_model=None # Model name if using LLM )

undefined

Command Line

命令行

bash

undefined

bash

undefined

Install

pip install markitdown

Convert file

markitdown document.docx > output.md

Or with output file

markitdown document.docx -o output.md

undefined

markitdown document.docx -o output.md

undefined

Word Document Conversion

Word文档转换

python

from markitdown import MarkItDown

md = MarkItDown()

python

from markitdown import MarkItDown

md = MarkItDown()

Convert Word document

result = md.convert("report.docx")

Output preserves:

- Headings (as # headers)

- Bold/italic formatting

- Lists (bulleted and numbered)

- Tables (as markdown tables)

- Hyperlinks

print(result.text_content)


**Example Output:**
```markdown

print(result.text_content)


**示例输出：**
```markdown

Annual Report 2024

Executive Summary

This report summarizes the key achievements and challenges...

Key Metrics

Metric	2023	2024	Change
Revenue	$10M	$12M	+20%
Users	50K	75K	+50%

Metric	2023	2024	Change
Revenue	$10M	$12M	+20%
Users	50K	75K	+50%

Detailed Analysis

The following sections provide...

undefined

The following sections provide...

undefined

Excel Conversion

Excel转换

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.xlsx")

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.xlsx")

Each sheet becomes a section

Data becomes markdown tables

print(result.text_content)


**Example Output:**
```markdown

print(result.text_content)


**示例输出：**
```markdown

Sheet1

Name	Department	Salary
John	Engineering	$80,000
Jane	Marketing	$75,000

Name	Department	Salary
John	Engineering	$80,000
Jane	Marketing	$75,000

Sheet2

Product	Q1	Q2	Q3	Q4
Widget A	100	120	150	180

undefined

Product	Q1	Q2	Q3	Q4
Widget A	100	120	150	180

undefined

PowerPoint Conversion

PowerPoint转换

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("presentation.pptx")

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("presentation.pptx")

Each slide becomes a section

Speaker notes included if present

print(result.text_content)


**Example Output:**
```markdown

print(result.text_content)


**示例输出：**
```markdown

Slide 1: Company Overview

Our mission is to...

Key Points

Innovation first
Customer focused
Global reach

Innovation first
Customer focused
Global reach

Slide 2: Market Analysis

The market opportunity is significant...

Notes: Mention the competitor analysis here

undefined

The market opportunity is significant...

Notes: Mention the competitor analysis here

undefined

PDF Conversion

PDF转换

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")

python

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")

Extracts text content

Tables converted where detected

print(result.text_content)

undefined

print(result.text_content)

undefined

Image Conversion (with Vision Model)

图片转换（结合视觉模型）

python

from markitdown import MarkItDown
import anthropic

python

from markitdown import MarkItDown
import anthropic

Use Claude for image description

client = anthropic.Anthropic()

md = MarkItDown( llm_client=client, llm_model="claude-sonnet-4-20250514" )

result = md.convert("diagram.png") print(result.text_content)

client = anthropic.Anthropic()

md = MarkItDown( llm_client=client, llm_model="claude-sonnet-4-20250514" )

result = md.convert("diagram.png") print(result.text_content)

Output: Description of the image content

undefined

undefined

Batch Conversion

批量转换

python

from markitdown import MarkItDown
from pathlib import Path

def batch_convert(input_dir, output_dir):
    """Convert all Office files to Markdown."""
    md = MarkItDown()
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    extensions = ['.docx', '.xlsx', '.pptx', '.pdf']
    
    for ext in extensions:
        for file in input_path.glob(f'*{ext}'):
            try:
                result = md.convert(str(file))
                output_file = output_path / f"{file.stem}.md"
                
                with open(output_file, 'w') as f:
                    f.write(result.text_content)
                
                print(f"Converted: {file.name}")
            except Exception as e:
                print(f"Error converting {file.name}: {e}")

batch_convert('./documents', './markdown')

python

from markitdown import MarkItDown
from pathlib import Path

def batch_convert(input_dir, output_dir):
    """Convert all Office files to Markdown."""
    md = MarkItDown()
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    extensions = ['.docx', '.xlsx', '.pptx', '.pdf']
    
    for ext in extensions:
        for file in input_path.glob(f'*{ext}'):
            try:
                result = md.convert(str(file))
                output_file = output_path / f"{file.stem}.md"
                
                with open(output_file, 'w') as f:
                    f.write(result.text_content)
                
                print(f"Converted: {file.name}")
            except Exception as e:
                print(f"Error converting {file.name}: {e}")

batch_convert('./documents', './markdown')

Best Practices

最佳实践

Check Output Quality: Review converted Markdown for accuracy
Handle Tables: Complex tables may need manual adjustment
Preserve Structure: Use consistent heading levels in source docs
Image Handling: Consider using vision models for important images
Version Control: Store converted Markdown in Git for tracking

检查输出质量：查看转换后的Markdown内容是否准确
处理表格：复杂表格可能需要手动调整
保留结构：源文档中使用统一的标题层级
图片处理：对于重要图片，考虑使用视觉模型
版本控制：将转换后的Markdown存储在Git中以便追踪

Common Patterns

常见应用场景

Document Archive

文档归档

python

import os
from datetime import datetime
from markitdown import MarkItDown

def archive_document(doc_path, archive_dir):
    """Convert and archive Office document to Markdown."""
    md = MarkItDown()
    result = md.convert(doc_path)
    
    # Create archive structure
    date_str = datetime.now().strftime('%Y-%m-%d')
    filename = os.path.basename(doc_path)
    base_name = os.path.splitext(filename)[0]
    
    # Save with metadata
    output_content = f"""---
source: {filename}
converted: {date_str}
---

{result.text_content}
"""
    
    output_path = os.path.join(archive_dir, f"{base_name}.md")
    with open(output_path, 'w') as f:
        f.write(output_content)
    
    return output_path

python

import os
from datetime import datetime
from markitdown import MarkItDown

def archive_document(doc_path, archive_dir):
    """Convert and archive Office document to Markdown."""
    md = MarkItDown()
    result = md.convert(doc_path)
    
    # Create archive structure
    date_str = datetime.now().strftime('%Y-%m-%d')
    filename = os.path.basename(doc_path)
    base_name = os.path.splitext(filename)[0]
    
    # Save with metadata
    output_content = f"""---
source: {filename}
converted: {date_str}
---

{result.text_content}
"""
    
    output_path = os.path.join(archive_dir, f"{base_name}.md")
    with open(output_path, 'w') as f:
        f.write(output_content)
    
    return output_path

AI-Ready Corpus

AI就绪语料库

python

from markitdown import MarkItDown
from pathlib import Path
import json

def create_ai_corpus(doc_folder, output_file):
    """Convert documents to JSON corpus for AI training/RAG."""
    md = MarkItDown()
    corpus = []
    
    for doc in Path(doc_folder).glob('**/*'):
        if doc.suffix in ['.docx', '.pdf', '.pptx', '.xlsx']:
            try:
                result = md.convert(str(doc))
                corpus.append({
                    'source': str(doc),
                    'filename': doc.name,
                    'content': result.text_content,
                    'type': doc.suffix[1:]
                })
            except Exception as e:
                print(f"Skipped {doc.name}: {e}")
    
    with open(output_file, 'w') as f:
        json.dump(corpus, f, indent=2)
    
    print(f"Created corpus with {len(corpus)} documents")
    return corpus

python

from markitdown import MarkItDown
from pathlib import Path
import json

def create_ai_corpus(doc_folder, output_file):
    """Convert documents to JSON corpus for AI training/RAG."""
    md = MarkItDown()
    corpus = []
    
    for doc in Path(doc_folder).glob('**/*'):
        if doc.suffix in ['.docx', '.pdf', '.pptx', '.xlsx']:
            try:
                result = md.convert(str(doc))
                corpus.append({
                    'source': str(doc),
                    'filename': doc.name,
                    'content': result.text_content,
                    'type': doc.suffix[1:]
                })
            except Exception as e:
                print(f"Skipped {doc.name}: {e}")
    
    with open(output_file, 'w') as f:
        json.dump(corpus, f, indent=2)
    
    print(f"Created corpus with {len(corpus)} documents")
    return corpus

Examples

示例

Example 1: Convert Documentation Suite

示例1：转换文档套件

python

from markitdown import MarkItDown
from pathlib import Path

def convert_docs_to_wiki(docs_folder, wiki_folder):
    """Convert all Office docs to markdown wiki structure."""
    md = MarkItDown()
    docs_path = Path(docs_folder)
    wiki_path = Path(wiki_folder)
    
    # Create wiki structure
    wiki_path.mkdir(exist_ok=True)
    
    # Create index
    index_content = "# Documentation Index\n\n"
    
    for doc in sorted(docs_path.glob('**/*.docx')):
        try:
            result = md.convert(str(doc))
            
            # Create relative path in wiki
            rel_path = doc.relative_to(docs_path)
            output_file = wiki_path / rel_path.with_suffix('.md')
            output_file.parent.mkdir(parents=True, exist_ok=True)
            
            # Write markdown
            with open(output_file, 'w') as f:
                f.write(result.text_content)
            
            # Add to index
            link = str(rel_path.with_suffix('.md')).replace('\\', '/')
            index_content += f"- [{doc.stem}]({link})\n"
            
            print(f"Converted: {doc.name}")
            
        except Exception as e:
            print(f"Error: {doc.name} - {e}")
    
    # Write index
    with open(wiki_path / 'index.md', 'w') as f:
        f.write(index_content)

convert_docs_to_wiki('./company_docs', './wiki')

python

from markitdown import MarkItDown
from pathlib import Path

def convert_docs_to_wiki(docs_folder, wiki_folder):
    """Convert all Office docs to markdown wiki structure."""
    md = MarkItDown()
    docs_path = Path(docs_folder)
    wiki_path = Path(wiki_folder)
    
    # Create wiki structure
    wiki_path.mkdir(exist_ok=True)
    
    # Create index
    index_content = "# Documentation Index\n\n"
    
    for doc in sorted(docs_path.glob('**/*.docx')):
        try:
            result = md.convert(str(doc))
            
            # Create relative path in wiki
            rel_path = doc.relative_to(docs_path)
            output_file = wiki_path / rel_path.with_suffix('.md')
            output_file.parent.mkdir(parents=True, exist_ok=True)
            
            # Write markdown
            with open(output_file, 'w') as f:
                f.write(result.text_content)
            
            # Add to index
            link = str(rel_path.with_suffix('.md')).replace('\\', '/')
            index_content += f"- [{doc.stem}]({link})\n"
            
            print(f"Converted: {doc.name}")
            
        except Exception as e:
            print(f"Error: {doc.name} - {e}")
    
    # Write index
    with open(wiki_path / 'index.md', 'w') as f:
        f.write(index_content)

convert_docs_to_wiki('./company_docs', './wiki')

Example 2: Meeting Notes Processor

示例2：会议记录处理

python

from markitdown import MarkItDown
import re
from datetime import datetime

def process_meeting_notes(pptx_path):
    """Extract and structure meeting notes from PowerPoint."""
    md = MarkItDown()
    result = md.convert(pptx_path)
    
    # Parse the markdown
    content = result.text_content
    
    # Extract sections
    sections = {
        'attendees': [],
        'agenda': [],
        'decisions': [],
        'action_items': []
    }
    
    current_section = None
    
    for line in content.split('\n'):
        line_lower = line.lower()
        
        if 'attendee' in line_lower or 'participant' in line_lower:
            current_section = 'attendees'
        elif 'agenda' in line_lower:
            current_section = 'agenda'
        elif 'decision' in line_lower:
            current_section = 'decisions'
        elif 'action' in line_lower:
            current_section = 'action_items'
        elif line.strip().startswith(('-', '*', '•')) and current_section:
            sections[current_section].append(line.strip()[1:].strip())
    
    # Generate structured output
    output = f"""# Meeting Notes

**Date:** {datetime.now().strftime('%Y-%m-%d')}
**Source:** {pptx_path}

python

from markitdown import MarkItDown
import re
from datetime import datetime

def process_meeting_notes(pptx_path):
    """Extract and structure meeting notes from PowerPoint."""
    md = MarkItDown()
    result = md.convert(pptx_path)
    
    # Parse the markdown
    content = result.text_content
    
    # Extract sections
    sections = {
        'attendees': [],
        'agenda': [],
        'decisions': [],
        'action_items': []
    }
    
    current_section = None
    
    for line in content.split('\n'):
        line_lower = line.lower()
        
        if 'attendee' in line_lower or 'participant' in line_lower:
            current_section = 'attendees'
        elif 'agenda' in line_lower:
            current_section = 'agenda'
        elif 'decision' in line_lower:
            current_section = 'decisions'
        elif 'action' in line_lower:
            current_section = 'action_items'
        elif line.strip().startswith(('-', '*', '•')) and current_section:
            sections[current_section].append(line.strip()[1:].strip())
    
    # Generate structured output
    output = f"""# Meeting Notes

**Date:** {datetime.now().strftime('%Y-%m-%d')}
**Source:** {pptx_path}

Attendees

{chr(10).join('- ' + a for a in sections['attendees'])}

Agenda

{chr(10).join('- ' + a for a in sections['agenda'])}

Decisions Made

{chr(10).join('- ' + d for d in sections['decisions'])}

Action Items

{chr(10).join('- [ ] ' + a for a in sections['action_items'])} """

return output

notes = process_meeting_notes('team_meeting.pptx') print(notes)

undefined

{chr(10).join('- [ ] ' + a for a in sections['action_items'])} """

return output

notes = process_meeting_notes('team_meeting.pptx') print(notes)

undefined

Example 3: Excel to Documentation

示例3：Excel转文档

python

from markitdown import MarkItDown

def excel_to_data_dictionary(xlsx_path):
    """Convert Excel data model to data dictionary documentation."""
    md = MarkItDown()
    result = md.convert(xlsx_path)
    
    # Add documentation structure
    doc = f"""# Data Dictionary

Generated from: `{xlsx_path}`

{result.text_content}

python

from markitdown import MarkItDown

def excel_to_data_dictionary(xlsx_path):
    """Convert Excel data model to data dictionary documentation."""
    md = MarkItDown()
    result = md.convert(xlsx_path)
    
    # Add documentation structure
    doc = f"""# Data Dictionary

Generated from: `{xlsx_path}`

{result.text_content}

Usage Notes

All tables are derived from the source Excel file
Review data types and constraints before use
Contact data team for clarifications

All tables are derived from the source Excel file
Review data types and constraints before use
Contact data team for clarifications

Change Log

Date	Change	Author
{datetime.now().strftime('%Y-%m-%d')}	Initial generation	Auto
"""

return doc

documentation = excel_to_data_dictionary('data_model.xlsx') with open('data_dictionary.md', 'w') as f: f.write(documentation)

undefined

Date	Change	Author
{datetime.now().strftime('%Y-%m-%d')}	Initial generation	Auto
"""

return doc

documentation = excel_to_data_dictionary('data_model.xlsx') with open('data_dictionary.md', 'w') as f: f.write(documentation)

undefined

Limitations

局限性

Complex formatting may be simplified
Images are not embedded (use vision model for descriptions)
Some table structures may not convert perfectly
Track changes in Word are not preserved
Comments may not be extracted

复杂格式可能会被简化
图片不会被嵌入（使用视觉模型生成描述）
部分表格结构可能无法完美转换
Word中的修订记录不会被保留
批注可能无法被提取

Installation

安装

bash

pip install markitdown

bash

pip install markitdown

For image/audio processing

pip install markitdown[all]

For specific features

pip install markitdown[images] # Image OCR pip install markitdown[audio] # Audio transcription

undefined

pip install markitdown[images] # Image OCR pip install markitdown[audio] # Audio transcription

undefined