office-to-md
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOffice to Markdown Skill
Office转Markdown Skill
Overview
概述
This skill enables conversion from various Office formats to Markdown using markitdown - Microsoft's open-source tool for converting documents to Markdown. Perfect for making Office content searchable, version-controllable, and AI-friendly.
本Skill借助markitdown——微软推出的文档转Markdown开源工具,支持将多种Office格式转换为Markdown。非常适合让Office内容具备可搜索、可版本控制和AI友好的特性。
How to Use
使用方法
- Provide the Office file (Word, Excel, PowerPoint, PDF, etc.)
- Optionally specify conversion options
- I'll convert it to clean Markdown
Example prompts:
- "Convert this Word document to Markdown"
- "Turn this PowerPoint into Markdown notes"
- "Extract content from this PDF as Markdown"
- "Convert this Excel file to Markdown tables"
- 提供Office文件(Word、Excel、PowerPoint、PDF等)
- 可选择性指定转换选项
- 我会将其转换为整洁的Markdown格式
示例提示词:
- "将这份Word文档转换为Markdown"
- "把这份PowerPoint转成Markdown笔记"
- "从这份PDF中提取内容并保存为Markdown"
- "将这份Excel文件转换为Markdown表格"
Domain Knowledge
领域知识
markitdown Fundamentals
markitdown 基础
python
from markitdown import MarkItDownpython
from markitdown import MarkItDownInitialize converter
Initialize converter
md = MarkItDown()
md = MarkItDown()
Convert file
Convert file
result = md.convert("document.docx")
print(result.text_content)
result = md.convert("document.docx")
print(result.text_content)
Save to file
Save to file
with open("output.md", "w") as f:
f.write(result.text_content)
undefinedwith open("output.md", "w") as f:
f.write(result.text_content)
undefinedSupported Formats
支持的格式
| Format | Extension | Notes |
|---|---|---|
| Word | .docx | Full text, tables, basic formatting |
| Excel | .xlsx | Converts to Markdown tables |
| PowerPoint | .pptx | Slides as sections |
| Text extraction | ||
| HTML | .html | Clean markdown |
| Images | .jpg, .png | OCR with vision model |
| Audio | .mp3, .wav | Transcription |
| ZIP | .zip | Processes contained files |
| 格式 | 扩展名 | 说明 |
|---|---|---|
| Word | .docx | 完整文本、表格、基础格式 |
| Excel | .xlsx | 转换为Markdown表格 |
| PowerPoint | .pptx | 幻灯片转为章节 |
| 文本提取 | ||
| HTML | .html | 整洁的Markdown格式 |
| 图片 | .jpg, .png | 结合视觉模型进行OCR识别 |
| 音频 | .mp3, .wav | 转录为文本 |
| ZIP | .zip | 处理压缩包内的文件 |
Basic Usage
基础用法
Python API
Python API
python
from markitdown import MarkItDownpython
from markitdown import MarkItDownSimple conversion
Simple conversion
md = MarkItDown()
result = md.convert("document.docx")
md = MarkItDown()
result = md.convert("document.docx")
Access content
Access content
markdown_text = result.text_content
markdown_text = result.text_content
With options
With options
md = MarkItDown(
llm_client=None, # Optional LLM for enhanced processing
llm_model=None # Model name if using LLM
)
undefinedmd = MarkItDown(
llm_client=None, # Optional LLM for enhanced processing
llm_model=None # Model name if using LLM
)
undefinedCommand Line
命令行
bash
undefinedbash
undefinedInstall
Install
pip install markitdown
pip install markitdown
Convert file
Convert file
markitdown document.docx > output.md
markitdown document.docx > output.md
Or with output file
Or with output file
markitdown document.docx -o output.md
undefinedmarkitdown document.docx -o output.md
undefinedWord Document Conversion
Word文档转换
python
from markitdown import MarkItDown
md = MarkItDown()python
from markitdown import MarkItDown
md = MarkItDown()Convert Word document
Convert Word document
result = md.convert("report.docx")
result = md.convert("report.docx")
Output preserves:
Output preserves:
- Headings (as # headers)
- Headings (as # headers)
- Bold/italic formatting
- Bold/italic formatting
- Lists (bulleted and numbered)
- Lists (bulleted and numbered)
- Tables (as markdown tables)
- Tables (as markdown tables)
- Hyperlinks
- Hyperlinks
print(result.text_content)
**Example Output:**
```markdownprint(result.text_content)
**示例输出:**
```markdownAnnual Report 2024
Annual Report 2024
Executive Summary
Executive Summary
This report summarizes the key achievements and challenges...
This report summarizes the key achievements and challenges...
Key Metrics
Key Metrics
| Metric | 2023 | 2024 | Change |
|---|---|---|---|
| Revenue | $10M | $12M | +20% |
| Users | 50K | 75K | +50% |
| Metric | 2023 | 2024 | Change |
|---|---|---|---|
| Revenue | $10M | $12M | +20% |
| Users | 50K | 75K | +50% |
Detailed Analysis
Detailed Analysis
The following sections provide...
undefinedThe following sections provide...
undefinedExcel Conversion
Excel转换
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xlsx")python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("data.xlsx")Each sheet becomes a section
Each sheet becomes a section
Data becomes markdown tables
Data becomes markdown tables
print(result.text_content)
**Example Output:**
```markdownprint(result.text_content)
**示例输出:**
```markdownSheet1
Sheet1
| Name | Department | Salary |
|---|---|---|
| John | Engineering | $80,000 |
| Jane | Marketing | $75,000 |
| Name | Department | Salary |
|---|---|---|
| John | Engineering | $80,000 |
| Jane | Marketing | $75,000 |
Sheet2
Sheet2
| Product | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|
| Widget A | 100 | 120 | 150 | 180 |
undefined| Product | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|
| Widget A | 100 | 120 | 150 | 180 |
undefinedPowerPoint Conversion
PowerPoint转换
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("presentation.pptx")python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("presentation.pptx")Each slide becomes a section
Each slide becomes a section
Speaker notes included if present
Speaker notes included if present
print(result.text_content)
**Example Output:**
```markdownprint(result.text_content)
**示例输出:**
```markdownSlide 1: Company Overview
Slide 1: Company Overview
Our mission is to...
Our mission is to...
Key Points
Key Points
- Innovation first
- Customer focused
- Global reach
- Innovation first
- Customer focused
- Global reach
Slide 2: Market Analysis
Slide 2: Market Analysis
The market opportunity is significant...
Notes: Mention the competitor analysis here
undefinedThe market opportunity is significant...
Notes: Mention the competitor analysis here
undefinedPDF Conversion
PDF转换
python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")Extracts text content
Extracts text content
Tables converted where detected
Tables converted where detected
print(result.text_content)
undefinedprint(result.text_content)
undefinedImage Conversion (with Vision Model)
图片转换(结合视觉模型)
python
from markitdown import MarkItDown
import anthropicpython
from markitdown import MarkItDown
import anthropicUse Claude for image description
Use Claude for image description
client = anthropic.Anthropic()
md = MarkItDown(
llm_client=client,
llm_model="claude-sonnet-4-20250514"
)
result = md.convert("diagram.png")
print(result.text_content)
client = anthropic.Anthropic()
md = MarkItDown(
llm_client=client,
llm_model="claude-sonnet-4-20250514"
)
result = md.convert("diagram.png")
print(result.text_content)
Output: Description of the image content
Output: Description of the image content
undefinedundefinedBatch Conversion
批量转换
python
from markitdown import MarkItDown
from pathlib import Path
def batch_convert(input_dir, output_dir):
"""Convert all Office files to Markdown."""
md = MarkItDown()
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
extensions = ['.docx', '.xlsx', '.pptx', '.pdf']
for ext in extensions:
for file in input_path.glob(f'*{ext}'):
try:
result = md.convert(str(file))
output_file = output_path / f"{file.stem}.md"
with open(output_file, 'w') as f:
f.write(result.text_content)
print(f"Converted: {file.name}")
except Exception as e:
print(f"Error converting {file.name}: {e}")
batch_convert('./documents', './markdown')python
from markitdown import MarkItDown
from pathlib import Path
def batch_convert(input_dir, output_dir):
"""Convert all Office files to Markdown."""
md = MarkItDown()
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
extensions = ['.docx', '.xlsx', '.pptx', '.pdf']
for ext in extensions:
for file in input_path.glob(f'*{ext}'):
try:
result = md.convert(str(file))
output_file = output_path / f"{file.stem}.md"
with open(output_file, 'w') as f:
f.write(result.text_content)
print(f"Converted: {file.name}")
except Exception as e:
print(f"Error converting {file.name}: {e}")
batch_convert('./documents', './markdown')Best Practices
最佳实践
- Check Output Quality: Review converted Markdown for accuracy
- Handle Tables: Complex tables may need manual adjustment
- Preserve Structure: Use consistent heading levels in source docs
- Image Handling: Consider using vision models for important images
- Version Control: Store converted Markdown in Git for tracking
- 检查输出质量:查看转换后的Markdown内容是否准确
- 处理表格:复杂表格可能需要手动调整
- 保留结构:源文档中使用统一的标题层级
- 图片处理:对于重要图片,考虑使用视觉模型
- 版本控制:将转换后的Markdown存储在Git中以便追踪
Common Patterns
常见应用场景
Document Archive
文档归档
python
import os
from datetime import datetime
from markitdown import MarkItDown
def archive_document(doc_path, archive_dir):
"""Convert and archive Office document to Markdown."""
md = MarkItDown()
result = md.convert(doc_path)
# Create archive structure
date_str = datetime.now().strftime('%Y-%m-%d')
filename = os.path.basename(doc_path)
base_name = os.path.splitext(filename)[0]
# Save with metadata
output_content = f"""---
source: {filename}
converted: {date_str}
---
{result.text_content}
"""
output_path = os.path.join(archive_dir, f"{base_name}.md")
with open(output_path, 'w') as f:
f.write(output_content)
return output_pathpython
import os
from datetime import datetime
from markitdown import MarkItDown
def archive_document(doc_path, archive_dir):
"""Convert and archive Office document to Markdown."""
md = MarkItDown()
result = md.convert(doc_path)
# Create archive structure
date_str = datetime.now().strftime('%Y-%m-%d')
filename = os.path.basename(doc_path)
base_name = os.path.splitext(filename)[0]
# Save with metadata
output_content = f"""---
source: {filename}
converted: {date_str}
---
{result.text_content}
"""
output_path = os.path.join(archive_dir, f"{base_name}.md")
with open(output_path, 'w') as f:
f.write(output_content)
return output_pathAI-Ready Corpus
AI就绪语料库
python
from markitdown import MarkItDown
from pathlib import Path
import json
def create_ai_corpus(doc_folder, output_file):
"""Convert documents to JSON corpus for AI training/RAG."""
md = MarkItDown()
corpus = []
for doc in Path(doc_folder).glob('**/*'):
if doc.suffix in ['.docx', '.pdf', '.pptx', '.xlsx']:
try:
result = md.convert(str(doc))
corpus.append({
'source': str(doc),
'filename': doc.name,
'content': result.text_content,
'type': doc.suffix[1:]
})
except Exception as e:
print(f"Skipped {doc.name}: {e}")
with open(output_file, 'w') as f:
json.dump(corpus, f, indent=2)
print(f"Created corpus with {len(corpus)} documents")
return corpuspython
from markitdown import MarkItDown
from pathlib import Path
import json
def create_ai_corpus(doc_folder, output_file):
"""Convert documents to JSON corpus for AI training/RAG."""
md = MarkItDown()
corpus = []
for doc in Path(doc_folder).glob('**/*'):
if doc.suffix in ['.docx', '.pdf', '.pptx', '.xlsx']:
try:
result = md.convert(str(doc))
corpus.append({
'source': str(doc),
'filename': doc.name,
'content': result.text_content,
'type': doc.suffix[1:]
})
except Exception as e:
print(f"Skipped {doc.name}: {e}")
with open(output_file, 'w') as f:
json.dump(corpus, f, indent=2)
print(f"Created corpus with {len(corpus)} documents")
return corpusExamples
示例
Example 1: Convert Documentation Suite
示例1:转换文档套件
python
from markitdown import MarkItDown
from pathlib import Path
def convert_docs_to_wiki(docs_folder, wiki_folder):
"""Convert all Office docs to markdown wiki structure."""
md = MarkItDown()
docs_path = Path(docs_folder)
wiki_path = Path(wiki_folder)
# Create wiki structure
wiki_path.mkdir(exist_ok=True)
# Create index
index_content = "# Documentation Index\n\n"
for doc in sorted(docs_path.glob('**/*.docx')):
try:
result = md.convert(str(doc))
# Create relative path in wiki
rel_path = doc.relative_to(docs_path)
output_file = wiki_path / rel_path.with_suffix('.md')
output_file.parent.mkdir(parents=True, exist_ok=True)
# Write markdown
with open(output_file, 'w') as f:
f.write(result.text_content)
# Add to index
link = str(rel_path.with_suffix('.md')).replace('\\', '/')
index_content += f"- [{doc.stem}]({link})\n"
print(f"Converted: {doc.name}")
except Exception as e:
print(f"Error: {doc.name} - {e}")
# Write index
with open(wiki_path / 'index.md', 'w') as f:
f.write(index_content)
convert_docs_to_wiki('./company_docs', './wiki')python
from markitdown import MarkItDown
from pathlib import Path
def convert_docs_to_wiki(docs_folder, wiki_folder):
"""Convert all Office docs to markdown wiki structure."""
md = MarkItDown()
docs_path = Path(docs_folder)
wiki_path = Path(wiki_folder)
# Create wiki structure
wiki_path.mkdir(exist_ok=True)
# Create index
index_content = "# Documentation Index\n\n"
for doc in sorted(docs_path.glob('**/*.docx')):
try:
result = md.convert(str(doc))
# Create relative path in wiki
rel_path = doc.relative_to(docs_path)
output_file = wiki_path / rel_path.with_suffix('.md')
output_file.parent.mkdir(parents=True, exist_ok=True)
# Write markdown
with open(output_file, 'w') as f:
f.write(result.text_content)
# Add to index
link = str(rel_path.with_suffix('.md')).replace('\\', '/')
index_content += f"- [{doc.stem}]({link})\n"
print(f"Converted: {doc.name}")
except Exception as e:
print(f"Error: {doc.name} - {e}")
# Write index
with open(wiki_path / 'index.md', 'w') as f:
f.write(index_content)
convert_docs_to_wiki('./company_docs', './wiki')Example 2: Meeting Notes Processor
示例2:会议记录处理
python
from markitdown import MarkItDown
import re
from datetime import datetime
def process_meeting_notes(pptx_path):
"""Extract and structure meeting notes from PowerPoint."""
md = MarkItDown()
result = md.convert(pptx_path)
# Parse the markdown
content = result.text_content
# Extract sections
sections = {
'attendees': [],
'agenda': [],
'decisions': [],
'action_items': []
}
current_section = None
for line in content.split('\n'):
line_lower = line.lower()
if 'attendee' in line_lower or 'participant' in line_lower:
current_section = 'attendees'
elif 'agenda' in line_lower:
current_section = 'agenda'
elif 'decision' in line_lower:
current_section = 'decisions'
elif 'action' in line_lower:
current_section = 'action_items'
elif line.strip().startswith(('-', '*', '•')) and current_section:
sections[current_section].append(line.strip()[1:].strip())
# Generate structured output
output = f"""# Meeting Notes
**Date:** {datetime.now().strftime('%Y-%m-%d')}
**Source:** {pptx_path}python
from markitdown import MarkItDown
import re
from datetime import datetime
def process_meeting_notes(pptx_path):
"""Extract and structure meeting notes from PowerPoint."""
md = MarkItDown()
result = md.convert(pptx_path)
# Parse the markdown
content = result.text_content
# Extract sections
sections = {
'attendees': [],
'agenda': [],
'decisions': [],
'action_items': []
}
current_section = None
for line in content.split('\n'):
line_lower = line.lower()
if 'attendee' in line_lower or 'participant' in line_lower:
current_section = 'attendees'
elif 'agenda' in line_lower:
current_section = 'agenda'
elif 'decision' in line_lower:
current_section = 'decisions'
elif 'action' in line_lower:
current_section = 'action_items'
elif line.strip().startswith(('-', '*', '•')) and current_section:
sections[current_section].append(line.strip()[1:].strip())
# Generate structured output
output = f"""# Meeting Notes
**Date:** {datetime.now().strftime('%Y-%m-%d')}
**Source:** {pptx_path}Attendees
Attendees
{chr(10).join('- ' + a for a in sections['attendees'])}
{chr(10).join('- ' + a for a in sections['attendees'])}
Agenda
Agenda
{chr(10).join('- ' + a for a in sections['agenda'])}
{chr(10).join('- ' + a for a in sections['agenda'])}
Decisions Made
Decisions Made
{chr(10).join('- ' + d for d in sections['decisions'])}
{chr(10).join('- ' + d for d in sections['decisions'])}
Action Items
Action Items
{chr(10).join('- [ ] ' + a for a in sections['action_items'])}
"""
return outputnotes = process_meeting_notes('team_meeting.pptx')
print(notes)
undefined{chr(10).join('- [ ] ' + a for a in sections['action_items'])}
"""
return outputnotes = process_meeting_notes('team_meeting.pptx')
print(notes)
undefinedExample 3: Excel to Documentation
示例3:Excel转文档
python
from markitdown import MarkItDown
def excel_to_data_dictionary(xlsx_path):
"""Convert Excel data model to data dictionary documentation."""
md = MarkItDown()
result = md.convert(xlsx_path)
# Add documentation structure
doc = f"""# Data Dictionary
Generated from: `{xlsx_path}`
{result.text_content}python
from markitdown import MarkItDown
def excel_to_data_dictionary(xlsx_path):
"""Convert Excel data model to data dictionary documentation."""
md = MarkItDown()
result = md.convert(xlsx_path)
# Add documentation structure
doc = f"""# Data Dictionary
Generated from: `{xlsx_path}`
{result.text_content}Usage Notes
Usage Notes
- All tables are derived from the source Excel file
- Review data types and constraints before use
- Contact data team for clarifications
- All tables are derived from the source Excel file
- Review data types and constraints before use
- Contact data team for clarifications
Change Log
Change Log
| Date | Change | Author |
|---|---|---|
| {datetime.now().strftime('%Y-%m-%d')} | Initial generation | Auto |
| """ |
return docdocumentation = excel_to_data_dictionary('data_model.xlsx')
with open('data_dictionary.md', 'w') as f:
f.write(documentation)
undefined| Date | Change | Author |
|---|---|---|
| {datetime.now().strftime('%Y-%m-%d')} | Initial generation | Auto |
| """ |
return docdocumentation = excel_to_data_dictionary('data_model.xlsx')
with open('data_dictionary.md', 'w') as f:
f.write(documentation)
undefinedLimitations
局限性
- Complex formatting may be simplified
- Images are not embedded (use vision model for descriptions)
- Some table structures may not convert perfectly
- Track changes in Word are not preserved
- Comments may not be extracted
- 复杂格式可能会被简化
- 图片不会被嵌入(使用视觉模型生成描述)
- 部分表格结构可能无法完美转换
- Word中的修订记录不会被保留
- 批注可能无法被提取
Installation
安装
bash
pip install markitdownbash
pip install markitdownFor image/audio processing
For image/audio processing
pip install markitdown[all]
pip install markitdown[all]
For specific features
For specific features
pip install markitdown[images] # Image OCR
pip install markitdown[audio] # Audio transcription
undefinedpip install markitdown[images] # Image OCR
pip install markitdown[audio] # Audio transcription
undefined