web-reader

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web Reader Skill

Web Reader 技能

This skill guides the implementation of web page reading and content extraction functionality using the z-ai-web-dev-sdk package, enabling applications to fetch and process web page content programmatically.
本技能指导你使用z-ai-web-dev-sdk包实现网页读取和内容提取功能,让应用能够以编程方式获取并处理网页内容。

Skills Path

技能路径

Skill Location:
{project_path}/skills/web-reader
This skill is located at the above path in your project.
Reference Scripts: Example test scripts are available in the
{Skill Location}/scripts/
directory for quick testing and reference. See
{Skill Location}/scripts/web-reader.ts
for a working example.
技能位置
{project_path}/skills/web-reader
该技能位于项目中的上述路径。
参考脚本
{Skill Location}/scripts/
目录下提供了示例测试脚本,可用于快速测试和参考。
{Skill Location}/scripts/web-reader.ts
是一个可运行的示例。

Overview

概述

Web Reader allows you to build applications that can extract content from web pages, retrieve article metadata, and process HTML content. The API automatically handles content extraction, providing clean, structured data from any web URL.
IMPORTANT: z-ai-web-dev-sdk MUST be used in backend code only. Never use it in client-side code.
Web Reader允许你构建能够从网页提取内容、获取文章元数据并处理HTML内容的应用。API会自动处理内容提取,从任意网页URL返回清晰的结构化数据。
重要提示:z-ai-web-dev-sdk 只能在后端代码中使用,绝不能在客户端代码中使用。

Prerequisites

前置条件

The z-ai-web-dev-sdk package is already installed. Import it as shown in the examples below.
z-ai-web-dev-sdk包已安装。可按照以下示例导入该包。

CLI Usage (For Simple Tasks)

CLI用法(适用于简单任务)

For simple web page content extraction, you can use the z-ai CLI instead of writing code. This is ideal for quick content scraping, testing URLs, or simple automation tasks.
对于简单的网页内容提取,你可以使用z-ai CLI而无需编写代码。这非常适合快速内容抓取、URL测试或简单的自动化任务。

Basic Page Reading

基础网页读取

bash
undefined
bash
undefined

Extract content from a web page

提取网页内容

z-ai function --name "page_reader" --args '{"url": "https://example.com"}'
z-ai function --name "page_reader" --args '{"url": "https://example.com"}'

Using short options

使用短选项

z-ai function -n page_reader -a '{"url": "https://www.example.com/article"}'
undefined
z-ai function -n page_reader -a '{"url": "https://www.example.com/article"}'
undefined

Save Page Content

保存网页内容

bash
undefined
bash
undefined

Save extracted content to JSON file

将提取的内容保存到JSON文件

z-ai function
-n page_reader
-a '{"url": "https://news.example.com/article"}'
-o page_content.json
z-ai function
-n page_reader
-a '{"url": "https://news.example.com/article"}'
-o page_content.json

Extract and save blog post

提取并保存博客文章

z-ai function
-n page_reader
-a '{"url": "https://blog.example.com/post/123"}'
-o blog_post.json
undefined
z-ai function
-n page_reader
-a '{"url": "https://blog.example.com/post/123"}'
-o blog_post.json
undefined

Common Use Cases

常见使用场景

bash
undefined
bash
undefined

Extract news article

提取新闻文章

z-ai function
-n page_reader
-a '{"url": "https://news.site.com/breaking-news"}'
-o news.json
z-ai function
-n page_reader
-a '{"url": "https://news.site.com/breaking-news"}'
-o news.json

Read documentation page

读取文档页面

z-ai function
-n page_reader
-a '{"url": "https://docs.example.com/getting-started"}'
-o docs.json
z-ai function
-n page_reader
-a '{"url": "https://docs.example.com/getting-started"}'
-o docs.json

Scrape blog content

抓取博客内容

z-ai function
-n page_reader
-a '{"url": "https://techblog.com/ai-trends-2024"}'
-o blog.json
z-ai function
-n page_reader
-a '{"url": "https://techblog.com/ai-trends-2024"}'
-o blog.json

Extract research article

提取研究文章

z-ai function
-n page_reader
-a '{"url": "https://research.org/papers/quantum-computing"}'
-o research.json
undefined
z-ai function
-n page_reader
-a '{"url": "https://research.org/papers/quantum-computing"}'
-o research.json
undefined

CLI Parameters

CLI参数

  • --name, -n
    : Required - Function name (use "page_reader")
  • --args, -a
    : Required - JSON arguments object with:
    • url
      (string, required): The URL of the web page to read
  • --output, -o <path>
    : Optional - Output file path (JSON format)
  • --name, -n
    必填 - 函数名称(使用"page_reader")
  • --args, -a
    必填 - JSON参数对象,包含:
    • url
      (字符串,必填):要读取的网页URL
  • --output, -o <path>
    :可选 - 输出文件路径(JSON格式)

Response Structure

响应结构

The CLI returns a JSON object containing:
  • title
    : Page title
  • html
    : Main content HTML
  • text
    : Plain text content
  • publish_time
    : Publication timestamp (if available)
  • url
    : Original URL
  • metadata
    : Additional page metadata
CLI返回的JSON对象包含:
  • title
    :页面标题
  • html
    :主要内容HTML
  • text
    :纯文本内容
  • publish_time
    :发布时间戳(如果可用)
  • url
    :原始URL
  • metadata
    :额外的页面元数据

Example Response

示例响应

json
{
  "title": "Introduction to Machine Learning",
  "html": "<article><h1>Introduction to Machine Learning</h1><p>Machine learning is...</p></article>",
  "text": "Introduction to Machine Learning\n\nMachine learning is...",
  "publish_time": "2024-01-15T10:30:00Z",
  "url": "https://example.com/ml-intro",
  "metadata": {
    "author": "John Doe",
    "description": "A comprehensive guide to ML"
  }
}
json
{
  "title": "Introduction to Machine Learning",
  "html": "<article><h1>Introduction to Machine Learning</h1><p>Machine learning is...</p></article>",
  "text": "Introduction to Machine Learning\n\nMachine learning is...",
  "publish_time": "2024-01-15T10:30:00Z",
  "url": "https://example.com/ml-intro",
  "metadata": {
    "author": "John Doe",
    "description": "A comprehensive guide to ML"
  }
}

Processing Multiple URLs

处理多个URL

bash
undefined
bash
undefined

Create a simple script to process multiple URLs

创建一个简单的脚本来处理多个URL

for url in
"https://site1.com/article1"
"https://site2.com/article2"
"https://site3.com/article3" do filename=$(echo $url | md5sum | cut -d' ' -f1) z-ai function -n page_reader -a "{"url": "$url"}" -o "${filename}.json" done
undefined
for url in
"https://site1.com/article1"
"https://site2.com/article2"
"https://site3.com/article3" do filename=$(echo $url | md5sum | cut -d' ' -f1) z-ai function -n page_reader -a "{"url": "$url"}" -o "${filename}.json" done
undefined

When to Use CLI vs SDK

何时使用CLI vs SDK

Use CLI for:
  • Quick content extraction
  • Testing URL accessibility
  • Simple web scraping tasks
  • One-off content retrieval
Use SDK for:
  • Batch URL processing with custom logic
  • Integration with web applications
  • Complex content processing pipelines
  • Production applications with error handling
使用CLI的场景
  • 快速内容提取
  • 测试URL可访问性
  • 简单的网页抓取任务
  • 一次性内容获取
使用SDK的场景
  • 带有自定义逻辑的批量URL处理
  • 与Web应用集成
  • 复杂的内容处理流水线
  • 带有错误处理的生产应用

How It Works

工作原理

The Web Reader uses the
page_reader
function to:
  1. Fetch the web page content
  2. Extract main article content and metadata
  3. Parse and clean the HTML
  4. Return structured data including title, content, and publication time
Web Reader使用
page_reader
函数完成以下步骤:
  1. 获取网页内容
  2. 提取主要文章内容和元数据
  3. 解析并清理HTML
  4. 返回包含标题、内容和发布时间的结构化数据

Basic Web Reading Implementation

基础网页读取实现

Simple Page Reading

简单网页读取

javascript
import ZAI from 'z-ai-web-dev-sdk';

async function readWebPage(url) {
  try {
    const zai = await ZAI.create();

    const result = await zai.functions.invoke('page_reader', {
      url: url
    });

    console.log('Title:', result.data.title);
    console.log('URL:', result.data.url);
    console.log('Published:', result.data.publishedTime);
    console.log('HTML Content:', result.data.html);
    console.log('Tokens Used:', result.data.usage.tokens);

    return result.data;
  } catch (error) {
    console.error('Page reading failed:', error.message);
    throw error;
  }
}

// Usage
const pageData = await readWebPage('https://example.com/article');
console.log('Page title:', pageData.title);
javascript
import ZAI from 'z-ai-web-dev-sdk';

async function readWebPage(url) {
  try {
    const zai = await ZAI.create();

    const result = await zai.functions.invoke('page_reader', {
      url: url
    });

    console.log('Title:', result.data.title);
    console.log('URL:', result.data.url);
    console.log('Published:', result.data.publishedTime);
    console.log('HTML Content:', result.data.html);
    console.log('Tokens Used:', result.data.usage.tokens);

    return result.data;
  } catch (error) {
    console.error('Page reading failed:', error.message);
    throw error;
  }
}

// 用法
const pageData = await readWebPage('https://example.com/article');
console.log('Page title:', pageData.title);

Extract Article Text Only

仅提取文章文本

javascript
import ZAI from 'z-ai-web-dev-sdk';

async function extractArticleText(url) {
  const zai = await ZAI.create();

  const result = await zai.functions.invoke('page_reader', {
    url: url
  });

  // Convert HTML to plain text (basic approach)
  const plainText = result.data.html
    .replace(/<[^>]*>/g, ' ')
    .replace(/\s+/g, ' ')
    .trim();

  return {
    title: result.data.title,
    text: plainText,
    url: result.data.url,
    publishedTime: result.data.publishedTime
  };
}

// Usage
const article = await extractArticleText('https://news.example.com/story');
console.log(article.title);
console.log(article.text.substring(0, 200) + '...');
javascript
import ZAI from 'z-ai-web-dev-sdk';

async function extractArticleText(url) {
  const zai = await ZAI.create();

  const result = await zai.functions.invoke('page_reader', {
    url: url
  });

  // 将HTML转换为纯文本(基础方法)
  const plainText = result.data.html
    .replace(/<[^>]*>/g, ' ')
    .replace(/\s+/g, ' ')
    .trim();

  return {
    title: result.data.title,
    text: plainText,
    url: result.data.url,
    publishedTime: result.data.publishedTime
  };
}

// 用法
const article = await extractArticleText('https://news.example.com/story');
console.log(article.title);
console.log(article.text.substring(0, 200) + '...');

Read Multiple Pages

读取多个页面

javascript
import ZAI from 'z-ai-web-dev-sdk';

async function readMultiplePages(urls) {
  const zai = await ZAI.create();
  const results = [];

  for (const url of urls) {
    try {
      const result = await zai.functions.invoke('page_reader', {
        url: url
      });

      results.push({
        url: url,
        success: true,
        data: result.data
      });
    } catch (error) {
      results.push({
        url: url,
        success: false,
        error: error.message
      });
    }
  }

  return results;
}

// Usage
const urls = [
  'https://example.com/article1',
  'https://example.com/article2',
  'https://example.com/article3'
];

const pages = await readMultiplePages(urls);
pages.forEach(page => {
  if (page.success) {
    console.log(`${page.data.title}`);
  } else {
    console.log(`${page.url}: ${page.error}`);
  }
});
javascript
import ZAI from 'z-ai-web-dev-sdk';

async function readMultiplePages(urls) {
  const zai = await ZAI.create();
  const results = [];

  for (const url of urls) {
    try {
      const result = await zai.functions.invoke('page_reader', {
        url: url
      });

      results.push({
        url: url,
        success: true,
        data: result.data
      });
    } catch (error) {
      results.push({
        url: url,
        success: false,
        error: error.message
      });
    }
  }

  return results;
}

// 用法
const urls = [
  'https://example.com/article1',
  'https://example.com/article2',
  'https://example.com/article3'
];

const pages = await readMultiplePages(urls);
pages.forEach(page => {
  if (page.success) {
    console.log(`${page.data.title}`);
  } else {
    console.log(`${page.url}: ${page.error}`);
  }
});

Advanced Use Cases

高级使用场景

Web Content Analyzer

网页内容分析器

javascript
import ZAI from 'z-ai-web-dev-sdk';

class WebContentAnalyzer {
  constructor() {
    this.cache = new Map();
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  async readPage(url, useCache = true) {
    // Check cache
    if (useCache && this.cache.has(url)) {
      console.log('Returning cached result for:', url);
      return this.cache.get(url);
    }

    // Fetch fresh content
    const result = await this.zai.functions.invoke('page_reader', {
      url: url
    });

    // Cache the result
    if (useCache) {
      this.cache.set(url, result.data);
    }

    return result.data;
  }

  async getPageMetadata(url) {
    const data = await this.readPage(url);

    return {
      title: data.title,
      url: data.url,
      publishedTime: data.publishedTime,
      contentLength: data.html.length,
      wordCount: this.estimateWordCount(data.html)
    };
  }

  estimateWordCount(html) {
    const text = html.replace(/<[^>]*>/g, ' ');
    const words = text.split(/\s+/).filter(word => word.length > 0);
    return words.length;
  }

  async comparePages(url1, url2) {
    const [page1, page2] = await Promise.all([
      this.readPage(url1),
      this.readPage(url2)
    ]);

    return {
      page1: {
        title: page1.title,
        wordCount: this.estimateWordCount(page1.html),
        published: page1.publishedTime
      },
      page2: {
        title: page2.title,
        wordCount: this.estimateWordCount(page2.html),
        published: page2.publishedTime
      }
    };
  }

  clearCache() {
    this.cache.clear();
  }
}

// Usage
const analyzer = new WebContentAnalyzer();
await analyzer.initialize();

const metadata = await analyzer.getPageMetadata('https://example.com/article');
console.log('Article Metadata:', metadata);

const comparison = await analyzer.comparePages(
  'https://example.com/article1',
  'https://example.com/article2'
);
console.log('Comparison:', comparison);
javascript
import ZAI from 'z-ai-web-dev-sdk';

class WebContentAnalyzer {
  constructor() {
    this.cache = new Map();
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  async readPage(url, useCache = true) {
    // 检查缓存
    if (useCache && this.cache.has(url)) {
      console.log('Returning cached result for:', url);
      return this.cache.get(url);
    }

    // 获取最新内容
    const result = await this.zai.functions.invoke('page_reader', {
      url: url
    });

    // 缓存结果
    if (useCache) {
      this.cache.set(url, result.data);
    }

    return result.data;
  }

  async getPageMetadata(url) {
    const data = await this.readPage(url);

    return {
      title: data.title,
      url: data.url,
      publishedTime: data.publishedTime,
      contentLength: data.html.length,
      wordCount: this.estimateWordCount(data.html)
    };
  }

  estimateWordCount(html) {
    const text = html.replace(/<[^>]*>/g, ' ');
    const words = text.split(/\s+/).filter(word => word.length > 0);
    return words.length;
  }

  async comparePages(url1, url2) {
    const [page1, page2] = await Promise.all([
      this.readPage(url1),
      this.readPage(url2)
    ]);

    return {
      page1: {
        title: page1.title,
        wordCount: this.estimateWordCount(page1.html),
        published: page1.publishedTime
      },
      page2: {
        title: page2.title,
        wordCount: this.estimateWordCount(page2.html),
        published: page2.publishedTime
      }
    };
  }

  clearCache() {
    this.cache.clear();
  }
}

// 用法
const analyzer = new WebContentAnalyzer();
await analyzer.initialize();

const metadata = await analyzer.getPageMetadata('https://example.com/article');
console.log('Article Metadata:', metadata);

const comparison = await analyzer.comparePages(
  'https://example.com/article1',
  'https://example.com/article2'
);
console.log('Comparison:', comparison);

RSS Feed Reader

RSS订阅阅读器

javascript
import ZAI from 'z-ai-web-dev-sdk';

class FeedReader {
  constructor() {
    this.articles = [];
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  async fetchArticlesFromUrls(urls) {
    const articles = [];

    for (const url of urls) {
      try {
        const result = await this.zai.functions.invoke('page_reader', {
          url: url
        });

        articles.push({
          title: result.data.title,
          url: result.data.url,
          publishedTime: result.data.publishedTime,
          content: result.data.html,
          fetchedAt: new Date().toISOString()
        });

        console.log(`Fetched: ${result.data.title}`);
      } catch (error) {
        console.error(`Failed to fetch ${url}:`, error.message);
      }
    }

    this.articles = articles;
    return articles;
  }

  getRecentArticles(limit = 10) {
    return this.articles
      .sort((a, b) => {
        const dateA = new Date(a.publishedTime || a.fetchedAt);
        const dateB = new Date(b.publishedTime || b.fetchedAt);
        return dateB - dateA;
      })
      .slice(0, limit);
  }

  searchArticles(keyword) {
    return this.articles.filter(article => {
      const searchText = `${article.title} ${article.content}`.toLowerCase();
      return searchText.includes(keyword.toLowerCase());
    });
  }
}

// Usage
const reader = new FeedReader();
await reader.initialize();

const feedUrls = [
  'https://example.com/article1',
  'https://example.com/article2',
  'https://example.com/article3'
];

await reader.fetchArticlesFromUrls(feedUrls);
const recent = reader.getRecentArticles(5);
console.log('Recent articles:', recent.map(a => a.title));
javascript
import ZAI from 'z-ai-web-dev-sdk';

class FeedReader {
  constructor() {
    this.articles = [];
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  async fetchArticlesFromUrls(urls) {
    const articles = [];

    for (const url of urls) {
      try {
        const result = await this.zai.functions.invoke('page_reader', {
          url: url
        });

        articles.push({
          title: result.data.title,
          url: result.data.url,
          publishedTime: result.data.publishedTime,
          content: result.data.html,
          fetchedAt: new Date().toISOString()
        });

        console.log(`Fetched: ${result.data.title}`);
      } catch (error) {
        console.error(`Failed to fetch ${url}:`, error.message);
      }
    }

    this.articles = articles;
    return articles;
  }

  getRecentArticles(limit = 10) {
    return this.articles
      .sort((a, b) => {
        const dateA = new Date(a.publishedTime || a.fetchedAt);
        const dateB = new Date(b.publishedTime || b.fetchedAt);
        return dateB - dateA;
      })
      .slice(0, limit);
  }

  searchArticles(keyword) {
    return this.articles.filter(article => {
      const searchText = `${article.title} ${article.content}`.toLowerCase();
      return searchText.includes(keyword.toLowerCase());
    });
  }
}

// 用法
const reader = new FeedReader();
await reader.initialize();

const feedUrls = [
  'https://example.com/article1',
  'https://example.com/article2',
  'https://example.com/article3'
];

await reader.fetchArticlesFromUrls(feedUrls);
const recent = reader.getRecentArticles(5);
console.log('Recent articles:', recent.map(a => a.title));

Content Aggregator

内容聚合器

javascript
import ZAI from 'z-ai-web-dev-sdk';

async function aggregateContent(urls, options = {}) {
  const zai = await ZAI.create();
  const aggregated = {
    sources: [],
    totalWords: 0,
    aggregatedAt: new Date().toISOString()
  };

  for (const url of urls) {
    try {
      const result = await zai.functions.invoke('page_reader', {
        url: url
      });

      const text = result.data.html.replace(/<[^>]*>/g, ' ');
      const wordCount = text.split(/\s+/).filter(w => w.length > 0).length;

      aggregated.sources.push({
        title: result.data.title,
        url: result.data.url,
        publishedTime: result.data.publishedTime,
        wordCount: wordCount,
        excerpt: text.substring(0, 200).trim() + '...'
      });

      aggregated.totalWords += wordCount;

      if (options.delay) {
        await new Promise(resolve => setTimeout(resolve, options.delay));
      }
    } catch (error) {
      console.error(`Failed to fetch ${url}:`, error.message);
    }
  }

  return aggregated;
}

// Usage
const sources = [
  'https://example.com/news1',
  'https://example.com/news2',
  'https://example.com/news3'
];

const aggregated = await aggregateContent(sources, { delay: 1000 });
console.log(`Aggregated ${aggregated.sources.length} sources`);
console.log(`Total words: ${aggregated.totalWords}`);
javascript
import ZAI from 'z-ai-web-dev-sdk';

async function aggregateContent(urls, options = {}) {
  const zai = await ZAI.create();
  const aggregated = {
    sources: [],
    totalWords: 0,
    aggregatedAt: new Date().toISOString()
  };

  for (const url of urls) {
    try {
      const result = await zai.functions.invoke('page_reader', {
        url: url
      });

      const text = result.data.html.replace(/<[^>]*>/g, ' ');
      const wordCount = text.split(/\s+/).filter(w => w.length > 0).length;

      aggregated.sources.push({
        title: result.data.title,
        url: result.data.url,
        publishedTime: result.data.publishedTime,
        wordCount: wordCount,
        excerpt: text.substring(0, 200).trim() + '...'
      });

      aggregated.totalWords += wordCount;

      if (options.delay) {
        await new Promise(resolve => setTimeout(resolve, options.delay));
      }
    } catch (error) {
      console.error(`Failed to fetch ${url}:`, error.message);
    }
  }

  return aggregated;
}

// 用法
const sources = [
  'https://example.com/news1',
  'https://example.com/news2',
  'https://example.com/news3'
];

const aggregated = await aggregateContent(sources, { delay: 1000 });
console.log(`Aggregated ${aggregated.sources.length} sources`);
console.log(`Total words: ${aggregated.totalWords}`);

Web Scraping Pipeline

网页抓取流水线

javascript
import ZAI from 'z-ai-web-dev-sdk';

class ScrapingPipeline {
  constructor() {
    this.processors = [];
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  addProcessor(name, processorFn) {
    this.processors.push({ name, fn: processorFn });
  }

  async scrape(url) {
    // Fetch the page
    const result = await this.zai.functions.invoke('page_reader', {
      url: url
    });

    let data = {
      raw: result.data,
      processed: {}
    };

    // Run through processors
    for (const processor of this.processors) {
      try {
        data.processed[processor.name] = await processor.fn(data.raw);
        console.log(`✓ Processed with ${processor.name}`);
      } catch (error) {
        console.error(`✗ Failed ${processor.name}:`, error.message);
        data.processed[processor.name] = null;
      }
    }

    return data;
  }
}

// Processor functions
function extractLinks(pageData) {
  const linkRegex = /href=["'](https?:\/\/[^"']+)["']/g;
  const links = [];
  let match;

  while ((match = linkRegex.exec(pageData.html)) !== null) {
    links.push(match[1]);
  }

  return [...new Set(links)]; // Remove duplicates
}

function extractImages(pageData) {
  const imgRegex = /src=["'](https?:\/\/[^"']+\.(jpg|jpeg|png|gif|webp))["']/gi;
  const images = [];
  let match;

  while ((match = imgRegex.exec(pageData.html)) !== null) {
    images.push(match[1]);
  }

  return [...new Set(images)];
}

function extractPlainText(pageData) {
  return pageData.html
    .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
    .replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
    .replace(/<[^>]*>/g, ' ')
    .replace(/\s+/g, ' ')
    .trim();
}

// Usage
const pipeline = new ScrapingPipeline();
await pipeline.initialize();

pipeline.addProcessor('links', extractLinks);
pipeline.addProcessor('images', extractImages);
pipeline.addProcessor('plainText', extractPlainText);

const result = await pipeline.scrape('https://example.com/article');
console.log('Links found:', result.processed.links.length);
console.log('Images found:', result.processed.images.length);
console.log('Text length:', result.processed.plainText.length);
javascript
import ZAI from 'z-ai-web-dev-sdk';

class ScrapingPipeline {
  constructor() {
    this.processors = [];
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  addProcessor(name, processorFn) {
    this.processors.push({ name, fn: processorFn });
  }

  async scrape(url) {
    // 获取页面
    const result = await this.zai.functions.invoke('page_reader', {
      url: url
    });

    let data = {
      raw: result.data,
      processed: {}
    };

    // 运行处理器
    for (const processor of this.processors) {
      try {
        data.processed[processor.name] = await processor.fn(data.raw);
        console.log(`✓ Processed with ${processor.name}`);
      } catch (error) {
        console.error(`✗ Failed ${processor.name}:`, error.message);
        data.processed[processor.name] = null;
      }
    }

    return data;
  }
}

// 处理器函数
function extractLinks(pageData) {
  const linkRegex = /href=["'](https?:\/\/[^"']+)["']/g;
  const links = [];
  let match;

  while ((match = linkRegex.exec(pageData.html)) !== null) {
    links.push(match[1]);
  }

  return [...new Set(links)]; // 去重
}

function extractImages(pageData) {
  const imgRegex = /src=["'](https?:\/\/[^"']+\.(jpg|jpeg|png|gif|webp))["']/gi;
  const images = [];
  let match;

  while ((match = imgRegex.exec(pageData.html)) !== null) {
    images.push(match[1]);
  }

  return [...new Set(images)];
}

function extractPlainText(pageData) {
  return pageData.html
    .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
    .replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
    .replace(/<[^>]*>/g, ' ')
    .replace(/\s+/g, ' ')
    .trim();
}

// 用法
const pipeline = new ScrapingPipeline();
await pipeline.initialize();

pipeline.addProcessor('links', extractLinks);
pipeline.addProcessor('images', extractImages);
pipeline.addProcessor('plainText', extractPlainText);

const result = await pipeline.scrape('https://example.com/article');
console.log('Links found:', result.processed.links.length);
console.log('Images found:', result.processed.images.length);
console.log('Text length:', result.processed.plainText.length);

Response Format

响应格式

Successful Response

成功响应

typescript
{
  code: 200,
  status: 200,
  data: {
    title: "Article Title",
    url: "https://example.com/article",
    html: "<div>Article content...</div>",
    publishedTime: "2025-01-15T10:30:00Z",
    usage: {
      tokens: 1500
    }
  },
  meta: {
    usage: {
      tokens: 1500
    }
  }
}
typescript
{
  code: 200,
  status: 200,
  data: {
    title: "Article Title",
    url: "https://example.com/article",
    html: "<div>Article content...</div>",
    publishedTime: "2025-01-15T10:30:00Z",
    usage: {
      tokens: 1500
    }
  },
  meta: {
    usage: {
      tokens: 1500
    }
  }
}

Response Fields

响应字段

FieldTypeDescription
code
numberResponse status code
status
numberHTTP status code
data.title
stringPage title
data.url
stringPage URL
data.html
stringExtracted HTML content
data.publishedTime
stringPublication date (optional)
data.usage.tokens
numberTokens used for processing
meta.usage.tokens
numberTotal tokens used
字段类型描述
code
数字响应状态码
status
数字HTTP状态码
data.title
字符串页面标题
data.url
字符串页面URL
data.html
字符串提取的HTML内容
data.publishedTime
字符串发布日期(可选)
data.usage.tokens
数字处理使用的令牌数
meta.usage.tokens
数字使用的总令牌数

Best Practices

最佳实践

1. Error Handling

1. 错误处理

javascript
async function safeReadPage(url) {
  try {
    const zai = await ZAI.create();

    // Validate URL
    if (!url || !url.startsWith('http')) {
      throw new Error('Invalid URL format');
    }

    const result = await zai.functions.invoke('page_reader', {
      url: url
    });

    // Check response status
    if (result.code !== 200) {
      throw new Error(`Failed to fetch page: ${result.code}`);
    }

    // Verify essential data
    if (!result.data.html || !result.data.title) {
      throw new Error('Incomplete page data received');
    }

    return {
      success: true,
      data: result.data
    };
  } catch (error) {
    console.error('Page reading error:', error);
    return {
      success: false,
      error: error.message
    };
  }
}
javascript
async function safeReadPage(url) {
  try {
    const zai = await ZAI.create();

    // 验证URL
    if (!url || !url.startsWith('http')) {
      throw new Error('Invalid URL format');
    }

    const result = await zai.functions.invoke('page_reader', {
      url: url
    });

    // 检查响应状态
    if (result.code !== 200) {
      throw new Error(`Failed to fetch page: ${result.code}`);
    }

    // 验证关键数据
    if (!result.data.html || !result.data.title) {
      throw new Error('Incomplete page data received');
    }

    return {
      success: true,
      data: result.data
    };
  } catch (error) {
    console.error('Page reading error:', error);
    return {
      success: false,
      error: error.message
    };
  }
}

2. Rate Limiting

2. 速率限制

javascript
class RateLimitedReader {
  constructor(requestsPerMinute = 10) {
    this.requestsPerMinute = requestsPerMinute;
    this.requestTimes = [];
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  async readPage(url) {
    await this.waitForRateLimit();

    const result = await this.zai.functions.invoke('page_reader', {
      url: url
    });

    this.requestTimes.push(Date.now());
    return result.data;
  }

  async waitForRateLimit() {
    const now = Date.now();
    const oneMinuteAgo = now - 60000;

    // Remove old timestamps
    this.requestTimes = this.requestTimes.filter(time => time > oneMinuteAgo);

    // Check if we need to wait
    if (this.requestTimes.length >= this.requestsPerMinute) {
      const oldestRequest = this.requestTimes[0];
      const waitTime = 60000 - (now - oldestRequest);

      if (waitTime > 0) {
        console.log(`Rate limit reached. Waiting ${waitTime}ms...`);
        await new Promise(resolve => setTimeout(resolve, waitTime));
      }
    }
  }
}

// Usage
const reader = new RateLimitedReader(10); // 10 requests per minute
await reader.initialize();

const urls = ['https://example.com/1', 'https://example.com/2'];
for (const url of urls) {
  const data = await reader.readPage(url);
  console.log('Fetched:', data.title);
}
javascript
class RateLimitedReader {
  constructor(requestsPerMinute = 10) {
    this.requestsPerMinute = requestsPerMinute;
    this.requestTimes = [];
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  async readPage(url) {
    await this.waitForRateLimit();

    const result = await this.zai.functions.invoke('page_reader', {
      url: url
    });

    this.requestTimes.push(Date.now());
    return result.data;
  }

  async waitForRateLimit() {
    const now = Date.now();
    const oneMinuteAgo = now - 60000;

    // 移除旧的时间戳
    this.requestTimes = this.requestTimes.filter(time => time > oneMinuteAgo);

    // 检查是否需要等待
    if (this.requestTimes.length >= this.requestsPerMinute) {
      const oldestRequest = this.requestTimes[0];
      const waitTime = 60000 - (now - oldestRequest);

      if (waitTime > 0) {
        console.log(`Rate limit reached. Waiting ${waitTime}ms...`);
        await new Promise(resolve => setTimeout(resolve, waitTime));
      }
    }
  }
}

// 用法
const reader = new RateLimitedReader(10); // 每分钟10次请求
await reader.initialize();

const urls = ['https://example.com/1', 'https://example.com/2'];
for (const url of urls) {
  const data = await reader.readPage(url);
  console.log('Fetched:', data.title);
}

3. Caching Strategy

3. 缓存策略

javascript
import ZAI from 'z-ai-web-dev-sdk';

class CachedWebReader {
  constructor(cacheDuration = 3600000) { // 1 hour default
    this.cache = new Map();
    this.cacheDuration = cacheDuration;
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  async readPage(url, forceRefresh = false) {
    const cacheKey = url;
    const cached = this.cache.get(cacheKey);

    // Return cached if valid and not forcing refresh
    if (cached && !forceRefresh) {
      const age = Date.now() - cached.timestamp;
      if (age < this.cacheDuration) {
        console.log('Returning cached content for:', url);
        return cached.data;
      }
    }

    // Fetch fresh content
    const result = await this.zai.functions.invoke('page_reader', {
      url: url
    });

    // Update cache
    this.cache.set(cacheKey, {
      data: result.data,
      timestamp: Date.now()
    });

    return result.data;
  }

  clearCache() {
    this.cache.clear();
  }

  getCacheStats() {
    return {
      size: this.cache.size,
      entries: Array.from(this.cache.keys())
    };
  }
}

// Usage
const reader = new CachedWebReader(3600000); // 1 hour cache
await reader.initialize();

const data1 = await reader.readPage('https://example.com'); // Fresh fetch
const data2 = await reader.readPage('https://example.com'); // From cache
const data3 = await reader.readPage('https://example.com', true); // Force refresh
javascript
import ZAI from 'z-ai-web-dev-sdk';

class CachedWebReader {
  constructor(cacheDuration = 3600000) { // 默认1小时
    this.cache = new Map();
    this.cacheDuration = cacheDuration;
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  async readPage(url, forceRefresh = false) {
    const cacheKey = url;
    const cached = this.cache.get(cacheKey);

    // 如果缓存有效且不强制刷新,则返回缓存
    if (cached && !forceRefresh) {
      const age = Date.now() - cached.timestamp;
      if (age < this.cacheDuration) {
        console.log('Returning cached content for:', url);
        return cached.data;
      }
    }

    // 获取最新内容
    const result = await this.zai.functions.invoke('page_reader', {
      url: url
    });

    // 更新缓存
    this.cache.set(cacheKey, {
      data: result.data,
      timestamp: Date.now()
    });

    return result.data;
  }

  clearCache() {
    this.cache.clear();
  }

  getCacheStats() {
    return {
      size: this.cache.size,
      entries: Array.from(this.cache.keys())
    };
  }
}

// 用法
const reader = new CachedWebReader(3600000); // 1小时缓存
await reader.initialize();

const data1 = await reader.readPage('https://example.com'); // 首次获取
const data2 = await reader.readPage('https://example.com'); // 从缓存获取
const data3 = await reader.readPage('https://example.com', true); // 强制刷新

4. Parallel Processing

4. 并行处理

javascript
import ZAI from 'z-ai-web-dev-sdk';

async function readPagesInParallel(urls, concurrency = 3) {
  const zai = await ZAI.create();
  const results = [];
  
  // Process in batches
  for (let i = 0; i < urls.length; i += concurrency) {
    const batch = urls.slice(i, i + concurrency);
    
    const batchResults = await Promise.allSettled(
      batch.map(url =>
        zai.functions.invoke('page_reader', { url })
          .then(result => ({
            url: url,
            success: true,
            data: result.data
          }))
          .catch(error => ({
            url: url,
            success: false,
            error: error.message
          }))
      )
    );

    results.push(...batchResults.map(r => r.value));
    console.log(`Completed batch ${Math.floor(i / concurrency) + 1}`);
  }

  return results;
}

// Usage
const urls = [
  'https://example.com/1',
  'https://example.com/2',
  'https://example.com/3',
  'https://example.com/4',
  'https://example.com/5'
];

const results = await readPagesInParallel(urls, 2); // 2 concurrent requests
results.forEach(result => {
  if (result.success) {
    console.log(`${result.data.title}`);
  } else {
    console.log(`${result.url}: ${result.error}`);
  }
});
javascript
import ZAI from 'z-ai-web-dev-sdk';

async function readPagesInParallel(urls, concurrency = 3) {
  const zai = await ZAI.create();
  const results = [];
  
  // 分批处理
  for (let i = 0; i < urls.length; i += concurrency) {
    const batch = urls.slice(i, i + concurrency);
    
    const batchResults = await Promise.allSettled(
      batch.map(url =>
        zai.functions.invoke('page_reader', { url })
          .then(result => ({
            url: url,
            success: true,
            data: result.data
          }))
          .catch(error => ({
            url: url,
            success: false,
            error: error.message
          }))
      )
    );

    results.push(...batchResults.map(r => r.value));
    console.log(`Completed batch ${Math.floor(i / concurrency) + 1}`);
  }

  return results;
}

// 用法
const urls = [
  'https://example.com/1',
  'https://example.com/2',
  'https://example.com/3',
  'https://example.com/4',
  'https://example.com/5'
];

const results = await readPagesInParallel(urls, 2); // 2个并发请求
results.forEach(result => {
  if (result.success) {
    console.log(`${result.data.title}`);
  } else {
    console.log(`${result.url}: ${result.error}`);
  }
});

5. Content Processing

5. 内容处理

javascript
import ZAI from 'z-ai-web-dev-sdk';

class ContentProcessor {
  static extractMainContent(html) {
    // Remove scripts, styles, and comments
    let content = html
      .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
      .replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
      .replace(/<!--[\s\S]*?-->/g, '');

    return content;
  }

  static htmlToPlainText(html) {
    return html
      .replace(/<br\s*\/?>/gi, '\n')
      .replace(/<\/p>/gi, '\n\n')
      .replace(/<[^>]*>/g, '')
      .replace(/&nbsp;/g, ' ')
      .replace(/&amp;/g, '&')
      .replace(/&lt;/g, '<')
      .replace(/&gt;/g, '>')
      .replace(/&quot;/g, '"')
      .replace(/\s+/g, ' ')
      .trim();
  }

  static extractMetadata(html) {
    const metadata = {};

    // Extract meta description
    const descMatch = html.match(/<meta\s+name=["']description["']\s+content=["']([^"']+)["']/i);
    if (descMatch) metadata.description = descMatch[1];

    // Extract keywords
    const keywordsMatch = html.match(/<meta\s+name=["']keywords["']\s+content=["']([^"']+)["']/i);
    if (keywordsMatch) metadata.keywords = keywordsMatch[1].split(',').map(k => k.trim());

    // Extract author
    const authorMatch = html.match(/<meta\s+name=["']author["']\s+content=["']([^"']+)["']/i);
    if (authorMatch) metadata.author = authorMatch[1];

    return metadata;
  }
}

// Usage
async function processWebPage(url) {
  const zai = await ZAI.create();
  const result = await zai.functions.invoke('page_reader', { url });

  return {
    title: result.data.title,
    url: result.data.url,
    mainContent: ContentProcessor.extractMainContent(result.data.html),
    plainText: ContentProcessor.htmlToPlainText(result.data.html),
    metadata: ContentProcessor.extractMetadata(result.data.html),
    publishedTime: result.data.publishedTime
  };
}

const processed = await processWebPage('https://example.com/article');
console.log('Processed content:', processed.title);
javascript
import ZAI from 'z-ai-web-dev-sdk';

class ContentProcessor {
  static extractMainContent(html) {
    // 移除脚本、样式和注释
    let content = html
      .replace(/<script[^>]*>[\s\S]*?<\/script>/gi, '')
      .replace(/<style[^>]*>[\s\S]*?<\/style>/gi, '')
      .replace(/<!--[\s\S]*?-->/g, '');

    return content;
  }

  static htmlToPlainText(html) {
    return html
      .replace(/<br\s*\/?>/gi, '\n')
      .replace(/<\/p>/gi, '\n\n')
      .replace(/<[^>]*>/g, '')
      .replace(/&nbsp;/g, ' ')
      .replace(/&amp;/g, '&')
      .replace(/&lt;/g, '<')
      .replace(/&gt;/g, '>')
      .replace(/&quot;/g, '"')
      .replace(/\s+/g, ' ')
      .trim();
  }

  static extractMetadata(html) {
    const metadata = {};

    // 提取meta描述
    const descMatch = html.match(/<meta\s+name=["']description["']\s+content=["']([^"']+)["']/i);
    if (descMatch) metadata.description = descMatch[1];

    // 提取关键词
    const keywordsMatch = html.match(/<meta\s+name=["']keywords["']\s+content=["']([^"']+)["']/i);
    if (keywordsMatch) metadata.keywords = keywordsMatch[1].split(',').map(k => k.trim());

    // 提取作者
    const authorMatch = html.match(/<meta\s+name=["']author["']\s+content=["']([^"']+)["']/i);
    if (authorMatch) metadata.author = authorMatch[1];

    return metadata;
  }
}

// 用法
async function processWebPage(url) {
  const zai = await ZAI.create();
  const result = await zai.functions.invoke('page_reader', { url });

  return {
    title: result.data.title,
    url: result.data.url,
    mainContent: ContentProcessor.extractMainContent(result.data.html),
    plainText: ContentProcessor.htmlToPlainText(result.data.html),
    metadata: ContentProcessor.extractMetadata(result.data.html),
    publishedTime: result.data.publishedTime
  };
}

const processed = await processWebPage('https://example.com/article');
console.log('Processed content:', processed.title);

Common Use Cases

常见使用场景

  1. News Aggregation: Collect and aggregate news articles from multiple sources
  2. Content Monitoring: Track changes on specific web pages
  3. Research Tools: Extract information from academic or reference websites
  4. Price Tracking: Monitor product pages for price changes
  5. SEO Analysis: Extract page metadata and content for SEO purposes
  6. Archive Creation: Create local copies of web content
  7. Content Curation: Collect and organize web content by topic
  8. Competitive Intelligence: Monitor competitor websites for updates
  1. 新闻聚合:从多个来源收集并聚合新闻文章
  2. 内容监控:跟踪特定网页的变化
  3. 研究工具:从学术或参考网站提取信息
  4. 价格跟踪:监控产品页面的价格变化
  5. SEO分析:提取页面元数据和内容用于SEO分析
  6. 存档创建:创建网页内容的本地副本
  7. 内容策划:按主题收集和组织网页内容
  8. 竞争情报:监控竞争对手网站的更新

Integration Examples

集成示例

Express.js API Endpoint

Express.js API端点

javascript
import express from 'express';
import ZAI from 'z-ai-web-dev-sdk';

const app = express();
app.use(express.json());

let zaiInstance;

async function initZAI() {
  zaiInstance = await ZAI.create();
}

app.post('/api/read-page', async (req, res) => {
  try {
    const { url } = req.body;

    if (!url) {
      return res.status(400).json({ 
        error: 'URL is required' 
      });
    }

    const result = await zaiInstance.functions.invoke('page_reader', {
      url: url
    });

    res.json({
      success: true,
      data: {
        title: result.data.title,
        url: result.data.url,
        content: result.data.html,
        publishedTime: result.data.publishedTime,
        tokensUsed: result.data.usage.tokens
      }
    });
  } catch (error) {
    res.status(500).json({
      success: false,
      error: error.message
    });
  }
});

app.post('/api/read-multiple', async (req, res) => {
  try {
    const { urls } = req.body;

    if (!urls || !Array.isArray(urls)) {
      return res.status(400).json({ 
        error: 'URLs array is required' 
      });
    }

    const results = await Promise.allSettled(
      urls.map(url =>
        zaiInstance.functions.invoke('page_reader', { url })
          .then(result => ({
            url: url,
            success: true,
            data: result.data
          }))
          .catch(error => ({
            url: url,
            success: false,
            error: error.message
          }))
      )
    );

    res.json({
      success: true,
      results: results.map(r => r.value)
    });
  } catch (error) {
    res.status(500).json({
      success: false,
      error: error.message
    });
  }
});

initZAI().then(() => {
  app.listen(3000, () => {
    console.log('Web reader API running on port 3000');
  });
});
javascript
import express from 'express';
import ZAI from 'z-ai-web-dev-sdk';

const app = express();
app.use(express.json());

let zaiInstance;

async function initZAI() {
  zaiInstance = await ZAI.create();
}

app.post('/api/read-page', async (req, res) => {
  try {
    const { url } = req.body;

    if (!url) {
      return res.status(400).json({ 
        error: 'URL is required' 
      });
    }

    const result = await zaiInstance.functions.invoke('page_reader', {
      url: url
    });

    res.json({
      success: true,
      data: {
        title: result.data.title,
        url: result.data.url,
        content: result.data.html,
        publishedTime: result.data.publishedTime,
        tokensUsed: result.data.usage.tokens
      }
    });
  } catch (error) {
    res.status(500).json({
      success: false,
      error: error.message
    });
  }
});

app.post('/api/read-multiple', async (req, res) => {
  try {
    const { urls } = req.body;

    if (!urls || !Array.isArray(urls)) {
      return res.status(400).json({ 
        error: 'URLs array is required' 
      });
    }

    const results = await Promise.allSettled(
      urls.map(url =>
        zaiInstance.functions.invoke('page_reader', { url })
          .then(result => ({
            url: url,
            success: true,
            data: result.data
          }))
          .catch(error => ({
            url: url,
            success: false,
            error: error.message
          }))
      )
    );

    res.json({
      success: true,
      results: results.map(r => r.value)
    });
  } catch (error) {
    res.status(500).json({
      success: false,
      error: error.message
    });
  }
});

initZAI().then(() => {
  app.listen(3000, () => {
    console.log('Web reader API running on port 3000');
  });
});

Scheduled Content Fetcher

定时内容获取器

javascript
import ZAI from 'z-ai-web-dev-sdk';
import cron from 'node-cron';

class ScheduledFetcher {
  constructor() {
    this.urls = [];
    this.results = [];
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  addUrl(url, schedule) {
    this.urls.push({ url, schedule });
  }

  async fetchContent(url) {
    try {
      const result = await this.zai.functions.invoke('page_reader', {
        url: url
      });

      return {
        url: url,
        success: true,
        title: result.data.title,
        content: result.data.html,
        fetchedAt: new Date().toISOString()
      };
    } catch (error) {
      return {
        url: url,
        success: false,
        error: error.message,
        fetchedAt: new Date().toISOString()
      };
    }
  }

  startScheduledFetch(url, schedule) {
    cron.schedule(schedule, async () => {
      console.log(`Fetching ${url}...`);
      const result = await this.fetchContent(url);
      this.results.push(result);
      
      // Keep only last 100 results
      if (this.results.length > 100) {
        this.results = this.results.slice(-100);
      }
      
      console.log(`Fetched: ${result.success ? result.title : result.error}`);
    });
  }

  start() {
    for (const { url, schedule } of this.urls) {
      this.startScheduledFetch(url, schedule);
    }
  }

  getResults() {
    return this.results;
  }
}

// Usage
const fetcher = new ScheduledFetcher();
await fetcher.initialize();

// Fetch every hour
fetcher.addUrl('https://example.com/news', '0 * * * *');

// Fetch every day at midnight
fetcher.addUrl('https://example.com/daily', '0 0 * * *');

fetcher.start();
console.log('Scheduled fetching started');
javascript
import ZAI from 'z-ai-web-dev-sdk';
import cron from 'node-cron';

class ScheduledFetcher {
  constructor() {
    this.urls = [];
    this.results = [];
  }

  async initialize() {
    this.zai = await ZAI.create();
  }

  addUrl(url, schedule) {
    this.urls.push({ url, schedule });
  }

  async fetchContent(url) {
    try {
      const result = await this.zai.functions.invoke('page_reader', {
        url: url
      });

      return {
        url: url,
        success: true,
        title: result.data.title,
        content: result.data.html,
        fetchedAt: new Date().toISOString()
      };
    } catch (error) {
      return {
        url: url,
        success: false,
        error: error.message,
        fetchedAt: new Date().toISOString()
      };
    }
  }

  startScheduledFetch(url, schedule) {
    cron.schedule(schedule, async () => {
      console.log(`Fetching ${url}...`);
      const result = await this.fetchContent(url);
      this.results.push(result);
      
      // 仅保留最近100条结果
      if (this.results.length > 100) {
        this.results = this.results.slice(-100);
      }
      
      console.log(`Fetched: ${result.success ? result.title : result.error}`);
    });
  }

  start() {
    for (const { url, schedule } of this.urls) {
      this.startScheduledFetch(url, schedule);
    }
  }

  getResults() {
    return this.results;
  }
}

// 用法
const fetcher = new ScheduledFetcher();
await fetcher.initialize();

// 每小时获取一次
fetcher.addUrl('https://example.com/news', '0 * * * *');

// 每天午夜获取一次
fetcher.addUrl('https://example.com/daily', '0 0 * * *');

fetcher.start();
console.log('Scheduled fetching started');

Troubleshooting

故障排除

Issue: "SDK must be used in backend"
  • Solution: Ensure z-ai-web-dev-sdk is only imported and used in server-side code
Issue: Failed to fetch page (404, 403, etc.)
  • Solution: Verify the URL is accessible and not behind authentication/paywall
Issue: Incomplete or missing content
  • Solution: Some pages may have dynamic content that requires JavaScript. The reader extracts static HTML content.
Issue: High token usage
  • Solution: The token usage depends on page size. Consider caching frequently accessed pages.
Issue: Slow response times
  • Solution: Implement caching, use parallel processing for multiple URLs, and consider rate limiting
Issue: Empty HTML content
  • Solution: Check if the page requires authentication or has anti-scraping measures. Verify the URL is correct.
问题:"SDK must be used in backend"
  • 解决方案:确保z-ai-web-dev-sdk仅在服务器端代码中导入和使用
问题:无法获取页面(404、403等)
  • 解决方案:验证URL是否可访问,且未处于认证或付费墙之后
问题:内容不完整或缺失
  • 解决方案:部分页面可能包含需要JavaScript的动态内容。本阅读器提取的是静态HTML内容。
问题:令牌使用量过高
  • 解决方案:令牌使用量取决于页面大小。考虑缓存频繁访问的页面。
问题:响应时间慢
  • 解决方案:实现缓存,对多个URL使用并行处理,并考虑速率限制
问题:HTML内容为空
  • 解决方案:检查页面是否需要认证或有反抓取措施。验证URL是否正确。

Performance Tips

性能优化技巧

  1. Implement caching: Cache frequently accessed pages to reduce API calls
  2. Use parallel processing: Fetch multiple pages concurrently (with rate limiting)
  3. Process content efficiently: Extract only needed information from HTML
  4. Set timeouts: Implement reasonable timeouts for page fetching
  5. Monitor token usage: Track usage to optimize costs
  6. Batch operations: Group multiple URL fetches when possible
  1. 实现缓存:缓存频繁访问的页面以减少API调用
  2. 使用并行处理:并发获取多个页面(配合速率限制)
  3. 高效处理内容:仅从HTML中提取所需信息
  4. 设置超时:为页面获取实现合理的超时时间
  5. 监控令牌使用:跟踪使用情况以优化成本
  6. 批量操作:可能的话将多个URL获取分组

Security Considerations

安全注意事项

  • Validate all URLs before processing
  • Sanitize extracted HTML content before displaying
  • Implement rate limiting to prevent abuse
  • Never expose SDK credentials in client-side code
  • Be respectful of robots.txt and website terms of service
  • Handle user data according to privacy regulations
  • Implement proper error handling for failed requests
  • 处理前验证所有URL
  • 显示前清理提取的HTML内容
  • 实现速率限制以防止滥用
  • 绝不在客户端代码中暴露SDK凭证
  • 尊重robots.txt和网站服务条款
  • 根据隐私法规处理用户数据
  • 为失败请求实现适当的错误处理

Remember

注意事项

  • Always use z-ai-web-dev-sdk in backend code only
  • The SDK is already installed - import as shown in examples
  • Implement proper error handling for robust applications
  • Use caching to improve performance and reduce costs
  • Respect website terms of service and rate limits
  • Process HTML content carefully to extract meaningful data
  • Monitor token usage for cost optimization
  • 始终仅在后端代码中使用z-ai-web-dev-sdk
  • SDK已安装 - 按照示例导入
  • 为健壮的应用实现适当的错误处理
  • 使用缓存以提高性能并降低成本
  • 尊重网站服务条款和速率限制
  • 仔细处理HTML内容以提取有意义的数据
  • 监控令牌使用以优化成本