cnki-paper-detail

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CNKI Paper Detail Extraction

CNKI论文详情信息提取

Extract complete metadata from a CNKI paper detail page.

从CNKI论文详情页提取完整的元数据。

Arguments

参数说明

$ARGUMENTS

is optionally a CNKI paper detail URL (containing

kcms2/article/abstract

). If not provided, assumes the current page is already a paper detail page.

$ARGUMENTS

可选为CNKI论文详情页URL（需包含

kcms2/article/abstract

）。如果未提供，则默认当前页面已是论文详情页。

Steps

操作步骤

1. Navigate to the paper page (if URL provided)

1. 导航至论文页面（若提供URL）

$ARGUMENTS

contains a URL:

Use
```
mcp__chrome-devtools__navigate_page
```
with the URL.

Use

mcp__chrome-devtools__wait_for

with text

["摘要"]

and timeout 15000.

如果

$ARGUMENTS

包含URL：

使用
```
mcp__chrome-devtools__navigate_page
```
工具访问该URL。
使用
```
mcp__chrome-devtools__wait_for
```
工具，等待文本
```
["摘要"]
```
加载，超时时间15000毫秒。

2. Check for captcha

2. 检查验证码

Use

mcp__chrome-devtools__take_snapshot

. If "拖动下方拼图完成验证" found, notify user:

CNKI 正在显示滑块验证码。请在 Chrome 浏览器中手动完成拼图验证，完成后告诉我继续。

使用

mcp__chrome-devtools__take_snapshot

工具。如果检测到“拖动下方拼图完成验证”，则通知用户：

CNKI 正在显示滑块验证码。请在 Chrome 浏览器中手动完成拼图验证，完成后告诉我继续。

3. Extract paper metadata via JavaScript

3. 通过JavaScript提取论文元数据

Use

mcp__chrome-devtools__evaluate_script

with this function:

javascript

() => {
  const brief = document.querySelector('.brief');
  if (!brief) return { error: 'Paper detail section (.brief) not found' };

  // Title
  const title = brief.querySelector('h1')?.innerText?.trim()
    ?.replace(/\s*附视频\s*$/, '')  // remove "附视频" suffix
    ?.replace(/\s*网络首发\s*$/, ''); // remove "网络首发" suffix

  // Authors - first h3.author contains author links with sup tags
  const authorH3s = brief.querySelectorAll('h3.author');
  const authorSection = authorH3s[0];
  const authors = [];
  if (authorSection) {
    const authorLinks = authorSection.querySelectorAll('a');
    authorLinks.forEach(a => {
      const name = a.innerText?.replace(/\d+$/, '').trim();
      const supMatch = a.innerText?.match(/(\d+)$/);
      const affiliationNum = supMatch ? supMatch[1] : '';
      authors.push({ name, affiliationNum });
    });
  }

  // Affiliations - second h3.author contains org links
  const affiliations = [];
  if (authorH3s.length > 1) {
    const orgLinks = authorH3s[1].querySelectorAll('a');
    orgLinks.forEach(a => {
      affiliations.push(a.innerText?.trim());
    });
  }

  // Abstract
  const abstractEl = document.querySelector('.abstract-text');
  const abstract = abstractEl?.innerText?.trim() || '';

  // Keywords
  const keywordsP = document.querySelector('p.keywords');
  const keywords = keywordsP
    ? Array.from(keywordsP.querySelectorAll('a')).map(a => a.innerText?.replace(/;$/, '').trim())
    : [];

  // Fund
  const fundsP = document.querySelector('p.funds');
  const fund = fundsP?.innerText?.trim() || '';

  // Classification code
  const clcCode = document.querySelector('.clc-code');
  const classification = clcCode?.innerText?.trim() || '';

  // Journal/source
  const docTop = document.querySelector('.doc-top');
  const journal = docTop?.querySelector('a')?.innerText?.trim() || '';

  // Online first / publication info
  const headTime = document.querySelector('.head-time');
  const pubInfo = headTime?.innerText?.trim() || '';

  // Is online first?
  const isOnlineFirst = !!brief.querySelector('.icon-shoufa');

  // Article outline/TOC
  const catalogList = document.querySelector('.catalog-list, .catalog-listDiv');
  const toc = catalogList?.innerText?.trim() || '';

  // Citation network counts
  const citationTabs = document.querySelectorAll('ul.module-tab.tpl_lieteratures li');
  const citationInfo = {};
  citationTabs.forEach(li => {
    const id = li.getAttribute('data-id');
    const text = li.innerText?.trim();
    const countMatch = text.match(/(\d+)/);
    if (id) {
      citationInfo[id] = {
        label: text.replace(/\d+/, '').trim(),
        count: countMatch ? parseInt(countMatch[1]) : 0
      };
    }
  });

  return {
    title,
    authors,
    affiliations,
    abstract,
    keywords,
    fund,
    classification,
    journal,
    pubInfo,
    isOnlineFirst,
    toc,
    citationInfo
  };
}

使用

mcp__chrome-devtools__evaluate_script

工具执行以下函数：

javascript

() => {
  const brief = document.querySelector('.brief');
  if (!brief) return { error: 'Paper detail section (.brief) not found' };

  // Title
  const title = brief.querySelector('h1')?.innerText?.trim()
    ?.replace(/\s*附视频\s*$/, '')  // remove "附视频" suffix
    ?.replace(/\s*网络首发\s*$/, ''); // remove "网络首发" suffix

  // Authors - first h3.author contains author links with sup tags
  const authorH3s = brief.querySelectorAll('h3.author');
  const authorSection = authorH3s[0];
  const authors = [];
  if (authorSection) {
    const authorLinks = authorSection.querySelectorAll('a');
    authorLinks.forEach(a => {
      const name = a.innerText?.replace(/\d+$/, '').trim();
      const supMatch = a.innerText?.match(/(\d+)$/);
      const affiliationNum = supMatch ? supMatch[1] : '';
      authors.push({ name, affiliationNum });
    });
  }

  // Affiliations - second h3.author contains org links
  const affiliations = [];
  if (authorH3s.length > 1) {
    const orgLinks = authorH3s[1].querySelectorAll('a');
    orgLinks.forEach(a => {
      affiliations.push(a.innerText?.trim());
    });
  }

  // Abstract
  const abstractEl = document.querySelector('.abstract-text');
  const abstract = abstractEl?.innerText?.trim() || '';

  // Keywords
  const keywordsP = document.querySelector('p.keywords');
  const keywords = keywordsP
    ? Array.from(keywordsP.querySelectorAll('a')).map(a => a.innerText?.replace(/;$/, '').trim())
    : [];

  // Fund
  const fundsP = document.querySelector('p.funds');
  const fund = fundsP?.innerText?.trim() || '';

  // Classification code
  const clcCode = document.querySelector('.clc-code');
  const classification = clcCode?.innerText?.trim() || '';

  // Journal/source
  const docTop = document.querySelector('.doc-top');
  const journal = docTop?.querySelector('a')?.innerText?.trim() || '';

  // Online first / publication info
  const headTime = document.querySelector('.head-time');
  const pubInfo = headTime?.innerText?.trim() || '';

  // Is online first?
  const isOnlineFirst = !!brief.querySelector('.icon-shoufa');

  // Article outline/TOC
  const catalogList = document.querySelector('.catalog-list, .catalog-listDiv');
  const toc = catalogList?.innerText?.trim() || '';

  // Citation network counts
  const citationTabs = document.querySelectorAll('ul.module-tab.tpl_lieteratures li');
  const citationInfo = {};
  citationTabs.forEach(li => {
    const id = li.getAttribute('data-id');
    const text = li.innerText?.trim();
    const countMatch = text.match(/(\d+)/);
    if (id) {
      citationInfo[id] = {
        label: text.replace(/\d+/, '').trim(),
        count: countMatch ? parseInt(countMatch[1]) : 0
      };
    }
  });

  return {
    title,
    authors,
    affiliations,
    abstract,
    keywords,
    fund,
    classification,
    journal,
    pubInfo,
    isOnlineFirst,
    toc,
    citationInfo
  };
}

4. Format and present the output

4. 格式化并展示输出

undefined

undefined

{title} {isOnlineFirst ? "[网络首发]" : ""}

Authors: {For each author: "- {name} ({affiliation})"}

Affiliations: {For each affiliation: "- {affiliation}"}

Journal: {journal} Publication Info: {pubInfo}

Abstract: {abstract}

Keywords: {keywords joined by ", "}

Fund: {fund} Classification: {classification}

Citation Network: {For each citation type: "- {label}: {count}"}

undefined

作者： {遍历每位作者："- {name} ({affiliation})"}

所属机构： {遍历每个机构："- {affiliation}"}

期刊： {journal} 出版信息： {pubInfo}

摘要： {abstract}

关键词： {keywords 用英文逗号连接}

基金项目： {fund} 分类号： {classification}

引用网络： {遍历每种引用类型："- {label}: {count}"}

undefined

5. Fallback: snapshot-based parsing

5. 备选方案：基于快照的解析

If JS extraction fails, use

mcp__chrome-devtools__take_snapshot

and parse the accessibility tree:

Title:
```
heading
```
level 1 element
Authors:
```
link
```
elements whose URLs contain
```
kcms2/author/detail
```
Affiliations:
```
link
```
elements whose URLs contain
```
kcms2/organ/detail
```
Abstract:
```
StaticText
```
following "摘要："
Keywords:
```
link
```
elements whose URLs contain
```
kcms2/keyword/detail
```
Fund:
```
link
```
elements following "基金资助："
Classification:
```
StaticText
```
following "分类号："

如果JS提取失败，使用

mcp__chrome-devtools__take_snapshot

工具并解析无障碍树：

标题：
```
heading
```
1级元素
作者：URL包含
```
kcms2/author/detail
```
的
```
link
```
元素
所属机构：URL包含
```
kcms2/organ/detail
```
的
```
link
```
元素
摘要：“摘要：”后的
```
StaticText
```
元素
关键词：URL包含
```
kcms2/keyword/detail
```
的
```
link
```
元素
基金项目：“基金资助：”后的
```
link
```
元素
分类号：“分类号：”后的
```
StaticText
```
元素

Verified DOM Selectors

已验证的DOM选择器

Data	Selector	Notes
Paper section	`.brief`	Main paper info container
Title	`.brief h1`	May contain icons, clean text needed
Authors	`.brief h3.author:first-of-type a`	Text has superscript numbers (e.g., "张三1")
Affiliations	`.brief h3.author:nth-of-type(2) a`	Text starts with "N." (e.g., "1.北京大学")
Abstract	`.abstract-text`	Full abstract text
Keywords	`p.keywords a`	Semicolon-separated keyword links
Fund	`p.funds`	Fund information text
Classification	`.clc-code`	CLC classification codes
Journal	`.doc-top a`	Source journal link
Online first	`.brief .icon-shoufa`	Present if paper is online first
Citation tabs	`ul.module-tab.tpl_lieteratures li`	data-id attr identifies type

数据项	选择器	说明
论文信息容器	`.brief`	论文核心信息的容器
标题	`.brief h1`	可能包含图标，需清理文本
作者	`.brief h3.author:first-of-type a`	文本包含上标数字（例如："张三1"）
所属机构	`.brief h3.author:nth-of-type(2) a`	文本以“N.”开头（例如："1.北京大学"）
摘要	`.abstract-text`	完整的摘要文本
关键词	`p.keywords a`	以分号分隔的关键词链接
基金项目	`p.funds`	基金信息文本
分类号	`.clc-code`	中图法分类号
期刊	`.doc-top a`	来源期刊链接
网络首发标识	`.brief .icon-shoufa`	存在则表示该论文为网络首发
引用标签	`ul.module-tab.tpl_lieteratures li`	data-id属性标识引用类型