cnki-paper-detail

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

CNKI Paper Detail Extraction

CNKI论文详情信息提取

Extract complete metadata from a CNKI paper detail page.
从CNKI论文详情页提取完整的元数据。

Arguments

参数说明

$ARGUMENTS
is optionally a CNKI paper detail URL (containing
kcms2/article/abstract
). If not provided, assumes the current page is already a paper detail page.
$ARGUMENTS
可选为CNKI论文详情页URL(需包含
kcms2/article/abstract
)。如果未提供,则默认当前页面已是论文详情页。

Steps

操作步骤

1. Navigate to the paper page (if URL provided)

1. 导航至论文页面(若提供URL)

If
$ARGUMENTS
contains a URL:
  • Use
    mcp__chrome-devtools__navigate_page
    with the URL.
  • Use
    mcp__chrome-devtools__wait_for
    with text
    ["摘要"]
    and timeout 15000.
如果
$ARGUMENTS
包含URL:
  • 使用
    mcp__chrome-devtools__navigate_page
    工具访问该URL。
  • 使用
    mcp__chrome-devtools__wait_for
    工具,等待文本
    ["摘要"]
    加载,超时时间15000毫秒。

2. Check for captcha

2. 检查验证码

Use
mcp__chrome-devtools__take_snapshot
. If "拖动下方拼图完成验证" found, notify user:
CNKI 正在显示滑块验证码。请在 Chrome 浏览器中手动完成拼图验证,完成后告诉我继续。
使用
mcp__chrome-devtools__take_snapshot
工具。如果检测到“拖动下方拼图完成验证”,则通知用户:
CNKI 正在显示滑块验证码。请在 Chrome 浏览器中手动完成拼图验证,完成后告诉我继续。

3. Extract paper metadata via JavaScript

3. 通过JavaScript提取论文元数据

Use
mcp__chrome-devtools__evaluate_script
with this function:
javascript
() => {
  const brief = document.querySelector('.brief');
  if (!brief) return { error: 'Paper detail section (.brief) not found' };

  // Title
  const title = brief.querySelector('h1')?.innerText?.trim()
    ?.replace(/\s*附视频\s*$/, '')  // remove "附视频" suffix
    ?.replace(/\s*网络首发\s*$/, ''); // remove "网络首发" suffix

  // Authors - first h3.author contains author links with sup tags
  const authorH3s = brief.querySelectorAll('h3.author');
  const authorSection = authorH3s[0];
  const authors = [];
  if (authorSection) {
    const authorLinks = authorSection.querySelectorAll('a');
    authorLinks.forEach(a => {
      const name = a.innerText?.replace(/\d+$/, '').trim();
      const supMatch = a.innerText?.match(/(\d+)$/);
      const affiliationNum = supMatch ? supMatch[1] : '';
      authors.push({ name, affiliationNum });
    });
  }

  // Affiliations - second h3.author contains org links
  const affiliations = [];
  if (authorH3s.length > 1) {
    const orgLinks = authorH3s[1].querySelectorAll('a');
    orgLinks.forEach(a => {
      affiliations.push(a.innerText?.trim());
    });
  }

  // Abstract
  const abstractEl = document.querySelector('.abstract-text');
  const abstract = abstractEl?.innerText?.trim() || '';

  // Keywords
  const keywordsP = document.querySelector('p.keywords');
  const keywords = keywordsP
    ? Array.from(keywordsP.querySelectorAll('a')).map(a => a.innerText?.replace(/;$/, '').trim())
    : [];

  // Fund
  const fundsP = document.querySelector('p.funds');
  const fund = fundsP?.innerText?.trim() || '';

  // Classification code
  const clcCode = document.querySelector('.clc-code');
  const classification = clcCode?.innerText?.trim() || '';

  // Journal/source
  const docTop = document.querySelector('.doc-top');
  const journal = docTop?.querySelector('a')?.innerText?.trim() || '';

  // Online first / publication info
  const headTime = document.querySelector('.head-time');
  const pubInfo = headTime?.innerText?.trim() || '';

  // Is online first?
  const isOnlineFirst = !!brief.querySelector('.icon-shoufa');

  // Article outline/TOC
  const catalogList = document.querySelector('.catalog-list, .catalog-listDiv');
  const toc = catalogList?.innerText?.trim() || '';

  // Citation network counts
  const citationTabs = document.querySelectorAll('ul.module-tab.tpl_lieteratures li');
  const citationInfo = {};
  citationTabs.forEach(li => {
    const id = li.getAttribute('data-id');
    const text = li.innerText?.trim();
    const countMatch = text.match(/(\d+)/);
    if (id) {
      citationInfo[id] = {
        label: text.replace(/\d+/, '').trim(),
        count: countMatch ? parseInt(countMatch[1]) : 0
      };
    }
  });

  return {
    title,
    authors,
    affiliations,
    abstract,
    keywords,
    fund,
    classification,
    journal,
    pubInfo,
    isOnlineFirst,
    toc,
    citationInfo
  };
}
使用
mcp__chrome-devtools__evaluate_script
工具执行以下函数:
javascript
() => {
  const brief = document.querySelector('.brief');
  if (!brief) return { error: 'Paper detail section (.brief) not found' };

  // Title
  const title = brief.querySelector('h1')?.innerText?.trim()
    ?.replace(/\s*附视频\s*$/, '')  // remove "附视频" suffix
    ?.replace(/\s*网络首发\s*$/, ''); // remove "网络首发" suffix

  // Authors - first h3.author contains author links with sup tags
  const authorH3s = brief.querySelectorAll('h3.author');
  const authorSection = authorH3s[0];
  const authors = [];
  if (authorSection) {
    const authorLinks = authorSection.querySelectorAll('a');
    authorLinks.forEach(a => {
      const name = a.innerText?.replace(/\d+$/, '').trim();
      const supMatch = a.innerText?.match(/(\d+)$/);
      const affiliationNum = supMatch ? supMatch[1] : '';
      authors.push({ name, affiliationNum });
    });
  }

  // Affiliations - second h3.author contains org links
  const affiliations = [];
  if (authorH3s.length > 1) {
    const orgLinks = authorH3s[1].querySelectorAll('a');
    orgLinks.forEach(a => {
      affiliations.push(a.innerText?.trim());
    });
  }

  // Abstract
  const abstractEl = document.querySelector('.abstract-text');
  const abstract = abstractEl?.innerText?.trim() || '';

  // Keywords
  const keywordsP = document.querySelector('p.keywords');
  const keywords = keywordsP
    ? Array.from(keywordsP.querySelectorAll('a')).map(a => a.innerText?.replace(/;$/, '').trim())
    : [];

  // Fund
  const fundsP = document.querySelector('p.funds');
  const fund = fundsP?.innerText?.trim() || '';

  // Classification code
  const clcCode = document.querySelector('.clc-code');
  const classification = clcCode?.innerText?.trim() || '';

  // Journal/source
  const docTop = document.querySelector('.doc-top');
  const journal = docTop?.querySelector('a')?.innerText?.trim() || '';

  // Online first / publication info
  const headTime = document.querySelector('.head-time');
  const pubInfo = headTime?.innerText?.trim() || '';

  // Is online first?
  const isOnlineFirst = !!brief.querySelector('.icon-shoufa');

  // Article outline/TOC
  const catalogList = document.querySelector('.catalog-list, .catalog-listDiv');
  const toc = catalogList?.innerText?.trim() || '';

  // Citation network counts
  const citationTabs = document.querySelectorAll('ul.module-tab.tpl_lieteratures li');
  const citationInfo = {};
  citationTabs.forEach(li => {
    const id = li.getAttribute('data-id');
    const text = li.innerText?.trim();
    const countMatch = text.match(/(\d+)/);
    if (id) {
      citationInfo[id] = {
        label: text.replace(/\d+/, '').trim(),
        count: countMatch ? parseInt(countMatch[1]) : 0
      };
    }
  });

  return {
    title,
    authors,
    affiliations,
    abstract,
    keywords,
    fund,
    classification,
    journal,
    pubInfo,
    isOnlineFirst,
    toc,
    citationInfo
  };
}

4. Format and present the output

4. 格式化并展示输出

undefined
undefined

{title} {isOnlineFirst ? "[网络首发]" : ""}

{title} {isOnlineFirst ? "[网络首发]" : ""}

Authors: {For each author: "- {name} ({affiliation})"}
Affiliations: {For each affiliation: "- {affiliation}"}
Journal: {journal} Publication Info: {pubInfo}
Abstract: {abstract}
Keywords: {keywords joined by ", "}
Fund: {fund} Classification: {classification}
Citation Network: {For each citation type: "- {label}: {count}"}
undefined
作者: {遍历每位作者:"- {name} ({affiliation})"}
所属机构: {遍历每个机构:"- {affiliation}"}
期刊: {journal} 出版信息: {pubInfo}
摘要: {abstract}
关键词: {keywords 用英文逗号连接}
基金项目: {fund} 分类号: {classification}
引用网络: {遍历每种引用类型:"- {label}: {count}"}
undefined

5. Fallback: snapshot-based parsing

5. 备选方案:基于快照的解析

If JS extraction fails, use
mcp__chrome-devtools__take_snapshot
and parse the accessibility tree:
  • Title:
    heading
    level 1 element
  • Authors:
    link
    elements whose URLs contain
    kcms2/author/detail
  • Affiliations:
    link
    elements whose URLs contain
    kcms2/organ/detail
  • Abstract:
    StaticText
    following "摘要:"
  • Keywords:
    link
    elements whose URLs contain
    kcms2/keyword/detail
  • Fund:
    link
    elements following "基金资助:"
  • Classification:
    StaticText
    following "分类号:"
如果JS提取失败,使用
mcp__chrome-devtools__take_snapshot
工具并解析无障碍树:
  • 标题
    heading
    1级元素
  • 作者:URL包含
    kcms2/author/detail
    link
    元素
  • 所属机构:URL包含
    kcms2/organ/detail
    link
    元素
  • 摘要:“摘要:”后的
    StaticText
    元素
  • 关键词:URL包含
    kcms2/keyword/detail
    link
    元素
  • 基金项目:“基金资助:”后的
    link
    元素
  • 分类号:“分类号:”后的
    StaticText
    元素

Verified DOM Selectors

已验证的DOM选择器

DataSelectorNotes
Paper section
.brief
Main paper info container
Title
.brief h1
May contain icons, clean text needed
Authors
.brief h3.author:first-of-type a
Text has superscript numbers (e.g., "张三1")
Affiliations
.brief h3.author:nth-of-type(2) a
Text starts with "N." (e.g., "1.北京大学")
Abstract
.abstract-text
Full abstract text
Keywords
p.keywords a
Semicolon-separated keyword links
Fund
p.funds
Fund information text
Classification
.clc-code
CLC classification codes
Journal
.doc-top a
Source journal link
Online first
.brief .icon-shoufa
Present if paper is online first
Citation tabs
ul.module-tab.tpl_lieteratures li
data-id attr identifies type
数据项选择器说明
论文信息容器
.brief
论文核心信息的容器
标题
.brief h1
可能包含图标,需清理文本
作者
.brief h3.author:first-of-type a
文本包含上标数字(例如:"张三1")
所属机构
.brief h3.author:nth-of-type(2) a
文本以“N.”开头(例如:"1.北京大学")
摘要
.abstract-text
完整的摘要文本
关键词
p.keywords a
以分号分隔的关键词链接
基金项目
p.funds
基金信息文本
分类号
.clc-code
中图法分类号
期刊
.doc-top a
来源期刊链接
网络首发标识
.brief .icon-shoufa
存在则表示该论文为网络首发
引用标签
ul.module-tab.tpl_lieteratures li
data-id属性标识引用类型