hypogenic

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Hypogenic

Overview

概述

Hypogenic provides automated hypothesis generation and testing using large language models to accelerate scientific discovery. The framework supports three approaches: HypoGeniC (data-driven hypothesis generation), HypoRefine (synergistic literature and data integration), and Union methods (mechanistic combination of literature and data-driven hypotheses).

Hypogenic借助大语言模型实现自动化假设生成与测试，以加速科学发现。该框架支持三种方法：HypoGeniC（数据驱动的假设生成）、HypoRefine（文献与数据的协同整合）以及Union方法（文献驱动与数据驱动假设的机制性结合）。

Quick Start

快速开始

Get started with Hypogenic in minutes:

bash

undefined

只需几分钟即可开始使用Hypogenic：

bash

undefined

Install the package

uv pip install hypogenic

Clone example datasets

git clone https://github.com/ChicagoHAI/HypoGeniC-datasets.git ./data

Run basic hypothesis generation

hypogenic_generation --config ./data/your_task/config.yaml --method hypogenic --num_hypotheses 20

Run inference on generated hypotheses

hypogenic_inference --config ./data/your_task/config.yaml --hypotheses output/hypotheses.json


**Or use Python API:**

```python
from hypogenic import BaseTask

hypogenic_inference --config ./data/your_task/config.yaml --hypotheses output/hypotheses.json


**或使用Python API：**

```python
from hypogenic import BaseTask

Create task with your configuration

task = BaseTask(config_path="./data/your_task/config.yaml")

Generate hypotheses

task.generate_hypotheses(method="hypogenic", num_hypotheses=20)

Run inference

results = task.inference(hypothesis_bank="./output/hypotheses.json")

undefined

results = task.inference(hypothesis_bank="./output/hypotheses.json")

undefined

When to Use This Skill

适用场景

Use this skill when working on:

Generating scientific hypotheses from observational datasets
Testing multiple competing hypotheses systematically
Combining literature insights with empirical patterns
Accelerating research discovery through automated hypothesis ideation
Domains requiring hypothesis-driven analysis: deception detection, AI-generated content identification, mental health indicators, predictive modeling, or other empirical research

在以下场景中可使用该工具：

从观测数据集中生成科学假设
系统性测试多个相互竞争的假设
将文献见解与实证模式相结合
通过自动化假设构思加速研究发现
需要假设驱动分析的领域：欺骗检测、AI生成内容识别、心理健康指标分析、预测建模或其他实证研究

Key Features

核心特性

Automated Hypothesis Generation

Generate 10-20+ testable hypotheses from data in minutes
Iterative refinement based on validation performance
Support for both API-based (OpenAI, Anthropic) and local LLMs

Literature Integration

Extract insights from research papers via PDF processing
Combine theoretical foundations with empirical patterns
Systematic literature-to-hypothesis pipeline with GROBID

Performance Optimization

Redis caching reduces API costs for repeated experiments
Parallel processing for large-scale hypothesis testing
Adaptive refinement focuses on challenging examples

Flexible Configuration

Template-based prompt engineering with variable injection
Custom label extraction for domain-specific tasks
Modular architecture for easy extension

Proven Results

8.97% improvement over few-shot baselines
15.75% improvement over literature-only approaches
80-84% hypothesis diversity (non-redundant insights)
Human evaluators report significant decision-making improvements

自动化假设生成

几分钟内从数据中生成10-20个可测试的假设
基于验证性能进行迭代优化
支持基于API的（OpenAI、Anthropic）和本地LLM

文献整合

通过PDF处理提取研究论文中的见解
将理论基础与实证模式相结合
结合GROBID的系统性文献到假设的流水线

性能优化

Redis缓存可降低重复实验的API成本
并行处理支持大规模假设测试
自适应优化聚焦于具有挑战性的案例

灵活配置

支持基于模板的提示工程与变量注入
针对特定领域任务的自定义标签提取
模块化架构便于扩展

已验证的成果

相比少样本基线提升8.97%的性能
相比仅基于文献的方法提升15.75%的性能
80-84%的假设多样性（非冗余见解）
人类评估者报告决策能力显著提升

Core Capabilities

核心功能

1. HypoGeniC: Data-Driven Hypothesis Generation

1. HypoGeniC：数据驱动的假设生成

Generate hypotheses solely from observational data through iterative refinement.

Process:

Initialize with a small data subset to generate candidate hypotheses
Iteratively refine hypotheses based on performance
Replace poorly-performing hypotheses with new ones from challenging examples

Best for: Exploratory research without existing literature, pattern discovery in novel datasets

通过迭代优化仅从观测数据中生成假设。

流程：

从小型数据子集初始化，生成候选假设
基于性能迭代优化假设
用来自挑战性案例的新假设替换性能不佳的假设

最佳适用场景： 无现有文献的探索性研究、新数据集中的模式发现

2. HypoRefine: Literature and Data Integration

2. HypoRefine：文献与数据整合

Synergistically combine existing literature with empirical data through an agentic framework.

Process:

Extract insights from relevant research papers (typically 10 papers)
Generate theory-grounded hypotheses from literature
Generate data-driven hypotheses from observational patterns
Refine both hypothesis banks through iterative improvement

Best for: Research with established theoretical foundations, validating or extending existing theories

通过智能体框架将现有文献与实证数据协同结合。

流程：

从相关研究论文（通常为10篇）中提取见解
从文献中生成基于理论的假设
从观测模式中生成数据驱动的假设
通过迭代优化完善两个假设库

最佳适用场景： 有成熟理论基础的研究、验证或扩展现有理论

3. Union Methods

3. Union方法

Mechanistically combine literature-only hypotheses with framework outputs.

Variants:

Literature ∪ HypoGeniC: Combines literature hypotheses with data-driven generation
Literature ∪ HypoRefine: Combines literature hypotheses with integrated approach

Best for: Comprehensive hypothesis coverage, eliminating redundancy while maintaining diverse perspectives

将仅基于文献的假设与框架输出进行机制性结合。

变体：

Literature ∪ HypoGeniC：结合文献假设与数据驱动生成的假设
Literature ∪ HypoRefine：结合文献假设与整合方法生成的假设

最佳适用场景： 实现全面的假设覆盖，在保持多样化视角的同时消除冗余

Installation

安装

Install via pip:

bash

uv pip install hypogenic

Optional dependencies:

Redis server (port 6832): Enables caching of LLM responses to significantly reduce API costs during iterative hypothesis generation
s2orc-doc2json: Required for processing literature PDFs in HypoRefine workflows
GROBID: Required for PDF preprocessing (see Literature Processing section)

Clone example datasets:

bash

undefined

通过pip安装：

bash

uv pip install hypogenic

可选依赖：

Redis服务器（端口6832）：启用LLM响应缓存，可显著降低迭代假设生成过程中的API成本
s2orc-doc2json：在HypoRefine工作流中处理文献PDF时需要
GROBID：PDF预处理所需（见文献处理部分）

克隆示例数据集：

bash

undefined

For HypoGeniC examples

适用于HypoGeniC示例

git clone https://github.com/ChicagoHAI/HypoGeniC-datasets.git ./data

For HypoRefine/Union examples

适用于HypoRefine/Union示例

git clone https://github.com/ChicagoHAI/Hypothesis-agent-datasets.git ./data

undefined

git clone https://github.com/ChicagoHAI/Hypothesis-agent-datasets.git ./data

undefined

Dataset Format

数据集格式

Datasets must follow HuggingFace datasets format with specific naming conventions:

Required files:

```
<TASK>_train.json
```
: Training data
```
<TASK>_val.json
```
: Validation data
```
<TASK>_test.json
```
: Test data

Required keys in JSON:

```
text_features_1
```
through
```
text_features_n
```
: Lists of strings containing feature values
```
label
```
: List of strings containing ground truth labels

Example (headline click prediction):

json

{
  "headline_1": [
    "What Up, Comet? You Just Got *PROBED*",
    "Scientists Made a Breakthrough in Quantum Computing"
  ],
  "headline_2": [
    "Scientists Everywhere Were Holding Their Breath Today. Here's Why.",
    "New Quantum Computer Achieves Milestone"
  ],
  "label": [
    "Headline 2 has more clicks than Headline 1",
    "Headline 1 has more clicks than Headline 2"
  ]
}

Important notes:

All lists must have the same length
Label format must match your
```
extract_label()
```
function output format
Feature keys can be customized to match your domain (e.g.,
```
review_text
```
,
```
post_content
```
, etc.)

数据集必须遵循HuggingFace数据集格式，并采用特定命名约定：

必需文件：

```
<TASK>_train.json
```
：训练数据
```
<TASK>_val.json
```
：验证数据
```
<TASK>_test.json
```
：测试数据

JSON中的必需键：

```
text_features_1
```
至
```
text_features_n
```
：包含特征值的字符串列表
```
label
```
：包含真实标签的字符串列表

示例（标题点击预测）：

json

{
  "headline_1": [
    "What Up, Comet? You Just Got *PROBED*",
    "Scientists Made a Breakthrough in Quantum Computing"
  ],
  "headline_2": [
    "Scientists Everywhere Were Holding Their Breath Today. Here's Why.",
    "New Quantum Computer Achieves Milestone"
  ],
  "label": [
    "Headline 2 has more clicks than Headline 1",
    "Headline 1 has more clicks than Headline 2"
  ]
}

重要说明：

所有列表的长度必须相同
标签格式必须与
```
extract_label()
```
函数的输出格式匹配
特征键可根据领域自定义（如
```
review_text
```
、
```
post_content
```
等）

Configuration

配置

Each task requires a

config.yaml

file specifying:

Required elements:

Dataset paths (train/val/test)
Prompt templates for:
- Observations generation
- Batched hypothesis generation
- Hypothesis inference
- Relevance checking
- Adaptive methods (for HypoRefine)

Template capabilities:

Dataset placeholders for dynamic variable injection (e.g.,
```
${text_features_1}
```
,
```
${num_hypotheses}
```
)
Custom label extraction functions for domain-specific parsing
Role-based prompt structure (system, user, assistant roles)

Configuration structure:

yaml

task_name: your_task_name

train_data_path: ./your_task_train.json
val_data_path: ./your_task_val.json
test_data_path: ./your_task_test.json

prompt_templates:
  # Extra keys for reusable prompt components
  observations: |
    Feature 1: ${text_features_1}
    Feature 2: ${text_features_2}
    Observation: ${label}
  
  # Required templates
  batched_generation:
    system: "Your system prompt here"
    user: "Your user prompt with ${num_hypotheses} placeholder"
  
  inference:
    system: "Your inference system prompt"
    user: "Your inference user prompt"
  
  # Optional templates for advanced features
  few_shot_baseline: {...}
  is_relevant: {...}
  adaptive_inference: {...}
  adaptive_selection: {...}

Refer to

references/config_template.yaml

for a complete example configuration.

每个任务都需要一个

config.yaml

文件，指定以下内容：

必需元素：

数据集路径（训练/验证/测试）
以下场景的提示模板：
- 观测结果生成
- 批量假设生成
- 假设推理
- 相关性检查
- 自适应方法（适用于HypoRefine）

模板功能：

数据集占位符支持动态变量注入（如
```
${text_features_1}
```
、
```
${num_hypotheses}
```
）
针对特定领域解析的自定义标签提取函数
基于角色的提示结构（系统、用户、助手角色）

配置结构：

yaml

task_name: your_task_name

train_data_path: ./your_task_train.json
val_data_path: ./your_task_val.json
test_data_path: ./your_task_test.json

prompt_templates:
  # 可重用提示组件的额外键
  observations: |
    Feature 1: ${text_features_1}
    Feature 2: ${text_features_2}
    Observation: ${label}
  
  # 必需模板
  batched_generation:
    system: "Your system prompt here"
    user: "Your user prompt with ${num_hypotheses} placeholder"
  
  inference:
    system: "Your inference system prompt"
    user: "Your inference user prompt"
  
  # 高级功能的可选模板
  few_shot_baseline: {...}
  is_relevant: {...}
  adaptive_inference: {...}
  adaptive_selection: {...}

完整示例配置请参考

references/config_template.yaml

。

Literature Processing (HypoRefine/Union Methods)

文献处理（HypoRefine/Union方法）

To use literature-based hypothesis generation, you must preprocess PDF papers:

Step 1: Setup GROBID (first time only)

bash

bash ./modules/setup_grobid.sh

Step 2: Add PDF files Place research papers in

literature/YOUR_TASK_NAME/raw/

Step 3: Process PDFs

bash

undefined

要使用基于文献的假设生成，必须预处理PDF论文：

步骤1：设置GROBID（首次使用）

bash

bash ./modules/setup_grobid.sh

步骤2：添加PDF文件 将研究论文放入

literature/YOUR_TASK_NAME/raw/

步骤3：处理PDF

bash

undefined

Start GROBID service

启动GROBID服务

bash ./modules/run_grobid.sh

Process PDFs for your task

处理你的任务对应的PDF

cd examples python pdf_preprocess.py --task_name YOUR_TASK_NAME


This converts PDFs to structured format for hypothesis extraction. Automated literature search will be supported in future releases.

cd examples python pdf_preprocess.py --task_name YOUR_TASK_NAME


这会将PDF转换为结构化格式以便提取假设。未来版本将支持自动化文献搜索。

CLI Usage

CLI使用方法

Hypothesis Generation

假设生成

bash

hypogenic_generation --help

Key parameters:

Task configuration file path
Model selection (API-based or local)
Generation method (HypoGeniC, HypoRefine, or Union)
Number of hypotheses to generate
Output directory for hypothesis banks

bash

hypogenic_generation --help

关键参数：

任务配置文件路径
模型选择（基于API或本地）
生成方法（HypoGeniC、HypoRefine或Union）
要生成的假设数量
假设库的输出目录

Hypothesis Inference

假设推理

bash

hypogenic_inference --help

Key parameters:

Task configuration file path
Hypothesis bank file path
Test dataset path
Inference method (default or multi-hypothesis)
Output file for results

bash

hypogenic_inference --help

关键参数：

任务配置文件路径
假设库文件路径
测试数据集路径
推理方法（默认或多假设）
结果输出文件

Python API Usage

Python API使用方法

For programmatic control and custom workflows, use Hypogenic directly in your Python code:

如需程序化控制和自定义工作流，可直接在Python代码中使用Hypogenic：

Basic HypoGeniC Generation

基础HypoGeniC生成

python

from hypogenic import BaseTask

python

from hypogenic import BaseTask

Clone example datasets first

先克隆示例数据集

git clone https://github.com/ChicagoHAI/HypoGeniC-datasets.git ./data

Load your task with custom extract_label function

使用自定义extract_label函数创建任务

task = BaseTask( config_path="./data/your_task/config.yaml", extract_label=lambda text: extract_your_label(text) )

Generate hypotheses

生成假设

task.generate_hypotheses( method="hypogenic", num_hypotheses=20, output_path="./output/hypotheses.json" )

Run inference

运行推理

results = task.inference( hypothesis_bank="./output/hypotheses.json", test_data="./data/your_task/your_task_test.json" )

undefined

results = task.inference( hypothesis_bank="./output/hypotheses.json", test_data="./data/your_task/your_task_test.json" )

undefined

HypoRefine/Union Methods

HypoRefine/Union方法

python

undefined

python

undefined

For literature-integrated approaches

适用于整合文献的方法

git clone https://github.com/ChicagoHAI/Hypothesis-agent-datasets.git ./data

Generate with HypoRefine

使用HypoRefine生成假设

task.generate_hypotheses( method="hyporefine", num_hypotheses=15, literature_path="./literature/your_task/", output_path="./output/" )

This generates 3 hypothesis banks:

这会生成3个假设库：

- HypoRefine (integrated approach)

- HypoRefine（整合方法）

- Literature-only hypotheses

- 仅基于文献的假设

- Literature∪HypoRefine (union)

- Literature∪HypoRefine（联合方法）

undefined

undefined

Multi-Hypothesis Inference

多假设推理

python

from examples.multi_hyp_inference import run_multi_hypothesis_inference

python

from examples.multi_hyp_inference import run_multi_hypothesis_inference

Test multiple hypotheses simultaneously

同时测试多个假设

results = run_multi_hypothesis_inference( config_path="./data/your_task/config.yaml", hypothesis_bank="./output/hypotheses.json", test_data="./data/your_task/your_task_test.json" )

undefined

results = run_multi_hypothesis_inference( config_path="./data/your_task/config.yaml", hypothesis_bank="./output/hypotheses.json", test_data="./data/your_task/your_task_test.json" )

undefined

Custom Label Extraction

自定义标签提取

The

extract_label()

function is critical for parsing LLM outputs. Implement it based on your task:

python

def extract_label(llm_output: str) -> str:
    """Extract predicted label from LLM inference text.
    
    Default behavior: searches for 'final answer:\s+(.*)' pattern.
    Customize for your domain-specific output format.
    """
    import re
    match = re.search(r'final answer:\s+(.*)', llm_output, re.IGNORECASE)
    if match:
        return match.group(1).strip()
    return llm_output.strip()

Important: Extracted labels must match the format of

label

values in your dataset for correct accuracy calculation.

extract_label()

函数对解析LLM输出至关重要。需根据任务实现该函数：

python

def extract_label(llm_output: str) -> str:
    """从LLM推理文本中提取预测标签。
    
    默认行为：搜索'final answer:\s+(.*)'模式。
    针对特定领域的输出格式进行自定义。
    """
    import re
    match = re.search(r'final answer:\s+(.*)', llm_output, re.IGNORECASE)
    if match:
        return match.group(1).strip()
    return llm_output.strip()

重要提示： 提取的标签必须与数据集中

label

值的格式匹配，才能正确计算准确率。

Workflow Examples

工作流示例

Example 1: Data-Driven Hypothesis Generation (HypoGeniC)

示例1：数据驱动的假设生成（HypoGeniC）

Scenario: Detecting AI-generated content without prior theoretical framework

Steps:

Prepare dataset with text samples and labels (human vs. AI-generated)
Create
```
config.yaml
```
with appropriate prompt templates

Run hypothesis generation:

bash

hypogenic_generation --config config.yaml --method hypogenic --num_hypotheses 20

Run inference on test set:

bash

hypogenic_inference --config config.yaml --hypotheses output/hypotheses.json --test_data data/test.json

Analyze results for patterns like formality, grammatical precision, and tone differences

场景： 无先验理论框架的AI生成内容检测

步骤：

准备包含文本样本和标签（人类生成vs AI生成）的数据集
创建包含合适提示模板的
```
config.yaml
```

运行假设生成：

bash

hypogenic_generation --config config.yaml --method hypogenic --num_hypotheses 20

在测试集上运行推理：

bash

hypogenic_inference --config config.yaml --hypotheses output/hypotheses.json --test_data data/test.json

分析结果中的模式，如正式程度、语法精度和语气差异

Example 2: Literature-Informed Hypothesis Testing (HypoRefine)

示例2：基于文献的假设测试（HypoRefine）

Scenario: Deception detection in hotel reviews building on existing research

Steps:

Collect 10 relevant papers on linguistic deception cues
Prepare dataset with genuine and fraudulent reviews
Configure
```
config.yaml
```
with literature processing and data generation templates

Run HypoRefine:

bash

hypogenic_generation --config config.yaml --method hyporefine --papers papers/ --num_hypotheses 15

Test hypotheses examining pronoun frequency, detail specificity, and other linguistic patterns
Compare literature-based and data-driven hypothesis performance

场景： 基于现有研究的酒店评论欺骗检测

步骤：

收集10篇关于语言欺骗线索的相关论文
准备包含真实和欺诈评论的数据集
配置
```
config.yaml
```
，包含文献处理和数据生成模板

运行HypoRefine：

bash

hypogenic_generation --config config.yaml --method hyporefine --papers papers/ --num_hypotheses 15

测试检查代词频率、细节特异性和其他语言模式的假设
比较基于文献和数据驱动的假设性能

Example 3: Comprehensive Hypothesis Coverage (Union Method)

示例3：全面假设覆盖（Union方法）

Scenario: Mental stress detection maximizing hypothesis diversity

Steps:

Generate literature hypotheses from mental health research papers
Generate data-driven hypotheses from social media posts

Run Union method to combine and deduplicate:

bash

hypogenic_generation --config config.yaml --method union --literature_hypotheses lit_hyp.json

Inference captures both theoretical constructs (posting behavior changes) and data patterns (emotional language shifts)

场景： 最大化假设多样性的心理压力检测

步骤：

从心理健康研究论文中生成基于文献的假设
从社交媒体帖子中生成数据驱动的假设

运行Union方法进行合并和去重：

bash

hypogenic_generation --config config.yaml --method union --literature_hypotheses lit_hyp.json

推理同时捕获理论构造（发帖行为变化）和数据模式（情感语言转变）

Performance Optimization

性能优化

Caching: Enable Redis caching to reduce API costs and computation time for repeated LLM calls

Parallel Processing: Leverage multiple workers for large-scale hypothesis generation and testing

Adaptive Refinement: Use challenging examples to iteratively improve hypothesis quality

缓存： 启用Redis缓存以降低重复LLM调用的API成本和计算时间

并行处理： 利用多个工作进程支持大规模假设生成和测试

自适应优化： 使用具有挑战性的案例迭代提升假设质量

Expected Outcomes

预期成果

Research using hypogenic has demonstrated:

14.19% accuracy improvement in AI-content detection tasks
7.44% accuracy improvement in deception detection tasks
80-84% of hypothesis pairs offering distinct, non-redundant insights
High helpfulness ratings from human evaluators across multiple research domains

使用Hypogenic的研究已证明：

AI内容检测任务中准确率提升14.19%
欺骗检测任务中准确率提升7.44%
80-84%的假设对提供独特、非冗余的见解
多个研究领域的人类评估者给出高实用性评分

Troubleshooting

故障排除

Issue: Generated hypotheses are too generic Solution: Refine prompt templates in

config.yaml

to request more specific, testable hypotheses

Issue: Poor inference performance Solution: Ensure dataset has sufficient training examples, adjust hypothesis generation parameters, or increase number of hypotheses

Issue: Label extraction failures Solution: Implement custom

extract_label()

function for domain-specific output parsing

Issue: GROBID PDF processing fails Solution: Ensure GROBID service is running (

bash ./modules/run_grobid.sh

) and PDFs are valid research papers

问题： 生成的假设过于通用 解决方案： 优化

config.yaml

中的提示模板，要求生成更具体、可测试的假设

问题： 推理性能不佳 解决方案： 确保数据集有足够的训练样本，调整假设生成参数，或增加假设数量

问题： 标签提取失败 解决方案： 针对特定领域的输出格式实现自定义

extract_label()

函数

问题： GROBID PDF处理失败 解决方案： 确保GROBID服务正在运行（

bash ./modules/run_grobid.sh

），且PDF为有效的研究论文

Creating Custom Tasks

创建自定义任务

To add a new task or dataset to Hypogenic:

要向Hypogenic添加新任务或数据集：

Step 1: Prepare Your Dataset

步骤1：准备数据集

Create three JSON files following the required format:

```
your_task_train.json
```
```
your_task_val.json
```
```
your_task_test.json
```

Each file must have keys for text features (

text_features_1

, etc.) and

label

创建三个符合要求格式的JSON文件：

```
your_task_train.json
```
```
your_task_val.json
```
```
your_task_test.json
```

每个文件必须包含文本特征键（

text_features_1

等）和

label

键。

Step 2: Create config.yaml

步骤2：创建config.yaml

Define your task configuration with:

Task name and dataset paths
Prompt templates for observations, generation, inference
Any extra keys for reusable prompt components
Placeholder variables (e.g.,
```
${text_features_1}
```
,
```
${num_hypotheses}
```
)

定义任务配置，包含：

任务名称和数据集路径
观测、生成、推理的提示模板
可重用提示组件的额外键
占位符变量（如
```
${text_features_1}
```
、
```
${num_hypotheses}
```
）

Step 3: Implement extract_label Function

步骤3：实现extract_label函数

Create a custom label extraction function that parses LLM outputs for your domain:

python

from hypogenic import BaseTask

def extract_my_label(llm_output: str) -> str:
    """Custom label extraction for your task.
    
    Must return labels in same format as dataset 'label' field.
    """
    # Example: Extract from specific format
    if "Final prediction:" in llm_output:
        return llm_output.split("Final prediction:")[-1].strip()
    
    # Fallback to default pattern
    import re
    match = re.search(r'final answer:\s+(.*)', llm_output, re.IGNORECASE)
    return match.group(1).strip() if match else llm_output.strip()

创建自定义标签提取函数，解析特定领域的LLM输出：

python

from hypogenic import BaseTask

def extract_my_label(llm_output: str) -> str:
    """为你的任务自定义标签提取。
    
    必须返回与数据集'label'字段格式相同的标签。
    """
    # 示例：从特定格式中提取
    if "Final prediction:" in llm_output:
        return llm_output.split("Final prediction:")[-1].strip()
    
    # 回退到默认模式
    import re
    match = re.search(r'final answer:\s+(.*)', llm_output, re.IGNORECASE)
    return match.group(1).strip() if match else llm_output.strip()

Use your custom task

使用自定义任务

task = BaseTask( config_path="./your_task/config.yaml", extract_label=extract_my_label )

undefined

task = BaseTask( config_path="./your_task/config.yaml", extract_label=extract_my_label )

undefined

Step 4: (Optional) Process Literature

步骤4：（可选）处理文献

For HypoRefine/Union methods:

Create
```
literature/your_task_name/raw/
```
directory
Add relevant research paper PDFs
Run GROBID preprocessing
Process with
```
pdf_preprocess.py
```

对于HypoRefine/Union方法：

创建
```
literature/your_task_name/raw/
```
目录
添加相关研究论文PDF
运行GROBID预处理
使用
```
pdf_preprocess.py
```
处理

Step 5: Generate and Test

步骤5：生成与测试

Run hypothesis generation and inference using CLI or Python API:

bash

undefined

使用CLI或Python API运行假设生成和推理：

bash

undefined

CLI approach

CLI方式

hypogenic_generation --config your_task/config.yaml --method hypogenic --num_hypotheses 20 hypogenic_inference --config your_task/config.yaml --hypotheses output/hypotheses.json

Or use Python API (see Python API Usage section)

或使用Python API（见Python API使用方法部分）

undefined

undefined

Repository Structure

仓库结构

Understanding the repository layout:

hypothesis-generation/
├── hypogenic/              # Core package code
├── hypogenic_cmd/          # CLI entry points
├── hypothesis_agent/       # HypoRefine agent framework
├── literature/            # Literature processing utilities
├── modules/               # GROBID and preprocessing modules
├── examples/              # Example scripts
│   ├── generation.py      # Basic HypoGeniC generation
│   ├── union_generation.py # HypoRefine/Union generation
│   ├── inference.py       # Single hypothesis inference
│   ├── multi_hyp_inference.py # Multiple hypothesis inference
│   └── pdf_preprocess.py  # Literature PDF processing
├── data/                  # Example datasets (clone separately)
├── tests/                 # Unit tests
└── IO_prompting/          # Prompt templates and experiments

Key directories:

hypogenic/: Main package with BaseTask and generation logic
examples/: Reference implementations for common workflows
literature/: Tools for PDF processing and literature extraction
modules/: External tool integrations (GROBID, etc.)

了解仓库布局：

hypothesis-generation/
├── hypogenic/              # 核心包代码
├── hypogenic_cmd/          # CLI入口点
├── hypothesis_agent/       # HypoRefine智能体框架
├── literature/            # 文献处理工具
├── modules/               # GROBID和预处理模块
├── examples/              # 示例脚本
│   ├── generation.py      # 基础HypoGeniC生成
│   ├── union_generation.py # HypoRefine/Union生成
│   ├── inference.py       # 单假设推理
│   ├── multi_hyp_inference.py # 多假设推理
│   └── pdf_preprocess.py  # 文献PDF处理
├── data/                  # 示例数据集（需单独克隆）
├── tests/                 # 单元测试
└── IO_prompting/          # 提示模板和实验

关键目录：

hypogenic/：包含BaseTask和生成逻辑的主包
examples/：常见工作流的参考实现
literature/：PDF处理和文献提取工具
modules/：外部工具集成（如GROBID等）

Related Publications

Additional Resources

额外资源

Official Links

官方链接

GitHub Repository: https://github.com/ChicagoHAI/hypothesis-generation
PyPI Package: https://pypi.org/project/hypogenic/
License: MIT License
Issues & Support: https://github.com/ChicagoHAI/hypothesis-generation/issues

GitHub仓库： https://github.com/ChicagoHAI/hypothesis-generation
PyPI包： https://pypi.org/project/hypogenic/
许可证： MIT License
问题与支持： https://github.com/ChicagoHAI/hypothesis-generation/issues

Example Datasets

示例数据集

Clone these repositories for ready-to-use examples:

bash

undefined

克隆以下仓库获取即用型示例：

bash

undefined

HypoGeniC examples (data-driven only)

HypoGeniC示例（仅数据驱动）

git clone https://github.com/ChicagoHAI/HypoGeniC-datasets.git ./data

HypoRefine/Union examples (literature + data)

HypoRefine/Union示例（文献+数据）

git clone https://github.com/ChicagoHAI/Hypothesis-agent-datasets.git ./data

undefined

git clone https://github.com/ChicagoHAI/Hypothesis-agent-datasets.git ./data

undefined

Community & Contributions

社区与贡献

Contributors: 7+ active contributors
Stars: 89+ on GitHub
Topics: research-tool, interpretability, hypothesis-generation, scientific-discovery, llm-application

For contributions or questions, visit the GitHub repository and check the issues page.

贡献者： 7+活跃贡献者
GitHub星标： 89+
相关主题： research-tool, interpretability, hypothesis-generation, scientific-discovery, llm-application

如需贡献或有疑问，请访问GitHub仓库并查看问题页面。

Local Resources

本地资源

references/

config_template.yaml

- Complete example configuration file with all required prompt templates and parameters. This includes:

Full YAML structure for task configuration
Example prompt templates for all methods
Placeholder variable documentation
Role-based prompt examples

config_template.yaml

- 包含所有必需提示模板和参数的完整示例配置文件，包括：

任务配置的完整YAML结构
所有方法的示例提示模板
占位符变量文档
基于角色的提示示例

scripts/

Scripts directory is available for:

Custom data preparation utilities
Format conversion tools
Analysis and evaluation scripts
Integration with external tools

脚本目录包含：

自定义数据准备工具
格式转换工具
分析和评估脚本
与外部工具的集成脚本

assets/

Assets directory is available for:

Example datasets and templates
Sample hypothesis banks
Visualization outputs
Documentation supplements

资源目录包含：

示例数据集和模板
示例假设库
可视化输出
文档补充材料