filter-js-from-html
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFilter JavaScript from HTML
从HTML中过滤JavaScript
Overview
概述
This skill provides guidance for tasks that require removing JavaScript and XSS attack vectors from HTML content while preserving the original formatting exactly. The key challenge is balancing comprehensive security filtering with format preservation.
本技能为需要移除HTML内容中的JavaScript和XSS攻击向量,同时严格保留原始格式的任务提供指导。核心挑战在于平衡全面的安全过滤与格式保留需求。
Critical Requirements Analysis
关键需求分析
Before implementation, identify and prioritize these requirements:
- Security completeness: All XSS vectors must be removed
- Format preservation: Output must be functionally identical to input except for harmful content removal
- Clean content handling: Files without XSS content should remain completely unchanged
These requirements often conflict - comprehensive parsing may alter formatting, while simple string replacement may miss attack vectors.
在实施前,需明确并优先处理以下需求:
- 安全性完整性:必须移除所有XSS攻击向量
- 格式保留:除移除有害内容外,输出需与输入功能完全一致
- 干净内容处理:无XSS内容的文件应保持完全不变
这些需求往往存在冲突——全面解析可能会改变格式,而简单的字符串替换可能会遗漏攻击向量。
Approach Selection
方案选择
Option 1: Regex-Based Surgical Removal (Recommended for Format Preservation)
方案1:基于正则表达式的精准移除(格式保留场景推荐)
When the task explicitly requires preserving original formatting, prefer regex-based approaches that surgically remove only the dangerous content.
Advantages:
- Preserves whitespace, attribute ordering, quote styles exactly
- Does not reconstruct or reformat HTML
- Output matches input character-for-character except for removed content
Considerations:
- Requires careful pattern construction to avoid partial matches
- Must handle various encodings and obfuscation techniques
- Test patterns against comprehensive XSS vector lists
当任务明确要求保留原始格式时,优先选择基于正则表达式的方案,仅精准移除危险内容。
优势:
- 精确保留空白字符、属性顺序、引号样式
- 不会重构或重新格式化HTML
- 除移除的内容外,输出与输入逐字符匹配
注意事项:
- 需要精心构建正则模式,避免部分匹配
- 必须处理各种编码和混淆技术
- 需使用全面的XSS攻击向量列表测试正则模式
Option 2: HTML Parser-Based Filtering
方案2:基于HTML解析器的过滤
When format preservation is less critical or when dealing with malformed HTML.
Considerations:
- HTML parsers inherently reconstruct output, changing formatting
- May normalize attribute quotes, whitespace, tag casing
- Better for malformed HTML that regex cannot reliably parse
- If using this approach, verify that clean HTML files remain unchanged
当格式保留需求较低,或处理畸形HTML时可使用此方案。
注意事项:
- HTML解析器会固有地重构输出,改变格式
- 可能会标准化属性引号、空白字符、标签大小写
- 更适合正则表达式无法可靠解析的畸形HTML
- 如果使用此方案,需验证干净的HTML文件是否保持不变
Comprehensive XSS Vector Checklist
全面XSS攻击向量检查清单
Before implementing, research and account for ALL of these attack categories:
实施前,需研究并覆盖以下所有攻击类别:
1. Script Execution Tags
1. 脚本执行标签
- tags (including variations with attributes)
<script> - abuse cases
<noscript>
- 标签(包括带属性的变体)
<script> - 滥用场景
<noscript>
2. Event Handlers (Comprehensive List Required)
2. 事件处理器(需覆盖完整列表)
Common handlers:
- ,
onclick,onload,onerror,onmouseover,onfocusonblur
Frequently missed handlers:
- ,
onlayoutcomplete,ontimeerroronselectionchange - ,
onrowsinserted,onrowsdelete,onrowexitonrowenter - ,
oncellchange,ondataavailable,ondatasetchangedondatasetcomplete - ,
onbeforeupdate,onafterupdateonerrorupdate - ,
onfilterchange,onpropertychangeonreadystatechange - ,
onbeforeprint,onafterprintonbeforeunload - ,
oncontextmenu,ondrag,ondragend,ondragenterondragleave - ,
ondragover,ondragstartondrop - ,
onhashchange,oninput,oninvalid,onpageshowonpagehide - ,
onpopstate,onresize,onstorageonwheel
Action: Search for comprehensive event handler lists (e.g., MDN, OWASP) rather than relying on memory.
常见处理器:
- ,
onclick,onload,onerror,onmouseover,onfocusonblur
容易遗漏的处理器:
- ,
onlayoutcomplete,ontimeerroronselectionchange - ,
onrowsinserted,onrowsdelete,onrowexitonrowenter - ,
oncellchange,ondataavailable,ondatasetchangedondatasetcomplete - ,
onbeforeupdate,onafterupdateonerrorupdate - ,
onfilterchange,onpropertychangeonreadystatechange - ,
onbeforeprint,onafterprintonbeforeunload - ,
oncontextmenu,ondrag,ondragend,ondragenterondragleave - ,
ondragover,ondragstartondrop - ,
onhashchange,oninput,oninvalid,onpageshowonpagehide - ,
onpopstate,onresize,onstorageonwheel
行动建议:搜索全面的事件处理器列表(如MDN、OWASP),而非依赖记忆。
3. JavaScript URL Protocol
3. JavaScript URL协议
- in href, src, action, formaction, data, poster attributes
javascript: - Case variations: ,
JavaScript:,JAVASCRIPT:JaVaScRiPt: - Encoded variations: ,
javascript:javascript:
- 位于href、src、action、formaction、data、poster属性中的
javascript: - 大小写变体:,
JavaScript:,JAVASCRIPT:JaVaScRiPt: - 编码变体:,
javascript:javascript:
4. Other Dangerous Protocols
4. 其他危险协议
- (IE legacy)
vbscript: - URIs with script content:
data:data:text/html,<script>...</script> - encoded payloads
data:text/html;base64,...
- (IE遗留协议)
vbscript: - 包含脚本内容的URI:
data:data:text/html,<script>...</script> - 编码载荷
data:text/html;base64,...
5. CSS-Based Attacks
5. 基于CSS的攻击
- tags with dangerous properties
<style> - (Firefox legacy)
-moz-binding - (IE legacy)
expression() - property
behavior: - with javascript or data URIs
@import
- 包含危险属性的标签
<style> - (Firefox遗留属性)
-moz-binding - (IE遗留属性)
expression() - 属性
behavior: - 引用javascript或data URI的
@import
6. Meta Tag Attacks
6. Meta标签攻击
<meta http-equiv="refresh" content="0;url=data:text/html,..."><meta http-equiv="refresh" content="0;url=javascript:...">
<meta http-equiv="refresh" content="0;url=data:text/html,..."><meta http-equiv="refresh" content="0;url=javascript:...">
7. External Resource Loading
7. 外部资源加载
- tags with dangerous href values
<link> - tags with data attributes
<object> - tags with src attributes
<embed> - tags (legacy)
<applet> - with src or srcdoc containing scripts
<iframe>
- 包含危险href值的标签
<link> - 带data属性的标签
<object> - 带src属性的标签
<embed> - 标签(遗留标签)
<applet> - src或srcdoc包含脚本的
<iframe>
8. SVG-Based Attacks
8. 基于SVG的攻击
- and other SVG event handlers
<svg onload="..."> <svg><script>...</script></svg>- SVG with external references
<use>
- 及其他SVG事件处理器
<svg onload="..."> <svg><script>...</script></svg>- 带外部引用的SVG
<use>
9. Encoding and Obfuscation
9. 编码与混淆
- HTML entity encoding:
<script> - URL encoding:
%3Cscript%3E - UTF-7 encoding attacks
- Null byte injection:
<scr\0ipt> - Unicode variations
- HTML实体编码:
<script> - URL编码:
%3Cscript%3E - UTF-7编码攻击
- 空字节注入:
<scr\0ipt> - Unicode变体
10. HTML Comment Exploits
10. HTML注释漏洞
- Conditional comments:
<!--[if IE]><script>...<![endif]--> - Nested comment breaking
- 条件注释:
<!--[if IE]><script>...<![endif]--> - 嵌套注释突破
Verification Strategy
验证策略
Test Categories (All Required)
测试类别(全部必需)
-
XSS Attack Vectors
- Use established XSS test suites (OWASP XSS Filter Evasion Cheat Sheet)
- Test XSS polyglots that combine multiple techniques
- Include lesser-known event handlers in tests
-
Format Preservation
- Provide clean HTML files with varied formatting
- Verify byte-for-byte identical output for clean files
- Test various whitespace patterns, quote styles, attribute ordering
-
Edge Cases
- Malformed HTML
- Mixed case tags and attributes
- Attributes without quotes
- Multiple encodings in same document
-
XSS攻击向量
- 使用成熟的XSS测试套件(OWASP XSS Filter Evasion Cheat Sheet)
- 测试结合多种技术的XSS多语言攻击代码
- 在测试中包含鲜为人知的事件处理器
-
格式保留
- 提供具有多种格式的干净HTML文件
- 验证干净文件的输出与输入逐字节完全一致
- 测试各种空白字符模式、引号样式、属性顺序
-
边缘情况
- 畸形HTML
- 大小写混合的标签和属性
- 无引号的属性
- 同一文档中包含多种编码
Testing Process
测试流程
-
Research first: Before writing tests, search for:
- OWASP XSS Prevention Cheat Sheet
- XSS Filter Evasion Cheat Sheet
- Known XSS polyglots
- Browser-specific attack vectors
-
Create adversarial tests: Do not rely solely on self-created test cases
- Use external comprehensive test suites
- Include vectors that have bypassed filters historically
-
Test clean content preservation: Equal priority to security testing
- Create diverse clean HTML samples
- Verify no modifications occur
- Check whitespace, comments, attribute order
-
先研究:编写测试前,搜索以下资源:
- OWASP XSS预防指南
- XSS过滤器绕过指南
- 已知的XSS多语言攻击代码
- 浏览器特定的攻击向量
-
创建对抗性测试:不要仅依赖自行创建的测试用例
- 使用外部全面的测试套件
- 包含历史上曾绕过过滤器的攻击向量
-
测试干净内容的保留:优先级与安全测试相同
- 创建多样化的干净HTML样本
- 验证无修改发生
- 检查空白字符、注释、属性顺序
Common Pitfalls
常见陷阱
1. Incomplete Event Handler Lists
1. 事件处理器列表不完整
Mistake: Hardcoding only common event handlers like , , .
Solution: Research and include ALL valid HTML event handlers, including deprecated and browser-specific ones.
onclickonloadonerror错误做法:仅硬编码常见的事件处理器,如, , 。
解决方案:研究并包含所有有效的HTML事件处理器,包括已弃用和浏览器特定的处理器。
onclickonloadonerror2. Ignoring CSS Attack Vectors
2. 忽略基于CSS的攻击向量
Mistake: Focusing only on JavaScript while ignoring CSS-based XSS.
Solution: Filter tags, dangerous CSS properties, and style attributes with expressions.
<style>错误做法:仅关注JavaScript,忽略基于CSS的XSS攻击。
解决方案:过滤标签、危险CSS属性,以及包含表达式的style属性。
<style>3. Missing Protocol Handlers
3. 遗漏协议处理器
Mistake: Only filtering protocol.
Solution: Also filter , URIs with dangerous content, and handle encoded protocol names.
javascript:vbscript:data:错误做法:仅过滤协议。
解决方案:同时过滤、包含危险内容的 URI,并处理编码后的协议名称。
javascript:vbscript:data:4. Format Alteration with Parsers
4. 使用解析器改变格式
Mistake: Using HTML parsers when format preservation is required.
Solution: If format preservation is critical, use regex-based surgical removal or verify parser output matches input formatting.
错误做法:在需要保留格式的场景下使用HTML解析器。
解决方案:如果格式保留至关重要,使用基于正则表达式的精准移除,或验证解析器输出与输入格式匹配。
5. Self-Validating Tests
5. 自验证测试
Mistake: Creating test cases that match implementation capabilities rather than real attack vectors.
Solution: Use external, adversarial test suites created by security researchers.
错误做法:创建仅匹配实现能力的测试用例,而非真实攻击向量。
解决方案:使用安全研究人员创建的外部对抗性测试套件。
6. Quote and Encoding Handling
6. 引号与编码处理不当
Mistake: Not handling HTML entities in attributes (, ).
Solution: Consider how encoded characters in attributes might bypass filters.
"'错误做法:未处理属性中的HTML实体(如, )。
解决方案:考虑属性中的编码字符可能绕过过滤器的情况。
"'7. Forgetting Meta Refresh
7. 忘记Meta刷新
Mistake: Not filtering with dangerous URLs.
Solution: Include meta tags in the filtering scope, especially those with data: or javascript: URLs.
<meta http-equiv="refresh">错误做法:未过滤包含危险URL的。
解决方案:将meta标签纳入过滤范围,尤其是包含data:或javascript: URL的标签。
<meta http-equiv="refresh">8. Ignoring External Resources
8. 忽略外部资源
Mistake: Not filtering , , tags.
Solution: Evaluate whether these tags can load or execute dangerous content.
<link><object><embed>错误做法:未过滤, , 标签。
解决方案:评估这些标签是否可能加载或执行危险内容。
<link><object><embed>Implementation Checklist
实施检查清单
Before considering the implementation complete:
- Researched comprehensive XSS attack vector lists
- Implemented filtering for ALL event handlers (not just common ones)
- Handled script tags and noscript abuse
- Filtered javascript:, vbscript:, and dangerous data: URIs
- Addressed CSS-based attacks (style tags, expressions, bindings)
- Handled meta refresh attacks
- Considered link, object, embed, applet tags
- Handled SVG-based attacks
- Accounted for encoding variations
- Tested with external XSS test suites
- Verified clean HTML files remain unchanged
- Tested format preservation (whitespace, quotes, ordering)
在确认实施完成前,需完成以下事项:
- 研究了全面的XSS攻击向量列表
- 实现了对所有事件处理器的过滤(而非仅常见处理器)
- 处理了脚本标签和noscript滥用场景
- 过滤了javascript:、vbscript:和危险的data: URI
- 解决了基于CSS的攻击(style标签、表达式、绑定)
- 处理了meta刷新攻击
- 考虑了link、object、embed、applet标签
- 处理了基于SVG的攻击
- 覆盖了编码变体
- 使用外部XSS测试套件进行了测试
- 验证干净的HTML文件保持不变
- 测试了格式保留(空白字符、引号、顺序)