HTML文本净化器

更新时间：2026-04-23

简介

HTML文本净化器: 去除文本中的 HTML 标签，只提取纯文本内容

功能描述

标题提取：自动识别<h1>-<h6>标签
正文抽取：智能识别文章主体内容
冗余过滤：移除<script>/<style>等非文本标签

算子参数

输入

输入	含义
texts	包含HTML内容的文本数组，元素类型为字符串

输出

输出	含义
cleaned_text	清理后的文本数组，元素类型为字符串

参数

参数	类型	默认值	含义
separator	str	\n	分隔符
strip	bool	True	是否去除空白

调用示例

Plain Text

1from __future__ import annotations
2
3import os
4import daft
5from daft import col
6
7from daft.aihc.common.udf import aihc_udf
8from daft.aihc.functions.text.clean_html_tag import CleanHtmlTag
9
10if __name__ == "__main__":
11    if os.getenv("DAFT_RUNNER", "native") == "ray":
12        import ray
13        ray.init(dashboard_host="0.0.0.0", ignore_reinit_error=True)
14        daft.set_runner_ray()
15    daft.set_execution_config(actor_udf_ready_timeout=6000, min_cpu_per_task=0)
16
17    samples = {
18        "text": [
19            """<!-- 这是 HTML 注释 -->
20    <!DOCTYPE html>
21    <html>
22    <head>
23        <!-- CSS -->
24        <link rel="stylesheet" href="styles.css">
25    </head>
26    <body>
27        <!--
28            多行注释#1
29            多行注释#2
30            多行注释#3
31        -->
32        <div class="content" id="main-content">
33            <p class="text">Hello World!</p>
34        </div>
35    </body>
36    </html>""",
37            None,
38        ]
39    }
40
41    separator = "\n"
42    strip = True
43    ds = daft.from_pydict(samples)
44    ds = ds.with_column(
45        "cleaned_text",
46        aihc_udf(
47            CleanHtmlTag,
48            construct_args={"separator": separator, "strip": strip},
49        )(col("text")),
50    )
51    ds.show()

评价此篇文章

有帮助没帮助

BOS 预签名URL生成器

MD5哈希值计算

百度智能云

百度百舸 · AI计算平台

百度百舸 · AI计算平台

HTML文本净化器

简介

功能描述

算子参数

输入

输出

参数

调用示例