当前位置：首页>开发>正文

Python取文本特定内容使用python对txt文本进行分析和提取

2023-05-04 10:15:19 互联网未知开发

Python取文本特定内容

file_object = open(rD: est.txt)
list_of_all_the_lines = file_object.readlines()
for line in list_of_all_the_lines:
   for i in line:
if (i == "]"):
   if (test == "Error"):
        print line
   break
   if (i == "["):
      test = ""
   else:
      test = test i

使用python对txt文本进行分析和提取

有规则的比如姓名：xxx 卡号 12356等就可以用正则 re开抽取，
有点乱的话也可以增加正则规则抽取出来
但是毫无规律的话就是只能使用姓名词典，知识库进行识别名字，卡号的话可以根据各行卡号规则和长度从数据中筛选。

python有哪些提取文本摘要的库

一篇文章的内容可以是纯文本格式的，但在网络盛行的当今，更多是HTML格式的。无论是哪种格式，摘要一般都是文章开头部分的内容，可以按照指定的字数来提取。
二、纯文本摘要
纯文本文档就是一个长字符串，很容易实现对它的摘要提取：
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Get a summary of the TEXT-format document"""
def get_summary(text, count):
u"""Get the first `count` characters from `text`
>>> text = uWelcome 这是一篇关于Python的文章
>>> get_summary(text, 12) == uWelcome 这是一篇
True
"""
assert(isinstance(text, unicode))
return text[0:count]
if __name__ == __main__:
import doctest
doctest.testmod()

三、HTML摘要
HTML文档中包含大量标记符（如

、
、等等），这些字符都是标记指令，并且通常是成对出现的，简单的文本截取会破坏HTML的文档结构，进而导致摘要在浏览器中显示不当。
在遵循HTML文档结构的同时，又要对内容进行截取，就需要解析HTML文档。在Python中，可以借助标准库 HTMLParser 来完成。
一个最简单的摘要提取功能，是忽略HTML标记符而只提取标记内部的原生文本。以下就是类似该功能的Python实现：
#!/usr/bin/env python
# -- coding: utf-8 --
"""Get a raw summary of the HTML-format document"""
from HTMLParser import HTMLParser
class SummaryHTMLParser(HTMLParser):
"""Parse HTML text to get a summary
>>> text = u
Hi guys:
This is a example using SummaryHTMLParser.

>>> parser = SummaryHTMLParser(10)
>>> parser.feed(text)
>>> parser.get_summary(u...)
u
Higuys:Thi...

"""
def init(self, count):
HTMLParser.init(self)
self.count = count
self.summary = u
def feed(self, data):
"""Only accept unicode `data`"""
assert(isinstance(data, unicode))
HTMLParser.feed(self, data)
def handle_data(self, data):
more = self.count - len(self.summary)
if more > 0:
# Remove possible whitespaces in `data`
data_without_whitespace = u.join(data.split())
self.summary = data_without_whitespace[0:more]
def get_summary(self, suffix=u, wrapper=up):
return u<{0}>{1}{2}.format(wrapper, self.summary, suffix)
if name == main:
import doctest
doctest.testmod()

HTMLParser（或者 BeautifulSoup 等等）更适合完成复杂的HTML摘要提取功能，对于上述简单的HTML摘要提取功能，其实有更简洁的实现方案（相比 SummaryHTMLParser 而言）：
#!/usr/bin/env python
# -- coding: utf-8 --
"""Get a raw summary of the HTML-format document"""
import re
def get_summary(text, count, suffix=u, wrapper=up):
"""A simpler implementation (vs `SummaryHTMLParser`).
>>> text = u
Hi guys:
This is a example using SummaryHTMLParser.

>>> get_summary(text, 10, u...)
u
Higuys:Thi...

"""
assert(isinstance(text, unicode))
summary = re.sub(r<.*?>, u, text) # key difference: use regex
summary = u.join(summary.split())[0:count]
return u<{0}>{1}{2}.format(wrapper, summary, suffix)
if name == main:
import doctest
doctest.testmod()
如何使用python提取wps特定内容信息
python是一款应用非常广泛的脚本程序语言，谷歌公司的网页就是用python编写。python在生物信息、统计、网页制作、计算等多个领域都体现出了强大的功能。python和其他脚本语言如java、R、Perl 一样，都可以直接在命令行里运行脚本程序。工具/原料
python；CMD命令行；windows操作系统
方法/步骤
1、首先下载安装python，建议安装2.7版本以上，3.0版本以下，由于3.0版本以上不向下兼容，体验较差。
2、打开文本编辑器，推荐editplus，notepad等，将文件保存成 .py格式，editplus和notepad支持识别python语法。
脚本第一行一定要写上 #!usr/bin/python
表示该脚本文件是可执行python脚本
如果python目录不在usr/bin目录下，则替换成当前python执行程序的目录。
3、编写完脚本之后注意调试、可以直接用editplus调试。调试方法可自行百度。脚本写完之后，打开CMD命令行，前提是python 已经被加入到环境变量中，如果没有加入到环境变量，请百度
4、在CMD命令行中，输入 “python” “空格”，即 ”python “；将已经写好的脚本文件拖拽到当前光标位置，然后敲回车运行即可。

最新文章

Python取文本特定内容使用python对txt文本进行分析和提取

Python取文本特定内容

使用python对txt文本进行分析和提取

python有哪些提取文本摘要的库

如何使用python提取wps特定内容信息

随便看看

Python取文本特定内容 使用python对txt文本进行分析和提取

Python取文本特定内容

使用python对txt文本进行分析和提取

python有哪些提取文本摘要的库

如何使用python提取wps特定内容信息

最新文章

随便看看

Python取文本特定内容使用python对txt文本进行分析和提取