python文本分析与挖掘（三）-词频统计

云烟 • 2024年 11月 8日上午10:00 • 未分类

欢迎大家来到IT世界,在知识的湖畔探索吧!

实现功能：

前一篇文章我介绍了文本分析与挖掘的第一步和第二步（具体可参加前两篇文章），即构建语料库和中文分词，这篇文章将在此基础上进行词频统计。

实现代码：

1	import os
2	from warnings import simplefilter
3	simplefilter(action=’ignore’, category=FutureWarning)
4	import os.path
5	import codecs
6	import pandas
7	import jieba
8	import numpy as np
9	#==========词料库构建===============
10	def Create_corpus(file):
11	filePaths = []
12	fileContents=[]
13	for root, dirs, files in os.walk(file):
14	# os.path.join()方法拼接文件名返回所有文件的路径，并储存在变量filePaths中
15	for name in files:
16	filePath=os.path.join(root, name)
17	filePaths.append(filePath)
18	f = codecs.open(filePath, ‘r’, ‘utf-8’)
19	fileContent = f.read()
20	f.close()
21	fileContents.append(fileContent)
22	#codecs.open()方法打开每个文件，用文件的read()方法依次读取其中的文本，将所有文本内容依次储存到变量fileContenst中，然后close()方法关闭文件。
23	#创建数据框corpos，添加filePaths和fileContents两个变量作为数组
24	corpos = pandas.DataFrame({‘filePath’: filePaths,’fileContent’: fileContents})
25	return corpos
26
27	#============中文分词===============
28	def Word_segmentation(corpos):
29	segments = []
30	filePaths = []
31	#遍历语料库的每一行数据，得到的row为一个个Series，index为key
32	for index, row in corpos.iterrows():
33	filePath = row[‘filePath’]#获取每一个row中filePath对应的文件路径
34	fileContent = row[‘fileContent’]#获取row中fileContent对应的每一个文本内容
35	segs = jieba.cut(fileContent)#对文本进行分词
36	for seg in segs:
37	segments.append(seg)#分词结果保存到变量segments中
38	filePaths.append(filePath)#对应的文件路径保存到变量filepaths中
39	#将分词结果及对应文件路径添加到数据框中
40	segmentDataFrame = pandas.DataFrame({‘segment’: segments,’filePath’: filePaths})
41	print(segmentDataFrame)
42	return segmentDataFrame
43
44	#===============词频统计================
45	def Word_frequency(segmentDataFrame):
46	segStat = segmentDataFrame.groupby(by=”segment”)[“segment”].agg([(“计数”,np.size)]).reset_index().sort_values(by=[“计数”],ascending=False) #对单个词进行分组计数，重置索引，并将计数列按照倒序排序。
47	print(segStat)
48	#移除停用词
49	stopwords = pandas.read_csv(r”F:\医学大数据课题\AI_SLE\AI_SLE_TWO\userdict.txt”, encoding=’utf8′, index_col=False)
50	print(stopwords)
51	#导入停用词文件，.isin()判断某个词是否在停用词中，~表示取反，这样就过滤掉了停用词了
52	fSegStat = segStat[~segStat[‘segment’].isin(stopwords[‘stopword’])]
53	print(fSegStat)
54
55	corpos=Create_corpus(“F:\医学大数据课题\AI_SLE\AI_SLE_TWO\TEST_DATA”)
56	segmentDataFrame=Word_segmentation(corpos)
57	Word_frequency(segmentDataFrame)

实现效果：

中文分词结果

单个词分组计数

停用词

过滤停用词分组计数

喜欢记得点赞，在看，收藏，

关注V订阅号：数据杂坛，获取数据集，完整代码和效果，将持续更新！

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://itzsg.com/85617.html