手搓Python代码来制作词云

2 5 月, 2022 阅读 2434 字数 10525 评论 0 喜欢 0

手搓Python代码来制作词云

在社会学、公共管理、社会心理学研究中，很多时候会需要对文本内容进行分析，然后一个很常规的操作就是，获取文本中关键词和关键词词频，并且制作一个词频云，也叫做词云。这次就用手搓Python代码来制作词云，并且将词云关键词和词频统计出来。

✦

首先声明

First Notice

现在网上有很多免费的词云制作工具。比如说，易词云、WordArt、毛线球科技的词云、图悦等等，这些在线软件还是可以的，对于一些比较小的文本内容，还是可以制作很漂亮的出词云图。

“

但是这些软件都是在线的，因此文档内容不能输入太多。而我们做科研的话，内容可能比较多，太小的内容可能无法满足科研的需要。

那么，手搓一个自己可控的Python来制作词云，统计关键词和词频的内容，并且数据是保存在自己手里，不会有暴露到网络的风险，因此，这自制的词云就显得很有必要。

下面开始介绍如何使用Python手搓一个词云图。

安装必要的软件和插件

必要的软件包括：

Python 3.6+和PyCharm

这两个软件的介绍见之前的推文。

必要的Python包：

WordCloud、numpy、jieba、pandas、PIL、os、csv

安装包的方式，请使用下面的代码：

pip install XXX（python包）

第

一

步

❖

准备相关的材料

必要的材料：

停用词列表

停用词是在文本检索过程中出现频率很高但对文本分析没有实际作用的单词。通常所有功能词和连接词都被认为是停用词，例如“和”、“或”和“在”。由于制作的词云是为了对文本意义进行分析。因此，在进行词云分析过程中，需要将一些无意义的词提出来，这种无意义的词就是停用词。

选用的材料：

一张自己喜欢的背景图片（去除图片背景的PNG格式）

一个自己喜欢的字体文件（ttf格式，例如微软雅黑）

第

二

步

第

三

步

准备好需要的数据文件

这次手搓的python代码，使用的是csv数据文件，里面需要进行分析的文本表头（列标题）需要设置为content。当然，你也可以在代码里面修改成数据格式的表头。

第四步

使用wordcloud包制作默认的词云图

这一步是制作一个默认的词云图，没有使用背景图片，使用txt格式的文本数据文档文件就好。

代码如下：

✦

import jieba
import wordcloud

with open("你的文本数据文件.txt",encoding="utf-8") as f:
text_data = f.read()
cut_text = jieba.lcut(text_data)
text = ' '.join(cut_text)
wc = wordcloud.WordCloud(font_path="msyh.ttf").generate(text)
wc.to_file("输出的词云图.png")

注意事项

Notice

最简单的代码仅使用了结巴分词包（Jieba），词云包（WordCloud）。制作的图片效果是比较简单的。需要注意的是，需要在wordcloud类里面添加字体的路径。

效果图如下（每次生成的词云图都不一致，基本上不会有重复的）

这个词云图中，文字的大小代表了关键词的词频强度，字体越大，代表词频强度越大。

第五步

使用wordcloud类的不同配置，调整词云图的外观

Wordcloud模块是可以修改一些参数配置，让词云图变得更加好看一些。具体的配置查看下面的说明。这个下面内容的修改位置是在，上一步中那个设定字体文件的位置。

✦

具体说明如下：

向上滑动全部参数

font_path：词云图的字体路径（OTF或TTF格式）

width：画布的宽度、默认为400，如果mask不为空时，设置为mask获取图片的大小

height：画布的高度，默认为200，如果mask不为空时，设置为mask获取图片的大小

prefer_horizontal：默认值0.9；当值<1时，遇到不合适的地方时，算法将词体自动旋转

mask：默认为None；如果不为空，指定了画布的图形，则width和height值不生效，使用提供的图形的大小

contour_width：如果mask不为空，并且contour_width>0，将描绘出mask获取图片的轮廓，值越大，轮廓的线越粗

contour_color：使用Mask时，描绘图片轮廓的颜色

scale：图片生成后放大缩小时的分辨率

min_font_size：词云图显示的最小字体，默认为4

max_font_size：词云图显示的最大字体

max_words：词云显示的最大词数

font_step：字体步长

stopwords：不显示的词，如果没有设置，则使用默认的内置的STOPWORdS列表；如果使用generate_from_frequencies参数，则忽略

background_color：背景颜色

mode：默认为"RGB"，当mode="RGBA"并且background_color为None时，将会显示透明背景

relative_scaling：字体大小与词频的关系，默认值为auto

color_func：默认为None，color_func=lambda *args, **kwargs:(255,0,0)词云的字体颜色将这设置为红色

regexp：使用正则切分，默认为r"w[w']+"，如果使用generate_from_frequencies则此参数不生效

collocations：是否包含两个词的搭配，默认为True，如果使用generate_from_frequencies则此参数不生效

colormap：设置颜色的参数，默认为"viridis"，如果使用color_func参数，则此参数不生效

normalize_plurals：是否删除尾随的词，比如's，如果使用generate_from_frequencies参数，则此参数不生效

repeat：是否重复词组直到设置的最大的词组数

include_numbers：是否包含数字，默认我False

min_word_length：最小数量的词，默认为0

collocation_threshold：默认为30，整体搭配的评分等级

小经验

Tips

Python中ColorMap可以使用的颜色比较多，但是需要使用英文来表征。网上找到了两份比较全面的颜色内容，可以参考看下。

COLORS = ['snow', 'ghost white', 'white smoke', 'gainsboro', 'floral white', 'old lace', 'linen', 'antique white', 'papaya whip', 'blanched almond', 'bisque', 'peach puff', 'navajo white', 'lemon chiffon', 'mint cream', 'azure', 'alice blue', 'lavender', 'lavender blush', 'misty rose', 'dark slate gray', 'dim gray', 'slate gray', 'light slate gray', 'gray', 'light grey', 'midnight blue', 'navy', 'cornflower blue', 'dark slate blue', 'slate blue', 'medium slate blue', 'light slate blue', 'medium blue', 'royal blue', 'blue', 'dodger blue', 'deep sky blue', 'sky blue', 'light sky blue', 'steel blue', 'light steel blue', 'light blue', 'powder blue', 'pale turquoise', 'dark turquoise', 'medium turquoise', 'turquoise', 'cyan', 'light cyan', 'cadet blue', 'medium aquamarine', 'aquamarine', 'dark green', 'dark olive green', 'dark sea green', 'sea green', 'medium sea green', 'light sea green', 'pale green', 'spring green', 'lawn green', 'medium spring green', 'green yellow', 'lime green', 'yellow green', 'forest green', 'olive drab', 'dark khaki', 'khaki', 'pale goldenrod', 'light goldenrod yellow', 'light yellow', 'yellow', 'gold', 'light goldenrod', 'goldenrod', 'dark goldenrod', 'rosy brown', 'indian red', 'saddle brown', 'sandy brown', 'dark salmon', 'salmon', 'light salmon', 'orange', 'dark orange', 'coral', 'light coral', 'tomato', 'orange red', 'red', 'hot pink', 'deep pink', 'pink', 'light pink', 'pale violet red', 'maroon', 'medium violet red', 'violet red', 'medium orchid', 'dark orchid', 'dark violet', 'blue violet', 'purple', 'medium purple', 'thistle', 'snow2', 'snow3', 'snow4', 'seashell2', 'seashell3', 'seashell4', 'AntiqueWhite1', 'AntiqueWhite2', 'AntiqueWhite3', 'AntiqueWhite4', 'bisque2', 'bisque3', 'bisque4', 'PeachPuff2', 'PeachPuff3', 'PeachPuff4', 'NavajoWhite2', 'NavajoWhite3', 'NavajoWhite4', 'LemonChiffon2', 'LemonChiffon3', 'LemonChiffon4', 'cornsilk2', 'cornsilk3', 'cornsilk4', 'ivory2', 'ivory3', 'ivory4', 'honeydew2', 'honeydew3', 'honeydew4', 'LavenderBlush2', 'LavenderBlush3', 'LavenderBlush4', 'MistyRose2', 'MistyRose3', 'MistyRose4', 'azure2', 'azure3', 'azure4', 'SlateBlue1', 'SlateBlue2', 'SlateBlue3', 'SlateBlue4', 'RoyalBlue1', 'RoyalBlue2', 'RoyalBlue3', 'RoyalBlue4', 'blue2', 'blue4', 'DodgerBlue2', 'DodgerBlue3', 'DodgerBlue4', 'SteelBlue1', 'SteelBlue2', 'SteelBlue3', 'SteelBlue4', 'DeepSkyBlue2', 'DeepSkyBlue3', 'DeepSkyBlue4', 'SkyBlue1', 'SkyBlue2', 'SkyBlue3', 'SkyBlue4', 'LightSkyBlue1', 'LightSkyBlue2', 'LightSkyBlue3', 'LightSkyBlue4', 'SlateGray1', 'SlateGray2', 'SlateGray3', 'SlateGray4', 'LightSteelBlue1', 'LightSteelBlue2', 'LightSteelBlue3', 'LightSteelBlue4', 'LightBlue1', 'LightBlue2', 'LightBlue3', 'LightBlue4', 'LightCyan2', 'LightCyan3', 'LightCyan4', 'PaleTurquoise1', 'PaleTurquoise2', 'PaleTurquoise3', 'PaleTurquoise4', 'CadetBlue1', 'CadetBlue2', 'CadetBlue3', 'CadetBlue4', 'turquoise1', 'turquoise2', 'turquoise3', 'turquoise4', 'cyan2', 'cyan3', 'cyan4', 'DarkSlateGray1', 'DarkSlateGray2', 'DarkSlateGray3', 'DarkSlateGray4', 'aquamarine2', 'aquamarine4', 'DarkSeaGreen1', 'DarkSeaGreen2', 'DarkSeaGreen3', 'DarkSeaGreen4', 'SeaGreen1', 'SeaGreen2', 'SeaGreen3', 'PaleGreen1', 'PaleGreen2', 'PaleGreen3', 'PaleGreen4', 'SpringGreen2', 'SpringGreen3', 'SpringGreen4', 'green2', 'green3', 'green4', 'chartreuse2', 'chartreuse3', 'chartreuse4', 'OliveDrab1', 'OliveDrab2', 'OliveDrab4', 'DarkOliveGreen1', 'DarkOliveGreen2', 'DarkOliveGreen3', 'DarkOliveGreen4', 'khaki1', 'khaki2', 'khaki3', 'khaki4', 'LightGoldenrod1', 'LightGoldenrod2', 'LightGoldenrod3', 'LightGoldenrod4', 'LightYellow2', 'LightYellow3', 'LightYellow4', 'yellow2', 'yellow3', 'yellow4', 'gold2', 'gold3', 'gold4', 'goldenrod1', 'goldenrod2', 'goldenrod3', 'goldenrod4', 'DarkGoldenrod1', 'DarkGoldenrod2', 'DarkGoldenrod3', 'DarkGoldenrod4', 'RosyBrown1', 'RosyBrown2', 'RosyBrown3', 'RosyBrown4', 'IndianRed1', 'IndianRed2', 'IndianRed3', 'IndianRed4', 'sienna1', 'sienna2', 'sienna3', 'sienna4', 'burlywood1', 'burlywood2', 'burlywood3', 'burlywood4', 'wheat1', 'wheat2', 'wheat3', 'wheat4', 'tan1', 'tan2', 'tan4', 'chocolate1', 'chocolate2', 'chocolate3', 'firebrick1', 'firebrick2', 'firebrick3', 'firebrick4', 'brown1', 'brown2', 'brown3', 'brown4', 'salmon1', 'salmon2', 'salmon3', 'salmon4', 'LightSalmon2', 'LightSalmon3', 'LightSalmon4', 'orange2', 'orange3', 'orange4', 'DarkOrange1', 'DarkOrange2', 'DarkOrange3', 'DarkOrange4', 'coral1', 'coral2', 'coral3', 'coral4', 'tomato2', 'tomato3', 'tomato4', 'OrangeRed2', 'OrangeRed3', 'OrangeRed4', 'red2', 'red3', 'red4', 'DeepPink2', 'DeepPink3', 'DeepPink4', 'HotPink1', 'HotPink2', 'HotPink3', 'HotPink4', 'pink1', 'pink2', 'pink3', 'pink4', 'LightPink1', 'LightPink2', 'LightPink3', 'LightPink4', 'PaleVioletRed1', 'PaleVioletRed2', 'PaleVioletRed3', 'PaleVioletRed4', 'maroon1', 'maroon2', 'maroon3', 'maroon4', 'VioletRed1', 'VioletRed2', 'VioletRed3', 'VioletRed4', 'magenta2', 'magenta3', 'magenta4', 'orchid1', 'orchid2', 'orchid3', 'orchid4', 'plum1', 'plum2', 'plum3', 'plum4', 'MediumOrchid1', 'MediumOrchid2', 'MediumOrchid3', 'MediumOrchid4', 'DarkOrchid1', 'DarkOrchid2', 'DarkOrchid3', 'DarkOrchid4', 'purple1', 'purple2', 'purple3', 'purple4', 'MediumPurple1', 'MediumPurple2', 'MediumPurple3', 'MediumPurple4', 'thistle1', 'thistle2', 'thistle3', 'thistle4', 'gray1', 'gray2', 'gray3', 'gray4', 'gray5', 'gray6', 'gray7', 'gray8', 'gray9', 'gray10', 'gray11', 'gray12', 'gray13', 'gray14', 'gray15', 'gray16', 'gray17', 'gray18', 'gray19', 'gray20', 'gray21', 'gray22', 'gray23', 'gray24', 'gray25', 'gray26', 'gray27', 'gray28', 'gray29', 'gray30', 'gray31', 'gray32', 'gray33', 'gray34', 'gray35', 'gray36', 'gray37', 'gray38', 'gray39', 'gray40', 'gray42', 'gray43', 'gray44', 'gray45', 'gray46', 'gray47', 'gray48', 'gray49', 'gray50', 'gray51', 'gray52', 'gray53', 'gray54', 'gray55', 'gray56', 'gray57', 'gray58', 'gray59', 'gray60', 'gray61', 'gray62', 'gray63', 'gray64', 'gray65', 'gray66', 'gray67', 'gray68', 'gray69', 'gray70', 'gray71', 'gray72', 'gray73', 'gray74', 'gray75', 'gray76', 'gray77', 'gray78', 'gray79', 'gray80', 'gray81', 'gray82', 'gray83', 'gray84', 'gray85', 'gray86', 'gray87', 'gray88', 'gray89', 'gray90', 'gray91', 'gray92', 'gray93', 'gray94', 'gray95', 'gray97', 'gray98', 'gray99']

第六步

手搓的可以供使用的设置了背景的词云代码

不多说，代码如下：

from wordcloud import WordCloud
from PIL import Image
import numpy as np
import jieba
import pandas as pd
import os
import csv

#此处填写关键词以及需要输出多少个关键词（默认200）
keyword_set = "乡村振兴"
keynum = 200

#下面四个文件，分别是原始文件地址/词云结果图/关键词和词频结果/分词后的文本
file = f"./data/{keyword_set}.csv"
ciyun_out = f'./out/ciyun_{keyword_set}.png'
filename_keywords = f'./out/关键词与词频_{keyword_set}.csv'
fenci_out = f'out/分词后文本_{keyword_set}.txt'

#下面三个文件，分别是停用词/背景图/字体文件的文件位置
stop_word_file = './cailiao/stopwords.txt'
back_pic = "./cailiao/china.png"
font_file = "./cailiao/msyh.ttf"
ss = file[:-4]
file_out = ss+'.txt'
df = pd.read_csv(file, encoding='utf-8', usecols=['content'])
with open(file_out, 'a+', encoding='utf-8') as f:
    for line in df.values:
        f.write((str(line[0]) + '\n'))
print("--------------正在进行分词操作----------------")
fr = open(stop_word_file, 'r',encoding='utf-8') #停用词
stop_word_list = fr.readlines()
new_stop_word_list = []
for stop_word in stop_word_list:
    stop_word = stop_word.replace('\ufeef', '').strip()
    new_stop_word_list.append(stop_word)
fr_xyj = open(file_out, 'r', encoding='utf-8')
s = fr_xyj.read()
fr_xyj.close()
words = jieba.cut(s) #, cut_all=True 全模式
word_dict = {}
word_list = ''
for word in words:
    if (len(word) > 1 and not word in new_stop_word_list):
        word_list = word_list + ' ' + word
        if (word_dict.get(word)):
            word_dict[word] = word_dict[word] + 1
        else:
            word_dict[word] = 1
fr.close()
sort_words = sorted(word_dict.items(), key=lambda x:x[1], reverse=True)

with open(filename_keywords, 'a', encoding='utf-8', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['关键词', '词频'])
    for row in sort_words[0:keynum]:        #需要输出多少个关键词
        print(row)
        writer.writerow(row)
f9 = open(fenci_out, 'a', encoding="utf-8")
f9.write(word_list + '\n')
f9.close()
mask = np.array(Image.open(back_pic))               #设置背景图片
wc = WordCloud(background_color="white",
                      width=900,
                      height=600,
                      max_words=200,
                      max_font_size=80,
                      mask=mask,
                      scale=3,
                      contour_width=4,
                      contour_color='steelblue',
                      font_path=font_file).generate(word_list)
wc.to_file(ciyun_out)
os.remove(file_out)

词云效果如下：

同时，代码运行后，会在out目录下生成经过结巴分词后的文本数据文件，以及前200个词频的关键词文件。

具体的代码，我已经打包好了，请点击下方下载。

词云代码及示例数据

手搓Python代码来制作词云

手搓Python代码来制作词云

发表回复