2020-11-26

分布式1126-Spark文本分析

spark的文本分析功能。

具体代码参考李丰老师课件L07.2-Text-Processing-with-Spark。

一、大数据文本分析的需求

利用分布式系统，进行文本处理。

现在有很多文本型数据，到底会遇见什么不一样的地方，spark又提供了哪些工具？

从基本概念和流程开始。

术语：

语料库。corpora。

语料库是一个包含大量感兴趣文本的集合，比如说，人民日报创刊以来所有的新闻社论。每个版2-3篇文章，一天做成一个行向量（可以是个很长的字典/列表），写入系统。

语料库最早是语言学家使用，处理语言问题。如研究50年代的语法，用于习惯。有从各种角度建立的语料库：经济学角度，政治角度，统计学角度，等等。

几十年前，如果研究人文类，会成为纯文科的事。而今，可以利用计算机，做词频统计等等，理科人也能掺和进来了。

来源，可能是工作就有，可能是自己采集，甚至可以我们自己构建语料库。

n乘k的语料矩阵，可以转成n乘m的语义数值阵，这一步很难，且有争议。解读和设定都比较主观，没有统一的标准。如何提取稳定的信息，就需要统计模型。

语料和数值型差距较大。

一个外国（说英语的）大学生，词汇量有两万多。构成文章，是这些词的排列组合，会有非常非常多的可能，计算机无法处理。也就无法将词作为基本单元来处理了。

于是需要化简，在处理中就是：断句分词。

逗号之间，段落内拆分成很小很多的单元。

我们学句读，就是在训练自己，把自己变成解释器。

——李丰老师

拆分成单词，就会丢失句子的信息。

而当今很多语言模型不能就“序”进行建模。我爱北京天安门，北京天安门爱我。这两句在家语言模型中可能是相同的。红黄蓝，蓝黄红，如果顺序有个权重，就完蛋了。（氦核：完蛋，全完蛋）

中文更特别，单词间没有空格，于是需要拆分词。在线翻译依然很垃圾，主要原因是文本实在是太难了。

新闻信息：语音录制工具（记录，转文本），做摘要，重新做新的填空，再做快报。

庭审记录员：解放书记员，书记员的记录会出错/有倾向性。背后有语义模型。

体育赛事：捕捉球员的用语，捕捉球员的兴奋状态。NBA以及是统计模型的竞赛了。

好的文本处理工具，可以让我们对语言不再束手无策。

——李丰老师

停词中，把相类似相近的东西都替换成相关的内容。语言处理需要大量经验。斯坦福的自然语言处理，中国中科院，哈工大，都有自己的语料库。有趣而无聊的操作。

很多互联网公司提供了免费的api，根据其语料库分词。造就了当今输入法。

举个海底捞的例子，海底捞商标，如果我注册一个河底捞商标，侵权了吗。（氦核：老师这举的啥例子。。笑）

二、Spark的解决方案

现在看看上面的操作，spark如何操作。

英文：对每一行都做了split拆分，得到一个词频矩阵。

1	HashingTF(inputCol = 'words',outputCol = 'rawFeatures')

另一个工具叫IDF，每天要过滤很多网络信息，很多没用的信息。

三千封邮件中，有两份出现了“爆炸”“枪”，两个词同时出现。关联性的信息很强。就要适当放大权重。（这就是通过IDF来实现的）把那些我们在常用词里不关心，但低频词在多个文档都出现的，增加权重。

再比如，有很多政治新闻，会影响原油价格。我想回归，把政治新闻当做协变量。谷歌找了一个办法，把文本信息做成向量Word2VecModel（word2vec），里面还要信号强度。通过两层的神经网络，重新转化成一个数值型向量。

去除停词StopWordsRemover

一些新的想法：

扩大相关性：

以一个词为中心，向左组词，向右组词，这个情况叫bi-gram，i向右扩1个，扩2个……信息的离散度越高。确定性的信息越少。不能无限扩大，常见的是2或3，能够体现相关性，重新构建词频矩阵。

我爱北京天安门

1我2爱3北京4天安门

5我爱6爱北京7我爱北京（567以爱为中心）

LDA

根据相似性聚合在一起，这个过程叫狄利克雷过程，也叫中国餐馆过程。哈哈。

如果这里面每一个客人，都是单词，我们就能通过统计学的聚类工具，自动把文章分为不同的主题。

主题模型建模发展目前正趋于成熟。

三、实战

略过建立sc的步骤

textFile = sc.textFile("test.txt")
textFile.first()
## 展示结果： 
## 'Title: The Romance of Wills and Testaments'


type(textFile)
## 展示结果： 
## pyspark.rdd.RDD


# 去空行
print(textFile.count())
text0 = textFile.filter(lambda x: len(x)>1) # 留下符合要求的行
print(text0.count())
## 展示结果： 
## 1129
## 492


from pyspark.ml.feature import HashingTF, IDF, Tokenizer
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenizer
## 展示结果： 
## Tokenizer_39744db6461e


# 注意数据格式
'''
sentenceData = spark.createDataFrame([
    (0.0, "Hi I heard about Spark"),
    (0.0, "I wish Java could use case classes"),
    (1.0, "Logistic regression models are neat")
], ["label", "sentence"])
'''
# 此处要用 开始创建的sparksession创建DataFrame
sentenceData= ss.createDataFrame(
    list(zip(list(map(float,range(len(text0.collect())-1))), text0.collect())) # 元祖信息
,['label','sentence'])
sentenceData
## DataFrame[label: double, sentence: string]


# 载入分词引擎
wordsData = tokenizer.transform(sentenceData)
wordsData.show()
## 展示结果： 
## +-----+--------------------+--------------------+
## |label|            sentence|               words|
## +-----+--------------------+--------------------+
## |  0.0|Title: The Romanc...|[title:, the, rom...|
## |  1.0|Author: Edgar Vin...|[author:, edgar, ...|
## |  2.0|             PREFACE|           [preface]|
## |  3.0|By way of preface...|[by, way, of, pre...|
## |  4.0|As in death, so i...|[as, in, death,, ...|
## |  5.0|Different types a...|[different, types...|
## |  6.0|It is desired to ...|[it, is, desired,...|
## |  7.0|Again, there are ...|[again,, there, a...|
## |  8.0|Especial acknowle...|[especial, acknow...|
## |  9.0|The idea of this ...|[the, idea, of, t...|
## | 10.0|Since these essay...|[since, these, es...|
## | 11.0|Scattered about t...|[scattered, about...|
## | 12.0|Other references ...|[other, reference...|
## | 13.0|       E. VINE HALL.|   [e., vine, hall.]|
## | 14.0|          Wimbledon.|        [wimbledon.]|
## | 15.0|           CHAPTER I|        [chapter, i]|
## | 16.0|THE ROMANCE OF WILLS|[the, romance, of...|
## | 17.0|��The older I gro...|[��the, older, i,...|
## | 18.0|The words of the ...|[the, words, of, ...|
## | 19.0|Historically they...|[historically, th...|
## +-----+--------------------+--------------------+
## only showing top 20 rows


hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
featurizedData.show()
## 展示结果： 
## +-----+--------------------+--------------------+--------------------+
## |label|            sentence|               words|         rawFeatures|
## +-----+--------------------+--------------------+--------------------+
## |  0.0|Title: The Romanc...|[title:, the, rom...|(20,[6,10,11,12,1...|
## |  1.0|Author: Edgar Vin...|[author:, edgar, ...|(20,[1,2,7,8],[1....|
## |  2.0|             PREFACE|           [preface]|      (20,[2],[1.0])|
## |  3.0|By way of preface...|[by, way, of, pre...|(20,[1,2,3,4,6,7,...|
## |  4.0|As in death, so i...|[as, in, death,, ...|(20,[0,1,2,3,4,5,...|
## |  5.0|Different types a...|[different, types...|(20,[0,1,3,4,5,6,...|
## |  6.0|It is desired to ...|[it, is, desired,...|(20,[0,1,2,3,4,5,...|
## |  7.0|Again, there are ...|[again,, there, a...|(20,[0,1,2,3,4,5,...|
## |  8.0|Especial acknowle...|[especial, acknow...|(20,[0,1,2,3,4,5,...|
## |  9.0|The idea of this ...|[the, idea, of, t...|(20,[0,1,2,3,4,5,...|
## | 10.0|Since these essay...|[since, these, es...|(20,[0,1,2,3,5,6,...|
## | 11.0|Scattered about t...|[scattered, about...|(20,[1,2,3,4,5,6,...|
## | 12.0|Other references ...|[other, reference...|(20,[0,1,3,4,5,6,...|
## | 13.0|       E. VINE HALL.|   [e., vine, hall.]|(20,[0,2,18],[1.0...|
## | 14.0|          Wimbledon.|        [wimbledon.]|      (20,[8],[1.0])|
## | 15.0|           CHAPTER I|        [chapter, i]|(20,[9,16],[1.0,1...|
## | 16.0|THE ROMANCE OF WILLS|[the, romance, of...|(20,[6,11,15,17],...|
## | 17.0|��The older I gro...|[��the, older, i,...|(20,[0,1,2,3,5,6,...|
## | 18.0|The words of the ...|[the, words, of, ...|(20,[0,1,2,3,4,5,...|
## | 19.0|Historically they...|[historically, th...|(20,[0,1,3,4,5,7,...|
## +-----+--------------------+--------------------+--------------------+
## only showing top 20 rows



# alternatively, CountVectorizer can also be used to get term frequency vectors
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show()
## 展示结果： 
## +-----+--------------------+
## |label|            features|
## +-----+--------------------+
## |  0.0|(20,[6,10,11,12,1...|
## |  1.0|(20,[1,2,7,8],[0....|
## |  2.0|(20,[12],[0.07923...|
## |  3.0|(20,[2],[1.060380...|
## |  4.0|(20,[12],[0.07923...|
## |  5.0|(20,[1,2,3,4,6,7,...|
## |  6.0|(20,[12],[0.07923...|
## |  7.0|(20,[0,1,2,3,4,5,...|
## |  8.0|(20,[12],[0.07923...|
## |  9.0|(20,[0,1,3,4,5,6,...|
## | 10.0|(20,[12],[0.07923...|
## | 11.0|(20,[0,1,2,3,4,5,...|
## | 12.0|(20,[12],[0.07923...|
## | 13.0|(20,[0,1,2,3,4,5,...|
## | 14.0|(20,[12],[0.07923...|
## | 15.0|(20,[0,1,2,3,4,5,...|
## | 16.0|(20,[12],[0.07923...|
## | 17.0|(20,[0,1,2,3,4,5,...|
## | 18.0|(20,[12],[0.07923...|
## | 19.0|(20,[0,1,2,3,5,6,...|
## +-----+--------------------+
## only showing top 20 rows


# word2Vec
from pyspark.ml.feature import Word2Vec
# 每个向量代表文档的词汇表中每个词语出现的次数。
# Input data: Each row is a bag of words from a sentence or document.
documentDF = ss.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I wish Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )
], ["text"])

'''textsplit = []
## 太慢了，慎重运行
for i in range(len(text0.collect())-1):
    textsplit.append((text0.collect()[i].split(' '),))
textsplit'''
# 对上一行操作的替代
textsplit = text0.map(lambda x:x.split(' ') )
textsplit
## 展示结果： 
## [(['Title:', 'The', 'Romance', 'of', 'Wills', 'and', 'Testaments'],),
##  (['Author:', 'Edgar', 'Vine', 'Hall'],),
##  (['PREFACE'],),
##  (['By',
##    'way',
##    'of',
##    'preface',
##    'it',
##    'is',
##    'necessary',
##    'to',
##    'explain',
##    'the',
##    'sources',
##    'from',
##    'which'...


# word2vec
from pyspark.ml.feature import Word2Vec
docDF = ss.createDataFrame(list(
    zip(textsplit.collect())
)).toDF("text")
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(docDF)

result = model.transform(docDF)
for row in result.collect():
    text, vector = row
    print("Text: [%s] => \nVector: %s\n" % (", ".join(text), str(vector)))
## 展示结果：    
## Text: [Title:, The, Romance, of, Wills, and, Testaments] => 
## Vector: [0.07510965237660067,0.10043827923280851,0.19377085047640968]
## 
## Text: [Author:, Edgar, Vine, Hall] => 
## Vector: [-0.07176287146285176,-0.10652108257636428,-0.012149352580308914]
## 
## Text: [PREFACE] => 
## Vector: [0.006923258304595947,-0.002445697784423828,0.12241993099451065]
## 
## Text: [By, way, of, preface, it, is, necessary, to, explain, the, sources, from, which, the, material, for, the, following, pages, is, taken., The, chief, feature, of, these, essays, consists,, I, think,, in, the, large, amount, of, original, matter, rescued, from, the, multitudinous, MS., volumes, of, wills,, &c.,, which, are, preserved, at, Somerset, House, and, elsewhere.] => 
## Vector: [0.030354710200939463,0.10534696077445038,0.05815259900582195]

（氦核：在本地看，我贴的代码其实很整齐的QAQ

（未完待续）

本文链接： https://konelane.github.io/2020/11/26/201126hadoop/

-- EOF --

转载请注明出处署名-非商业性使用-禁止演绎 3.0 国际（CC BY-NC-ND 3.0）

￥^￥请氦核牛饮一盒奶~suki

打赏