Kata Zen: 007 Group Keyword and Counting for Top N

#!/usr/bin/env python

# http://katazen.blogspot.com/2014/04/006-python-cookbook-01-09-finding-common.html
# http://katazen.blogspot.com/2014/04/006-python-cookbook-01-12-most.html

# http://stackoverflow.com/questions/743806/split-string-into-a-list-in-python
# http://www.saltycrane.com/blog/2007/09/how-to-sort-python-dictionary-by-keys/

Requirements:

There is one text section and it needs to select Top 10 keywords, which occurs at least thrice.
Even there is two text sections, it is still easy to merge the meta statistic.
It should be friendly to redirect to formatted output, such as CSV for Weka.

---------- run 1: select top 10, shows 3 times at least
{'google': 3, 'sources': 4, 'when': 4, 'to': 3, 'news': 6, 'the': 4}
---------- run 2: select top 10, shows 3 times at least
{'of': 4, 'the': 4}
---------- Sum
{'google': 3, 'to': 3, 'of': 4, 'when': 4, 'sources': 4, 'news': 6, 'the': 8}
---------- CSV
the,8
news,6
when,4
sources,4
of,4
to,3
google,3

from collections import Counter

def parseText2TokenList(stringin):
for i in stringin.split():
i = i.strip(';,.\'').lower()
yield i

def dumpTokenBy_TopN_CountAtLeastM(listin, topN, countM):
for (key, value) in Counter(listin).most_common(topN):
if value >= countM:
yield (key, value)

def sumValueByKeyOnDict(dicta, dictb):
dictOut = dict()
for i in dicta.keys():
if dictb.has_key(i):
dictOut[i] = dicta[i] + dictb[i]
else:
dictOut[i] = dicta[i]
for i in dictb.keys():
if not dicta.has_key(i):
dictOut[i] = dictb[i]
return dictOut

if __name__ == "__main__":

articalText1 = """As a news aggregator site, Google uses its own software to
determine which stories to show from the online news sources it watches.
Human editorial input does come into the system, however, in choosing
exactly which sources Google News will pick from. This is where some of
the controversy over Google News originates, when some news sources are
included when visitors feel they don't deserve it, and when other news
sources are excluded when visitors feel they ought to be included.
For examples, see the above mentions of Indymedia, or National Vanguard."""

articalText2 = """The actual list of sources is not known outside of Google.
The stated information from Google is that it watches more than 4,500
English-language news sites. In the absence of a list, many independent
sites have come up with their own ways of determining Google's news sources
, as in the chart below."""

print "---------- run 1: select top 10, shows 3 times at least"
art1WordList = list(parseText2TokenList(articalText1))
art1TopToken = dict(dumpTokenBy_TopN_CountAtLeastM(art1WordList, 10, 3))
print art1TopToken

print "---------- run 2: select top 10, shows 3 times at least"
art2WordList = list(parseText2TokenList(articalText2))
art2TopToken = dict(dumpTokenBy_TopN_CountAtLeastM(art2WordList, 10, 3))
print art2TopToken

print "---------- Sum"
artSum = sumValueByKeyOnDict(art1TopToken, art2TopToken)
print artSum

# dump for post-process
print "---------- CSV"
for key, value in reversed(sorted(artSum.iteritems(), key=lambda (k,v): (v,k))):
print "%s,%s" % (key, value)

Kata Zen

2014年4月8日星期二

007 Group Keyword and Counting for Top N

沒有留言:

張貼留言

2014年4月8日 星期二

007 Group Keyword and Counting for Top N

沒有留言:

張貼留言

2014年4月8日星期二