# http://katazen.blogspot.com/2014/04/006-python-cookbook-01-09-finding-common.html
# http://katazen.blogspot.com/2014/04/006-python-cookbook-01-12-most.html
# http://stackoverflow.com/questions/743806/split-string-into-a-list-in-python
# http://www.saltycrane.com/blog/2007/09/how-to-sort-python-dictionary-by-keys/
Requirements:
- There is one text section and it needs to select Top 10 keywords, which occurs at least thrice.
- Even there is two text sections, it is still easy to merge the meta statistic.
- It should be friendly to redirect to formatted output, such as CSV for Weka.
| ---------- run 1: select top 10, shows 3 times at least {'google': 3, 'sources': 4, 'when': 4, 'to': 3, 'news': 6, 'the': 4} ---------- run 2: select top 10, shows 3 times at least {'of': 4, 'the': 4} ---------- Sum {'google': 3, 'to': 3, 'of': 4, 'when': 4, 'sources': 4, 'news': 6, 'the': 8} ---------- CSV the,8 news,6 when,4 sources,4 of,4 to,3 google,3 |
| from collections import Counter def parseText2TokenList(stringin): for i in stringin.split(): i = i.strip(';,.\'').lower() yield i def dumpTokenBy_TopN_CountAtLeastM(listin, topN, countM): for (key, value) in Counter(listin).most_common(topN): if value >= countM: yield (key, value) def sumValueByKeyOnDict(dicta, dictb): dictOut = dict() for i in dicta.keys(): if dictb.has_key(i): dictOut[i] = dicta[i] + dictb[i] else: dictOut[i] = dicta[i] for i in dictb.keys(): if not dicta.has_key(i): dictOut[i] = dictb[i] return dictOut if __name__ == "__main__": articalText1 = """As a news aggregator site, Google uses its own software to determine which stories to show from the online news sources it watches. Human editorial input does come into the system, however, in choosing exactly which sources Google News will pick from. This is where some of the controversy over Google News originates, when some news sources are included when visitors feel they don't deserve it, and when other news sources are excluded when visitors feel they ought to be included. For examples, see the above mentions of Indymedia, or National Vanguard.""" articalText2 = """The actual list of sources is not known outside of Google. The stated information from Google is that it watches more than 4,500 English-language news sites. In the absence of a list, many independent sites have come up with their own ways of determining Google's news sources , as in the chart below.""" print "---------- run 1: select top 10, shows 3 times at least" art1WordList = list(parseText2TokenList(articalText1)) art1TopToken = dict(dumpTokenBy_TopN_CountAtLeastM(art1WordList, 10, 3)) print art1TopToken print "---------- run 2: select top 10, shows 3 times at least" art2WordList = list(parseText2TokenList(articalText2)) art2TopToken = dict(dumpTokenBy_TopN_CountAtLeastM(art2WordList, 10, 3)) print art2TopToken print "---------- Sum" artSum = sumValueByKeyOnDict(art1TopToken, art2TopToken) print artSum # dump for post-process print "---------- CSV" for key, value in reversed(sorted(artSum.iteritems(), key=lambda (k,v): (v,k))): print "%s,%s" % (key, value) |
沒有留言:
張貼留言