First, a quick description of some popular algorithms & implementations for text summarization that exist today: the summarization module in gensim implements TextRank, an unsupervised algorithm based on weighted-graphs from a paper by Mihalcea et al. It was added by another incubator student Olavur Mortensen – see his previous post on this blog. It is built on top of the popular PageRank algorithm that Google used for ranking webpages. TextRank works as follows: Pre-process the text: remove stop words and stem the remaining words. Create a graph where vertices are sentences. Connect every sentence to every other sentence by an edge. The weight of the edge is how similar the two sentences are. Run the PageRank algorithm on the graph. Pick the vertices(sentences) with the highest PageRank score. In original TextRank the weights of an edge between two sentences is the percentage of words appearing in both of them. Gensim’s TextRank uses Okapi BM25 function to see how similar the sentences are. It is an improvement from a paper by Barrios et al.
