Due date: Wednesday, November 19, 2025.
In this exercise, we will create a knowledge graph that represents the content of a text document. The graph will have words (tokens, stems) for vertices, each of them having an assigned score that reflects its importance in the document. The edges connect the words if they appear in the same sentence in the document and with a weight that represents how strongs the connection between words is. We will use the graph to find out which words are the most important in the document based on their frequency, as a way to define the content of the document.
Download the following files:
make_graph.py: contains a class
reading information from the files and using the graph class to build
the knowledge graph.
word_graph.py: contains the graph
class specific for this problem.
porter_stemmer.py: contains
functions to stem a word and reduce it to its root or stem.
stop_words.txt: contains the list of
common words that are usually not indexed by search engines because
they are too common.
frequent_words.txt: contains the
1000 most common words in the English language according to the
Guttenberg project. They are listed in order of frequency.
three_piggies.txt: contains a short story that can
be used to test the
program. Source: https://reedsy.com/short-stories/bedtime/.
In the class Word_Graph, implement the function print_top that prints the info for the top vertices with the highest score in the graph. You will need to somehow sort the vertices by the score, and then output the top however many are required.
In the same class, implement the function output_file that receives a file name and outputs the graph data to it. We will use the simplest pairwise file format. The file will contain a line for each edge in the graph showing the following data:
name1 name2 weight score
where the weight is the weight of the edge and the score is the score of the first vertex. The values are separated by a tab ('\t').
In the class Make_Graph, implement the function read_index_file that receives the name of a file containing a text document to be indexed. The function should
Find a small to medium size document to test the program with. It can be anything you may find interesting: a story, a poem, a song lyrics, a news article, and so on. Use the program to create a graph file from it and also save it to a file.
We will use the visualization tool Graphia from graphia.app. If you are using a Windows computer, you can download it and install it. Otherwise you can also just use it directly from the web site.
To open the web application, you can go to web.graphia.app. You can go through the tutorial to get familiarized with the tool or just skip it. From the File menu, you can open one of the graph file you created with the program.
After playing around with the visualization and figuring out if it helps you understand the data better, take a screenshot of one of them and save it as a png file to upload with the submission.
Upload all the Python files you have modified as well as the text file you chose for the test, the resulting graph files, and a snapshot of visualizing the graph with Graphia - on Canvas in Assignments - Homework 10.