Autocomplete

The questions below are due on Sunday July 21, 2024; 10:00:00 PM.
 
You are not logged in.

Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

1) Preparation

The following zip contains code and other resources as a starting point for this exercise: u6_alternate_words.zip

All of your changes should be made to lab.py, which you will submit at the end of this exercise. Importantly, you should not add any imports to the file. As in the previous exercises, we provide you with a test.py script to help you verify the correctness of your code.

Note that passing all of the tests on the server will require that your code runs reasonably efficiently.

2) Introduction

Type "aren't you" into a search engine and you'll get a handful of search suggestions, ranging from "aren't you clever?" to "aren't you a little short for a stormtrooper?". If you've ever done a web search, you've probably seen an autocompletion—a handy list of words that pops up under your search, guessing at what you were about to type.

In this exercise, we are going to implement our own version of an autocomplete engine using a tree structure called a trie, as described in this document. We'll first to create a class to represent a generic trie data structure. We'll extensively leverage Python's "dunder" (double underscore) or "magic" methods: specially-named methods which are called when Pythonic syntax is used to interact with objects. We will then use the trie to write an autocompletion machine.

2.1) The Trie Data Structure

A trie1, also known as a prefix tree, is a type of search tree that stores an associative array (a mapping from keys to values). In a trie, the keys are always ordered sequences. The trie stores keys organized by their prefixes (their first characters), with longer prefixes given by successive levels of the trie. Each node optionally contains a value to be associated with that node's prefix.

As an example, consider a trie constructed as follows:

t = Trie()
t['bat'] = True
t['bar'] = True
t['bark'] = True

This trie would look like the following (Fig. 1).

Placeholder for Diagram 37972a72e649b35a62f926d6e4c9de96
Fig. 1

One important thing to notice is that the keys associated with each node are not actually stored in the nodes themselves. Rather, they are associated with the edges connecting the nodes.

We'll start by implementing a class called Trie to represent tries in Python. This class will include facilities for adding, deleting, modifying, and getting all the key/value pairs. For example, consider:

>>> t = Trie()
>>> t['bat'] = True
>>> t['bar'] = True
>>> t['bark'] = True
>>>
>>> t['bat']
True
>>> t['something']
Traceback (most recent call last):
    ...
KeyError
>>>
>>> t['bark'] = 20
>>> t['bark']
20
>>>
>>> t.items()
[('bat', True), ('bar', True), ('bark', 20)]

>>>
>>> del t['bar']
>>>
>>> t.items()
[('bat', True), ('bark', 20)]

Note that, in terms of interface and functionality, the Trie class will have a lot in common with a Python dictionary. However, the representation we're using "under the hood" has some nice features that make it well-suited for tasks like autocompletion that use prefix-based lookups.

3) Trie class and basic methods

In lab.py, you are responsible for implementing the Trie class, which should support the following methods.

__init__( self )

Initialize self to be an object with exactly two instance variables:

  • value, the value associated with the sequence ending at this node. Initial value is None (we will assume that a value of None means that a given key has no value associated with it, not that the value None is associated with it).

  • children, a dictionary mapping length-1 strings to another trie node, i.e., the next level of the trie hierarchy (tries are a recursive data structure). Initial value is an empty dictionary.

__setitem__( self, key, value )

Add the given key to the trie, associating it with the given value. For the trie node that marks the end of the key, set that node's value attribute to be given value argument. This method doesn't return a value. This is a special method name used by Python to implement subscript assignment. For example, x[k] = v is translated by Python into x.__setitem__(k, v).

Examples (using the trie structure from the picture above):

  • t = Trie() would create the root node of the example trie above.

  • t['bat'] = True adds three nodes (representing the 'b', 'ba', and 'bat' prefixes), and associates the value True with the node corresponding to 'bat'.

  • t['bark'] = True adds two new nodes for prefixes 'bar' and 'bark' shown on the bottom right of the trie, setting the value of the last node to True.

  • t['bar'] = True doesn't add any nodes and only sets the value of the first node added above when inserting "bark" to True.

__getitem__( self, key )

Return the value associated with the given key. This is a special method name used by Python to implement subscripting (indexing). For example, x[k] is translated by Python into x.__getitem__(k). It is expected that the trie node descended from the self trie will be located corresponding to key and the value associated with that node will be returned, or a KeyError will be raised if the key cannot be found in the trie.

Examples (using the example trie from above):

  • t['bar'] should return True.

  • t['apple'] should raise a KeyError since the given key does not exist in the trie.

  • t['ba'] should also raise a KeyError since, even though the key 'ba' is represented in the trie, it has no value associated with it.

__delitem__( self, key )

Disassociate the given key from its value. This is a special method name used by Python to implement index deletion. For example, del x[k] is translated by Python into x.__delitem__(k). Trying to delete a key that has no associated value should raise a KeyError.

Examples (using the example trie from above):

  • del t["bar"] should disassociate "bar" from its value in the trie, so that subsequent lookups of t["bar"] produce a KeyError.

  • del t["foo"] should raise a KeyError since "foo" does not exist as a key in the example trie.

For the purposes of this exercise, you only need to do the bare minimum so that the key is no longer associated with a value (don't worry about extra memory usage after deleting the key).

__contains__( self, key )

Return True if key occurs and has a value other than None in the trie. __contains__ is the special method name used by Python to implement the in operator. For example,

k in x

is translated to

x.__contains__(k)

Hint: At first glance, the code for this method might look very similar to some of the other methods above. Make good use of helper functions to avoid repetitious code!

Examples (using the example trie from above):

  • "ba" in t returns False since that interior node has no value associated with it.

  • "bar" in t returns True (not because the value associated with 'bar' is True, but because 'bar' has a value associated with it at all.

  • "barking" in t returns False since "barking" can't be found in trie.

items( self )

Return a list of (key, value) tuples for each key stored in the trie.2 The pairs can be listed in any order. Hint: You may want to generate this list recursively using the items from child nodes.

Example (using the example trie from above):

  • t.items() returns a list containing the tuples ('bat', True), ('bar', True), and ('bark', True) (in any order).

After implementing each of these methods, you should pass all the tests in the Test_1_Trie suite of test.py.

4) Autocomplete

Now, let's implement autocompletion as a function described below:

autocomplete( trie, prefix, max_count=None )

trie is an instance of Trie, prefix is a string, max_count is an integer or None. Return a list of the max_count most-frequently-occurring keys that start with prefix. In the case of a tie, you may output any of the most-frequently-occurring keys. If there are fewer than max_count valid keys available starting with prefix, return only as many as there are. The returned list may be in any order. If max_count is not specified, your list should contain all keys that start with prefix.

Return [] if prefix is not in the trie.

Examples (using the example trie in Fig. 2 below):

  • autocomplete(t, "ba", 1) returns ['bat'].

  • autocomplete(t, "ba", 2) might return either ['bat', 'bark'] or ['bat', 'bar'] since "bark" and "bar" occur with equal frequency.

  • autocomplete(t, "be", 1) returns [].

Placeholder for Diagram ed0229eac3b017728a9762c0e00bb502
Fig. 2

Note: To make full use of your Trie data structure, you should not use a "brute-force" method that involves generating and/or looping over all words in the trie. Instead, you should only loop through the words which begin with the given prefix. That is, you should descend your Trie along the prefix, then examine all the completions of that prefix stored at that (sub)trie. Doing otherwise will prove much slower.

After implementing this function, you should pass all the tests in the Test_2_AutoComplete suite of test.py. You may find it interesting to examine the test cases there—they create Tries from real texts.

5) Code Submission

Submit your lab.py below:

Please upload your code here. This question will be manually graded for correctness (passing all the provided test cases in under 10 seconds) and good code style. Remember to use succinct, descriptive variable names, docstrings and comments to explain confusing lines.
 No file selected


 
Footnotes

1Different people have different opinions about whether this data structure's name should be pronounced like "tree" or like "try". It originally comes from the middle syllable of the word "retrieval," which suggests one pronunciation, but some prefer to say it like "try" to avoid confusion with general "tree" structures in programming. Some even say "tree as in try," but that's kind of a mouthful... (click to return to text)

2An alternate implementation may instead implement this functionality as the iter dunder method, which returns a generator object, and allows you to directly iterate over the object. (click to return to text)