Autocomplete
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.
1) Preparation
The following zip contains code and other resources as a starting point for this exercise: u6_alternate_words.zip
All of your changes should be made to lab.py
, which you will submit at the
end of this exercise. Importantly, you should not add any imports to the file. As in the previous exercises, we provide you with a test.py
script to help you
verify the correctness of your code.
Note that passing all of the tests on the server will require that your code runs reasonably efficiently.
2) Introduction
Type "aren't you" into a search engine and you'll get a handful of search suggestions, ranging from "aren't you clever?" to "aren't you a little short for a stormtrooper?". If you've ever done a web search, you've probably seen an autocompletion—a handy list of words that pops up under your search, guessing at what you were about to type.
In this exercise, we are going to implement our own version of an autocomplete engine using a tree structure called a trie, as described in this document. We'll first to create a class to represent a generic trie data structure. We'll extensively leverage Python's "dunder" (double underscore) or "magic" methods: specially-named methods which are called when Pythonic syntax is used to interact with objects. We will then use the trie to write an autocompletion machine.
2.1) The Trie Data Structure
A trie1, also known as a prefix tree, is a type of search tree that stores an associative array (a mapping from keys to values). In a trie, the keys are always ordered sequences. The trie stores keys organized by their prefixes (their first characters), with longer prefixes given by successive levels of the trie. Each node optionally contains a value to be associated with that node's prefix.
As an example, consider a trie constructed as follows:
t = Trie()
t['bat'] = True
t['bar'] = True
t['bark'] = True
This trie would look like the following (Fig. 1).
37972a72e649b35a62f926d6e4c9de96
One important thing to notice is that the keys associated with each node are not actually stored in the nodes themselves. Rather, they are associated with the edges connecting the nodes.
We'll start by implementing a class called Trie
to represent tries in Python.
This class will include facilities for adding, deleting, modifying, and
getting all the key/value pairs. For example, consider:
>>> t = Trie()
>>> t['bat'] = True
>>> t['bar'] = True
>>> t['bark'] = True
>>>
>>> t['bat']
True
>>> t['something']
Traceback (most recent call last):
...
KeyError
>>>
>>> t['bark'] = 20
>>> t['bark']
20
>>>
>>> t.items()
[('bat', True), ('bar', True), ('bark', 20)]
>>>
>>> del t['bar']
>>>
>>> t.items()
[('bat', True), ('bark', 20)]
Note that, in terms of interface and functionality, the Trie
class will have
a lot in common with a Python dictionary. However, the representation we're
using "under the hood" has some nice features that make it well-suited for
tasks like autocompletion that use prefix-based lookups.
3) Trie class and basic methods
In lab.py
, you are responsible for implementing the Trie
class, which
should support the following methods.
__init__( self )
Initialize self to be an object with exactly two instance variables:
-
value
, the value associated with the sequence ending at this node. Initial value isNone
(we will assume that a value ofNone
means that a given key has no value associated with it, not that the valueNone
is associated with it). -
children
, a dictionary mapping length-1 strings to another trie node, i.e., the next level of the trie hierarchy (tries are a recursive data structure). Initial value is an empty dictionary.
__setitem__( self, key, value )
Add the given key
to the trie, associating it with the given value
. For
the trie node that marks the end of the key, set that node's value
attribute
to be given value
argument. This method doesn't return a value. This is a
special method name used by Python to implement subscript assignment. For
example, x[k] = v
is translated by Python into x.__setitem__(k, v)
.
-
t = Trie()
would create the root node of the example trie above. -
t['bat'] = True
adds three nodes (representing the'b'
,'ba'
, and'bat'
prefixes), and associates the valueTrue
with the node corresponding to'bat'
. -
t['bark'] = True
adds two new nodes for prefixes'bar'
and'bark'
shown on the bottom right of the trie, setting the value of the last node toTrue
. -
t['bar'] = True
doesn't add any nodes and only sets the value of the first node added above when inserting "bark" toTrue
.
__getitem__( self, key )
Return the value associated with the given key. This is a special method name
used by Python to implement subscripting (indexing). For example, x[k]
is
translated by Python into x.__getitem__(k)
. It is expected that the trie node
descended from the self
trie will be located corresponding to key
and the value associated with
that node will be returned, or a KeyError
will be raised if the key cannot be found in the
trie.
Examples (using the example trie from above):
-
t['bar']
should returnTrue
. -
t['apple']
should raise aKeyError
since the given key does not exist in the trie. -
t['ba']
should also raise aKeyError
since, even though the key'ba'
is represented in the trie, it has no value associated with it.
__delitem__( self, key )
Disassociate the given key from its value. This is a special method name
used by Python to implement index deletion. For example, del x[k]
is
translated by Python into x.__delitem__(k)
. Trying to delete a key that has no
associated value should raise a KeyError
.
Examples (using the example trie from above):
-
del t["bar"]
should disassociate"bar"
from its value in the trie, so that subsequent lookups oft["bar"]
produce aKeyError
. -
del t["foo"]
should raise aKeyError
since"foo"
does not exist as a key in the example trie.
For the purposes of this exercise, you only need to do the bare minimum so that the key is no longer associated with a value (don't worry about extra memory usage after deleting the key).
__contains__( self, key )
Return True
if key
occurs and has a value other than None
in the
trie. __contains__
is the special method name used by Python to implement the
in
operator. For example,
k in x
is translated to
x.__contains__(k)
Hint: At first glance, the code for this method might look very similar to some of the other methods above. Make good use of helper functions to avoid repetitious code!
Examples (using the example trie from above):
-
"ba" in t
returnsFalse
since that interior node has no value associated with it. -
"bar" in t
returnsTrue
(not because the value associated with'bar'
isTrue
, but because'bar'
has a value associated with it at all. -
"barking" in t
returnsFalse
since"barking"
can't be found in trie.
items( self )
Return a list of (key, value)
tuples for each
key stored in the trie.2 The pairs can be listed in any order.
Hint: You may want to generate this list recursively using the items from child nodes.
Example (using the example trie from above):
t.items()
returns a list containing the tuples('bat', True)
,('bar', True)
, and('bark', True)
(in any order).
After implementing each of these methods, you should pass all the tests in the Test_1_Trie
suite of test.py
.
4) Autocomplete
Now, let's implement autocompletion as a function described below:
autocomplete( trie, prefix, max_count=None )
trie
is an instance of Trie
, prefix
is a string, max_count
is an
integer or None
. Return a list of the max_count
most-frequently-occurring keys that start with prefix
. In the case of a
tie, you may output any of the most-frequently-occurring keys. If there are
fewer than max_count
valid keys available starting with prefix
, return
only as many as there are. The returned list may be in any order. If
max_count
is not specified, your list should contain all keys that start
with prefix
.
Return []
if prefix
is not in the trie.
Examples (using the example trie in Fig. 2 below):
-
autocomplete(t, "ba", 1)
returns['bat']
. -
autocomplete(t, "ba", 2)
might return either['bat', 'bark']
or['bat', 'bar']
since "bark" and "bar" occur with equal frequency. -
autocomplete(t, "be", 1)
returns[]
.
ed0229eac3b017728a9762c0e00bb502
Note: To make full use of your Trie data structure, you should not use a "brute-force" method that involves generating and/or looping over all words in the trie. Instead, you should only loop through the words which begin with the given prefix. That is, you should descend your Trie along the prefix, then examine all the completions of that prefix stored at that (sub)trie. Doing otherwise will prove much slower.
After implementing this function, you should pass all the tests in the Test_2_AutoComplete
suite of test.py
. You may find it interesting to examine the test cases there—they create Trie
s from real texts.
5) Code Submission
Submit your lab.py
below:
Footnotes
1Different people have different opinions about whether this data structure's name should be pronounced like "tree" or like "try". It originally comes from the middle syllable of the word "retrieval," which suggests one pronunciation, but some prefer to say it like "try" to avoid confusion with general "tree" structures in programming. Some even say "tree as in try," but that's kind of a mouthful...
2An alternate implementation may instead implement this functionality as the iter dunder method, which returns a generator object, and allows you to directly iterate over the object.