Ratings
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.
As an intern for Office Shack, a new chain of office supply stores, you are assigned to the market research team which will determine what products get placed on store shelves. You've been assigned the task of researching the quality of potential products based on user rating.
To help accomplish your task, your manager sent you this large ZIP FOLDER, which contains the files you will
need for this problem and the Ratings Analysis problem.
Inside the resources folder is a large CSV file called office_products.csv
which contains a real data set of Amazon office product ratings
(source.) Each row of this file
has four columns: item, user, rating, and timestamp.
Printing the first five rows of the file outputs:
['0140503528', 'A2WJLOXXIB7NF3', '3.0', '1162512000']
['0140503528', 'A1RKICUK0GG6VF', '5.0', '1147132800']
['0140503528', 'A1QA5E50M398VW', '5.0', '1142035200']
['0140503528', 'A3N0HBW8IP8CZQ', '5.0', '980294400']
['0140503528', 'A1K1JW1C5CUSUZ', '5.0', '964915200']
Note that the item and user columns are alphanumeric strings. The rating column represents a user review from 1 to 5 stars (5 being the highest rating.) Finally, the timestamp column is in Unix time, which represents time as the number of seconds since January 1st, 1970. In addition to the large office_products.csv
file, there are also smaller testing files in the resources filer. Notably,
tiny_office_products.csv
has only ten rows and is written in a more human
readable format. Printing out the rows of the tiny products CSV outputs:
['stapler', 'user1', '5', '980294400']
['desk', 'user1', '4', '1142035200']
['rubber bands', 'user2', '5', '1147132800']
['desk', 'user2', '5', '964915200']
['desk', 'user3', '4', '1147132800']
['desk', 'user4', '5', '1142035200']
['stapler', 'user2', '1', '964915200']
['stapler', 'user1', '1', '1162512000']
['rubber bands', 'user4', '5', '1142035200']
['stapler', 'user1', '5', '980294400']
1) Parse and Map
Before we can analyze our data, we want to organize it in a form that will make it
easier to analyze which products are the most (or least) popular. To accomplish
this, fill in the definitions of the parse_reviews
, make_user_ratings
, and
make_item_ratings
functions in the ratings.py
file.
After parsing the tiny office reviews, you should end up with a set that contains the following nine tuples (in any order):
{
('desk', 'user1', 4.0, 1142035200),
('stapler', 'user1', 5.0, 980294400),
('desk', 'user2', 5.0, 964915200),
('stapler', 'user1', 1.0, 1162512000),
('stapler', 'user2', 1.0, 964915200),
('rubber bands', 'user2', 5.0, 1147132800),
('desk', 'user4', 5.0, 1142035200),
('desk', 'user3', 4.0, 1147132800),
('rubber bands', 'user4', 5.0, 1142035200)
}
Note that the first and last reviews in the tiny office file are the same, so
the duplicate ['stapler', 'user1', '5', '980294400']
row was removed.
However, it is still hard to make sense of the data in this form, especially if we want to analyze user reviews or item ratings for a specific product.
Taking the set of parsed reviews as input to the make_user_ratings
function
should produce the following dictionary (again, the order of the keys and
ratings within the lists do not matter):
{'user2': [5.0, 1.0, 5.0],
'user1': [4.0, 1.0, 5.0],
'user4': [5.0, 5.0],
'user3': [4.0]}
Compared to the raw format, it is now much easier to tell that user1
has
made 3 reviews, and rated the products with 4, 1, and 5 stars. Similarly, calling make_item_ratings
on the set of tiny reviews should produce
the following dictionary:
{'rubber bands': [5.0, 5.0],
'desk': [4.0, 5.0, 4.0, 5.0],
'stapler': [1.0, 1.0, 5.0]}
Now we can see easily that the rubber bands are highly rated, while the stapler has mixed reviews.
Like the Read States problem in the
previous unit, we have provided a test.py
file which you may wish to
use to test your functions before you upload them here. After
correctly implementing make_item_ratings
and make_user_ratings
you
should pass the tests associated with those functions in the test.py
file.
When you are confident that these functions work, upload it below:
Note to get the test.py
file to work you may need to run pip install dill
first, which will install the package used to compress the expected results
make_item_ratings
and make_user_ratings
functions
2) Filter
Next, we'll practice using functions as first class objects. First implement filter5
,
which is an example of a filter function that can be provided as the filter
argument to filter_ratings
.
For example, given the item_ratings
from the tiny reviews, filter5(item_ratings, "desk")
should return False
because the desk only has four reviews. Applying this filter
to all the keys by calling filter_ratings(item_ratings, filter5)
on the tiny
item ratings returns an empty dictionary since no item has five or more reviews.
In the next assignment, we'll work with the larger datasets, where most items
have very few ratings and filter the ratings to help us narrow down our analysis
to the most popular items.
After correctly implementing these two functions, your code should pass the test_tiny_filter
and test_filter_ratings
test cases. Upload your code below
when you are ready for the server to test your filter_ratings
function:
filter_ratings
function
3) Find Best
Finally, we want to be able to select and analyze items according to some criteria, for example the item that has the most reviews might be one of the most popular items.
Implement larger_num_ratings
to compare two items and find the one with more
reviews. (Don't worry about ties; the function can return either key in such cases.) For example, calling larger_num_ratings(item_ratings, 'rubber bands', 'desk')
using the tiny item ratings dictionary should return 'desk'
,
since the desk has four reviews and the rubber bands only have two.
larger_num_ratings
is an example of a comparison function that can be passed
into find_best
. After implementing find_best
correctly, you should be able
to see that find_best(item_ratings, larger_num_ratings)
will find that the
desk has the most reviews.
After implementing these functions correctly you should pass all the tests in
the test.py
file (yay!) Upload your code below so the server can also check
your code.
Next Exercise: Ratings Analysis