Ratings

The questions below are due on Sunday July 14, 2024; 10:00:00 PM.
 
You are not logged in.

Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

As an intern for Office Shack, a new chain of office supply stores, you are assigned to the market research team which will determine what products get placed on store shelves. You've been assigned the task of researching the quality of potential products based on user rating.

To help accomplish your task, your manager sent you this large ZIP FOLDER, which contains the files you will need for this problem and the Ratings Analysis problem. Inside the resources folder is a large CSV file called office_products.csv which contains a real data set of Amazon office product ratings (source.) Each row of this file has four columns: item, user, rating, and timestamp.

Printing the first five rows of the file outputs:

['0140503528', 'A2WJLOXXIB7NF3', '3.0', '1162512000']
['0140503528', 'A1RKICUK0GG6VF', '5.0', '1147132800']
['0140503528', 'A1QA5E50M398VW', '5.0', '1142035200']
['0140503528', 'A3N0HBW8IP8CZQ', '5.0', '980294400']
['0140503528', 'A1K1JW1C5CUSUZ', '5.0', '964915200']

Note that the item and user columns are alphanumeric strings. The rating column represents a user review from 1 to 5 stars (5 being the highest rating.) Finally, the timestamp column is in Unix time, which represents time as the number of seconds since January 1st, 1970. In addition to the large office_products.csv file, there are also smaller testing files in the resources filer. Notably, tiny_office_products.csv has only ten rows and is written in a more human readable format. Printing out the rows of the tiny products CSV outputs:

['stapler', 'user1', '5', '980294400']
['desk', 'user1', '4', '1142035200']
['rubber bands', 'user2', '5', '1147132800']
['desk', 'user2', '5', '964915200']
['desk', 'user3', '4', '1147132800']
['desk', 'user4', '5', '1142035200']
['stapler', 'user2', '1', '964915200']
['stapler', 'user1', '1', '1162512000']
['rubber bands', 'user4', '5', '1142035200']
['stapler', 'user1', '5', '980294400']

1) Parse and Map

Before we can analyze our data, we want to organize it in a form that will make it easier to analyze which products are the most (or least) popular. To accomplish this, fill in the definitions of the parse_reviews, make_user_ratings, and make_item_ratings functions in the ratings.py file.

After parsing the tiny office reviews, you should end up with a set that contains the following nine tuples (in any order):

{
('desk', 'user1', 4.0, 1142035200),
('stapler', 'user1', 5.0, 980294400),
('desk', 'user2', 5.0, 964915200),
('stapler', 'user1', 1.0, 1162512000),
('stapler', 'user2', 1.0, 964915200),
('rubber bands', 'user2', 5.0, 1147132800),
('desk', 'user4', 5.0, 1142035200),
('desk', 'user3', 4.0, 1147132800),
('rubber bands', 'user4', 5.0, 1142035200)
}

Note that the first and last reviews in the tiny office file are the same, so the duplicate ['stapler', 'user1', '5', '980294400'] row was removed.

However, it is still hard to make sense of the data in this form, especially if we want to analyze user reviews or item ratings for a specific product.

Taking the set of parsed reviews as input to the make_user_ratings function should produce the following dictionary (again, the order of the keys and ratings within the lists do not matter):

{'user2': [5.0, 1.0, 5.0],
'user1': [4.0, 1.0, 5.0],
'user4': [5.0, 5.0],
'user3': [4.0]}

Compared to the raw format, it is now much easier to tell that user1 has made 3 reviews, and rated the products with 4, 1, and 5 stars. Similarly, calling make_item_ratings on the set of tiny reviews should produce the following dictionary:

{'rubber bands': [5.0, 5.0],
'desk': [4.0, 5.0, 4.0, 5.0],
'stapler': [1.0, 1.0, 5.0]}

Now we can see easily that the rubber bands are highly rated, while the stapler has mixed reviews.

Like the Read States problem in the previous unit, we have provided a test.py file which you may wish to use to test your functions before you upload them here. After correctly implementing make_item_ratings and make_user_ratings you should pass the tests associated with those functions in the test.py file. When you are confident that these functions work, upload it below:

Note to get the test.py file to work you may need to run pip install dill first, which will install the package used to compress the expected results

Upload your code here to test your make_item_ratings and make_user_ratings functions
  No file selected

2) Filter

Next, we'll practice using functions as first class objects. First implement filter5, which is an example of a filter function that can be provided as the filter argument to filter_ratings.

For example, given the item_ratings from the tiny reviews, filter5(item_ratings, "desk") should return False because the desk only has four reviews. Applying this filter to all the keys by calling filter_ratings(item_ratings, filter5) on the tiny item ratings returns an empty dictionary since no item has five or more reviews. In the next assignment, we'll work with the larger datasets, where most items have very few ratings and filter the ratings to help us narrow down our analysis to the most popular items.

After correctly implementing these two functions, your code should pass the test_tiny_filter and test_filter_ratings test cases. Upload your code below when you are ready for the server to test your filter_ratings function:

Upload your code here to test your filter_ratings function
  No file selected

3) Find Best

Finally, we want to be able to select and analyze items according to some criteria, for example the item that has the most reviews might be one of the most popular items.

Implement larger_num_ratings to compare two items and find the one with more reviews. (Don't worry about ties; the function can return either key in such cases.) For example, calling larger_num_ratings(item_ratings, 'rubber bands', 'desk') using the tiny item ratings dictionary should return 'desk', since the desk has four reviews and the rubber bands only have two.

larger_num_ratings is an example of a comparison function that can be passed into find_best. After implementing find_best correctly, you should be able to see that find_best(item_ratings, larger_num_ratings) will find that the desk has the most reviews.

After implementing these functions correctly you should pass all the tests in the test.py file (yay!) Upload your code below so the server can also check your code.

Upload your code here to test your that all your functions work as expected:
  No file selected

Next Exercise: Ratings Analysis

Back to exercises