Electricity (Part 2: Files)

The questions below are due on Sunday July 20, 2025; 10:00:00 PM.
 
You are not logged in.

Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

Last week, we looked at some temperature and electricity usage data for Cambridge, MA and transformed some lists of lists of strings into some dictionaries.

However, it is a bit awkward to copy and paste large lists into .py files, so when working with large amounts of data, it is common to store the raw data separately in .csv, .txt, or even .pickle files, and load that data into your program.

In this assignment, we're going to first implement a function read_csv that converts a CSV file to the 2D arrays that we worked with in unit 4. Then, we'll implement make_rows to convert a dictionary into a 2D array that we can feed into write_csv to write that data to a new CSV file.

The goal of writing these functions is to use them to take the electricity and temperature data stored in different formats in separate CSV files and join them together into a new combined CSV file so that we can explore the relationship between these two quantities in later assignments.

1) Assignment Files

Download the following starter code file: u5_electricity_2.zip

This folder has several files:

  • electric.py is the starter file where you will write your code for this assignment.

  • graphs.py is the file for the next assignment.

  • The data folder contains a number of CSV files:

    • cambridge_temps.csv is a file that contains the daily temperature information. It has the same form as temp_array in the unit 4 problem. tiny_temps.csv contains a small subset of this CSV file.

    • cambridge_kwh.csv is a file that contains the electricity usage information. It has the same form as kwh_array in the unit 4 problem. tiny_kwh.csv contains a small subset of this CSV file.

    • tiny_combined.csv is a file that contains the monthly average electricity usage and temperature information that is the result of combining the data from both tiny CSV files.

Note that when you work with large amounts of data and code, it is a good idea to put some planning and thought into how you want to organize your folders and Python scripts (and document your functions), otherwise as the project grows it will be easy to lose track of things. It is a good idea to also make (and update) a README.md markdown file to document the file structure, like in the description above.

2) Implement Functions

Now that you are familiar with the file structure, start by copying your main functions (not including the if __name__ == "__main__" block) from Electricity Part 1 into electric.py in the place indicated.

Now implement read_csv, make_rows, and write_csv according to their descriptions. Note that each function has an associated test function at the bottom of the electric.py file. You can comment out the tests that you are not using, or add additional tests of your own.

Note that read_csv should remove any duplicate rows found in the file. Sometimes data entry errors happen, causing data to get replicated or go missing, so it is always a good idea to think carefully about how you "read in" data. Why is it a good idea to avoid duplicate rows (especially in the temperature files)? (Hint: think about what join_arrays calculates.)

When you are ready to check your make_rows function, upload your file below. Note that the inputs and outputs to some test cases are hidden because of the large input sizes.

  No file selected

3) Combined CSV

In order to avoid errors, limit complexity, and save time (especially when working with large input data) it is usually a good idea to separate the step of data processing from data analysis. One way to accomplish this is to process the data first to get the relevant data in a convenient format, and then store that processed data in a new file, so that the analysis can start with the "clean" data without having to repeat the processing step each time.

To practice this, use Python to create a new CSV file, where each row contains four columns:

  • Column 1 should contain a year (a number 2007-2016)
  • Column 2 should contain a month (a number 1-12)
  • Column 3 should contain the electricity usage for that month (in kWh), or be blank if that data is missing.
  • Column 4 should contain the average of the daily average temperatures over that month (in Fahrenheit), or be blank if that data is missing.

The first row of the CSV should contain a header ['year', 'month', 'avg_kwh', 'avg_temp'] The order of the remaining rows should be sorted in ascending order (so oldest year / month first, and most recent year / month last).

The tiny_combined.csv file shows the expected format of combining data from tiny_temps.csv and tiny_kwh.csv. Note that writing None to the cell of a CSV file will automatically create a blank cell.

year,month,avg_kwh,avg_temp
2007,1,554.01,43.75
2007,2,606.33,
2007,3,658.42,56.5
2007,4,526.59,
2007,5,760.0,
2007,6,1110.79,
2007,7,1348.45,
2007,8,1471.95,
2007,9,1518.79,
2007,10,655.5,
2007,11,574.19,
2007,12,648.72,
2008,7,,76.0

Note: Do not round the average temperature values.

Now join the data from the cambridge_temps.csv and the cambridge_kwh.csv into a combined CSV file. You may also wish to open this file (temporarily) in a spreadsheet program to make sure it is formatted properly. When you are finished, upload your new CSV file below:

Please upload the CSV you generated below:
 No file selected

Next Exercise: Electricity (Part 3: Data Visualization)

Back to exercises