Electric Bill

The questions below are due on Sunday August 04, 2024; 10:00:00 PM.
 
You are not logged in.

Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

In this problem, we will get familiar with functionality which you've seen in R in 15.087. In particular, you will programmatically generate graphs in Python via the matplotlib package (an analogy to ggplot2), and you will perform linear regressions. It is useful to be familiar with these functionalities in Python in addition to R, in case you're using Python to analyze data.

In this problem we will also do a bit of data wrangling to get our data in the right form for analysis and visualization. We'll be using a set of real-world data gathered over 10 recent years.

1) The Data

Download and unzip the following .zip file: electricbill.zip

There are two files in that zip archive:

  • cambridge_avg_temps.tsv is a TSV file containing information about the average temperature in Cambridge, MA every day for 10 years. In each row:
    • The first column represents the year in which the measurement was taken.
    • The second column represents the month in which the measurement was taken.
    • The third column represents the day (of the month) in which the measurement was taken.
    • The fourth, fifth, and sixth columns contain the high temperature, low temperature, and average temperature on that day, respectively. These temperatures are all measured in degrees Fahrenheit.

  • cambridge_bills.tsv is a TSV file containing information about the electricity usage of a single two-person household in Cambridge for every month over the same period. In each row:
    • The first column represents the year in which the measurement was taken.
    • The second column represents the month in which the measurement was taken.
    • The third column represents the household's electricity usage during that month, in units of kilowatt-hours (kWh).

Throughout this exercise, we will explore the relationship between these two quantities (electricity consumption and temperature). To begin, you are welcome to look at these files in a spreadsheet program to get a feel for how they are put together.

2) Plotting Data

Link to relevant readings

Start by using Python to generate three plots to get a feel for the data:

  1. Make a plot showing each day's measured average temperature over time. (Axes: x-axis: each unique year/month/day in increasing order, y-axis: daily average temperature)
  2. Make a plot showing each month's electricity usage over time. (Axes: x-axis: each unique year/month in increasing order, y-axis: usage)
  3. On one set of axes, plot one line for each year's electricity usage (by month). This graph will give a sense of year-to-year changes in electricity usage. Use a loop to make this plot, rather than explicitly making a separate call to plt.plot for each year. (Axes: month, usage)

Each of these graphs should contain a relevant title, axis labels, and, where appropriate, a legend.

Save these plots (as PNG or SVG images), as well as the Python script you used to generate them. Combine all of these into a zip file, and upload that file below:

.zip file containing three plots and your Python script :
 No file selected

3) Relationship Between Temperature and Electricity Usage

We would like to be able to explore the relationship between temperature and electricity usage, but that is hard given the current form of the data. One particular source of difficulty stems from the fact that the average temperatures we have are reported daily, but the electricity usage numbers are reported monthly.

3.1) Combining Spreadsheets

In order to simplify all of our further analysis, let's begin by combining all of this information into a single spreadsheet that contains information about both temperature and electricity usage. Because the plots from before are not directly relevant here, you may wish to start with a new, fresh Python script for this portion of the assignment.

In order to do this, you will need to combine the information about temperature to find the average temperature in each month. For our analysis, it will suffice to say that the average temperature in a given month is the mean of its daily temperature averages.

Use Python to create a new TSV file, where each row contains four columns:

  • Column 1 should contain a year (a number 2007-2016)
  • Column 2 should contain a month (a number 1-12)
  • Column 3 should contain the average temperature over that month (in Fahrenheit).
  • Column 4 should contain the electricity usage for that month (in kWh)

The first row of the TSV should contain a header ['year', 'month', 'avg_temp', 'avg_kwh'] The order of the remaining rows does not matter.

We will use this file for the rest of our analysis. As a couple of checkpoints, three of the rows in your TSV should be:

2015    2   19.1    656.65
2015    3   33.3    486.49
2015    4   48.0    593.3

You may also wish to open this file (temporarily) in a spreadsheet program to make sure it is formatted properly. When you are finished, upload your new TSV file below:

Please upload the TSV you generated below:
 No file selected

4) An Aside: Regression

We take a brief aside to describe various packages in Python that provide regression capabilities.

4.1) Univariate Regression

NumPy performs linear regressions via the numpy.linalg.lstsq function. It uses an iterative method to produce an approximate solution.

For example, if we want to find the best-fit line for the following set of data:

x_iy_i
11
22
31.3
43.75
52.25

We could use NumPy to formulate this as a least-squares optimization problem:

A = numpy.array([[1, 1],
                 [2, 1],
                 [3, 1],
                 [4, 1],
                 [5, 1]])

b = numpy.array([[1, 2, 1.3, 3.75, 2.25]]).T

print(numpy.linalg.lstsq(A, b))

The first element in the resulting tuple is a NumPy array object representing \left[\begin{array}{c}m\\ b\end{array}\right].

Another package you installed at our kickoff, SciPy, also does linear regressions via its scipy.stats.linregress function. Review the documentation to get a sense for it. As yet another alternative, one can of course manually code up linear regression in Python! We have done so for you in this file, which you may find interesting to examine. You may use all or part of this file in your solution, although you certainly do not need to.

In this exercise, you may use whichever of these alternatives feels most intuitive or familiar to you. Of the three, the manually-coded solution is least efficient, as is to be expected, but not to an extent that will affect this problem.

4.2) Multivariate Linear Regression

The form of the NumPy regression above generalizes to multivariate problems. Consider, for example, the 3-dimensional case, where we have one dependent variable z and two independent variables, x and y. In this case, we have several measured points (x_i, y_i, z_i), and we want to find A, B, and C so as to minimize the sum of squared errors between Ax_i + By_i + C and z_i. If we want to find the best-fit line for the following set of data:

x_iy_iz_i
101
202
3.71.3
4.23.75
512.25

We could use NumPy to formulate this as a least-squares optimization problem:

A = numpy.array([[1, 0, 1],
                 [2, 0, 1],
                 [3, 0.7, 1],
                 [4, .2, 1],
                 [5, 1, 1]])

b = numpy.array([[1, 2, 1.3, 3.75, 2.25]]).T

print(numpy.linalg.lstsq(A, b)[0])

SciPy also supports multivariate regressions.

5) Modeling

For the rest of the exercise, we will focus on finding a simple predictive model relating temperature and electricity usage. In order to have a way to test how predictive each of our models is, we will use the first seven years of data to build our model; and we'll use the last three years only as a way to test how predictive our model is.

5.1) Univariate Regression

Let's start by performing a linear regression with average temperature as the independent variable and electricity usage as the dependent variable. Remember only to perform this regression on the first seven years of data. Enter your data to at least 3 decimal places

What is the slope of the 'best-fit' line?

What is the vertical intercept of the 'best-fit' line?

How much electricty (in kWh) does this model predict our sample household will use in a month where the average temperature is 30 degrees Fahrenheit?

For the most part, these results match what we would expect. The amount of electricity used increases with the average temperature, and the usage this month predicts for a very cold month (average of 30 degrees) is around the levels we see during the cold winter months in the real data.

Now let's test how predictive our model is. We can do this by calculating Pearson's r between:

  • the usages our model predicts for the last three years based on the actual measured temperatures, and
  • the actual measured usages during those months

The file linked above has one way to calculate this value; the NumPy and SciPy documentation gives alternatives.

What is the correlation coefficient between these two sets of data (the predicted usages, and the measured usages)?

5.2) Analysis

That is a reasonably strong correlation, but we could hope to do better. Let's try to figure out if there is something else going on in the data that we can leverage to do better.

To start, let's generate another plot. This plot should show two pieces of data on the same axes:

  • A scatter plot showing electricity usage vs temperature for the first seven years, and
  • In a different color, the line representing the result from the linear regression.

(Axes: monthly average temperature, usage)

You will want to choose the values over which you plot the line carefully so that you can see both the raw data and the fit.

Upload this graph below:

Graph showing training data and fit:
 No file selected

Try Now:

Do you notice anything interesting about this plot?

The line doesn't really represent the data all that well. In fact, it is a bad estimate in almost all cases!

It underestimates in hot or cold months, and it overestimates in other months. This might make sense, since both heating and cooling will use electricity. In the spring and fall months, people tend not to need either heating or cooling, so their overall usage is lower in those months. In the remaining sections, we'll attempt to take more information into account, so that we can make a model that does a better job of predicting electricity usage.

5.3) Other Information

Now let's try to account for the discrepancy above. Here, we'll define three types of days:

  • A "hot day" is any day whose low temperature is above 65 degrees
  • A "cold day" is any day whose high temperature is less than 40 degrees
  • Any other day is a "neutral day"

Augment your TSV from earlier so that it now contains three additional fields. In total, your spreadsheet should now have the following columns:

  • Column 1 should contain a year (a number 2007-2016)
  • Column 2 should contain a month (a number 1-12)
  • Column 3 should contain the average temperature over that month (in Fahrenheit), rounded to 1 decimal places using Python's round function.
  • Column 4 should contain the number of hot days in that month.
  • Column 5 should contain the number of cold days in that month.
  • Column 6 should contain the number of neutral days in that month.
  • Column 7 should contain the electricity usage for that month (in kWh)

The first row of the TSV should contain a header ['year', 'month', 'avg_temp', 'num_hot', 'num_cold', 'num_neutral', 'avg_kwh']. The order of the remaining rows does not matter.

As an example, the following rows should exist in your TSV:

2011    4   50.1    0   1   29  552.95
2011    5   59.2    1   0   30  760.95
2011    6   67.1    5   0   25  1493.31

Upload your new TSV below:

Please upload the TSV you generated below:
 No file selected

We will use this TSV moving forward.

5.4) Another Regression

Next, perform another regression on the first seven years' data, this time using all four independent variables (average temperature, hot days, cold days, and neutral days). What are the coefficients associated with each of these variables in the "best fit" line?

Coefficient associated with average temperature:

Coefficient associated with hot days:

Coefficient associated with cold days:

Coefficient associated with neutral days:

Constant term:

5.5) Analysis

Now let's test how predictive this new model is. We can do this by calculating Pearson's r between:

  • the usages our model predicts for the last three years based on the actual measured temperatures, and
  • the actual measured usages during those months

What is the correlation coefficient between these two sets of data (the predicted usages, and the actual measured usages)?

This suggests that our new model (which made use of the extra information) is much more predictive than our original model!

For final code style feedback, please upload your code again here. This question will be manually graded. Remember to use succinct, descriptive variable names and comments to explain confusing lines.
 No file selected

Next Exercise: Survey

Back to exercises