Electric Bill

The questions below are due on Thursday August 05, 2021; 11:00:00 AM.

You are not logged in.
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

In this problem, we will get familiar with functionality which you've seen in R in 15.087. In particular, you will programmatically generate graphs in Python via the matplotlib package (an analogy to ggplot2), and you will perform linear regressions. It is useful to be familiar with these functionalities in Python in addition to R, in case you're using Python to analyze data.

In this problem we will also do a bit of data wrangling to get our data in the right form for analysis and visualization. We'll be using a set of real-world data gathered over 10 recent years.

The Data

Download and unzip the following .zip file: electricbill.zip

There are two files in that zip archive:

cambridge_avg_temps.tsv is a TSV file containing information about the average temperature in Cambridge, MA every day for 10 years. In each row:
- The first column represents the year in which the measurement was taken.
- The second column represents the month in which the measurement was taken.
- The third column represents the day (of the month) in which the measurement was taken.
- The fourth, fifth, and sixth columns contain the high temperature, low temperature, and average temperature on that day, respectively. These temperatures are all measured in degrees Fahrenheit.

cambridge_bills.tsv is a TSV file containing information about the electricity usage of a single two-person household in Cambridge for every month over the same period. In each row:
- The first column represents the year in which the measurement was taken.
- The second column represents the month in which the measurement was taken.
- The third column represents the household's electricity usage during that month, in units of kilowatt-hours (kWh).

Throughout this exercise, we will explore the relationship between these two quantities (electricity consumption and temperature). To begin, you are welcome to look at these files in a spreadsheet program to get a feel for how they are put together.

An Aside: Generating Plots

The matplotlib.pyplot module provides a number of useful functions for creating plots from Python. We'll start here with a couple of short examples before moving on to actually working with the data.

To import the pyplot module, add the following to the top of your script:

import matplotlib.pyplot as plt

Once you have done so, you can make a new plot by calling plt.figure() with no arguments. After that, you can use various functions to add data to the figures. When you are ready, calling the plt.show() function with no arguments will cause matplotlib to open windows displaying the resulting graphs. You can also add a legend and/or a title to the plot, as well as labels to the axes, as shown in the example below.

The following code will cause four windows to be displayed. Try running this code on your own machine to see the results. Notice that the plt.show() function does not return until the plotting windows are closed.

import matplotlib.pyplot as plt

# here we plot a set of "y" values only; these are associated automatically
# with integer "x" values starting with 0.
plt.figure()
plt.plot([9, 4, 7, 6])

# if given two arguments, the first list/array will be used as the "x" values,
# and the second as the associated "y" values
plt.figure()
plt.plot([10, 9, 8, 7], [1, 2, 3, 4])
plt.grid()  # this adds a background grid to the plot

# we can also create scatter plots.  scatter plots require both "x" and "y"
# values.
plt.figure()
plt.scatter([10,25, 37, 42], [12, 28, 5, 37])

# multiple calls to plt.plot or plt.scatter will operate on the same axes
plt.figure()
plt.scatter([10, 25, 37, 42], [12, 28, 5, 37], label='scatter')
plt.plot([10, 40], [5, 20], 'r', label='a line') # the 'r' means 'red'
plt.plot([5, 9, 15, 30], [10, 20, 30, 35], 'k', label='more data')
plt.title('A more complete example')
plt.xlabel('A label for x')
plt.ylabel('The vertical axis')
plt.legend()

# finally, display the results
print('Showing Graphs')
plt.show()
print('Done')

Plotting Data

Start by using Python to generate three plots to get a feel for the data:

Make a plot showing each day's measured average temperature over time. (Axes: day, average temperature)
Make a plot showing each month's electricity usage over time. (Axes: month, usage)
On one set of axes, plot one line for each year's electricity usage (by month). This graph will give a sense of year-to-year changes in electricity usage. Use a loop to make this plot, rather than explicitly making a separate call to plt.plot for each year. (Axes: month, usage)

Each of these graphs should contain a relevant title, axis labels, and, where appropriate, a legend.

Save these plots (as PNG or SVG images), as well as the Python script you used to generate them. Combine all of these into a zip file, and upload that file below:

.zip file containing three plots and your Python script :

No file selected

Relationship Between Temperature and Electricity Usage

We would like to be able to explore the relationship between temperature and electricity usage, but that is hard given the current form of the data. One particular source of difficulty stems from the fact that the average temperatures we have are reported daily, but the electricity usage numbers are reported monthly.

Combining Spreadsheets

In order to simplify all of our further analysis, let's begin by combining all of this information into a single spreadsheet that contains information about both temperature and electricity usage. Because the plots from before are not directly relevant here, you may wish to start with a new, fresh Python script for this portion of the assignment.

In order to do this, you will need to combine the information about temperature to find the average temperature in each month. For our analysis, it will suffice to say that the average temperature in a given month is the mean of its daily temperature averages.

Use Python to create a new TSV file, where each row contains four columns:

Column 1 should contain a year (a number 2007-2016)
Column 2 should contain a month (a number 1-12)
Column 3 should contain the average temperature over that month (in Fahrenheit), rounded to 1 decimal places using Python's round function.
Column 4 should contain the electricity usage for that month (in kWh)

We will use this file for the rest of our analysis. As a couple of checkpoints, three of the rows in your TSV should be:

2015    2   19.1    656.65
2015    3   33.3    486.49
2015    4   48.0    593.30

You may also wish to open this file (temporarily) in a spreadsheet program to make sure it is formatted properly. When you are finished, upload your new TSV file below:

New (combined) TSV file:

No file selected

An Aside: Regression

We take a brief aside to describe various packages in Python that provide regression capabilities.

Univariate Regression

NumPy performs linear regressions via the numpy.linalg.lstsq function. It uses an iterative method to produce an approximate solution.

For example, if we want to find the best-fit line for the following set of data:

x_i	y_i
1	1
2	2
3	1.3
4	3.75
5	2.25

We could use NumPy to formulate this as a least-squares optimization problem:

A = numpy.array([[1, 1],
                 [2, 1],
                 [3, 1],
                 [4, 1],
                 [5, 1]])

b = numpy.array([[1, 2, 1.3, 3.75, 2.25]]).T

print(numpy.linalg.lstsq(A, b))

The first element in the resulting tuple is a NumPy array object representing \left[\begin{array}{c}m\\ b\end{array}\right].

Another package you installed at our kickoff, SciPy, also does linear regressions via its scipy.stats.linregress function. Review the documentation to get a sense for it. As yet another alternative, one can of course manually code up linear regression in Python! We have done so for you in this file, which you may find interesting to examine. You may use all or part of this file in your solution, although you certainly do not need to.

In this exercise, you may use whichever of these alternatives feels most intuitive or familiar to you. Of the three, the manually-coded solution is least efficient, as is to be expected, but not to an extent that will affect this problem.

Multivariate Linear Regression

The form of the NumPy regression above generalizes to multivariate problems. Consider, for example, the 3-dimensional case, where we have one dependent variable z and two independent variables, x and y. In this case, we have several measured points (x_i, y_i, z_i), and we want to find A, B, and C so as to minimize the sum of squared errors between Ax_i + By_i + C and z_i. If we want to find the best-fit line for the following set of data:

x_i	y_i	z_i
1	0	1
2	0	2
3	.7	1.3
4	.2	3.75
5	1	2.25

We could use NumPy to formulate this as a least-squares optimization problem:

A = numpy.array([[1, 0, 1],
                 [2, 0, 1],
                 [3, 0.7, 1],
                 [4, .2, 1],
                 [5, 1, 1]])

b = numpy.array([[1, 2, 1.3, 3.75, 2.25]]).T

print(numpy.linalg.lstsq(A, b)[0])

SciPy also supports multivariate regressions.

Modeling

For the rest of the exercise, we will focus on finding a simple predictive model relating temperature and electricity usage. In order to have a way to test how predictive each of our models is, we will use the first seven years of data to build our model; and we'll use the last three years only as a way to test how predictive our model is.

Univariate Regression

Let's start by performing a linear regression with average temperature as the independent variable and electricity usage as the dependent variable. Remember only to perform this regression on the first seven years of data. Enter your data to at least 3 decimal places

What is the slope of the 'best-fit' line?

What is the vertical intercept of the 'best-fit' line?

How much electricty (in kWh) does this model predict our sample household will use in a month where the average temperature is 30 degrees Fahrenheit?

For the most part, these results match what we would expect. The amount of electricity used increases with the average temperature, and the usage this month predicts for a very cold month (average of 30 degrees) is around the levels we see during the cold winter months in the real data.

Now let's test how predictive our model is. We can do this by calculating Pearson's r between:

the usages our model predicts for the last three years based on the actual measured temperatures, and
the actual measured usages during those months

The file linked above has one way to calculate this value; the NumPy and SciPy documentation gives alternatives.

What is the correlation coefficient between these two sets of data (the predicted usages, and the measured usages)?

Analysis

That is a reasonably strong correlation, but we could hope to do better. Let's try to figure out if there is something else going on in the data that we can leverage to do better.

To start, let's generate another plot. This plot should show two pieces of data on the same axes:

A scatter plot showing electricity usage vs temperature for the first seven years, and
In a different color, the line representing the result from the linear regression.

(Axes: monthly average temperature, usage)

You will want to choose the values over which you plot the line carefully so that you can see both the raw data and the fit.

Upload this graph below:

Graph showing training data and fit:

No file selected

Try Now:

Do you notice anything interesting about this plot?

The line doesn't really represent the data all that well. In fact, it is a bad estimate in almost all cases!

It underestimates in hot or cold months, and it overestimates in other months. This might make sense, since both heating and cooling will use electricity. In the spring and fall months, people tend not to need either heating or cooling, so their overall usage is lower in those months. In the remaining sections, we'll attempt to take more information into account, so that we can make a model that does a better job of predicting electricity usage.

Other Information

Now let's try to account for the discrepancy above. Here, we'll define three types of days:

A "hot day" is any day whose low temperature is above 65 degrees
A "cold day" is any day whose high temperature is less than 40 degrees
Any other day is a "neutral day"

Augment your TSV from earlier so that it now contains three additional fields. In total, your spreadsheet should now have the following columns:

Column 1 should contain a year (a number 2007-2016)
Column 2 should contain a month (a number 1-12)
Column 3 should contain the average temperature over that month (in Fahrenheit), rounded to 1 decimal places using Python's round function.
Column 4 should contain the number of hot days in that month.
Column 5 should contain the number of cold days in that month.
Column 6 should contain the number of neutral days in that month.
Column 7 should contain the electricity usage for that month (in kWh)

As an example, the following rows should exist in your TSV:

2011    4   50.1    0   1   29  552.95
2011    5   59.2    1   0   30  760.95
2011    6   67.1    5   0   25  1493.31

Upload your new TSV below:

New (combined) TSV file:

No file selected

We will use this TSV moving forward.

Another Regression

Next, perform another regression on the first seven years' data, this time using all four independent variables (average temperature, hot days, cold days, and neutral days). What are the coefficients associated with each of these variables in the "best fit" line?

Coefficient associated with average temperature:

Coefficient associated with hot days:

Coefficient associated with cold days:

Coefficient associated with neutral days:

Constant term:

Analysis

Now let's test how predictive this new model is. We can do this by calculating Pearson's r between:

the usages our model predicts for the last three years based on the actual measured temperatures, and
the actual measured usages during those months

What is the correlation coefficient between these two sets of data (the predicted usages, and the actual measured usages)?

This suggests that our new model (which made use of the extra information) is much more predictive than our original model!

For final code style feedback, please upload your code again here. This question will be manually graded. Remember to use succinct, descriptive variable names and comments to explain confusing lines.

No file selected

Next Exercise: Survey

Back to exercises