Electricity (Part 4: Linear Regression)

The questions below are due on Sunday August 03, 2025; 10:00:00 PM.

You are not logged in.
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

In our final installment of the electricity problem, we will find a simple predictive model relating temperature and electricity usage, similar to what you've seen in your statistics and optimization classes.

Before we dive in, let's review various packages in Python that provide regression capabilities.

0.1) Univariate Regression

NumPy performs linear regressions via the numpy.linalg.lstsq function. It uses an iterative method to produce an approximate solution.

For example, if we want to find the best-fit line for the following set of data:

x_i	y_i
1	1
2	2
3	1.3
4	3.75
5	2.25

We could use NumPy to formulate this as a least-squares optimization problem:

A = numpy.array([[1, 1],
                 [2, 1],
                 [3, 1],
                 [4, 1],
                 [5, 1]])

b = numpy.array([[1, 2, 1.3, 3.75, 2.25]]).T

print(numpy.linalg.lstsq(A, b))

The first element in the resulting tuple is a NumPy array object representing \left[\begin{array}{c}m\\ b\end{array}\right].

Another package you installed at our kickoff, SciPy, also does linear regressions via its scipy.stats.linregress function. Review the documentation to get a sense for it. As yet another alternative, one can of course manually code up linear regression in Python! We have done so for you in this file, which you may find interesting to examine. You may use all or part of this file in your solution, although you certainly do not need to.

In this exercise, you may use whichever of these alternatives feels most intuitive or familiar to you. Of the three, the manually-coded solution is least efficient, as is to be expected, but not to an extent that will affect this problem.

0.2) Multivariate Linear Regression

The form of the NumPy regression above generalizes to multivariate problems. Consider, for example, the 3-dimensional case, where we have one dependent variable z and two independent variables, x and y. In this case, we have several measured points (x_i, y_i, z_i), and we want to find A, B, and C so as to minimize the sum of squared errors between Ax_i + By_i + C and z_i. If we want to find the best-fit line for the following set of data:

x_i	y_i	z_i
1	0	1
2	0	2
3	.7	1.3
4	.2	3.75
5	1	2.25

We could use NumPy to formulate this as a least-squares optimization problem:

A = numpy.array([[1, 0, 1],
                 [2, 0, 1],
                 [3, 0.7, 1],
                 [4, .2, 1],
                 [5, 1, 1]])

b = numpy.array([[1, 2, 1.3, 3.75, 2.25]]).T

print(numpy.linalg.lstsq(A, b)[0])

SciPy also supports multivariate regressions.

1) Modeling

Now onto finding our predictive model! In order to have a way to test how predictive each of our models is, we will use the first seven years of data to build our model; and we'll use the last three years only as a way to test how predictive our model is. In machine learning, datasets are commonly split into similar "training" and "test" sets.

1.1) Univariate Regression

Let's start by performing a linear regression with average temperature as the independent variable and electricity usage as the dependent variable. Remember only to perform this regression on the first seven years of data. Enter your data to at least 3 decimal places

What is the slope of the 'best-fit' line?

What is the vertical intercept of the 'best-fit' line?

How much electricty (in kWh) does this model predict our sample household will use in a month where the average temperature is 30 degrees Fahrenheit?

For the most part, these results match what we would expect. The amount of electricity used increases with the average temperature, and the usage this month predicts for a very cold month (average of 30 degrees) is around the levels we see during the cold winter months in the real data.

Now let's test how predictive our model is. We can do this by calculating Pearson's r between:

the usages our model predicts for the last three years based on the actual measured temperatures, and
the actual measured usages during those months

The file linked above has one way to calculate this value; the NumPy and SciPy documentation gives alternatives.

What is the correlation coefficient between these two sets of data (the predicted usages, and the measured usages)?

1.2) Analysis

That is a reasonably strong correlation, but we could hope to do better. Let's try to figure out if there is something else going on in the data that we can leverage to do better.

To start, let's generate another plot. This plot should show two pieces of data on the same axes:

A scatter plot showing electricity usage vs temperature for the first seven years (this is plot #3 from the data visualization exercise), and
In a different color, the line representing the result from the linear regression.

(Axes: monthly average temperature, usage)

You will want to choose the values over which you plot the line carefully so that you can see both the raw data and the fit.

Upload this graph below:

Graph showing training data and fit:

No file selected

Try Now:

Do you notice anything interesting about this plot?

The line doesn't really represent the data all that well. In fact, it is a bad estimate in almost all cases!

It underestimates in hot or cold months, and it overestimates in other months. This might make sense, since ACs are typically run only in the hot summer months and use lots of electricity. In the spring and fall months, people tend not to need either heating or cooling, so their overall usage is lower in those months. In the winter, most homes in Cambridge use natural gas for heating (hence the lower electricity usage), but when it is really cold people sometimes also use electric space heaters (which increases electricity usage).

In the remaining sections, we'll attempt to take more information into account, so that we can make a model that does a better job of predicting electricity usage.

1.3) Other Information

Now let's try to account for the discrepancy above. Here, we'll define three types of days:

A "hot day" is any day whose low temperature is above 65 degrees
A "cold day" is any day whose high temperature is less than 40 degrees
Any other day is a "neutral day"

Augment your CSV from earlier so that it now contains three additional fields. In total, your spreadsheet should now have the following columns:

Column 1 should contain a year (a number 2007-2016)
Column 2 should contain a month (a number 1-12)
Column 3 should contain the electricity usage for that month (in kWh)
Column 4 should contain the average temperature over that month (in Fahrenheit)
Column 5 should contain the number of hot days in that month.
Column 6 should contain the number of cold days in that month.
Column 7 should contain the number of neutral days in that month.

The first row of the CSV should contain a header ['year', 'month', 'avg_kwh', 'avg_temp', 'num_hot', 'num_cold', 'num_neutral']. The order of the remaining rows should be in chronological order (earliest year / month first, most recent year / month last.)

As an example, the first ten rows of your CSV file should look like the following (but there may be slight differences in calculated avg_temp):

year,month,avg_kwh,avg_temp,num_hot,num_cold,num_neutral
2007,1,554.01,32.74193548387097,0,17,14
2007,2,606.33,26.146428571428572,0,24,4
2007,3,658.42,37.44516129032258,0,11,20
2007,4,526.59,45.166666666666664,0,2,28
2007,5,760.0,61.377419354838715,1,0,30
2007,6,1110.79,68.51666666666667,5,0,25
2007,7,1348.45,72.9,12,0,19
2007,8,1471.95,72.68064516129031,14,0,17
2007,9,1518.79,67.64,3,0,27

Upload your new CSV below:

Please upload the CSV you generated below:

No file selected

We will use this CSV moving forward.

1.4) Another Regression

Next, perform another regression on the first seven years' data, this time using all four independent variables (average temperature, hot days, cold days, and neutral days) and a constant term. What are the coefficients associated with each of these variables in the "best fit" line?

Coefficient associated with average temperature:

Coefficient associated with hot days:

Coefficient associated with cold days:

Coefficient associated with neutral days:

Constant term:

1.5) Analysis

Now let's test how predictive this new model is. We can do this by calculating Pearson's r between:

the usages our model predicts for the last three years based on the actual measured temperatures, and
the actual measured usages during those months

What is the correlation coefficient between these two sets of data (the predicted usages, and the actual measured usages)?

This suggests that our new model (which made use of the extra information) is much more predictive than our original model!

Looking at the plot of the multivariate predictions versus the univariate predications and actual test data below, we can see that the multivariate prediction follows the curve pretty accurately.

For final code style feedback, please upload your code again here. This question will be manually graded. Remember to use succinct, descriptive variable names and comments to explain confusing lines.

No file selected

Next Exercise: Survey

Back to exercises