Electric Bill
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.
In this problem, we will get familiar with functionality which you've seen in R in 15.087. In particular, you will programmatically generate graphs in Python via the
matplotlib
package (an analogy to ggplot2
), and you will perform linear regressions. It is useful to be familiar with these functionalities in Python in addition to R, in case you're using Python to analyze data.
In this problem we will also do a bit of data wrangling to get our data in the right form for analysis and visualization. We'll be using a set of real-world data gathered over 10 recent years.
1) The Data
Download and unzip the following .zip file: electricbill.zip
There are two files in that zip archive:
cambridge_avg_temps.tsv
is a TSV file containing information about the average temperature in Cambridge, MA every day for 10 years. In each row:- The first column represents the year in which the measurement was taken.
- The second column represents the month in which the measurement was taken.
- The third column represents the day (of the month) in which the measurement was taken.
- The fourth, fifth, and sixth columns contain the high temperature, low temperature, and average temperature on that day, respectively. These temperatures are all measured in degrees Fahrenheit.
cambridge_bills.tsv
is a TSV file containing information about the electricity usage of a single two-person household in Cambridge for every month over the same period. In each row:- The first column represents the year in which the measurement was taken.
- The second column represents the month in which the measurement was taken.
- The third column represents the household's electricity usage during that month, in units of kilowatt-hours (kWh).
Throughout this exercise, we will explore the relationship between these two quantities (electricity consumption and temperature). To begin, you are welcome to look at these files in a spreadsheet program to get a feel for how they are put together.
2) Plotting Data
Start by using Python to generate three plots to get a feel for the data:
- Make a plot showing each day's measured average temperature over time. (Axes: x-axis: each unique year/month/day in increasing order, y-axis: daily average temperature)
- Make a plot showing each month's electricity usage over time. (Axes: x-axis: each unique year/month in increasing order, y-axis: usage)
- On one set of axes, plot one line for each year's electricity usage (by month). This graph will give a sense of year-to-year changes in electricity usage. Use a loop to make this plot, rather than explicitly making a separate call to
plt.plot
for each year. (Axes: month, usage)
Each of these graphs should contain a relevant title, axis labels, and, where appropriate, a legend.
Save these plots (as PNG or SVG images), as well as the Python script you used to generate them. Combine all of these into a zip file, and upload that file below:
3) Relationship Between Temperature and Electricity Usage
We would like to be able to explore the relationship between temperature and electricity usage, but that is hard given the current form of the data. One particular source of difficulty stems from the fact that the average temperatures we have are reported daily, but the electricity usage numbers are reported monthly.
3.1) Combining Spreadsheets
In order to simplify all of our further analysis, let's begin by combining all of this information into a single spreadsheet that contains information about both temperature and electricity usage. Because the plots from before are not directly relevant here, you may wish to start with a new, fresh Python script for this portion of the assignment.
In order to do this, you will need to combine the information about temperature to find the average temperature in each month. For our analysis, it will suffice to say that the average temperature in a given month is the mean of its daily temperature averages.
Use Python to create a new TSV file, where each row contains four columns:
- Column 1 should contain a year (a number 2007-2016)
- Column 2 should contain a month (a number 1-12)
- Column 3 should contain the average temperature over that month (in Fahrenheit).
- Column 4 should contain the electricity usage for that month (in kWh)
The first row of the TSV should contain a header ['year', 'month', 'avg_temp', 'avg_kwh'] The order of the remaining rows does not matter.
We will use this file for the rest of our analysis. As a couple of checkpoints, three of the rows in your TSV should be:
2015 2 19.1 656.65
2015 3 33.3 486.49
2015 4 48.0 593.3
You may also wish to open this file (temporarily) in a spreadsheet program to make sure it is formatted properly. When you are finished, upload your new TSV file below:
4) An Aside: Regression
We take a brief aside to describe various packages in Python that provide regression capabilities.
4.1) Univariate Regression
NumPy performs linear regressions via the numpy.linalg.lstsq
function. It uses an iterative method to produce an approximate solution.
For example, if we want to find the best-fit line for the following set of data:
x_i | y_i |
---|---|
1 | 1 |
2 | 2 |
3 | 1.3 |
4 | 3.75 |
5 | 2.25 |
We could use NumPy to formulate this as a least-squares optimization problem:
A = numpy.array([[1, 1],
[2, 1],
[3, 1],
[4, 1],
[5, 1]])
b = numpy.array([[1, 2, 1.3, 3.75, 2.25]]).T
print(numpy.linalg.lstsq(A, b))
The first element in the resulting tuple is a NumPy array
object representing \left[\begin{array}{c}m\\
b\end{array}\right].
Another package you installed at our kickoff, SciPy, also does linear regressions via its scipy.stats.linregress
function. Review the documentation to get a sense for it. As yet another alternative, one can of course manually code up linear regression in Python! We have done so for you in this file, which you may find interesting to examine. You may use all or part of this file in your solution, although you certainly do not need to.
In this exercise, you may use whichever of these alternatives feels most intuitive or familiar to you. Of the three, the manually-coded solution is least efficient, as is to be expected, but not to an extent that will affect this problem.
4.2) Multivariate Linear Regression
The form of the NumPy regression above generalizes to multivariate problems. Consider, for example, the 3-dimensional case, where we have one dependent variable z and two independent variables, x and y. In this case, we have several measured points (x_i, y_i, z_i), and we want to find A, B, and C so as to minimize the sum of squared errors between Ax_i + By_i + C and z_i. If we want to find the best-fit line for the following set of data:
x_i | y_i | z_i |
---|---|---|
1 | 0 | 1 |
2 | 0 | 2 |
3 | .7 | 1.3 |
4 | .2 | 3.75 |
5 | 1 | 2.25 |
We could use NumPy to formulate this as a least-squares optimization problem:
A = numpy.array([[1, 0, 1],
[2, 0, 1],
[3, 0.7, 1],
[4, .2, 1],
[5, 1, 1]])
b = numpy.array([[1, 2, 1.3, 3.75, 2.25]]).T
print(numpy.linalg.lstsq(A, b)[0])
SciPy also supports multivariate regressions.
5) Modeling
For the rest of the exercise, we will focus on finding a simple predictive model relating temperature and electricity usage. In order to have a way to test how predictive each of our models is, we will use the first seven years of data to build our model; and we'll use the last three years only as a way to test how predictive our model is.
5.1) Univariate Regression
Let's start by performing a linear regression with average temperature as the independent variable and electricity usage as the dependent variable. Remember only to perform this regression on the first seven years of data. Enter your data to at least 3 decimal places
For the most part, these results match what we would expect. The amount of electricity used increases with the average temperature, and the usage this month predicts for a very cold month (average of 30 degrees) is around the levels we see during the cold winter months in the real data.
Now let's test how predictive our model is. We can do this by calculating Pearson's r between:
- the usages our model predicts for the last three years based on the actual measured temperatures, and
- the actual measured usages during those months
The file linked above has one way to calculate this value; the NumPy and SciPy documentation gives alternatives.
5.2) Analysis
That is a reasonably strong correlation, but we could hope to do better. Let's try to figure out if there is something else going on in the data that we can leverage to do better.
To start, let's generate another plot. This plot should show two pieces of data on the same axes:
- A scatter plot showing electricity usage vs temperature for the first seven years, and
- In a different color, the line representing the result from the linear regression.
(Axes: monthly average temperature, usage)
You will want to choose the values over which you plot the line carefully so that you can see both the raw data and the fit.
Upload this graph below:
Do you notice anything interesting about this plot?
It underestimates in hot or cold months, and it overestimates in other months. This might make sense, since both heating and cooling will use electricity. In the spring and fall months, people tend not to need either heating or cooling, so their overall usage is lower in those months. In the remaining sections, we'll attempt to take more information into account, so that we can make a model that does a better job of predicting electricity usage.
5.3) Other Information
Now let's try to account for the discrepancy above. Here, we'll define three types of days:
- A "hot day" is any day whose low temperature is above 65 degrees
- A "cold day" is any day whose high temperature is less than 40 degrees
- Any other day is a "neutral day"
Augment your TSV from earlier so that it now contains three additional fields. In total, your spreadsheet should now have the following columns:
- Column 1 should contain a year (a number 2007-2016)
- Column 2 should contain a month (a number 1-12)
- Column 3 should contain the average temperature over that month (in
Fahrenheit), rounded to 1 decimal places using Python's
round
function. - Column 4 should contain the number of hot days in that month.
- Column 5 should contain the number of cold days in that month.
- Column 6 should contain the number of neutral days in that month.
- Column 7 should contain the electricity usage for that month (in kWh)
The first row of the TSV should contain a header ['year', 'month', 'avg_temp', 'num_hot', 'num_cold', 'num_neutral', 'avg_kwh']. The order of the remaining rows does not matter.
As an example, the following rows should exist in your TSV:
2011 4 50.1 0 1 29 552.95
2011 5 59.2 1 0 30 760.95
2011 6 67.1 5 0 25 1493.31
Upload your new TSV below:
We will use this TSV moving forward.
5.4) Another Regression
Next, perform another regression on the first seven years' data, this time using all four independent variables (average temperature, hot days, cold days, and neutral days). What are the coefficients associated with each of these variables in the "best fit" line?
5.5) Analysis
Now let's test how predictive this new model is. We can do this by calculating Pearson's r between:
- the usages our model predicts for the last three years based on the actual measured temperatures, and
- the actual measured usages during those months
This suggests that our new model (which made use of the extra information) is much more predictive than our original model!
Next Exercise: Survey