Lesson Explainer: Least Squares Regression Line | Nagwa Lesson Explainer: Least Squares Regression Line | Nagwa

Lesson Explainer: Least Squares Regression Line Mathematics • Third Year of Secondary School

Join Nagwa Classes

Attend live Mathematics sessions on Nagwa Classes to learn more about this topic from an expert teacher!

In this explainer, we will learn how to find and use the least squares regression line equation.

The term “regression” was first used by Sir Francis Galton, an English Victorian era statistician, in reference to the heights of children and their parents. Tall parents tended to have children shorter than themselves and vice versa for short parents. He called this effect “regression towards mediocrity”; heights regressed to the mean. Since his findings, regression analysis has been used to identify and analyze relationships between variables. In particular, the method of least squares allows us to determine the line of best fit for a set of bivariate data.

Suppose we have collected 𝑛 measurements for two quantitative variables, 𝑋 and 𝑌, to form a set of bivariate data. That is, we have 𝑛 of pairs of data, (𝑥,𝑦), 𝑖=1,,𝑛. Suppose also that both a scatter plot and the correlation coefficient of our data indicate that the variables 𝑋 and 𝑌 are linearly related. That is, as one increases the other increases linearly, or decreases linearly, with the first.

Our next step in analyzing such data is to try and model this relationship with a line of best fit. This means that we seek the equation of the line that maps the path of the data passing as closely as possible to each of the data points. We might try and construct this line by eye; however, there is a technique that can allow us to find its exact equation.

Recall that, in general, the equation of a straight line is 𝑦=𝑎+𝑏𝑥, where 𝑎 is the 𝑦-intercept and 𝑏 is the slope of the line. It is unlikely that a set of bivariate data will lie exactly on a straight line, so to find the equation of the line that fits our data most closely, we find the line whose overall average distance from all of our data points is minimized. This distance 𝑦̂𝑦, for each point (𝑥,𝑦), is called the error or residual. It is the difference between the true value of 𝑦 for a data point and the predicted value ̂𝑦, on the line, for the same 𝑥-value.

The least squares regression line, ̂𝑦=𝑎+𝑏𝑥, minimizes the sum of the squared differences of the points from the line, hence, the phrase “least squares.” We will not cover the derivation of the formulae for the line of best fit here. However, we will demonstrate how to use the formulae to find coefficients 𝑎 and 𝑏 of the line.

Definition: The Least Squares Regression Line

If ̂𝑦=𝑎+𝑏𝑥 is the line of least squares regression for a set of bivariate data with variables 𝑋 and 𝑌, then theslopeand𝑏=𝑆𝑆𝑎=𝑦𝑏𝑥, where 𝑆=𝑥𝑦𝑥𝑦𝑛,𝑆=𝑥𝑥𝑛,𝑥=𝑥𝑛(𝑥),𝑦=𝑦𝑛(𝑦).meanofmeanof

We may also write the slope 𝑏 as 𝑏=𝑟𝑠𝑠, where 𝑟 is the correlation coefficient and 𝑠 and 𝑠 are the standard deviations of 𝑥 and 𝑦, respectively, or, alternatively, by substituting the expressions for 𝑆 and 𝑆 into the formula for the slope as 𝑏=𝑛𝑥𝑦𝑥𝑦𝑛𝑥𝑥.

We note further that 𝑆 and 𝑆 are often written as 𝑆=𝑥𝑥(𝑦𝑦),𝑆=𝑥𝑥, which are equivalent to the expressions given above. In this form, we can see that 𝑆 is the sum of the product of the differences between each 𝑥 and the mean of 𝑥 and each 𝑦 and the mean of 𝑦, and 𝑆 is the sum of the squares of the differences between each 𝑥 and the mean of 𝑥.

In practice, we calculate the slope, 𝑏, first, since 𝑏 is needed to calculate 𝑎. Let’s look at an example of how to calculate the least squares regression line from a table of bivariate data.

Example 1: Calculating the Least Squares Regression Line given a Table of Data Summarizing the Sums of the Observed Values

Use the information in the table to find the equation of the least squares regression line of 𝑦 on 𝑥. Write the equation in the form 𝑦=𝑎𝑥+𝑏, where 𝑎 and 𝑏 are accurate to three decimal places.

𝑥𝑦𝑥𝑦𝑥𝑦
12218396484324
22219418484361
32320460529400
42618468676324
53123713961529
632247681‎ ‎024576
734227481‎ ‎156484
837259251‎ ‎369625
941291‎ ‎1891‎ ‎681841
1042271‎ ‎1341‎ ‎764729
Sum3102257‎ ‎21910‎ ‎1285‎ ‎193

Answer

To find the least squares regression line 𝑦=𝑎+𝑏𝑥, we must find the slope, 𝑏, and the 𝑦-intercept, 𝑎. To do this, we use the formulae 𝑏=𝑆𝑆=𝑛𝑥𝑦𝑥𝑦𝑛𝑥𝑥𝑎=𝑦𝑏𝑥,and where 𝑥=𝑥𝑛 is the mean of 𝑥 and 𝑦=𝑦𝑛 is the mean of 𝑦.

The number of data pairs in our data set is 𝑛=10 and in the final row of the table we are given the sums that we need. These are 𝑥𝑦=7219, 𝑥=310, 𝑦=225 and 𝑥=10128. Since we will need the slope, 𝑏, to calculate 𝑎, let’s first use the given values to find 𝑏: 𝑏=𝑛𝑥𝑦𝑥𝑦𝑛𝑥𝑥=10×7219310×22510×10128(310)=721906975010128096100=24405180=122259=0.4713.tod.p

To calculate the value of the 𝑦-intercept, 𝑎, we need the means of the 𝑥- and the 𝑦-values. These are 𝑥=𝑥𝑛=31010=31,𝑦=𝑦𝑛=22510=22.5.

We can now use these values, in fraction form for accuracy, together with our slope 𝑏=122259, to find 𝑎: 𝑎=𝑦𝑏𝑥=22510122259×31010=7.898.tothreedecimalplaces

Hence, with the 𝑥-term first, the least squares regression line is 𝑦=0.471𝑥+7.898.

Within our calculations, we have used expressions such as 𝑥, 𝑦, and 𝑥𝑦, which are known as summary statistics.

Definition: Summary Statistics

Summary statistics are statistics that we calculate from the observations in a sample data set, which summarize the data in a way that allows us to communicate, and hence interpret, as much information as possible.

In our next example, we will find the least squares regression line directly from the summary statistics.

Example 2: Calculating a Regression Coefficient for a Least Squares Regression Model from Summary Statistics

For a given data set, 𝑥=47, 𝑦=45.75, 𝑥=329, 𝑦=389.3125, 𝑥𝑦=310.25, and 𝑛=8. Calculate the value of the regression coefficient 𝑏 in the least squares regression model 𝑦=𝑎+𝑏𝑥. Give your answer correct to three decimal places.

Answer

To calculate the regression coefficient 𝑏 from the summary statistics given, we can use the formula 𝑏=𝑆𝑆=𝑛𝑥𝑦𝑥𝑦𝑛𝑥𝑥.

Substituting in our values for 𝑥, 𝑦, 𝑥, 𝑥𝑦=310.25 and 𝑛=8, we have 𝑏=8×310.2547×45.758×329(47)=24822150.2526322209=331.754230.78428.

Hence, to three decimal places, the regression coefficient 𝑏=0.784.

Our next example demonstrates how to find the equation of the least squares regression line for a given set of bivariate data.

Example 3: Calculating the Equation of the Least Squares Regression Line

The scatterplot shows a set of data for which a linear regression model appears appropriate.

The data used to produce this scatterplot is given in the table shown.

𝑥0.511.522.533.54
𝑦9.257.68.256.55.454.51.751.8

Calculate the equation of the least squares regression line of 𝑦 on 𝑥, rounding the regression coefficients to the nearest thousandth.

Answer

The equation of the line of least squares regression is ̂𝑦=𝑎+𝑏𝑥, where the slope or regression coefficient is 𝑏=𝑆𝑆=𝑛𝑥𝑦𝑥𝑦𝑛𝑥𝑥.

The 𝑦-intercept is given by 𝑎=𝑦𝑏𝑥, with 𝑥=𝑥𝑛, the mean of 𝑥, and 𝑦=𝑦𝑛, the mean of 𝑦. We have eight paired data points (𝑥,𝑦); hence, 𝑛=8. To find the two coefficients, 𝑎 and 𝑏, we begin by putting our data into a table with columns for the product 𝑥𝑦 and 𝑥, since we will need their sums for our calculation. So, for example, in the third column, the first entry is 0.5×9.25=4.625, and so on for each pair (𝑥,𝑦).

𝑥𝑦𝑥𝑦𝑥
0.59.254.6250.25
17.67.6001.00
1.58.2512.3752.25
26.513.0004.00
2.55.4513.6256.25
34.513.5009.00
3.51.756.12512.25
41.87.20016.00
Sum

Our next step is to sum each of the columns so that we have the sums in the final row.

𝑥𝑦𝑥𝑦𝑥
0.59.254.6250.25
17.67.6001.00
1.58.2512.3752.25
26.513.0004.00
2.55.4513.6256.25
34.513.5009.00
3.51.756.12512.25
41.87.20016.00
Sum𝑥=18.00𝑦=45.10𝑥𝑦=78.05𝑥=51.00

We can now use these sums in the formula for 𝑏 to calculate the slope of the regression line: 𝑏=8×78.0518.0×45.18×51.0(18.0)=624.4811.8408324=187.484=937420=2.231.tothenearestthousandth

Note that, from the scatter diagram, we see that as 𝑥 increases, the 𝑦-values, in general, decrease and this is confirmed by the fact that the slope, 𝑏, is negative. To find the value of the constant 𝑎=𝑦𝑏𝑥, we must first calculate the means 𝑥 and 𝑦. Using the sums from our table, these are 𝑥=𝑥𝑛=188=2.25,𝑦=𝑦𝑛=45.18=45180=5.6375.

Hence, keeping these values in exact fractional form for accuracy, together with our value for 𝑏, which in exact form is equal to 937420, our 𝑦-intercept is 𝑎=45180937420188=10.657.tothenearestthousandth

The equation of the least squares regression line of 𝑦 on 𝑥 for this data, to the nearest thousandth, is therefore ̂𝑦=10.6572.231𝑥.

In our next example, we will apply our knowledge of the calculation of the least squares regression line to a real-life situation. However, when considering real-life variables in the context of regression, if possible, we first establish which of our variables is the dependent variable and which is the independent variable. These are defined as follows.

Definition: Dependent and Independent Variables

Independent variables are variables that we may control or change and that we believe have a direct effect on a dependent variable. Independent variables are also sometimes called explanatory variables and are often labeled 𝑥, or 𝑥𝑖=1,,𝑛, for 𝑛 explanatory variables.

Dependent variables are variables that are being tested and are dependent on independent variables. Dependent variables are often called response variables as they respond to changes in explanatory variables and are often labeled 𝑦.

Example 4: Finding the Equation of a Regression Line in a Regression Model

Using the information in the table, find the regression line ̂𝑦=𝑎+𝑏𝑥. Round 𝑎 and 𝑏 to 3 decimal places.

Cultivated Land in Feddan126 13 104 180 38 161 14 99 55 177
Production of a Summer Crop in Kilograms 160 40 80 340 260 200 280 280 140 100

Answer

We begin by determining which of our variables is the independent variable and which is the dependent variable. Since we would expect the amount of a summer crop produced to depend on the amount of land on which it is cultivated, it makes sense that the “production” variable is the dependent variable (𝑦) and the “land” variable is the independent variable (𝑥).

To find the equation of the line of least squares regression, ̂𝑦=𝑎+𝑏𝑥, we must find the slope or regression coefficient 𝑏 and the 𝑦-intercept 𝑎. We have ten pairs of data, that is, ten measurements of the independent variable “cultivated land in feddan” that are paired with ten measurements of the dependent variable “production of a summer crop in kilograms”, so 𝑛=10. We can use the following formula to calculate 𝑏: 𝑏=𝑆𝑆=𝑛𝑥𝑦𝑥𝑦𝑛𝑥𝑥.

We will therefore need to find the sums 𝑥𝑦, 𝑥, 𝑦, and 𝑥. Let us put our data into a table with columns for the product 𝑥𝑦 and for 𝑥 so that we may more easily calculate the required sums.

Cultivated Land (Feddan) 𝑥Summer Crop (kg) 𝑦𝑥𝑦𝑥
12616020‎ ‎16015‎ ‎876
1340520169
104808‎ ‎32010‎ ‎816
18034061‎ ‎20032‎ ‎400
382609‎ ‎8801‎ ‎444
16120032‎ ‎20025‎ ‎921
142803‎ ‎920196
9928027‎ ‎7209‎ ‎801
551407‎ ‎7003‎ ‎025
17710017‎ ‎70031‎ ‎329
Sum𝑥=967𝑦=1880𝑥𝑦=189320𝑥=130977

The sums, which are in the final row, have been calculated for each column, and we may now use these in the formula to find 𝑏: 𝑏=10×189320967×188010×130977(967)=189320018179601309770935089=752403746810.20081=0.2013.tod.p

The 𝑦-intercept is given by 𝑎=𝑦𝑏𝑥, where 𝑥 is the mean of the 𝑥-values and 𝑦 is the mean of the 𝑦-values. These are 𝑥=𝑥𝑛=96710=96.7,𝑦=𝑦𝑛=188010=188.

To calculate 𝑎 accurately to three decimal places, we need to substitute a suitably accurate value for 𝑏. Here, we can substitute the exact fraction, or a decimal accurate to at least five decimal places. Therefore, calculating 𝑎, we have 𝑎=18896.7×0.20081=18819.41840168.5816=168.5823.tod.p

With the values of our regression coefficient and 𝑦-intercept to three decimal places, the line of least squares regression is ̂𝑦=0.201𝑥+168.582.

We may interpret this as follows: for every additional unit of cultivated land in feddan, we expect the production of the summer crop to increase by approximately 0.2 kg. We might also interpret the value of 𝑎, since this is the 𝑦-intercept. However, we need to be careful that our interpretation makes sense within the context of the data. In our case, with 𝑎=168.582, we might conclude that, with no cultivated land, that is, 𝑥=0, we could expect to produce 168.582 kg of the summer crop, which does not make physical sense. We might perhaps infer that we begin with 168.582 kg of the summer crop from other sources, but we do not know this from the data. This illustrates how care must be taken when considering how variables behave outside of the range of the given data.

Once we have a regression model, which in the case of linear data is the least squares line of regression, we may, with care, use our model to estimate values for the dependent variable. We see how this works in our next example.

Example 5: Calculating an Estimated Value for a Variable at a Given Point in a Regression Model

Using the information in the table, estimate the value of 𝑦 when 𝑥=13. Give your answer to the nearest integer.

𝑥 23 9 24 15 7 12
𝑦 22 24 25 13 21 9

Answer

We are given a set of bivariate data where we have six pairs of values for each of the two variables 𝑥 and 𝑦. To estimate a 𝑦-value for a given 𝑥-value, assuming the data is approximately linear, we must first find the equation of the regression line, ̂𝑦=𝑎+𝑏𝑥. To do this, we first calculate the slope 𝑏, using the formula below: 𝑏=𝑆𝑆=𝑛𝑥𝑦𝑥𝑦𝑛𝑥𝑥.

This requires the sums 𝑥𝑦, 𝑥, 𝑦, and 𝑥, and by putting our data into a table with columns for the product 𝑥𝑦 and for 𝑥, we can easily calculate these as shown below.

𝑥𝑦𝑥𝑦𝑥
2322506529
92421681
2425600576
1513195225
72114749
129108144
Sum𝑥=90𝑦=114𝑥𝑦=1772𝑥=1604

Substituting the necessary sums into our formula for 𝑏 gives 𝑏=6×177290×1146×1604(90)=106321026096248100=37215240.2440945=0.244.tothreedecimalplaces

The 𝑦-intercept verifies the equation 𝑎=𝑦𝑏𝑥 and we will use our value for 𝑏 to calculate this. However, first we must find the means of the 𝑥- and the 𝑦-values. These are 𝑥=𝑥𝑛=906=15,𝑦=𝑦𝑛=1146=19.

With these values, we then have 𝑎=190.2440945×1515.3386.

The regression line is, therefore, ̂𝑦=15.3386+0.2440945𝑥. Now, if we substitute 𝑥=13, we find ̂𝑦=15.3386+0.2440945×1318.512.

Hence, to the nearest integer, when 𝑥=13, we estimate 𝑦=19.

In this example, we estimated a value of the dependent variable 𝑦 for a value of 𝑥 that was within our range of known values. This is called interpolation and the following definition clarifies this.

Definition: Interpolation and Extrapolation

Interpolation Estimating or predicting a value of the dependent variable from within the range of known values of the independent variable.

Extrapolation Estimating or predicting a value of the dependent variable from outside the range of known values of the independent variable.

Extrapolation should be used with the utmost caution, if at all. The behavior of the variables may change outside the known range of data leading to errors. Therefore, extrapolation should be avoided where possible.

We complete this explainer by recalling some of the key points covered.

Key Points

  • The least squares regression line is a linear model for bivariate data sets consisting of 𝑛 data points (𝑥,𝑦), where 𝑥 is the independent, or explanatory, variable and 𝑦 is the dependent, or response, variable.
  • The least squares regression line is the line whose sum of squares of distances of the data points from that line is a minimum. The equation of the line is ̂𝑦=𝑎+𝑏𝑥, where 𝑏=𝑆𝑆=𝑛𝑥𝑦𝑥𝑦𝑛𝑥𝑥𝑎=𝑦𝑏𝑥,and with 𝑥=𝑥𝑛, the mean of 𝑥, and 𝑦=𝑦𝑛, the mean of 𝑦, and where 𝑆=𝑥𝑦𝑥𝑦𝑛,𝑆=𝑥𝑥𝑛.
  • The slope, 𝑏, may also be written as 𝑏=𝑟𝑠𝑠, where 𝑟 is the correlation coefficient and 𝑠 and 𝑠 are the standard deviations of 𝑥 and 𝑦.
  • We may use the least squares regression line to estimate or predict values of the dependent variable using interpolation, that is, using 𝑥-values within the known range. Extrapolation, that is, using values outside this range to estimate or predict is not advisable as results may be erroneous.

Join Nagwa Classes

Attend live sessions on Nagwa Classes to boost your learning with guidance and advice from an expert teacher!

  • Interactive Sessions
  • Chat & Messaging
  • Realistic Exam Questions

Nagwa uses cookies to ensure you get the best experience on our website. Learn more about our Privacy Policy