Skip to document

100C note - Lecture notes 1-30

Complete Stats 100C notes
Course

Statistics Linear model (100C)

3 Documents
Students shared 3 documents in this course
Academic year: 2017/2018
Uploaded by:
Anonymous Student
This document has been uploaded by a student, just like you, who decided to remain anonymous.
University of California Los Angeles

Comments

Please sign in or register to post comments.

Preview text

Simplest Regression

Simplest regressionSuppose we observe (xi, yi),i= 1, ..., n. The simplest linear regression model isyi=βxi+ǫi. Least squares principleThe least squares estimatorβˆis defined by minimizing the residual sum of squaresR(β) =

∑n i=1(yi−βxi) 2. Least squares estimating equationR′(β) = 0, i.,

∑n i=1(yi−βxi)xi= 0. Least squares estimatorThe solution isβˆ=

∑n i=1xiyi/

∑n i=1x

2 i. R′′(β) =

∑n i=1x

2 i≥0. Soβˆis a minimum ofR(β).

Geometric interpretationLetxbe the column vector (x 1 , ..., xn)T. Letybe the column vector (y 1 , ..., yn)T. ThenR(β) =|y−βx| 2. Soβxˆ is the projection ofy onx. Lete= ~y−βxbe the residual vector. Thene⊥x, i.,〈e, ~x〉= 0, i.,〈y−β~x, ~x〉= 0, which is the least squares estimating equation in vector notation.

Figure 1: Least squares projection

The solution isβˆ=〈x, ~y〉/|x| 2 , where|x| 2 =〈x, x〉. ProjectionLetu 1 =x/|x|be the unit vector alongx. The projection ofyonxoru 1 is〈y, ~u 1 〉u 1 = 〈x, ~y〉/|x| 2 x=βxˆ. Soβˆ=〈x, ~y〉/|x| 2. Frequentist framework of hypothetical repeated samplingIf we repeatedly sample (xi, yi) and computeβˆ, what will happen? True modelAssume the data are generated by the true modelyi=βtruexi+ǫifori= 1, ..., n, wherexiare fixed, andǫi∼N(0, σ 2 ) independently. Sampling propertyUnder the true model, the least squares estimator can be expressed as

βˆ=

∑n i=1x∑i(nβtruexi+ǫi) i=1x 2 i

=βtrue+

∑n ∑i=1n xiǫi i=1x 2 i

.

So E(βˆ) =βtrue.

Var(βˆ) =

∑n i=1x

2 iσ 2 (

∑n i=1x 2 i) 2

=

∑σ 2 n i=1x 2 i

.

Combination of equationsWe can combine the equationsyi ≈βxi into a single estimating equation

∑n i=1wiyi=

∑n i=1wixiβ, wherewiare constants. The estimator is

βˆ=

∑n ∑ni=1wiyi i=1wixi

.

The least squares estimator corresponds towi∝xi.

Optimality of least squaresFor the above estimator based on combination of equations, we have βˆ=

∑n i=1∑win(βtruexi+ǫi) i=1wixi

=βtrue+

∑n ∑ni=1wiǫi i=1wixi

.

Thus E(βˆ) =βtrue, and

Var(βˆ) =

∑n i=1w 2 iσ 2 (

∑n i=1wixi) 2

=|~w|

2 σ 2 〈~w, ~x〉 2

= |~w|

2 σ 2 (|w||x|cosθ) 2

= σ

2 |~x| 2 cos 2 θ

≥ σ

2 |~x| 2

,

which is the variance of the least squares estimator. The minimum is achieved atx∝w. In vector notationThe true model isy=βtruex+~ǫ.

βˆ=〈x, βtruex+ǫ〉 |x| 2

=βtrue+〈x,ǫ〉 |~x| 2

.

Fitted valuesThe vector of fitted values, or the projected vector,

yˆ=βxˆ =βtruex+

x,ǫ〉x |x| 2 =βtruex+〈ǫ, u 1 〉u 1.

Orthonormal vectorsLetu 1 , ~u 2 , ..., ~un be a set of orthonormal vectors, i.,|ui| = 1, and 〈ui, ~uj〉= 0 fori 6 = j. Letǫ=

∑n i=1δiuibe the decomposition ofǫin the coordinate system (ui, i= 1, ..., n), andδi=〈ǫ, ~ui〉. Residual vector

e=y−βxˆ =ǫ−δ 1 ~u 1 =

∑n

i=

δi~ui.

Residual sum of squares

|e| 2 =|y−β~xˆ | 2 =

∑n

i=

(yi−βxˆ i) 2 =|

∑n

i=

δi~ui| 2 =

∑n

i=

δ 2 i,

whereδi∼N(0, σ 2 ) independently, i.,δi=σZi, withZi∼N(0,1) independently fori= 1, ..., n. Thus n ∑

i=

(yi−βxˆ i) 2 =σ 2

∑n

i=

Zi 2 =σ 2 χ 2 n− 1.

Residual standard error

s 2 =

∑n i=1(yi−βxˆ i)

2 n− 1 =

σ 2

∑n i=2Z i 2 n− 1.

So E(s 2 ) =σ 2. Z statistic

βˆ=βtrue+〈x,ǫ〉 |~x| 2 =βtrue+

δ 1 |~x|∼N

(

βtrue,

σ 2 |~x| 2

)

.

Let σ(βˆ) =

σ |~x|.

If−λ≤

∑n i=1xiyi≤λ, thenβˆ= 0 minimizesL(β) because the left derivative ofL(β),L′−(0)≤ 0 and the right derivative ofL(β),L′+(0)≥0. For simplicity, assume that|x|= 1. Then we can writeβˆ=T(〈x, ~y〉), whereT(r) is the soft- thresholding function, i.,T(r) = 0 if|r| ≤λ, andT(r) =r−λifr > λ, andT(r) =r+λif r <−λ. This is consistent with hypothesis testing. Bias and variance tradeoffThe mean square error is (letμ= E(βˆ), so E(βˆ−μ) = 0)

E[(βˆ−βtrue) 2 ] = E[((βˆ−μ)−(βtrue−μ)) 2 ] = E[(βˆ−μ) 2 ] + E[(βtrue−μ) 2 ] = Var(βˆ) + Bias(βˆ) 2 ,

where Bias(βˆ) = E(βˆ)−βtrue. The ridge regression estimator and Lasso estimator are biased, but they may have reduced variances than the unbiased least squares estimator. So in the end, they can have smaller mean square errors.

Simple Regression and Correlation

Simple regressionSuppose we observe (xi, yi),i= 1, ..., n. The simple linear regression model is yi=α+βxi+ǫi. Least squares principleThe least squares estimators (ˆα,βˆ) are defined by minimizing the residual sum of squaresR(α, β) =

∑n i=1(yi−α−βxi) 2. Least squares estimating equations(1) ∂α∂R(α, β) = 0, i.,

∑n i=1(yi−α−βxi) = 0, so ̄y−α−βx ̄= 0, that is the linear regression line goes through the center of the data( ̄x,y ̄). So at the minimum ,α= ̄y−βx ̄. (2)∂β∂R(α, β) = 0, i.,

∑n i=1(yi−α−βxi)xi= 0, substitutingαby ̄y−βx ̄, we have

∑n i=1((yi−y ̄)−β(xi−x ̄))xi= 0. So

βˆ=

∑n ∑ni=1(yi−y ̄)xi i=1(xi−x ̄)xi

,

and ˆα= ̄y−βˆx ̄. CentralizationLet ̃xi=xi− ̄x, ̃yi=yi−y ̄.

∑n i=1x ̃i= 0,

∑n i=1y ̃i= 0.

βˆ=

∑n ∑i=1n x ̃iy ̃i i=1x ̃ 2 i

.

Transforming to simplest regressionLetyi′=yi−βxi, thenR(α, β) =

∑n i=1(y′i−α1) 2. This can be considered a simplest regression of (1, y′i). So at the minimumα=

∑n i=1y′i/n= ̄y−βx ̄. ThusR(α, β) =

∑n i=1( ̃yi−β ̃xi) 2. This can be considered a simplest regression of ( ̃xi,y ̃i). So βˆ=∑ni=1x ̃iy ̃i/∑ni=1 ̃x 2 i. Geometric interpretationLetxbe the column vector (x 1 , ..., xn)T. Letybe the column vector (y 1 , ..., yn)T. Let1 = (1, 1 , ..)Tbe a vector of 1’s of lengthn. ThenR(α, β) =|y−α~ 1 −βx| 2. So αˆ1 +βxˆ is the projection ofyon the subspace spanned by1 andx. Lete=y−α−βxbe the residual vector. Thene⊥1 ande⊥x, i.,〈e, 1 〉= 0, and〈e, ~x〉= 0. This gives us the estimating equations (1)〈y−α−βx, 1 〉= 0, (2)〈y−α−βx, ~x〉= 0, which are the least squares estimating equations in vector notation. Sample covariance and correlationLetCxy = 1 n

∑n i=1x ̃iy ̃i be the sample covariance. Let Vx= 1 n

∑n i=1 ̃x 2 ibe the sample variance of (xi). LetVy= 1 n

∑n i=1y ̃ 2 ibe the sample variance of (yi). The sample correlation is defined as

ρ=

Cxy √ Vx

Vy

.

Figure 2: Least squares projection

Correlation and angleLet ̃x= ( ̃x 1 , ...,x ̃n)T and ̃y= ( ̃y 1 , ...,y ̃n)T. ThenCxy= n 1 〈 ̃x, ̃y〉. Vx= 1 n|x ̃|

2 .Vy= 1 n|y ̃|

  1. Soρ=〈x, ̃ y ̃〉/(|x ̃||y ̃|) = cosθ, whereθis the angle between ̃xand ̃y.

Figure 3: Correlation as cosine of angle

Correlation and regressionThe regression coefficient

βˆ=〈x, ̃y ̃〉 |x ̃| 2 =

Cxy Vx =ρ

σy σx,

whereσx=

Vxandσy=

Vy, andσy/σxserves to fix the units ofyandx. So the regression line is y−y ̄=βˆ(x− ̄x),

that is, y− ̄y σy

=ρx−x ̄ σx

.

The regression line suggests thatyis going back towards the mean because|ρ| ≤1. This is the original meaning of “regression”. Analysis of varianceSinceβˆminimizes

∑n i=1( ̃yi−β ̃xi) 2 =| ̃y−βx ̃| 2 ,βˆx ̃is the projection of ̃yon ̃x. Lete= ̃y−βˆx ̃=y−αˆ−βxˆ (because ˆα= ̄y−βˆx ̄) be the residual vector. Thus|y ̃| 2 =|βˆx ̃| 2 +|e| 2 , that is n ∑

i=

(yi−y ̄) 2 =

∑n

i=

(βˆ(xi−x ̄)) 2 +

∑n

i=

(yi−αˆ−βxˆ i) 2.

That is SS(total) = SS(regression) + SS(residual). R-squareTheR 2 is the proportion of total sum of squares explained by the regression, i.,R 2 = SS(regression)/SS(total).

R 2 =

|βˆ ̃x| 2 | ̃y| 2 = cos

2 θ=ρ 2.

So the magnitude of the correlation measures the strength of linear relationship.

Figure 4: Geometry of centralization

Now consider the simple linear regression, where we projectyon the space of1 andx. Let the end point ofybeA, the projection beB. If we project bothAandBonto1, then they meet at the same pointC. The reason is as follows. If we projectBonto1 and letCbe the projection. Then1 is perpendicular to the lineCB. Meanwhile,1 is perpendicular to the line ofAB, sinceAB is perpendicular to the plane of1 andx. Thus1 is perpendicular to the planeABC. So the line ACis also perpendicular to1.

Figure 5: Geometry of reducing simple regression to simplest regression

SinceCis the projection ofyonto1, soCis ̄y1. Meanwhile,Cis also the projection ofBonto ~1 whereBis ˆy= ˆα+βxˆ , soCis also (ˆα+βˆx ̄)~1. Thus ̄y= ˆα+βˆx ̄, that is ( ̄x,y ̄) is on the regression

line. If we look at the triangleABC, the vector fromCtoAis ̃y. The vector fromBtoAise. The vector fromCtoBis ˆy−(ˆα+βˆx ̄)1 =βˆx ̃. Thus|y ̃| 2 =βˆ 2 |x ̃| 2 +|~e| 2 , which is the analysis of variance result obtained before.

Multiple Regression in Matrices

A meta rule for matrices and vectorsWhenever possible, you can always think of matrices or vectors as scalers or numbers in your calculations. You only need to takecare to match the dimensionalities. We shall define expectations, variances, and derivatives for matrices and vectors as a matter of conveniently packaging their elements so that we can avoid subindices in our calculations. Multiple regressionLetxi= (xi 1 , ..., xip)T, andβ= (β 1 , ..., βp)T. The multiple regression model is yi=

∑p

j=

βjxij+ǫi=xTiβ+ǫi,

fori= 1, ..., n. LetXj = (x 1 j, x 2 j, ..., xnj)T. LetX = (X 1 , ..., Xp) be then×pmatrix. Let

Y= (y 1 , ..., yn)T. Letǫ= (ǫ 1 , ..., ǫn)T. We can then write the model as

Y =

∑p

j=

βjXj+ǫ=Xβ+ǫ.

Least squares principleWe estimateβby minimizing the residual sum of squares

R(β) =

∑n

i=

yi−

∑p

j=

βjxij

2

=

∑n

i=

(yi−xTiβ) 2

= |Y −

∑p

j=

βjXj| 2

= |Y −Xβ| 2

Least squares estimating equationsTake partial derivative with respect to eachβj, we have

R(β) ∂βj

=− 2

∑n

i=

yi−

∑p

j=

βjxij

xij.

Define∂R(β)/∂β= (∂R(β)/∂β 1 , ..., ∂R(β)/∂βp)T, we have

∂R(β) ∂β = − 2

∑n

i=

yi−

∑p

j=

βjxij

(xi 1 , ..., xip)T

= − 2

∑n

i=

(yi−xTiβ)xi.

So the least squares estimating equation is

∑n

i=

xi(yi−xTiβ) = 0.

Thus the least squares estimatorβˆ= (

∑n i=1xixTi)− 1 (

∑n i=1xiyi). The partial derivative∂R(β)/∂βjcan also be written as

R(β) ∂βj

= − 2 〈Y−

∑p

j=

βjXj, Xj〉

= − 2 XjT

Y−

∑p

j=

βjXj

= − 2 XjT(Y −Xβ).

So ∂R(β) ∂β =−2(X 1 , ..., Xj)

T(Y−Xβ) =− 2 XT(Y−Xβ).

Y = AX, thenyij =

kaikxkj, and E(yij) = E(

kaikxkj) =

kaikE(xkj). Thus E(Y) = (E(yij)) = (

kaikE(xkj)) =AE(X). Variance-covariance matrix of a random vectorLetXbe a random vector. LetμX= E(X). We define Var(X) = E[(X−μX)(X−μX)T]. Then the (i, j)-th element of Var(X) is Cov(xi, xj). LetAbe a matrix of appropriate dimension, then Var(AX) =AVar(X)AT. This is because

Var(AX) = E[(AX−E(AX))(AX−E(AX))T] = E[(AX−AμX)(AX−AμX)T] = E[A(X−μX)(X−μX)TAT] = AE[(X−μX)(X−μX)T]AT = AVar(X)AT.

Positive definite matrixA matrixV is said to be positive definite (or more precisely non-negative definite), denoted asV ≥0, if for any vectora 6 = 0, i., not all its components are zero,aTV a≥0. LetV = Var(X), thenV ≥0, because Var(aTX) =aTVar(X)a≥0. Compare two positive definition matricesForV 1 ≥0 andV 2 ≥0, we say thatV 1 ≥V 2 if V 1 −V 2 ≥0, i., for anya 6 = 0,aT(V 1 −V 2 )a≥0, i.,aTV 1 a≥aTV 2 a. Cauchy-Schwards inequalityFor any matrixM,MTM ≥MTHM, whereH is a projection matrix. This is because for anya 6 = 0, letY =M a, then according to Cauchy-Schwards inequality aTMTM a=YTY ≥YTHY =aTMTHM a. Sampling propertyUnder the true model, the least squares estimator

βˆ = (XTX)− 1 XTY = (XTX)− 1 XT(Xβtrue+ǫ) = βtrue+ (XTX)− 1 XTǫ.

The fitted vectorYˆ=Xβˆ=Xβtrue+X(XTX)− 1 XTǫ=Xβtrue+ˆǫ, where ˆǫ=Hǫis the projection ofǫinto the space ofX. Since E[(XTX)− 1 XTǫ] = (XTX)− 1 XTE(ǫ) = 0, and

Var((XTX)− 1 XTǫ) = (XTX)− 1 XTVar(ǫ)[(XTX)− 1 XT]T = σ 2 (XTX)− 1 ,

we have E(βˆ) =βtrue, and Var(βˆ) =σ 2 (XTX)− 1 , andβˆ∼N(βtrue, σ 2 (XTX)− 1 ). Optimality of least squares, Gauss-Markov, BLUEWe shall show that the least squares esti- mator is the best linear unbiased estimator (BLUE). This is also called the Gauss-Markov theorem. LetβˆLS= (XTX)− 1 XTY be the least squares estimator. Letβˆ=ATYbe a linear unbiased estima- tor. Then we must have E(βˆ) = E(ATY) =ATE(Y) =AT(Xβtrue) =βtrueno matter whatβtrueis. So we must haveATX=Ip, whereIpisp×pidentity matrix. Var(βˆ) = Var(AY) =ATVar(Y)A= σ 2 ATA. According to Cauchy-Schwards,ATA≥ATHA=ATX(XTX)− 1 XTA= (XTX)− 1. So Var(βˆ)≥Var(βˆLS).

We can also be more specific. For any vectorx,

Var(xTβˆ) = σ 2 xTATAx ≥ σ 2 xTATHAx = σ 2 xTATX(XTX)− 1 XTAx = σ 2 xT(XTX)− 1 x = Var(xTβˆLS).

So the least squares estimator is the best linear unbiased estimators(BLUE). An example of linear unbiased estimator is obtained by combination of equations. From the nequationsXβ ≈Y, we can combine them intopequationsWTXβ =WTY, whose solution βˆ= (WTX)− 1 WTY, i.,AT= (WTX)− 1 WT. ClearlyATX=Ip. Gram-Schmidt orthogonalizationFor two vectorsXandY, we can projectY ontoX, and the projection is, according to the simplest regression,

Proj(Y, X) =βXˆ =

〈Y, X〉

|X| 2 X.

Consider the vectors (X 1 , ..., Xp), letu 1 = X 1 /|X 1 |. Then lete 2 = X 2 −proj(X 2 , u 1 ), and let ~u 2 =e 2 /|e 2 |. Then lete 3 =X 3 −proj(X 3 , u 1 )−proj(X 3 , ~u 2 ), and letu 3 =e 3 /|e 3 |. Continue this process, let

~ej=Xj−

∑j− 1

k=

proj(Xj, uk), ~uj=ej/|~ej|.

We will change (X 1 , ..., Xp) into (u 1 , ..., ~up) that are orthonormal vectors. We can also add random vectorsXp+1, ..., Xnand continue this process to getnorthonormal vectors (u 1 , ..., up, ~up+1, ..., ~un). Error vector in new basis(u 1 , ..., un) form a new basis. For the random vectorǫ, we can expand it in the new basisǫ=δ 1 ~u 1 +...+δnun, where the coordinate ofǫin this new basis is (δ 1 , ..., δn) andδj=〈ǫ, uj〉. Letδ= (δ 1 , ..., δn)T. LetU= (u 1 , ..., un) be then×northogonal matrix. Then ǫ=U δ, andδ=UTǫ, thusU UT=In. Sinceǫ∼N(0, σ 2 In), and E(ǫ) = 0, Var(ǫ) =σ 2 In, we have E(δ) = E(UTǫ) =UTE(ǫ) = 0, and Var(ǫ) = Var(UTǫ) =UTVar(ǫ)U =σ 2 UTU =σ 2 In. Thus δ∼N(0, σ 2 In). That is,δj∼N(0, σ 2 ) independently. Residual sum of squaresSinceY =Xβtrue+ǫandYˆ=Xβtrue+ ˆǫ, we havee=Y−Yˆ=ǫ−ˆǫ. ǫ=

∑n j=1δj~uj. ˆǫ=

∑p j=1δjuj. Thuse=

∑n j=p+1δjuj. So|e|

2 =∑n j=p+1δ j 2 ∼σ 2 χ 2 n−p. Residual standard errorWe can estimateσ 2 by

s 2 = |~e|

2 n−p

=|Y−X

βˆ| 2 n−p

.

So E(s 2 ) =σ 2. Z statisticβˆ∼ N(βtrue, V = σ 2 (XTX)− 1 ). Then marginallyβˆj ∼ N(βtrue,j, Vjj). Under the null hypothesisH 0 :βtrue,j= 0, we haveβˆj ∼N(0, Vjj). ThusZ =βˆj/

Vjj ∼N(0,1) under H 0. So ifβˆjis small relative to

Vjj, we will conclude thatβtrue,j= 0, i., the variableXj has nothing to do withY. In that case, βˆj only captures the noise inǫsince there is no signal. If H 1 is true, orβtrue,j 6 = 0, i.,Y depends onXj, thenZ=βˆj/

Vjjdoes not follow N(0,1), since E(Z) =βtrue,j/

Vjj.

H 0. So the F-statistic

F =

(|e 0 | 2 − |e 1 | 2 )/(p−d) |~e 1 | 2 /(n−p)

=

∑p j=d+1δ j 2 /(p−d) ∑n j=p+1δ 2 j/(n−p)

∼Fp−d,n−p.

That is, ifH 0 is true, fitting theH 1 model only overfits the noise, that is, the improvement in the training error|e 0 | 2 − |e 1 | 2 is only due to noise terms

∑p j=d+1δ 2 j, which is no much. IfH 1 is true,

~e 0 = Y−Xβˆ

=

∑p

j=

βtrue,jXj+ǫ−

∑d

j=

(βtrue,j+δj)Xj

=

∑p

j=d+

βtrue,jXj+

∑n

j=d+

δj~uj.

So|~e 0 | 2 =

∑p j=d+1βtrue 2 ,j+

∑n j=d+1δ 2 j. ~e 1 =

∑n j=p+1δjujas before. So the improvement in fitting H 1 model versus fittingH 0 model is|e 0 | 2 − |~e 1 | 2 =

∑p j=d+1β

2 true,j+

∑p j=d+1δ

2 j, which can be quite large sinceβtrue,j 6 = 0 forj=d+ 1, ..., punderH 1. So underH 1 , the F-statistic is

F = (|~e 0 |

2 − |e 1 | 2 )/(p−d) |e 1 | 2 /(n−p)

=

∑p j=d+1(β

2 true,j+δ 2 j)/(p−d) ∑n j=p+1δj 2 /(n−p)

.

Training error and testing errorSuppose the true model isY =

∑p j=1βtrue,jXj+ǫ. Suppose we fit the modelY =

∑d j=1βjXj+ǫford≤p. As derived above, the training residual vector is

~e=

∑p

j=d+

βtrue,jXj+

∑n

j=d+

δj~uj.

The training error is|~e| 2 =

∑p j=d+1β

2 true,j+∑nj=d+1δj 2. Thus E(|~e| 2 ) =∑p j=d+1β true 2 ,j+ (n−d)σ 2. Now let us consider generating a new data setY ̃=Xβ ̃ true+ ̃ǫ, where ̃ǫhas the same distribution asǫbut is independent of the latter. Suppose we predictY ̃ byX ̃βˆ, whereβˆis obtained from the training data (X, Y). Then the testing error is

̃e = Y ̃−

∑d

j=

βˆjX ̃j

=

∑p

j=

βtrue,jX ̃j+ ̃ǫ−

∑d

j=

(βtrue,j+δj)X ̃j

=

∑p

j=d+

βtrue,jX ̃j+ ̃ǫ−

∑d

j=

δjX ̃j.

For simplicity, let us assumeX ̃=X. Then the testing residual vector is

e ̃=

∑p

j=d+

βtrue,jXj+

∑d

j=

( ̃δj−δj)Xj+

∑n

j=d+

δ ̃j~uj.

Thus the expected testing error is E[|e ̃| 2 ] =

∑p j=d+1β true 2 ,j+ (n+d)σ 2. The key observation is that ̃δjfrom the testing noise ̃ǫis not the same asδjfrom the training noiseǫforj= 1, ..., d.

Ifd > p, i., if we add spurious variables whose true coefficients are all zero, then E(|e| 2 ) = (n−d)σ 2 , and E(| ̃e| 2 ) = (n+d)σ 2. So if we increase the number of variablesd, the training error E(|e| 2 ) will keep decreasing. But the testing error E(|e ̃| 2 ) will decrease at first, but will increases, especially afterd > p. The difference between the expected testing error and training erroris 2dσ 2. So if the model is too simple, the model may miss important signals. But if the model is too complex, the model may overfit noises. Regularization by penalized least squaresIf there are many predictors, we can avoid over- fitting by hypothesis testing, i., accept larger model only when there is strong evidence, such as using F-test. We can also use regularization to avoid overfitting. LetR(β) =

∑n i=1(yi−x

Tiβ) 2. The

ridge regression minimizesR(β) +λ

∑p j=1β 2 j. The Lasso regression minimizesR(β) +λ

∑p j=1|βj|. Supervised learning and classificationThe linear model can also serve as a building block for classification, whereyi is binary, i., yi ∈ { 0 , 1 }, oryi ∈ {+,−}The perceptron model is ˆyi= sign(xTiβ), where sign(s) = + or 1 ifs≥0, and sign(s) =−or 0 ifs <0. We can estimateβ by minimizing a loss function

∑n i=1L(yi, xTiβ), with possible regularization such as that in the ridge or Lasso regression. One particular example is support vector machine, which seeks to maximize the margin of the separating hyperplane. Sinceyiare given in the training stage, we call the learning as supervised learning. Logistic regressionThe logistic regression is for the binary response. P(yi = 1) = pi = exp(xTiβ)/(1+exp(xTiβ)).βis usually estimated by maximum likelihood, which is a generalization of least squares. Unsupervised learningIn the Netflix recommendation example, we model the rating of useru on itemi,rui=

∑d k=1pukaik+ǫui, wherepukis the preference of useruin thek-th aspect, andaikis the appeal of itemiin thek-th aspect= (pu 1 , ..., pud)Tandai= (ai 1 , ..., aid)Tcan be estimated by iterated least squares that seeks to minimize

ui(rui− 〈pu, ai〉) 2 +λ 1

u|pu| 2 +λ 2

i|ai| 2. Then we can predict those unobservedrui, which can be used for making recommendations. This is called unsupervised learning because we do not need to know beforehand what are thedaspects, neither do we need to knowpuoraiin training.

Was this document helpful?

100C note - Lecture notes 1-30

Course: Statistics Linear model (100C)

3 Documents
Students shared 3 documents in this course
Was this document helpful?
Simplest Regression
Simplest regression Suppose we observe (xi, yi), i= 1, ..., n. The simplest linear regression model
is yi=βxi+ǫi.
Least squares principle The least squares estimator ˆ
βis defined by minimizing the residual sum
of squares R(β) = Pn
i=1(yiβxi)2.
Least squares estimating equation R(β) = 0, i.e., Pn
i=1(yiβxi)xi= 0.
Least squares estimator The solution is ˆ
β=Pn
i=1 xiyi/Pn
i=1 x2
i.
R′′(β) = Pn
i=1 x2
i0. So ˆ
βis a minimum of R(β).
Geometric interpretation Let ~x be the column vector (x1, ..., xn)T. Let ~y be the column vector
(y1, ..., yn)T. Then R(β) = |~y β~x|2. So ˆ
βx is the projection of ~y on ~x. Let ~e =~y β~x be
the residual vector. Then ~e ~x, i.e., h~e, ~xi= 0, i.e., h~y β~x, ~xi= 0, which is the least squares
estimating equation in vector notation.
Figure 1: Least squares projection
The solution is ˆ
β=h~x, ~yi/|~x|2, where |~x|2=h~x, ~xi.
Projection Let ~u1=~x/|~x|be the unit vector along ~x. The projection of ~y on ~x or ~u1is h~y, ~u1i~u1=
h~x, ~yi/|~x|2~x =ˆ
β~x. So ˆ
β=h~x, ~yi/|~x|2.
Frequentist framework of hypothetical repeated sampling If we repeatedly sample (xi, yi)
and compute ˆ
β, what will happen?
True model Assume the data are generated by the true model yi=βtruexi+ǫifor i= 1, ..., n,
where xiare fixed, and ǫiN(0, σ2) independently.
Sampling property Under the true model, the least squares estimator can be expressed as
ˆ
β=Pn
i=1 xi(βtruexi+ǫi)
Pn
i=1 x2
i
=βtrue +Pn
i=1 xiǫi
Pn
i=1 x2
i
.
So
E( ˆ
β) = βtrue.
Var( ˆ
β) = Pn
i=1 x2
iσ2
(Pn
i=1 x2
i)2=σ2
Pn
i=1 x2
i
.
Combination of equations We can combine the equations yiβxiinto a single estimating
equation Pn
i=1 wiyi=Pn
i=1 wixiβ, where wiare constants. The estimator is
ˆ
β=Pn
i=1 wiyi
Pn
i=1 wixi
.
The least squares estimator corresponds to wixi.
1