Category Archives: Interactive Visualization

R and Python: Theory of Linear Least Squares

In my previous article, we talked about implementations of linear regression models in R, Python and SAS. On the theoretical sides, however, I briefly mentioned the estimation procedure for the parameter $boldsymbol{beta}$. So to help us understand how software does the estimation procedure, we’ll look at the mathematics behind it. We will also perform the estimation manually in R and in Python, that means we’re not going to use any special packages, this will help us appreciate the theory.

Linear Least Squares

Consider the linear regression model, [ y_i=f_i(mathbf{x}|boldsymbol{beta})+varepsilon_i,quadmathbf{x}_i=left[ begin{array}{cccc} 1&x_{11}&cdots&x_{1p} end{array}right],quadboldsymbol{beta}=left[begin{array}{c}beta_0\beta_1\vdots\beta_pend{array}right], ] where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,cdots, N$. The $f_i(mathbf{x}|boldsymbol{beta})$ is the deterministic part of the model that depends on both the parameters $boldsymbol{beta}inmathbb{R}^{p+1}$ and the predictor variable $mathbf{x}_i$, which in matrix form, say $mathbf{X}$, is represented as follows [ mathbf{X}=left[ begin{array}{cccccc} 1&x_{11}&cdots&x_{1p}\ 1&x_{21}&cdots&x_{2p}\ vdots&vdots&ddots&vdots\ 1&x_{N1}&cdots&x_{Np}\ end{array} right]. ] $varepsilon_i$ is the error term at the $i$th case which we assumed to be Gaussian distributed with mean 0 and variance $sigma^2$. So that [ mathbb{E}y_i=f_i(mathbf{x}|boldsymbol{beta}), ] i.e. $f_i(mathbf{x}|boldsymbol{beta})$ is the expectation function. The uncertainty around the response variable is also modelled by Gaussian distribution. Specifically, if $Y=f(mathbf{x}|boldsymbol{beta})+varepsilon$ and $yin Y$ such that $y>0$, then begin{align*} mathbb{P}[Yleq y]&=mathbb{P}[f(x|beta)+varepsilonleq y]\ &=mathbb{P}[varepsilonleq y-f(mathbf{x}|boldsymbol{beta})]=mathbb{P}left[frac{varepsilon}{sigma}leq frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]\ &=Phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right], end{align*} where $Phi$ denotes the Gaussian distribution with density denoted by $phi$ below. Hence $Ysimmathcal{N}(f(mathbf{x}|boldsymbol{beta}),sigma^2)$. That is, begin{align*} frac{operatorname{d}}{operatorname{d}y}Phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]&=phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]frac{1}{sigma}=mathbb{P}[y|f(mathbf{x}|boldsymbol{beta}),sigma^2]\ &=frac{1}{sqrt{2pi}sigma}expleft{-frac{1}{2}left[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}. end{align*} If the data are independent and identically distributed, then the log-likelihood function of $y$ is, begin{align*} mathcal{L}[boldsymbol{beta}|mathbf{y},mathbf{X},sigma]&=mathbb{P}[mathbf{y}|mathbf{X},boldsymbol{beta},sigma]=prod_{i=1}^Nfrac{1}{sqrt{2pi}sigma}expleft{-frac{1}{2}left[frac{y_i-f_i(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}\ &=frac{1}{(2pi)^{frac{n}{2}}sigma^n}expleft{-frac{1}{2}sum_{i=1}^Nleft[frac{y_i-f_i(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}\ logmathcal{L}[boldsymbol{beta}|mathbf{y},mathbf{X},sigma]&=-frac{n}{2}log2pi-nlogsigma-frac{1}{2sigma^2}sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2. end{align*} And because the likelihood function tells us about the plausibility of the parameter $boldsymbol{beta}$ in explaining the sample data. We therefore want to find the best estimate of $boldsymbol{beta}$ that likely generated the sample. Thus our goal is to maximize the likelihood function which is equivalent to maximizing the log-likelihood with respect to $boldsymbol{beta}$. And that’s simply done by taking the partial derivative with respect to the parameter $boldsymbol{beta}$. Therefore, the first two terms in the right hand side of the equation above can be disregarded since it does not depend on $boldsymbol{beta}$. Also, the location of the maximum log-likelihood with respect to $boldsymbol{beta}$ is not affected by arbitrary positive scalar multiplication, so the factor $frac{1}{2sigma^2}$ can be omitted. And we are left with the following equation, begin{equation}label{eq:1} -sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2. end{equation} One last thing is that, instead of maximizing the log-likelihood function we can do minimization on the negative log-likelihood. Hence we are interested on minimizing the negative of Equation (ref{eq:1}) which is begin{equation}label{eq:2} sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2, end{equation} popularly known as the residual sum of squares (RSS). So RSS is a consequence of maximum log-likelihood under the Gaussian assumption of the uncertainty around the response variable $y$. For models with two parameters, say $beta_0$ and $beta_1$ the RSS can be visualized like the one in my previous article, that is

Error Surface

Performing differentiation under $(p+1)$-dimensional parameter $boldsymbol{beta}$ is manageable in the context of linear algebra, so Equation (ref{eq:2}) is equivalent to begin{align*} lVertmathbf{y}-mathbf{X}boldsymbol{beta}rVert^2&=langlemathbf{y}-mathbf{X}boldsymbol{beta},mathbf{y}-mathbf{X}boldsymbol{beta}rangle=mathbf{y}^{text{T}}mathbf{y}-mathbf{y}^{text{T}}mathbf{X}boldsymbol{beta}-(mathbf{X}boldsymbol{beta})^{text{T}}mathbf{y}+(mathbf{X}boldsymbol{beta})^{text{T}}mathbf{X}boldsymbol{beta}\ &=mathbf{y}^{text{T}}mathbf{y}-mathbf{y}^{text{T}}mathbf{X}boldsymbol{beta}-boldsymbol{beta}^{text{T}}mathbf{X}^{text{T}}mathbf{y}+boldsymbol{beta}^{text{T}}mathbf{X}^{text{T}}mathbf{X}boldsymbol{beta} end{align*} And the derivative with respect to the parameter is begin{align*} frac{operatorname{partial}}{operatorname{partial}boldsymbol{beta}}lVertmathbf{y}-mathbf{X}boldsymbol{beta}rVert^2&=-2mathbf{X}^{text{T}}mathbf{y}+2mathbf{X}^{text{T}}mathbf{X}boldsymbol{beta} end{align*} Taking the critical point by setting the above equation to zero vector, we have begin{align} frac{operatorname{partial}}{operatorname{partial}boldsymbol{beta}}lVertmathbf{y}-mathbf{X}hat{boldsymbol{beta}}rVert^2&overset{text{set}}{=}mathbf{0}nonumber\ -mathbf{X}^{text{T}}mathbf{y}+mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{0}nonumber\ mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{X}^{text{T}}mathbf{y}label{eq:norm} end{align} Equation (ref{eq:norm}) is called the normal equation. If $mathbf{X}$ is full rank, then we can compute the inverse of $mathbf{X}^{text{T}}mathbf{X}$, begin{align} mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{X}^{text{T}}mathbf{y}nonumber\ (mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}nonumber\ hat{boldsymbol{beta}}&=(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}.label{eq:betahat} end{align} That’s it, since both $mathbf{X}$ and $mathbf{y}$ are known.


If $mathbf{X}$ is full rank and spans the subspace $Vsubseteqmathbb{R}^N$, where $mathbb{E}mathbf{y}=mathbf{X}boldsymbol{beta}in V$. Then the predicted values of $mathbf{y}$ is given by, begin{equation}label{eq:pred} hat{mathbf{y}}=mathbb{E}mathbf{y}=mathbf{P}_{V}mathbf{y}=mathbf{X}(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}, end{equation} where $mathbf{P}$ is the projection matrix onto the space $V$. For proof of the projection matrix in Equation (ref{eq:pred}) please refer to reference (1) below. Notice that this is equivalent to begin{equation}label{eq:yhbh} hat{mathbf{y}}=mathbb{E}mathbf{y}=mathbf{X}hat{boldsymbol{beta}}. end{equation}


Let’s fire up R and Python and see how we can apply those equations we derived. For purpose of illustration, we’re going to simulate data from Gaussian distributed population. To do so, consider the following codes

R ScriptPython ScriptHere we have two predictors x1 and x2, and our response variable y is generated by the parameters $beta_1=3.5$ and $beta_2=2.8$, and it has Gaussian noise with variance 7. While we set the same random seeds for both R and Python, we should not expect the random values generated in both languages to be identical, instead both values are independent and identically distributed (iid). For visualization, I will use Python Plotly, you can also translate it to R Plotly.

Now let’s estimate the parameter $boldsymbol{beta}$ which by default we set to $beta_1=3.5$ and $beta_2=2.8$. We will use Equation (ref{eq:betahat}) for estimation. So that we have

R ScriptPython ScriptThat’s a good estimate, and again just a reminder, the estimate in R and in Python are different because we have different random samples, the important thing is that both are iid. To proceed, we’ll do prediction using Equations (ref{eq:pred}). That is,

R ScriptPython ScriptThe first column above is the data y and the second column is the prediction due to Equation (ref{eq:pred}). Thus if we are to expand the prediction into an expectation plane, then we have

You have to rotate the plot by the way to see the plane, I still can’t figure out how to change it in Plotly. Anyway, at this point we can proceed computing for other statistics like the variance of the error, and so on. But I will leave it for you to explore. Our aim here is just to give us an understanding on what is happening inside the internals of our software when we try to estimate the parameters of the linear regression models.


  1. Arnold, Steven F. (1981). The Theory of Linear Models and Multivariate Analysis. Wiley.
  2. OLS in Matrix Form

R, Python, and SAS: Getting Started with Linear Regression

Consider the linear regression model, $$ y_i=f_i(boldsymbol{x}|boldsymbol{beta})+varepsilon_i, $$ where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,cdots, N$ and the predictor or the independent variable is the $boldsymbol{x}$ term defined in the mean function $f_i(boldsymbol{x}|boldsymbol{beta})$. For simplicity, consider the following simple linear regression (SLR) model, $$ y_i=beta_0+beta_1x_i+varepsilon_i. $$ To obtain the (best) estimate of $beta_0$ and $beta_1$, we solve for the least residual sum of squares (RSS) given by, $$ S=sum_{i=1}^{n}varepsilon_i^2=sum_{i=1}^{n}(y_i-beta_0-beta_1x_i)^2. $$ Now suppose we want to fit the model to the following data, Average Heights and Weights for American Women, where weight is the response and height is the predictor. The data is available in R by default.

The following is the plot of the residual sum of squares of the data base on the SLR model over $beta_0$ and $beta_1$, note that we standardized the variables first before plotting it,

Error Surface

If you are interested on the codes of the above figure, please click here. To minimize this elliptic paraboloid, differentiation has to be done with respect to the parameters, and then equate this to zero to obtain the stationary point, and finally solve for $beta_0$ and $beta_1$. For more on derivation of the estimates of the parameters see reference 1.

Simple Linear Regression in R

In R, we can fit the model using the lm function, which stands for linear models, i.e.

Formula, defined above as {response ~ predictor}, is a handy method for fitting model to the data in R. Mathematically, our model is $$ weight = beta_0 + beta_1 (height) + varepsilon. $$ The summary of it is obtain by running model %>% summary or for non-magrittr user summary(model), given the model object defined in the previous code,

The Coefficients section above returns the estimated coefficients of the model, and these are $beta_0 = -87.51667$ and $beta_1=3.45000$ (it should be clear that we used the unstandardized variables for obtaining these estimates). The estimates are both significant base on the p-value under .05 and even in .01 level of the test. Using the estimated coefficients along with the residual standard error we can now construct the fitted line and it’s confidence interval as shown below.

Fig 1. Plot of the Data and the Predicted Values in R.

Simple Linear Regression in Python

In Python, there are two modules that have implementation of linear regression modelling, one is in scikit-learn (sklearn) and the other is in Statsmodels (statsmodels). For example we can model the above data using sklearn as follows:

Above output is the estimate of the parameters, to obtain the predicted values and plot these along with the data points like what we did in R, we can wrapped the functions above into a class called linear_regression say, that requires Seaborn package for neat plotting, see the codes below:

Using this class and its methods, fitting the model to the data is coded as follows:

The predicted values of the data points is obtain using the predict method,

And Figure 2 below shows the plot of the predicted values along with its confidence interval and data points.

Fig 2. Plot of the Data and the Predicted Values in Python.

If one is only interested on the estimates of the model, then LinearRegression of scikit-learn is sufficient, but if the interest on other statistics such as that returned in R model summary is necessary, the said module can also do the job but might need to program other necessary routine. statsmodels, on the other hand, returns complete summary of the fitted model as compared to the R output above, which is useful for studies with particular interest on these information. So that modelling the data using simple linear regression is done as follows:

Clearly, we could spare time with statsmodels, especially in diagnostic checking involving test statistics such as Durbin-Watson and Jarque-Bera tests. We can of course add some plotting for diagnostic, but I prefer to discuss that on a separate entry.

Simple Linear Regression in SAS

Now let’s consider running the data in SAS, I am using SAS Studio and in order to import the data, I saved it as a CSV file first with columns height and weight. Uploaded it to SAS Studio, in which follows are the codes below to import the data.

Next we fit the model to the data using the REG procedure,

Number of Observations Read 15
Number of Observations Used 15
Analysis of Variance
Source DF Sum of
F Value Pr > F
Model 1 3332.70000 3332.70000 1433.02 <.0001
Error 13 30.23333 2.32564    
Corrected Total 14 3362.93333      
Root MSE 1.52501 R-Square 0.9910
Dependent Mean 136.73333 Adj R-Sq 0.9903
Coeff Var 1.11531    
Parameter Estimates
Variable DF Parameter
t Value Pr > |t|
Intercept 1 -87.51667 5.93694 -14.74 <.0001
height 1 3.45000 0.09114 37.86 <.0001

Now that’s a lot of output, probably the complete one. But like I said, I am not going to discuss each of these values and plots as some of it are used for diagnostic checking (you can read more on that in reference 1, and in other applied linear regression books). For now, let’s just confirm the coefficients obtained — both the estimates are the same with that in R and Python.

Multiple Linear Regression (MLR)

To extend SLR to MLR, we’ll demonstrate this by simulation. Using the formula-based lm function of R, assuming we have $x_1$ and $x_2$ as our predictors, then following is how we do MLR in R:

Although we did not use intercept in simulating the data, but the obtained estimates for $beta_1$ and $beta_2$ are close to the true parameters (.35 and .56). The intercept, however, will help us capture the noise term we added in simulation.

Next we’ll try MLR in Python using statsmodels, consider the following:

It should be noted that, the estimates in R and in Python should not (necessarily) be the same since these are simulated values from different software. Finally, we can perform MLR in SAS as follows:

Number of Observations Read 100
Number of Observations Used 100
Analysis of Variance
Source DF Sum of
F Value Pr > F
Model 2 610.86535 305.43268 303.88 <.0001
Error 97 97.49521 1.00511    
Corrected Total 99 708.36056      
Root MSE 1.00255 R-Square 0.8624
Dependent Mean 244.07327 Adj R-Sq 0.8595
Coeff Var 0.41076    
Parameter Estimates
Variable DF Parameter
t Value Pr > |t|
Intercept 1 18.01299 11.10116 1.62 0.1079
X1 1 0.31770 0.01818 17.47 <.0001
X2 1 0.58276 0.03358 17.35 <.0001


In conclusion, SAS saves a lot of work, since it returns complete summary of the model, no doubt why companies prefer to use this, besides from their active customer support. R and Python, on the other hand, despite the fact that it is open-source, it can well compete with the former, although it requires programming skills to achieved all of the SAS outputs, but I think that’s the exciting part of it — it makes you think, and manage time. The achievement in R and Python is of course fulfilling. Hope you’ve learned something, feel free to share your thoughts on the comment below.


  1. Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. John Wiley & Sons, Inc. United States of America.
  2. Scikit-learn Documentation
  3. Statsmodels Documentation
  4. SAS Documentation
  5. Delwiche, Lora D., and Susan J. Slaughter. 2012. The Little SAS® Book: A Primer, Fifth Edition. Cary, NC: SAS Institute Inc.
  6. Regression with SAS. Institute for Digital Research and Education. UCLA. Retrieved August 13, 2015.
  7. Python Plotly Documentation

Parametric Inference: The Power Function of the Test

In Statistics, we model random phenomenon and make conclusions about its population. For example, in an experiment of determining the true heights of the students in the university. Suppose we take sample from the population of the students, and consider testing the null hypothesis that the average height is 5.4 ft against an alternative hypothesis that the average height is greater than 5.4 ft. Mathematically, we can represent this as $H_0:theta=theta_0$ vs $H_1:theta>theta_0$, where $theta$ is the true value of the parameter, and $theta_0=5.4$ is the testing value set by the experimenter. And because we only consider subset (the sample) of the population for testing the hypotheses, then we expect for errors we commit. To understand these errors, consider if the above test results into rejecting $H_0$ given that $thetainTheta_0$, where $Theta_0$ is the parameter space of the null hypothesis, in other words we mistakenly reject $H_0$, then in this case we committed a Type I error. Another is, if the above test results into accepting $H_0$ given that $thetainTheta_0^c$, where $Theta_0^c$ is the parameter space of the alternative hypothesis, then we committed a Type II error. To summarize this consider the following table,

Truth Decision
Table 1: Two Types of Errors in Hypothesis Testing.
Accept $H_0$ Reject $H_0$
$H_0$ Correct Decision Type I Error
$H_1$ Type II Error Correct Decision

Let’s formally define the power function, from Casella and Berger (2001), see reference 1.

Definition 1. The power function of a hypothesis test with rejection region $R$ is the function of $theta$ defined by $beta(theta)=mathrm{P}_{theta}(mathbf{X}in R)$.

To relate the definition to the above problem, if $R$ is the rejection region of $H_0$. Then we make mistake if the sample observed, $mathbf{x}$, $mathbf{x}in R$ given that $thetainTheta_0$. That is, $beta(theta)=mathrm{P}_{theta}(mathbf{X}in R)$ is the probability of Type I error. Let’s consider an example, one that is popularly used in testing the sample mean. The example below is the combined problem of Example 8.3.3 and Exercise 8.37 (a) of reference 1.

Example 1. Let $X_1,cdots, X_noverset{r.s.}{sim}N(mu,sigma^2)$ — normal population where $sigma^2$ is known. Consider testing $H_0:thetaleq theta_0$ vs $H_1:theta> theta_0$, obtain the likelihood ratio test (LRT) statistic and its power function.

Solution:The LRT statistic is given by $$ lambda(mathbf{x})=frac{displaystylesup_{thetaleqtheta_0}L(theta|mathbf{x})}{displaystylesup_{-inftyentry, $lambda(mathbf{x})$ is rejected if it is small, such that $lambda(mathbf{x})leq c$ for some $cin[0,1]$. Hence, $$ begin{aligned} lambda(mathbf{x})&=expleft[-frac{n(bar{x}-theta_0)^2}{2sigma^2}right]sqrt{-2log c}. end{aligned} $$ So that $H_0$ is rejected if $frac{bar{x}-theta_0}{sigma/sqrt{n}}> c’$ for some $c’=sqrt{-2log c}in[0,infty)$. Now the power function of the test, is the probability of rejecting the null hypothesis given that it is true, or the probability of the Type I error given by, $$ begin{aligned} beta(theta)&=mathrm{P}left[frac{bar{x}-theta_0}{sigma/sqrt{n}}> c’right]\ &=mathrm{P}left[frac{bar{x}-theta+theta-theta_0}{sigma/sqrt{n}}> c’right]\ &=mathrm{P}left[frac{bar{x}-theta}{sigma/sqrt{n}}+frac{theta-theta_0}{sigma/sqrt{n}}> c’right]\ &=mathrm{P}left[frac{bar{x}-theta}{sigma/sqrt{n}}> c’-frac{theta-theta_0}{sigma/sqrt{n}}right]\ &=1-mathrm{P}left[frac{bar{x}-theta}{sigma/sqrt{n}}leq c’+frac{theta_0-theta}{sigma/sqrt{n}}right]\ &=1-Phileft[c’+frac{theta_0-theta}{sigma/sqrt{n}}right]. end{aligned} $$ To illustrate this, consider $theta_0=5.4,sigma = 1,n=30$ and $c’=1.645$. Then the plot of the power function as a function of $theta$ is,

Power Function

Since $beta$ is an increasing function with unit range, then $$ alpha = sup_{thetaleqtheta_0}beta(theta)=beta(theta_0)=1-Phi(c’). $$ So that using values we set for the above graph, $alpha=0.049985approx 0.05$, $alpha$ here is called the size of the test since it is the supremum of the power function over $thetaleqtheta_0$, see reference 1 for level of the test. Now let’s investigate the power function above, the probability of committing Type I error, $beta(theta), forall thetaleq theta_0$, is acceptably small. However, the probability of committing Type II error, $1-beta(theta), forall theta > theta_0$, is too high as we can see in the following plot,

Type II Error

Therefore, it’s better to investigate the error structure when considering the power of the test. From Casella and Berger (2001), the ideal power function is 0 $forallthetainTheta_0$ and 1 $forallthetainTheta_0^c$. Except in trivial situations, this ideal cannot be attained. Qualitatively, a good test has power function near 1 for most $thetainTheta_0^c$ and $thetainTheta_0$. Implying, one that has steeper power curve.

Now an interesting fact about power function is that it depends on the sample size $n$. Suppose in our experiment above we want the Type I error to be 0.05 and the Type II error to be 0.1 if $thetageq theta_0+sigma/2$. Since the power function is increasing, then we have $$ beta(theta_0)=0.05Rightarrow c’=1.645quadtext{and}quad 1 – beta(theta_0+sigma/2)=0.1Rightarrowbeta(theta_0+sigma/2)=0.9. $$ Where $$ begin{aligned} beta(theta_0+sigma/2)&=1-Phileft[c’ +frac{theta_0-sigma/2-theta_0}{sigma/sqrt{n}}right]\ &=1-Phileft[c’ – frac{sqrt{n}}{2}right]\ 0.9&=1-Phileft[1.645 – frac{sqrt{n}}{2}right]\ 0.1&=Phileft[1.645 – frac{sqrt{n}}{2}right].\ end{aligned} $$ Hence, $n$ is chosen such that it solves the above equation. That is, $$ begin{aligned} 1.645 – frac{sqrt{n}}{2}&=-1.28155,quadtext{since }Phi(-1.28155)=0.1\ frac{3.29 – sqrt{n}}{2}&=-1.28155\ 3.29 – sqrt{n}&=-2.5631\ n&=(3.29+2.5631)^2=34.25878,;text{take }n=35. end{aligned} $$ For purpose of illustration, we’ll consider the non-rounded value of $n$. Below is the plot of this,

Power Function with Sample Size

And for different values of $n$, consider the following power functions

Effect of Sample Size on Power Function

From the above plot, the larger the sample size, $n$, the steeper the curve implying a better error structure. To see this, try hovering over the lines in the plot, and you’ll witness a fast departure for values of large $n$ on the unit range, this characteristics contribute to the sensitivity of the test.

Plot’s Python Codes

In case you want to reproduce the above plots, click here for the source code.


  1. Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.
  2. Plotly Python Library Documentation

Parametric Inference: Likelihood Ratio Test by Example

Hypothesis testing have been extensively used on different discipline of science. And in this post, I will attempt on discussing the basic theory behind this, the Likelihood Ratio Test (LRT) defined below from Casella and Berger (2001), see reference 1.

Definition. The likelihood ratio test statistic for testing $H_0:thetainTheta_0$ versus $H_1:thetainTheta_0^c$ is begin{equation} label{eq:lrt} lambda(mathbf{x})=frac{displaystylesup_{thetainTheta_0}L(theta|mathbf{x})}{displaystylesup_{thetainTheta}L(theta|mathbf{x})}. end{equation} A likelihood ratio test (LRT) is any test that has a rejection region of the form ${mathbf{x}:lambda(mathbf{x})leq c}$, where $c$ is any number satisfying $0leq c leq 1$.

The numerator of equation (ref{eq:lrt}) gives us the supremum probability of the parameter, $theta$, over the restricted domain (null hypothesis, $Theta_0$) of the parameter space $Theta$, that maximizes the joint probability of the sample, $mathbf{x}$. While the denominator of the LRT gives us the supremum probability of the parameter, $theta$, over the unrestricted domain, $Theta$, that maximizes the joint probability of the sample, $mathbf{x}$. Therefore, if the value of $lambda(mathbf{x})$ is small such that $lambda(mathbf{x})leq c$, for some $cin [0, 1]$, then the true value of the parameter that is plausible in explaining the sample is likely to be in the alternative hypothesis, $Theta_0^c$.

Example 1. Let $X_1,X_2,cdots,X_noverset{r.s.}{sim}f(x|theta)=frac{1}{theta}expleft[-frac{x}{theta}right],x>0,theta>0$. From this sample, consider testing $H_0:theta = theta_0$ vs $H_1:theta

The parameter space $Theta$ is the set $(0,Theta_0]$, where $Theta_0={theta_0}$. Hence, using the likelihood ratio test, we have $$ lambda(mathbf{x})=frac{displaystylesup_{theta=theta_0}L(theta|mathbf{x})}{displaystylesup_{thetaleqtheta_0}L(theta|mathbf{x})}, $$ where, $$ begin{aligned} sup_{theta=theta_0}L(theta|mathbf{x})&=sup_{theta=theta_0}prod_{i=1}^{n}frac{1}{theta}expleft[-frac{x_i}{theta}right]\ &=sup_{theta=theta_0}left(frac{1}{theta}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta}right]\ &=left(frac{1}{theta_0}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right], end{aligned} $$ and $$ begin{aligned} sup_{thetaleqtheta_0}L(theta|mathbf{x})&=sup_{thetaleqtheta_0}prod_{i=1}^{n}frac{1}{theta}expleft[-frac{x_i}{theta}right]\ &=sup_{thetaleqtheta_0}left(frac{1}{theta}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta}right]=sup_{thetaleqtheta_0}f(mathbf{x}|theta). end{aligned} $$ Now the supremum of $f(mathbf{x}|theta)$ over all values of $thetaleqtheta_0$ is the MLE (maximum likelihood estimator) of $f(x|theta)$, which is $bar{x}$, provided that $bar{x}leq theta_0$.

So that, $$ begin{aligned} lambda(mathbf{x})&=frac{left(frac{1}{theta_0}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]} {left(frac{1}{bar{x}}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{bar{x}}right]},quadtext{provided that};bar{x}leq theta_0\ &=left(frac{bar{x}}{theta_0}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]exp[n]. end{aligned} $$ And we say that, if $lambda(mathbf{x})leq c$, $H_0$ is rejected. That is, $$ begin{aligned} left(frac{bar{x}}{theta_0}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]exp[n]&leq c\ left(frac{bar{x}}{theta_0}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]&leq c’,quadtext{where};c’=frac{c}{exp[n]}\ nlogleft(frac{bar{x}}{theta_0}right)-frac{n}{theta_0}bar{x}&leq log c’\ logleft(frac{bar{x}}{theta_0}right)-frac{bar{x}}{theta_0}&leq frac{1}{n}log c’\ logleft(frac{bar{x}}{theta_0}right)-frac{bar{x}}{theta_0}&leq frac{1}{n}log c-1. end{aligned} $$ Now let $h(x)=log x – x$, then $h'(x)=frac{1}{x}-1$. So the critical point of $h'(x)$ is $x=1$. And to test if this is maximum or minimum, we apply second derivative test. That is, $$ h”(x)=-frac{1}{x^2} LRT and its Critical Value

Python and R: Basic Sampling Problem

In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr.


  1. Let $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known.

    $y_1 = 98$ $y_2 = 102$ $y_3=154$
    $y_4 = 133$ $y_5 = 190$ $y_6=175$

    We are interested in $bar{y}_U$, the population mean. Consider eight possible samples chosen.

    Sample No. Sample, $mathcal{S}$ $P(mathcal{S})$
    1 ${1,3,5}$ $1/8$
    2 ${1,3,6}$ $1/8$
    3 ${1,4,5}$ $1/8$
    4 ${1,4,6}$ $1/8$
    5 ${2,3,5}$ $1/8$
    6 ${2,3,6}$ $1/8$
    7 ${2,4,5}$ $1/8$
    8 ${2,4,6}$ $1/8$

    1. What is the value of $bar{y}_U$?
    2. Let $bar{y}$ be the mean of the sample values. For each sampling plan, find
      1. $mathrm{E}bar{y}$;
      2. $mathrm{Var}bar{y}$;
      3. $mathrm{Bias}(bar{y})$;
      4. $mathrm{MSE}(bar{y})$;
  2. Mayr et al. (1994) took an SRS of 240 children who visisted their pediatric outpatient clinic. They found the following frequency distribution for the age (in months) of free (unassisted) walking among the children:

    Age (months) 9 10 11 12 13 14 15 16 17 18 19 20
    Number of Children 13 35 44 69 36 24 7 3 2 5 1 1

    Find the mean and SE of the age for onset of free walking.
  3. Table 1 gives the cultivated area in acres in 1981 for 40 villages in a region. (Theory and Method of Survey) Using the arrangement (random) of data in the table, draw systematic sample of size 8. Use r ((random start) = 2,

    Village $Y_j$ Village $Y_j$ Village $Y_j$ Village $Y_j$
    1 105 11 319 21 70 31 16
    2 625 12 72 22 249 32 439
    3 47 13 109 23 384 33 123
    4 312 14 91 24 482 34 207
    5 327 15 152 25 378 35 145
    6 230 16 189 26 111 36 666
    7 240 17 365 27 534 37 338
    8 203 18 70 28 306 38 624
    9 535 9 249 29 655 39 501
    10 275 20 384 30 102 40 962


In order to appreciate the codes, I will share some theoretical part of the solution. But our main focus here is to solve this problem computationally using Python and R.

    1. The value of $bar{y}_U$ is coded as follows:

      Python Code R Code

    2. To obtain the sample using the sample index given in the table in the above question, we do a combination of population index of three elements, ${6choose 3}$, first. Where the first two combinations are the samples, ${1,2,3}$ and ${1,2,4}$, and so on. Then from this list of all possible combinations of three elements, we draw those that are listed in the above table as our samples, with first sample index ${1,3,5}$, having population units, ${98, 154, 190}$. So that the following is the code of this sampling design:

      Python Code R Code

      1. Now to obtain the expected value of the average of the sample data, we compute it using $mathrm{E}bar{y}=sum_{k}bar{y}_kmathrm{P}(bar{y}_k)=sum_{k}bar{y_k}mathrm{P}(mathcal{S}_k)$, $forall kin{1,cdots,8}$. So for $k = 1$, $$ begin{aligned} bar{y}_1mathrm{P}(mathcal{S}_1)&=frac{98+154+190}{3}mathrm{P}(mathcal{S}_1)\ &=frac{98+154+190}{3}left(frac{1}{8}right)=18.41667. end{aligned} $$ Applying this to the remaining $n-1$ $k$s, and summing up the terms gives us the answer to $mathrm{E}bar{y}$. So that the following is the equivalent of this:

        Python Code R Code From the above code, the output tells us that $mathrm{E}bar{y}=140$.

      2. Next is to compute for the variance of $bar{y}$, which is $mathrm{Var}bar{y}=mathrm{E}bar{y}^{2}-(mathrm{E}bar{y})^2$. So we need a function for $mathrm{E}bar{y}^2$, where the first term of this, $k=1$, is $bar{y}_1^2mathrm{P}(mathcal{S}_1)=left(frac{98+154+190}{3}right)^2mathrm{P}(mathcal{S}_1)=left(frac{98+154+190}{3}right)^2(frac{1}{8})=2713.3889$. Applying this to other terms and summing them up, we have following code:

        Python Code R Code So that using the above output, 20182.94, and subtracting $(mathrm{E}bar{y})^2$ to it, will give us the variance. And hence the succeeding code:

        Python Code: R Code: So the variance of the $bar{y}$ is $18.9444$.

    3. The $mathrm{Bias}$ is just the difference between the estimate and the true value. And since the estimate is unbiased ($mathrm{E}bar{y}=142$), so $mathrm{Bias}=142-142=0$.
    4. $mathrm{MSE}=mathrm{Var}bar{y}-(mathrm{Bias}bar{y})^2$, and since the $mathrm{Bias}bar{y}=0$. So $mathrm{MSE}=mathrm{Var}bar{y}$.
  1. First we need to obtain the probability of each Age, that is by dividing the Number of Children with the total sum of it. That is why, we have p_s function defined below. After obtaining the probabilities, we can then compute the expected value using the expectation function we defined earlier.

    Python Code R Code It should be clear in the data that the average age is about 12 months old, where the plot of it is shown below, For the code of the above plot please click here. Next is to compute the standard error which is just the square root of the variance of the sample,

    Python Code R Code So the standard variability of the Age is 1.920824.

  2. Let me give you a brief discussion on the systematic sampling to help you understand the code. The idea in systematic sampling is that, given the population units numbered from 1 to $N$, we compute for the sampling interval, given by $k = frac{N}{n}$, where $n$ is the number of units needed for the sample. After that, we choose for the random start, number between $1$ and $k$. This random start will be the first sample, and then the second unit in the sample is obtained by adding the sampling interval to the random start, and so on. There are two types of systematic sampling namely, Linear and Circular Systematic Samplings. Circular systematic sampling treats the population units numbered from $1$ to $N$ in circular form, so that if the increment step is more than the number of $N$ units, say $N+2$, the sample unit is the $2^{nd}$ element in the population, and so on. The code that I will be sharing can be used both for linear and circular, but for this particular problem only. Since there are rules in linear that are not satisfied in the function, one of which is if $k$ is not a whole number, despite that, however, you can always extend it to a more general function.

    Python Code R Code You may notice in the output above, that the index returned in Python is not the same with the index returned in R. This is because Python index starts with 0, while that in R starts with 1. So that’s why we have the same population units sampled between the two language despite the differences between the index returned.


  1. Lohr, Sharon (2009). Sampling: Design and Analysis. Cengage Learning.