Category Archives: Python

Photo op with python

For a fee, some visitors to the Philippine Eagle Center in Malagos, Davao City pose with a python being handled by locals outside the facility on Saturday, December 2, 2017. MindaNews photo by H. MARCOS C. MORDENO

R and Python: Theory of Linear Least Squares

In my previous article, we talked about implementations of linear regression models in R, Python and SAS. On the theoretical sides, however, I briefly mentioned the estimation procedure for the parameter $boldsymbol{beta}$. So to help us understand how software does the estimation procedure, we’ll look at the mathematics behind it. We will also perform the estimation manually in R and in Python, that means we’re not going to use any special packages, this will help us appreciate the theory.

Linear Least Squares

Consider the linear regression model, [ y_i=f_i(mathbf{x}|boldsymbol{beta})+varepsilon_i,quadmathbf{x}_i=left[ begin{array}{cccc} 1&x_{11}&cdots&x_{1p} end{array}right],quadboldsymbol{beta}=left[begin{array}{c}beta_0\beta_1\vdots\beta_pend{array}right], ] where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,cdots, N$. The $f_i(mathbf{x}|boldsymbol{beta})$ is the deterministic part of the model that depends on both the parameters $boldsymbol{beta}inmathbb{R}^{p+1}$ and the predictor variable $mathbf{x}_i$, which in matrix form, say $mathbf{X}$, is represented as follows [ mathbf{X}=left[ begin{array}{cccccc} 1&x_{11}&cdots&x_{1p}\ 1&x_{21}&cdots&x_{2p}\ vdots&vdots&ddots&vdots\ 1&x_{N1}&cdots&x_{Np}\ end{array} right]. ] $varepsilon_i$ is the error term at the $i$th case which we assumed to be Gaussian distributed with mean 0 and variance $sigma^2$. So that [ mathbb{E}y_i=f_i(mathbf{x}|boldsymbol{beta}), ] i.e. $f_i(mathbf{x}|boldsymbol{beta})$ is the expectation function. The uncertainty around the response variable is also modelled by Gaussian distribution. Specifically, if $Y=f(mathbf{x}|boldsymbol{beta})+varepsilon$ and $yin Y$ such that $y>0$, then begin{align*} mathbb{P}[Yleq y]&=mathbb{P}[f(x|beta)+varepsilonleq y]\ &=mathbb{P}[varepsilonleq y-f(mathbf{x}|boldsymbol{beta})]=mathbb{P}left[frac{varepsilon}{sigma}leq frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]\ &=Phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right], end{align*} where $Phi$ denotes the Gaussian distribution with density denoted by $phi$ below. Hence $Ysimmathcal{N}(f(mathbf{x}|boldsymbol{beta}),sigma^2)$. That is, begin{align*} frac{operatorname{d}}{operatorname{d}y}Phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]&=phileft[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]frac{1}{sigma}=mathbb{P}[y|f(mathbf{x}|boldsymbol{beta}),sigma^2]\ &=frac{1}{sqrt{2pi}sigma}expleft{-frac{1}{2}left[frac{y-f(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}. end{align*} If the data are independent and identically distributed, then the log-likelihood function of $y$ is, begin{align*} mathcal{L}[boldsymbol{beta}|mathbf{y},mathbf{X},sigma]&=mathbb{P}[mathbf{y}|mathbf{X},boldsymbol{beta},sigma]=prod_{i=1}^Nfrac{1}{sqrt{2pi}sigma}expleft{-frac{1}{2}left[frac{y_i-f_i(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}\ &=frac{1}{(2pi)^{frac{n}{2}}sigma^n}expleft{-frac{1}{2}sum_{i=1}^Nleft[frac{y_i-f_i(mathbf{x}|boldsymbol{beta})}{sigma}right]^2right}\ logmathcal{L}[boldsymbol{beta}|mathbf{y},mathbf{X},sigma]&=-frac{n}{2}log2pi-nlogsigma-frac{1}{2sigma^2}sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2. end{align*} And because the likelihood function tells us about the plausibility of the parameter $boldsymbol{beta}$ in explaining the sample data. We therefore want to find the best estimate of $boldsymbol{beta}$ that likely generated the sample. Thus our goal is to maximize the likelihood function which is equivalent to maximizing the log-likelihood with respect to $boldsymbol{beta}$. And that’s simply done by taking the partial derivative with respect to the parameter $boldsymbol{beta}$. Therefore, the first two terms in the right hand side of the equation above can be disregarded since it does not depend on $boldsymbol{beta}$. Also, the location of the maximum log-likelihood with respect to $boldsymbol{beta}$ is not affected by arbitrary positive scalar multiplication, so the factor $frac{1}{2sigma^2}$ can be omitted. And we are left with the following equation, begin{equation}label{eq:1} -sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2. end{equation} One last thing is that, instead of maximizing the log-likelihood function we can do minimization on the negative log-likelihood. Hence we are interested on minimizing the negative of Equation (ref{eq:1}) which is begin{equation}label{eq:2} sum_{i=1}^Nleft[y_i-f_i(mathbf{x}|boldsymbol{beta})right]^2, end{equation} popularly known as the residual sum of squares (RSS). So RSS is a consequence of maximum log-likelihood under the Gaussian assumption of the uncertainty around the response variable $y$. For models with two parameters, say $beta_0$ and $beta_1$ the RSS can be visualized like the one in my previous article, that is

Error Surface

Performing differentiation under $(p+1)$-dimensional parameter $boldsymbol{beta}$ is manageable in the context of linear algebra, so Equation (ref{eq:2}) is equivalent to begin{align*} lVertmathbf{y}-mathbf{X}boldsymbol{beta}rVert^2&=langlemathbf{y}-mathbf{X}boldsymbol{beta},mathbf{y}-mathbf{X}boldsymbol{beta}rangle=mathbf{y}^{text{T}}mathbf{y}-mathbf{y}^{text{T}}mathbf{X}boldsymbol{beta}-(mathbf{X}boldsymbol{beta})^{text{T}}mathbf{y}+(mathbf{X}boldsymbol{beta})^{text{T}}mathbf{X}boldsymbol{beta}\ &=mathbf{y}^{text{T}}mathbf{y}-mathbf{y}^{text{T}}mathbf{X}boldsymbol{beta}-boldsymbol{beta}^{text{T}}mathbf{X}^{text{T}}mathbf{y}+boldsymbol{beta}^{text{T}}mathbf{X}^{text{T}}mathbf{X}boldsymbol{beta} end{align*} And the derivative with respect to the parameter is begin{align*} frac{operatorname{partial}}{operatorname{partial}boldsymbol{beta}}lVertmathbf{y}-mathbf{X}boldsymbol{beta}rVert^2&=-2mathbf{X}^{text{T}}mathbf{y}+2mathbf{X}^{text{T}}mathbf{X}boldsymbol{beta} end{align*} Taking the critical point by setting the above equation to zero vector, we have begin{align} frac{operatorname{partial}}{operatorname{partial}boldsymbol{beta}}lVertmathbf{y}-mathbf{X}hat{boldsymbol{beta}}rVert^2&overset{text{set}}{=}mathbf{0}nonumber\ -mathbf{X}^{text{T}}mathbf{y}+mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{0}nonumber\ mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{X}^{text{T}}mathbf{y}label{eq:norm} end{align} Equation (ref{eq:norm}) is called the normal equation. If $mathbf{X}$ is full rank, then we can compute the inverse of $mathbf{X}^{text{T}}mathbf{X}$, begin{align} mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{X}^{text{T}}mathbf{y}nonumber\ (mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}nonumber\ hat{boldsymbol{beta}}&=(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}.label{eq:betahat} end{align} That’s it, since both $mathbf{X}$ and $mathbf{y}$ are known.


If $mathbf{X}$ is full rank and spans the subspace $Vsubseteqmathbb{R}^N$, where $mathbb{E}mathbf{y}=mathbf{X}boldsymbol{beta}in V$. Then the predicted values of $mathbf{y}$ is given by, begin{equation}label{eq:pred} hat{mathbf{y}}=mathbb{E}mathbf{y}=mathbf{P}_{V}mathbf{y}=mathbf{X}(mathbf{X}^{text{T}}mathbf{X})^{-1}mathbf{X}^{text{T}}mathbf{y}, end{equation} where $mathbf{P}$ is the projection matrix onto the space $V$. For proof of the projection matrix in Equation (ref{eq:pred}) please refer to reference (1) below. Notice that this is equivalent to begin{equation}label{eq:yhbh} hat{mathbf{y}}=mathbb{E}mathbf{y}=mathbf{X}hat{boldsymbol{beta}}. end{equation}


Let’s fire up R and Python and see how we can apply those equations we derived. For purpose of illustration, we’re going to simulate data from Gaussian distributed population. To do so, consider the following codes

R ScriptPython ScriptHere we have two predictors x1 and x2, and our response variable y is generated by the parameters $beta_1=3.5$ and $beta_2=2.8$, and it has Gaussian noise with variance 7. While we set the same random seeds for both R and Python, we should not expect the random values generated in both languages to be identical, instead both values are independent and identically distributed (iid). For visualization, I will use Python Plotly, you can also translate it to R Plotly.

Now let’s estimate the parameter $boldsymbol{beta}$ which by default we set to $beta_1=3.5$ and $beta_2=2.8$. We will use Equation (ref{eq:betahat}) for estimation. So that we have

R ScriptPython ScriptThat’s a good estimate, and again just a reminder, the estimate in R and in Python are different because we have different random samples, the important thing is that both are iid. To proceed, we’ll do prediction using Equations (ref{eq:pred}). That is,

R ScriptPython ScriptThe first column above is the data y and the second column is the prediction due to Equation (ref{eq:pred}). Thus if we are to expand the prediction into an expectation plane, then we have

You have to rotate the plot by the way to see the plane, I still can’t figure out how to change it in Plotly. Anyway, at this point we can proceed computing for other statistics like the variance of the error, and so on. But I will leave it for you to explore. Our aim here is just to give us an understanding on what is happening inside the internals of our software when we try to estimate the parameters of the linear regression models.


  1. Arnold, Steven F. (1981). The Theory of Linear Models and Multivariate Analysis. Wiley.
  2. OLS in Matrix Form

R, Python, and SAS: Getting Started with Linear Regression

Consider the linear regression model, $$ y_i=f_i(boldsymbol{x}|boldsymbol{beta})+varepsilon_i, $$ where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,cdots, N$ and the predictor or the independent variable is the $boldsymbol{x}$ term defined in the mean function $f_i(boldsymbol{x}|boldsymbol{beta})$. For simplicity, consider the following simple linear regression (SLR) model, $$ y_i=beta_0+beta_1x_i+varepsilon_i. $$ To obtain the (best) estimate of $beta_0$ and $beta_1$, we solve for the least residual sum of squares (RSS) given by, $$ S=sum_{i=1}^{n}varepsilon_i^2=sum_{i=1}^{n}(y_i-beta_0-beta_1x_i)^2. $$ Now suppose we want to fit the model to the following data, Average Heights and Weights for American Women, where weight is the response and height is the predictor. The data is available in R by default.

The following is the plot of the residual sum of squares of the data base on the SLR model over $beta_0$ and $beta_1$, note that we standardized the variables first before plotting it,

Error Surface

If you are interested on the codes of the above figure, please click here. To minimize this elliptic paraboloid, differentiation has to be done with respect to the parameters, and then equate this to zero to obtain the stationary point, and finally solve for $beta_0$ and $beta_1$. For more on derivation of the estimates of the parameters see reference 1.

Simple Linear Regression in R

In R, we can fit the model using the lm function, which stands for linear models, i.e.

Formula, defined above as {response ~ predictor}, is a handy method for fitting model to the data in R. Mathematically, our model is $$ weight = beta_0 + beta_1 (height) + varepsilon. $$ The summary of it is obtain by running model %>% summary or for non-magrittr user summary(model), given the model object defined in the previous code,

The Coefficients section above returns the estimated coefficients of the model, and these are $beta_0 = -87.51667$ and $beta_1=3.45000$ (it should be clear that we used the unstandardized variables for obtaining these estimates). The estimates are both significant base on the p-value under .05 and even in .01 level of the test. Using the estimated coefficients along with the residual standard error we can now construct the fitted line and it’s confidence interval as shown below.

Fig 1. Plot of the Data and the Predicted Values in R.

Simple Linear Regression in Python

In Python, there are two modules that have implementation of linear regression modelling, one is in scikit-learn (sklearn) and the other is in Statsmodels (statsmodels). For example we can model the above data using sklearn as follows:

Above output is the estimate of the parameters, to obtain the predicted values and plot these along with the data points like what we did in R, we can wrapped the functions above into a class called linear_regression say, that requires Seaborn package for neat plotting, see the codes below:

Using this class and its methods, fitting the model to the data is coded as follows:

The predicted values of the data points is obtain using the predict method,

And Figure 2 below shows the plot of the predicted values along with its confidence interval and data points.

Fig 2. Plot of the Data and the Predicted Values in Python.

If one is only interested on the estimates of the model, then LinearRegression of scikit-learn is sufficient, but if the interest on other statistics such as that returned in R model summary is necessary, the said module can also do the job but might need to program other necessary routine. statsmodels, on the other hand, returns complete summary of the fitted model as compared to the R output above, which is useful for studies with particular interest on these information. So that modelling the data using simple linear regression is done as follows:

Clearly, we could spare time with statsmodels, especially in diagnostic checking involving test statistics such as Durbin-Watson and Jarque-Bera tests. We can of course add some plotting for diagnostic, but I prefer to discuss that on a separate entry.

Simple Linear Regression in SAS

Now let’s consider running the data in SAS, I am using SAS Studio and in order to import the data, I saved it as a CSV file first with columns height and weight. Uploaded it to SAS Studio, in which follows are the codes below to import the data.

Next we fit the model to the data using the REG procedure,

Number of Observations Read 15
Number of Observations Used 15
Analysis of Variance
Source DF Sum of
F Value Pr > F
Model 1 3332.70000 3332.70000 1433.02 <.0001
Error 13 30.23333 2.32564    
Corrected Total 14 3362.93333      
Root MSE 1.52501 R-Square 0.9910
Dependent Mean 136.73333 Adj R-Sq 0.9903
Coeff Var 1.11531    
Parameter Estimates
Variable DF Parameter
t Value Pr > |t|
Intercept 1 -87.51667 5.93694 -14.74 <.0001
height 1 3.45000 0.09114 37.86 <.0001

Now that’s a lot of output, probably the complete one. But like I said, I am not going to discuss each of these values and plots as some of it are used for diagnostic checking (you can read more on that in reference 1, and in other applied linear regression books). For now, let’s just confirm the coefficients obtained — both the estimates are the same with that in R and Python.

Multiple Linear Regression (MLR)

To extend SLR to MLR, we’ll demonstrate this by simulation. Using the formula-based lm function of R, assuming we have $x_1$ and $x_2$ as our predictors, then following is how we do MLR in R:

Although we did not use intercept in simulating the data, but the obtained estimates for $beta_1$ and $beta_2$ are close to the true parameters (.35 and .56). The intercept, however, will help us capture the noise term we added in simulation.

Next we’ll try MLR in Python using statsmodels, consider the following:

It should be noted that, the estimates in R and in Python should not (necessarily) be the same since these are simulated values from different software. Finally, we can perform MLR in SAS as follows:

Number of Observations Read 100
Number of Observations Used 100
Analysis of Variance
Source DF Sum of
F Value Pr > F
Model 2 610.86535 305.43268 303.88 <.0001
Error 97 97.49521 1.00511    
Corrected Total 99 708.36056      
Root MSE 1.00255 R-Square 0.8624
Dependent Mean 244.07327 Adj R-Sq 0.8595
Coeff Var 0.41076    
Parameter Estimates
Variable DF Parameter
t Value Pr > |t|
Intercept 1 18.01299 11.10116 1.62 0.1079
X1 1 0.31770 0.01818 17.47 <.0001
X2 1 0.58276 0.03358 17.35 <.0001


In conclusion, SAS saves a lot of work, since it returns complete summary of the model, no doubt why companies prefer to use this, besides from their active customer support. R and Python, on the other hand, despite the fact that it is open-source, it can well compete with the former, although it requires programming skills to achieved all of the SAS outputs, but I think that’s the exciting part of it — it makes you think, and manage time. The achievement in R and Python is of course fulfilling. Hope you’ve learned something, feel free to share your thoughts on the comment below.


  1. Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. John Wiley & Sons, Inc. United States of America.
  2. Scikit-learn Documentation
  3. Statsmodels Documentation
  4. SAS Documentation
  5. Delwiche, Lora D., and Susan J. Slaughter. 2012. The Little SAS® Book: A Primer, Fifth Edition. Cary, NC: SAS Institute Inc.
  6. Regression with SAS. Institute for Digital Research and Education. UCLA. Retrieved August 13, 2015.
  7. Python Plotly Documentation

Parametric Inference: The Power Function of the Test

In Statistics, we model random phenomenon and make conclusions about its population. For example, in an experiment of determining the true heights of the students in the university. Suppose we take sample from the population of the students, and consider testing the null hypothesis that the average height is 5.4 ft against an alternative hypothesis that the average height is greater than 5.4 ft. Mathematically, we can represent this as $H_0:theta=theta_0$ vs $H_1:theta>theta_0$, where $theta$ is the true value of the parameter, and $theta_0=5.4$ is the testing value set by the experimenter. And because we only consider subset (the sample) of the population for testing the hypotheses, then we expect for errors we commit. To understand these errors, consider if the above test results into rejecting $H_0$ given that $thetainTheta_0$, where $Theta_0$ is the parameter space of the null hypothesis, in other words we mistakenly reject $H_0$, then in this case we committed a Type I error. Another is, if the above test results into accepting $H_0$ given that $thetainTheta_0^c$, where $Theta_0^c$ is the parameter space of the alternative hypothesis, then we committed a Type II error. To summarize this consider the following table,

Truth Decision
Table 1: Two Types of Errors in Hypothesis Testing.
Accept $H_0$ Reject $H_0$
$H_0$ Correct Decision Type I Error
$H_1$ Type II Error Correct Decision

Let’s formally define the power function, from Casella and Berger (2001), see reference 1.

Definition 1. The power function of a hypothesis test with rejection region $R$ is the function of $theta$ defined by $beta(theta)=mathrm{P}_{theta}(mathbf{X}in R)$.

To relate the definition to the above problem, if $R$ is the rejection region of $H_0$. Then we make mistake if the sample observed, $mathbf{x}$, $mathbf{x}in R$ given that $thetainTheta_0$. That is, $beta(theta)=mathrm{P}_{theta}(mathbf{X}in R)$ is the probability of Type I error. Let’s consider an example, one that is popularly used in testing the sample mean. The example below is the combined problem of Example 8.3.3 and Exercise 8.37 (a) of reference 1.

Example 1. Let $X_1,cdots, X_noverset{r.s.}{sim}N(mu,sigma^2)$ — normal population where $sigma^2$ is known. Consider testing $H_0:thetaleq theta_0$ vs $H_1:theta> theta_0$, obtain the likelihood ratio test (LRT) statistic and its power function.

Solution:The LRT statistic is given by $$ lambda(mathbf{x})=frac{displaystylesup_{thetaleqtheta_0}L(theta|mathbf{x})}{displaystylesup_{-inftyentry, $lambda(mathbf{x})$ is rejected if it is small, such that $lambda(mathbf{x})leq c$ for some $cin[0,1]$. Hence, $$ begin{aligned} lambda(mathbf{x})&=expleft[-frac{n(bar{x}-theta_0)^2}{2sigma^2}right]sqrt{-2log c}. end{aligned} $$ So that $H_0$ is rejected if $frac{bar{x}-theta_0}{sigma/sqrt{n}}> c’$ for some $c’=sqrt{-2log c}in[0,infty)$. Now the power function of the test, is the probability of rejecting the null hypothesis given that it is true, or the probability of the Type I error given by, $$ begin{aligned} beta(theta)&=mathrm{P}left[frac{bar{x}-theta_0}{sigma/sqrt{n}}> c’right]\ &=mathrm{P}left[frac{bar{x}-theta+theta-theta_0}{sigma/sqrt{n}}> c’right]\ &=mathrm{P}left[frac{bar{x}-theta}{sigma/sqrt{n}}+frac{theta-theta_0}{sigma/sqrt{n}}> c’right]\ &=mathrm{P}left[frac{bar{x}-theta}{sigma/sqrt{n}}> c’-frac{theta-theta_0}{sigma/sqrt{n}}right]\ &=1-mathrm{P}left[frac{bar{x}-theta}{sigma/sqrt{n}}leq c’+frac{theta_0-theta}{sigma/sqrt{n}}right]\ &=1-Phileft[c’+frac{theta_0-theta}{sigma/sqrt{n}}right]. end{aligned} $$ To illustrate this, consider $theta_0=5.4,sigma = 1,n=30$ and $c’=1.645$. Then the plot of the power function as a function of $theta$ is,

Power Function

Since $beta$ is an increasing function with unit range, then $$ alpha = sup_{thetaleqtheta_0}beta(theta)=beta(theta_0)=1-Phi(c’). $$ So that using values we set for the above graph, $alpha=0.049985approx 0.05$, $alpha$ here is called the size of the test since it is the supremum of the power function over $thetaleqtheta_0$, see reference 1 for level of the test. Now let’s investigate the power function above, the probability of committing Type I error, $beta(theta), forall thetaleq theta_0$, is acceptably small. However, the probability of committing Type II error, $1-beta(theta), forall theta > theta_0$, is too high as we can see in the following plot,

Type II Error

Therefore, it’s better to investigate the error structure when considering the power of the test. From Casella and Berger (2001), the ideal power function is 0 $forallthetainTheta_0$ and 1 $forallthetainTheta_0^c$. Except in trivial situations, this ideal cannot be attained. Qualitatively, a good test has power function near 1 for most $thetainTheta_0^c$ and $thetainTheta_0$. Implying, one that has steeper power curve.

Now an interesting fact about power function is that it depends on the sample size $n$. Suppose in our experiment above we want the Type I error to be 0.05 and the Type II error to be 0.1 if $thetageq theta_0+sigma/2$. Since the power function is increasing, then we have $$ beta(theta_0)=0.05Rightarrow c’=1.645quadtext{and}quad 1 – beta(theta_0+sigma/2)=0.1Rightarrowbeta(theta_0+sigma/2)=0.9. $$ Where $$ begin{aligned} beta(theta_0+sigma/2)&=1-Phileft[c’ +frac{theta_0-sigma/2-theta_0}{sigma/sqrt{n}}right]\ &=1-Phileft[c’ – frac{sqrt{n}}{2}right]\ 0.9&=1-Phileft[1.645 – frac{sqrt{n}}{2}right]\ 0.1&=Phileft[1.645 – frac{sqrt{n}}{2}right].\ end{aligned} $$ Hence, $n$ is chosen such that it solves the above equation. That is, $$ begin{aligned} 1.645 – frac{sqrt{n}}{2}&=-1.28155,quadtext{since }Phi(-1.28155)=0.1\ frac{3.29 – sqrt{n}}{2}&=-1.28155\ 3.29 – sqrt{n}&=-2.5631\ n&=(3.29+2.5631)^2=34.25878,;text{take }n=35. end{aligned} $$ For purpose of illustration, we’ll consider the non-rounded value of $n$. Below is the plot of this,

Power Function with Sample Size

And for different values of $n$, consider the following power functions

Effect of Sample Size on Power Function

From the above plot, the larger the sample size, $n$, the steeper the curve implying a better error structure. To see this, try hovering over the lines in the plot, and you’ll witness a fast departure for values of large $n$ on the unit range, this characteristics contribute to the sensitivity of the test.

Plot’s Python Codes

In case you want to reproduce the above plots, click here for the source code.


  1. Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.
  2. Plotly Python Library Documentation

Parametric Inference: Likelihood Ratio Test by Example

Hypothesis testing have been extensively used on different discipline of science. And in this post, I will attempt on discussing the basic theory behind this, the Likelihood Ratio Test (LRT) defined below from Casella and Berger (2001), see reference 1.

Definition. The likelihood ratio test statistic for testing $H_0:thetainTheta_0$ versus $H_1:thetainTheta_0^c$ is begin{equation} label{eq:lrt} lambda(mathbf{x})=frac{displaystylesup_{thetainTheta_0}L(theta|mathbf{x})}{displaystylesup_{thetainTheta}L(theta|mathbf{x})}. end{equation} A likelihood ratio test (LRT) is any test that has a rejection region of the form ${mathbf{x}:lambda(mathbf{x})leq c}$, where $c$ is any number satisfying $0leq c leq 1$.

The numerator of equation (ref{eq:lrt}) gives us the supremum probability of the parameter, $theta$, over the restricted domain (null hypothesis, $Theta_0$) of the parameter space $Theta$, that maximizes the joint probability of the sample, $mathbf{x}$. While the denominator of the LRT gives us the supremum probability of the parameter, $theta$, over the unrestricted domain, $Theta$, that maximizes the joint probability of the sample, $mathbf{x}$. Therefore, if the value of $lambda(mathbf{x})$ is small such that $lambda(mathbf{x})leq c$, for some $cin [0, 1]$, then the true value of the parameter that is plausible in explaining the sample is likely to be in the alternative hypothesis, $Theta_0^c$.

Example 1. Let $X_1,X_2,cdots,X_noverset{r.s.}{sim}f(x|theta)=frac{1}{theta}expleft[-frac{x}{theta}right],x>0,theta>0$. From this sample, consider testing $H_0:theta = theta_0$ vs $H_1:theta

The parameter space $Theta$ is the set $(0,Theta_0]$, where $Theta_0={theta_0}$. Hence, using the likelihood ratio test, we have $$ lambda(mathbf{x})=frac{displaystylesup_{theta=theta_0}L(theta|mathbf{x})}{displaystylesup_{thetaleqtheta_0}L(theta|mathbf{x})}, $$ where, $$ begin{aligned} sup_{theta=theta_0}L(theta|mathbf{x})&=sup_{theta=theta_0}prod_{i=1}^{n}frac{1}{theta}expleft[-frac{x_i}{theta}right]\ &=sup_{theta=theta_0}left(frac{1}{theta}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta}right]\ &=left(frac{1}{theta_0}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right], end{aligned} $$ and $$ begin{aligned} sup_{thetaleqtheta_0}L(theta|mathbf{x})&=sup_{thetaleqtheta_0}prod_{i=1}^{n}frac{1}{theta}expleft[-frac{x_i}{theta}right]\ &=sup_{thetaleqtheta_0}left(frac{1}{theta}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta}right]=sup_{thetaleqtheta_0}f(mathbf{x}|theta). end{aligned} $$ Now the supremum of $f(mathbf{x}|theta)$ over all values of $thetaleqtheta_0$ is the MLE (maximum likelihood estimator) of $f(x|theta)$, which is $bar{x}$, provided that $bar{x}leq theta_0$.

So that, $$ begin{aligned} lambda(mathbf{x})&=frac{left(frac{1}{theta_0}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]} {left(frac{1}{bar{x}}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{bar{x}}right]},quadtext{provided that};bar{x}leq theta_0\ &=left(frac{bar{x}}{theta_0}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]exp[n]. end{aligned} $$ And we say that, if $lambda(mathbf{x})leq c$, $H_0$ is rejected. That is, $$ begin{aligned} left(frac{bar{x}}{theta_0}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]exp[n]&leq c\ left(frac{bar{x}}{theta_0}right)^nexpleft[-displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]&leq c’,quadtext{where};c’=frac{c}{exp[n]}\ nlogleft(frac{bar{x}}{theta_0}right)-frac{n}{theta_0}bar{x}&leq log c’\ logleft(frac{bar{x}}{theta_0}right)-frac{bar{x}}{theta_0}&leq frac{1}{n}log c’\ logleft(frac{bar{x}}{theta_0}right)-frac{bar{x}}{theta_0}&leq frac{1}{n}log c-1. end{aligned} $$ Now let $h(x)=log x – x$, then $h'(x)=frac{1}{x}-1$. So the critical point of $h'(x)$ is $x=1$. And to test if this is maximum or minimum, we apply second derivative test. That is, $$ h”(x)=-frac{1}{x^2} LRT and its Critical Value

Python and R: Basic Sampling Problem

In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr.


  1. Let $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known.

    $y_1 = 98$ $y_2 = 102$ $y_3=154$
    $y_4 = 133$ $y_5 = 190$ $y_6=175$

    We are interested in $bar{y}_U$, the population mean. Consider eight possible samples chosen.

    Sample No. Sample, $mathcal{S}$ $P(mathcal{S})$
    1 ${1,3,5}$ $1/8$
    2 ${1,3,6}$ $1/8$
    3 ${1,4,5}$ $1/8$
    4 ${1,4,6}$ $1/8$
    5 ${2,3,5}$ $1/8$
    6 ${2,3,6}$ $1/8$
    7 ${2,4,5}$ $1/8$
    8 ${2,4,6}$ $1/8$

    1. What is the value of $bar{y}_U$?
    2. Let $bar{y}$ be the mean of the sample values. For each sampling plan, find
      1. $mathrm{E}bar{y}$;
      2. $mathrm{Var}bar{y}$;
      3. $mathrm{Bias}(bar{y})$;
      4. $mathrm{MSE}(bar{y})$;
  2. Mayr et al. (1994) took an SRS of 240 children who visisted their pediatric outpatient clinic. They found the following frequency distribution for the age (in months) of free (unassisted) walking among the children:

    Age (months) 9 10 11 12 13 14 15 16 17 18 19 20
    Number of Children 13 35 44 69 36 24 7 3 2 5 1 1

    Find the mean and SE of the age for onset of free walking.
  3. Table 1 gives the cultivated area in acres in 1981 for 40 villages in a region. (Theory and Method of Survey) Using the arrangement (random) of data in the table, draw systematic sample of size 8. Use r ((random start) = 2,

    Village $Y_j$ Village $Y_j$ Village $Y_j$ Village $Y_j$
    1 105 11 319 21 70 31 16
    2 625 12 72 22 249 32 439
    3 47 13 109 23 384 33 123
    4 312 14 91 24 482 34 207
    5 327 15 152 25 378 35 145
    6 230 16 189 26 111 36 666
    7 240 17 365 27 534 37 338
    8 203 18 70 28 306 38 624
    9 535 9 249 29 655 39 501
    10 275 20 384 30 102 40 962


In order to appreciate the codes, I will share some theoretical part of the solution. But our main focus here is to solve this problem computationally using Python and R.

    1. The value of $bar{y}_U$ is coded as follows:

      Python Code R Code

    2. To obtain the sample using the sample index given in the table in the above question, we do a combination of population index of three elements, ${6choose 3}$, first. Where the first two combinations are the samples, ${1,2,3}$ and ${1,2,4}$, and so on. Then from this list of all possible combinations of three elements, we draw those that are listed in the above table as our samples, with first sample index ${1,3,5}$, having population units, ${98, 154, 190}$. So that the following is the code of this sampling design:

      Python Code R Code

      1. Now to obtain the expected value of the average of the sample data, we compute it using $mathrm{E}bar{y}=sum_{k}bar{y}_kmathrm{P}(bar{y}_k)=sum_{k}bar{y_k}mathrm{P}(mathcal{S}_k)$, $forall kin{1,cdots,8}$. So for $k = 1$, $$ begin{aligned} bar{y}_1mathrm{P}(mathcal{S}_1)&=frac{98+154+190}{3}mathrm{P}(mathcal{S}_1)\ &=frac{98+154+190}{3}left(frac{1}{8}right)=18.41667. end{aligned} $$ Applying this to the remaining $n-1$ $k$s, and summing up the terms gives us the answer to $mathrm{E}bar{y}$. So that the following is the equivalent of this:

        Python Code R Code From the above code, the output tells us that $mathrm{E}bar{y}=140$.

      2. Next is to compute for the variance of $bar{y}$, which is $mathrm{Var}bar{y}=mathrm{E}bar{y}^{2}-(mathrm{E}bar{y})^2$. So we need a function for $mathrm{E}bar{y}^2$, where the first term of this, $k=1$, is $bar{y}_1^2mathrm{P}(mathcal{S}_1)=left(frac{98+154+190}{3}right)^2mathrm{P}(mathcal{S}_1)=left(frac{98+154+190}{3}right)^2(frac{1}{8})=2713.3889$. Applying this to other terms and summing them up, we have following code:

        Python Code R Code So that using the above output, 20182.94, and subtracting $(mathrm{E}bar{y})^2$ to it, will give us the variance. And hence the succeeding code:

        Python Code: R Code: So the variance of the $bar{y}$ is $18.9444$.

    3. The $mathrm{Bias}$ is just the difference between the estimate and the true value. And since the estimate is unbiased ($mathrm{E}bar{y}=142$), so $mathrm{Bias}=142-142=0$.
    4. $mathrm{MSE}=mathrm{Var}bar{y}-(mathrm{Bias}bar{y})^2$, and since the $mathrm{Bias}bar{y}=0$. So $mathrm{MSE}=mathrm{Var}bar{y}$.
  1. First we need to obtain the probability of each Age, that is by dividing the Number of Children with the total sum of it. That is why, we have p_s function defined below. After obtaining the probabilities, we can then compute the expected value using the expectation function we defined earlier.

    Python Code R Code It should be clear in the data that the average age is about 12 months old, where the plot of it is shown below, For the code of the above plot please click here. Next is to compute the standard error which is just the square root of the variance of the sample,

    Python Code R Code So the standard variability of the Age is 1.920824.

  2. Let me give you a brief discussion on the systematic sampling to help you understand the code. The idea in systematic sampling is that, given the population units numbered from 1 to $N$, we compute for the sampling interval, given by $k = frac{N}{n}$, where $n$ is the number of units needed for the sample. After that, we choose for the random start, number between $1$ and $k$. This random start will be the first sample, and then the second unit in the sample is obtained by adding the sampling interval to the random start, and so on. There are two types of systematic sampling namely, Linear and Circular Systematic Samplings. Circular systematic sampling treats the population units numbered from $1$ to $N$ in circular form, so that if the increment step is more than the number of $N$ units, say $N+2$, the sample unit is the $2^{nd}$ element in the population, and so on. The code that I will be sharing can be used both for linear and circular, but for this particular problem only. Since there are rules in linear that are not satisfied in the function, one of which is if $k$ is not a whole number, despite that, however, you can always extend it to a more general function.

    Python Code R Code You may notice in the output above, that the index returned in Python is not the same with the index returned in R. This is because Python index starts with 0, while that in R starts with 1. So that’s why we have the same population units sampled between the two language despite the differences between the index returned.


  1. Lohr, Sharon (2009). Sampling: Design and Analysis. Cengage Learning.

Probability Theory: Convergence in Distribution Problem

Let’s solve some theoretical problem in probability, specifically on convergence. The problem below is originally from Exercise 5.42 of Casella and Berger (2001). And I just want to share my solution on this. If there is an incorrect argument below, I would be happy if you could point that to me.


Let $X_1, X_2,cdots$ be iid (independent and identically distributed) and $X_{(n)}=max_{1leq ileq n}x_i$.

  1. If $x_isim$ beta(1,$beta$), find a value of $nu$ so that $n^{nu}(1-X_{(n)})$ converges in distribution;
  2. If $x_isim$ exponential(1), find a sequence $a_n$ so that $X_{(n)}-a_n$ converges in distribution.


  1. Let $Y_n=n^{nu}(1-X_{(n)})$, we say that $Y_nrightarrow Y$ in distribution. If $$lim_{nrightarrow infty}F_{Y_n}(y)=F_Y(y).$$ Then, $$ begin{aligned} lim_{nrightarrowinfty}F_{Y_n}(y)&=lim_{nrightarrowinfty}P(Y_nleq y)=lim_{nrightarrowinfty}P(n^{nu}(1-X_{(n)})leq y)\ &=lim_{nrightarrowinfty}Pleft(1-X_{(n)}leq frac{y}{n^{nu}}right)\ &=lim_{nrightarrowinfty}Pleft(-X_{(n)}leq frac{y}{n^{nu}}-1right)=lim_{nrightarrowinfty}left[1-Pleft(-X_{(n)}> frac{y}{n^{nu}}-1right)right]\ &=lim_{nrightarrowinfty}left[1-Pleft(max{X_1,X_2,cdots,X_n}And because $x_isim$ beta(1,$beta$), the density is $$ f_{X_1}(x)=begin{cases} beta(1-x)^{beta – 1}&beta>0, 0leq xleq 1\ 0,&mathrm{Otherwise} end{cases} $$ Implies, $$ begin{aligned} lim_{nto infty}P(Y_nleq y)&=lim_{nto infty}left{1-left[int_0^{1-frac{y}{n^{nu}}}beta(1-t)^{beta-1},mathrm{d}tright]^nright}\ &=lim_{nto infty}left{1-left[-int_1^{frac{y}{n^{nu}}}beta u^{beta-1},mathrm{d}uright]^{n}right}\ &=lim_{nto infty}left{1-left[-betafrac{u^{beta}}{beta}bigg|_{u=1}^{u=frac{y}{n^{nu}}}right]^{n}right}\ &=1-lim_{nto infty}left[1-left(frac{y}{n^{nu}}right)^{beta}right]^{n} end{aligned} $$ We can simplify the limit if $nu=frac{1}{beta}$, that is $$ lim_{ntoinfty}P(Y_nleq y)=1-lim_{ntoinfty}left[1-frac{y^{beta}}{n}right]^{n}=1-e^{-y^{beta}} $$ To confirm this in Python, run the following code using the sympy module

    Therefore, if $1-e^{-y^{beta}}$ is a distribution function of $Y$, then $Y_n=n^{nu}(1-X_{(n)})$ converges in distribution to $Y$ for $nu=frac{1}{beta}$.

  2. $$ begin{aligned} P(X_{(n)}-a_{n}leq y) &= P(X_{(n)}leq y + a_n)=P(max{X_1,X_2,cdots,X_n}leq y+a_n)\ &=P(X_1leq y+a_n,X_2leq y+a_n,cdots,X_nleq y+a_n)\ &=P(X_1leq y+a_n)^n,;text{since};x_i’s;text{are iid}\ &=left[int_{-infty}^{y+a_n}f_{X_1}(t),mathrm{d}tright]^n end{aligned} $$ Since $X_isim$ exponential(1), then the density is $$ f_{X_1}=begin{cases} e^{-x},&0leq xleq infty\ 0,&mathrm{otherwise} end{cases} $$ So that, $$ begin{aligned} P(X_{(n)}-a_{n}leq y)&=left[int_{0}^{y+a_n}e^{-t},mathrm{d}tright]=left{-left[e^{-(y+a_n)}-1right]right}^n\ &=left[1-e^{-(y+a_n)}right]^n end{aligned} $$ If we let $Y_n=X_{(n)}-a_n$, then we say that $Y_nrightarrow Y$ in distribution if $$ lim_{ntoinfty}P(Y_nleq y)=P(Yleq y) $$ Therefore, $$ begin{aligned} lim_{ntoinfty}P(Y_nleq y) &= lim_{ntoinfty}P(X_{(n)}-a_nleq y)=lim_{nto infty}left[1-e^{-y-a_n}right]^n\ &=lim_{ntoinfty}left[1-frac{e^{-y}}{e^{a_n}}right]^n end{aligned} $$ We can simplify the limit if $a_n=log n$, that is $$ lim_{ntoinfty}left[1-frac{e^{-y}}{e^{log n}}right]^n=lim_{ntoinfty}left[1-frac{e^{-y}}{n}right]^n=e^{-e^{-y}} $$ Check this in Python by running the following code,

    In conclusion, if $e^{-e^{-y}}$ is a distribution function of Y, then $Y_n=X_{(n)}-a_n$ converges in distribution to $Y$ for sequence $a_n=log n$.


  1. Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.

Python: Getting Started with Data Analysis

Analysis with Programming has recently been syndicated to Planet Python. And as a first post being a contributing blog on the said site, I would like to share how to get started with data analysis on Python. Specifically, I would like to do the following:

  1. Importing the data
    • Importing CSV file both locally and from the web;
  2. Data transformation;
  3. Descriptive statistics of the data;
  4. Hypothesis testing
    • One-sample t test;
  5. Visualization; and
  6. Creating custom function.

Importing the data

This is the crucial step, we need to import the data in order to proceed with the succeeding analysis. And often times data are in CSV format, if not, at least can be converted to CSV format. In Python we can do this using the following codes:

To read CSV file locally, we need the pandas module which is a python data analysis library. The read_csv function can read data both locally and from the web.

Data transformation

Now that we have the data in the workspace, next is to do transformation. Statisticians and scientists often do this step to remove unnecessary data not included in the analysis. Let’s view the data first:

To R programmers, above is the equivalent of print(head(df)) which prints the first six rows of the data, and print(tail(df)) — the last six rows of the data, respectively. In Python, however, the number of rows for head of the data by default is 5 unlike in R, which is 6. So that the equivalent of the R code head(df, n = 10) in Python, is df.head(n = 10). Same goes for the tail of the data.

Column and row names of the data are extracted using the colnames and rownames functions in R, respectively. In Python, we extract it using the columns and index attributes. That is,

Transposing the data is obtain using the T method,

Other transformations such as sort can be done using sort attribute. Now let’s extract a specific column. In Python, we do it using either iloc or ix attributes, but ix is more robust and thus I prefer it. Assuming we want the head of the first column of the data, we have

By the way, the indexing in Python starts with 0 and not 1. To slice the index and first three columns of the 11th to 21st rows, run the following

Which is equivalent to print df.ix[10:20, ['Abra', 'Apayao', 'Benguet']]

To drop a column in the data, say columns 1 (Apayao) and 2 (Benguet), use the drop attribute. That is,

axis argument above tells the function to drop with respect to columns, if axis = 0, then the function drops with respect to rows.

Descriptive Statistics

Next step is to do descriptive statistics for preliminary analysis of our data using the describe attribute:

Hypothesis Testing

Python has a great package for statistical inference. And that’s the stats library of scipy. The one sample t-test is implemented in ttest_1samp function. So that, if we want to test the mean of the Abra’s volume of palay production against the null hypothesis with 15000 assumed population mean of the volume of palay production, we have

The values returned are tuple of the following values:

  • t : float or array
  • prob : float or array
        two-tailed p-value

From the above numerical output, we see that the p-value = 0.2627 is greater than $alpha=0.05$, hence there is no sufficient evidence to conclude that the average volume of palay production is not equal to 15000. Applying this test for all variables against the population mean 15000 volume of production, we have

The first array returned is the t-statistic of the data, and the second array is the corresponding p-values.


There are several module for visualization in Python, and the most popular one is the matplotlib library. To mention few, we have bokeh and seaborn modules as well to choose from. In my previous post, I’ve demonstrated the matplotlib package which has the following graphic for box-whisker plot,

Now plotting using pandas module can beautify the above plot into the theme of the popular R plotting package, the ggplot. To use the ggplot theme just add one more line to the above code,

And you’ll have the following,

Even neater than the default matplotlib.pyplot theme. But in this post, I would like to introduce the seaborn module which is a statistical data visualization library. So that, we have the following

Sexy boxplot, scroll down for more.

Creating custom function

To define a custom function in Python, we use the def function. For example, say we define a function that will add two numbers, we do it as follows,

By the way, in Python indentation is important. Use indentation for scope of the function, which in R we do it with braces {...}. Now here’s an algorithm from my previous post,

  1. Generate samples of size 10 from Normal distribution with $mu$ = 3 and $sigma^2$ = 5;
  2. Compute the $bar{x}$ and $bar{x}mp z_{alpha/2}displaystylefrac{sigma}{sqrt{n}}$ using the 95% confidence level;
  3. Repeat the process 100 times; then
  4. Compute the percentage of the confidence intervals containing the true mean.

Coding this in Python we have,

Above code might be easy to read, but it’s slow in replication. Below is the improvement of the above code, thanks to Python gurus, see comments on my previous post.


For those who are interested in the ipython notebook of this article, please click here. This article was converted to ipython notebook by of Nuttens Claude.

Data Source


  1. Pandas, Scipy, and Seaborn Documentations.
  2. Wes McKinney & PyData Development Team (2014). pandas: powerful Python data analysis toolkit.

Probability Theory Problems

Let’s have fun on probability theory, here is my first problem set in the said subject.


  1. It was noted that statisticians who follow the deFinetti school do not accept the Axiom of Countable Additivity, instead adhering to the Axiom of Finite Additivity.
    1. Show that the Axiom of Countable Additivity implies Finite Additivity.
    2. Although, by itself, the Axiom of Finite Additivity does not imply Countable Additivity, suppose we supplement it with the following. Let $A_1supset A_2supsetcdotssupset A_nsupset cdots$ be an infinite sequence of nested sets whose limit is the empty set, which we denote by $A_ndownarrowemptyset$. Consider the following:
      Axiom of Continuity: If $A_ndownarrowemptyset$, then $P(A_n)rightarrow 0$

      Prove that the Axiom of Continuity and the Axiom of Finite Additivity imply Countable Additivity.

  2. Prove each of the following statements. (Assume that any conditioning event has positive probability.)
    1. If $P(B)=1$, then $P(A|B)=P(A)$ for any $A$.
    2. If $Asubset B$, then $P(B|A)=1$ and $P(A|B)=P(A)/P(B)$.
    3. If $A$ and $B$ are mutually exclusive, then begin{equation}nonumber P(A|Acup B) = displaystylefrac{P(A)}{P(A)+P(B)}. end{equation}
    4. $P(Acap Bcap C)=P(A|Bcap C)P(B|C)P(C)$.
  3. Prove that the following functions are cdfs.
    1. $frac{1}{2}+frac{1}{pi}arctan(x), xin (-infty, infty)$
    2. $(1+e^{-x})^{-1},xin (-infty,infty)$
    3. $e^{-e^{-x}}, xin (-infty, infty)$
    4. $1-e^{-x}, xin (0,infty)$
    5. the function defined in (1.5.6), (Check in the reference below.)
  4. A cdf $F_X$ is stochastically greater than a cdf $F_{Y}$ if $F_{X}(t)leq F_{Y}(t)$ for all $t$ and $F_{X}(t) < F_{Y}(t)$ for some $t$. Prove that if $Xsim F_X$ and $Ysim F_Y$, then begin{equation}nonumber P(X>t) geq P(Y>t);text{for every};t end{equation} and begin{equation}nonumber P(X>t)>P(Y>t),;text{for some}; t end{equation} that is, $X$ tends to be bigger than $Y$.
  5. Let $X$ be a continuous random variable with pdf $f(x)$ and cdf $F(x)$. For a fixed number $x_0$, define the function begin{equation}nonumber g(x) = begin{cases} f(x) / [1-F(x_0)]& x geq x_0\ 0 & x < x_0. end{cases} end{equation} Prove that $g(x)$ is a pdf. (Assume that $F(x_0)
  6. For each of the following, determine the value of $c$ that makes $f(x)$ a pdf.
    1. $f(x)=mathrm{c}sin x, 0 < x < pi/2$
    2. $f(x)=mathrm{c}e^{-|x|},-infty < x < infty$


    1. Proof. Let $mathscr{B}$ be a $sigma$-algebra and consider $A_1,A_2,cdotsin mathscr{B}$ are pairwise disjoint, then by countable additivity begin{equation}nonumber Pleft(displaystylebigcup_{i=1}^{infty}A_iright)=displaystylesum_{i=1}^{infty}P(A_i). end{equation} Now, begin{equation} begin{aligned} Pleft(displaystylebigcup_{i=1}^{infty}A_iright)&= Pleft(displaystylebigcup_{i=1}^{n}A_icupdisplaystyle bigcup_{i=n+1}^{infty}A_iright)\ &= Pleft(displaystylebigcup_{i=1}^{n}A_iright)+Pleft(displaystyle bigcup_{i=n+1}^{infty}A_iright),;(text{since};A_i’s;text{are disjoints})\ &=P(A_1)+cdots+P(A_n)+Pleft(displaystyle bigcup_{i=n+1}^{infty}A_iright),\ &quad(text{by finite additivity})\ &=displaystylesum_{i=1}^{n}P(A_i)+Pleft(displaystyle bigcup_{i=n+1}^{infty}A_iright) end{aligned}nonumber end{equation} Notice that for any $n$, we can consider $P(A_i),;i>n$ to be empty. Implying begin{equation}nonumber Pleft(displaystylebigcup_{i=n+1}^{infty}A_iright)=displaystyle sum_{i=n+1}^{infty}P(A_i)=P(emptyset)+P(emptyset)+cdots, end{equation} that is, begin{equation}nonumber begin{aligned} Pleft(displaystylebigcup_{i=1}^{infty}A_iright)&= displaystylesum_{i=1}^{n}P(A_i)+sum_{i=n+1}^{infty}P(A_i)\ &=displaystylesum_{i=1}^{n}P(A_i)+P(emptyset)+P(emptyset)+cdots end{aligned} end{equation} $therefore$ countable additivity implies finite additivity.
    2. From (a), we have shown that countable additivity implies finite additivity, i.e., begin{equation} Pleft(displaystylebigcup_{i=1}^{infty}A_iright)=displaystylesum_{i=1}^{n}P(A_i)+Pleft(displaystyle bigcup_{i=n+1}^{infty}A_iright) nonumber end{equation} If we supplement this with the following condition, that $A_1supset A_2supset A_3supsetcdots$. By Axiom of Continuity, $displaystylelim_{nto infty}A_n=emptyset$, and by Monotone Sequential Continuity, $Pleft(displaystylelim_{ntoinfty}A_nright)= displaystylelim_{ntoinfty}P(A_n)=0$. Now we can write $A_1supset A_2supset A_3supsetcdots$ as begin{equation}nonumber B_k=bigcup_{i=k}^{infty}A_i,;text{such that};B_{k+1}subset B_k, text{implying}; lim_{ktoinfty}B_k=emptyset end{equation} Thus, finite additivity plus axiom of continuity, we have begin{equation}nonumber begin{aligned} Pleft(bigcup_{i=1}^{infty}A_iright)&=lim_{ntoinfty}left( sum_{i=1}^{n}P(A_i)+P(B_{n+1})right)\ &=lim_{ntoinfty}left(sum_{i=1}^{n}P(A_i)right)+lim_{ntoinfty} P(B_{n+1})\ &=sum_{i=1}^{infty}P(A_i)+0,;(text{by axiom of continuity}). end{aligned} end{equation} Implying countable additivity.
    1. Proof. If $P(B)=1$, then $P(S)=P(B)=1$. Because $Asubseteq S$, implies $Asubseteq B$. Thus, $Acap B = A$, and therefore begin{equation}nonumber P(A|B)=displaystylefrac{P(Acap B)}{P(B)}=displaystylefrac{P(A)}{P(B)}=P(A) end{equation} $hspace{12.5cm}blacksquare$
    2. Proof. If $Asubseteq B$ then begin{equation}nonumber P(B|A)=displaystylefrac{P(Acap B)}{P(A)}=displaystylefrac{P(A)}{P(A)}=1 end{equation} and, begin{equation}nonumber P(A|B)=displaystylefrac{P(Acap B)}{P(B)}=displaystylefrac{P(A)}{P(B)} end{equation} $hspace{12.5cm}blacksquare$
    3. Proof. If $A$ and $B$ are mutually exclusive, then begin{equation} nonumber begin{aligned} P(A|Acup B)&=displaystylefrac{P(Acap (Acup B))}{P(Acup B)}\ &=displaystylefrac{P(A)cup [P(Acap B)]}{P(A)+ P(B)}\ &=displaystylefrac{P(A)}{P(A)+ P(B)} end{aligned} end{equation}$hspace{12.5cm}blacksquare$
    4. Proof. Consider, begin{equation}nonumber P(A|Bcap C)=displaystylefrac{P(Acap Bcap C)}{P(Bcap C)} end{equation} Hence, begin{equation}nonumber P(Acap Bcap C) = P(A|Bcap C)P(Bcap C) end{equation} Now $P(Bcap C)=P(B|C)P(C)$, therefore begin{equation}nonumber P(Acap Bcap C) = P(A|Bcap C)P(B|C)P(C) end{equation}$hspace{12.5cm}blacksquare$
  1. $F(x)$ is a cdf if it satisfies the following conditions:
    1. $displaystylelim_{xto-infty}F(x)=0$ and $displaystylelim_{xtoinfty}F(x)=1$
    2. $F(x)$ is nondecreasing.
    3. $F(x)$ is right-continuous.
    1. Proof.
      1. $F(x)=frac{1}{2}+frac{1}{pi}arctan(x), xin (-infty, infty)$

        Above figure was generated by the following $mathrm{LaTeX}$ codes:

        begin{equation}nonumber begin{aligned} displaystylelim_{xto-infty}F(x)&=displaystylelim_{xto-infty} left(frac{1}{2}+frac{1}{pi}arctan(x)right)\ &=frac{1}{2}+frac{1}{pi}displaystylelim_{xto-infty}left(arctan(x)right)\ &=frac{1}{2}+frac{1}{pi} left(frac{-pi}{2}right),;text{since};displaystylelim_{xto-frac{pi}{2}}frac{sin(x)}{cos(x)}=-infty\ &=0\[0.5cm] displaystylelim_{xtoinfty}F(x)&=displaystylelim_{xtoinfty} left(frac{1}{2}+frac{1}{pi}arctan(x)right)\ &=frac{1}{2}+frac{1}{pi}displaystylelim_{xtoinfty}left(arctan(x)right)\ &=frac{1}{2}+frac{1}{pi} left(frac{pi}{2}right),;text{since};displaystylelim_{xtofrac{pi}{2}}frac{sin(x)}{cos(x)}=infty\ &=1 end{aligned} end{equation}

      2. To test if $F(x)$ is nondecreasing, recall in Calculus that, first differentiation of the function tells us if it is decreasing or increasing. In particular, $frac{dF(x)}{dx}>0$ tells us that the function is increasing in a given interval of $x$. Thus, begin{equation} nonumber frac{dF(x)}{dx}=frac{d}{dx}left(frac{1}{2}+frac{1}{pi}arctan(x)right)=frac{1}{pi(1+x^2)} end{equation} Confirm the above differentiation with Python using sympy module.

        Since $x^2$ is always positive for all $x$, thus $frac{dF(x)}{dx}>0$, implying $F(x)$ is increasing.

      3. $F(x)$ is continuous, implies that $F(x)$ is right-continuous.


    2. Proof.
      1. $ F(x)=displaystylefrac{1}{1+e^{-x}}, xin(-infty,infty) $

        begin{equation}nonumber begin{aligned} displaystylelim_{xto-infty}F(x)&=displaystylelim_{xto-infty} left(frac{1}{1+e^{-x}}right)\ &=0\[0.5cm] displaystylelim_{xtoinfty}F(x)&=displaystylelim_{xtoinfty} left(frac{1}{1+e^{-x}}right)\ &=displaystylelim_{xtoinfty} left(frac{1}{1+frac{1}{e^{x}}}right)\ &=1 end{aligned} end{equation} Confirm these in Python,

      2. Using the same method we did in (a), we have begin{equation} nonumber begin{aligned} frac{dF(x)}{dx}&=frac{d}{dx}left(displaystylefrac{1}{1+e^{-x}}right)\ &=frac{e^{-x}}{(1+e^{-x})^2} end{aligned} end{equation} $frac{dF(x)}{dx}=frac{e^{-x}}{(1+e^{-x})^2}>0,;forall;xin(-infty,infty)$. Thus the function is increasing in the interval of $x$.
      3. $F(x)$ is continuous, implies the function is right-continuous.


    3. Proof.
      1. $F(x)=e^{-e^{-x}}, xin (-infty, infty)$

        begin{equation}nonumber begin{aligned} displaystylelim_{xto-infty}F(x)&=displaystylelim_{xto-infty} left(e^{-e^{-x}}right)\ &=displaystylelim_{xto-infty} left(frac{1}{e^{frac{1}{e^{x}}}}right)\ &=0\[0.5cm] displaystylelim_{xtoinfty}F(x)&=displaystylelim_{xtoinfty} left(e^{-e^{-x}}right)\ &=displaystylelim_{xtoinfty} left(frac{1}{e^{frac{1}{e^{x}}}}right)\ &=1 end{aligned} end{equation}

      2. Like what we did above, $frac{dF(x)}{dx}$ is, begin{equation} nonumber frac{dF(x)}{dx}=frac{d}{dx}left(e^{-e^{-x}}right)=e^{-x}e^{-e^{-x}}>0 end{equation} Because $e^{-x}e^{-e^{-x}}>0,;forall; xin(-infty,infty)$. Then we say $F(x)$ is an increasing function in the interval of $x$.
      3. $F(x)$ is continuous, implies that $F(x)$ is right-continuous.


    4. Proof.
      1. $F(x)=1-displaystylefrac{1}{e^{x}}, xin(0,infty)$

        begin{equation}nonumber begin{aligned} displaystylelim_{xto-infty}F(x)&=displaystyle lim_{xto 0}F(x)=1-displaystylelim_{xto 0} left(frac{1}{e^{x}}right) =0\[0.5cm] displaystylelim_{xtoinfty}F(x)&=1- displaystylelim_{xtoinfty} left(frac{1}{e^{x}}right)=1 end{aligned} end{equation}

      2. begin{equation}nonumber frac{dF(x)}{dx}=frac{d}{dx}left(1-frac{1}{e^{x}}right)=0-(-e^{-x})=frac{1}{e^{x}} end{equation} $F(x)$ is an increasing function since $frac{1}{e^{x}}>0,;forall;xin(0,infty)$.
      3. $F(x)$ is right-continuous, since it is continuous.


    5. Proof. The function in Equation (1.5.6) is given by, begin{equation} F_Y(y)=begin{cases} displaystylefrac{1-varepsilon}{1+e^{-y}}&text{if};yvarepsilon>0\ varepsilon+displaystylefrac{1-varepsilon}{1+e^{-y}}&text{if};ygeq 0,;text{for some};varepsilon, 1>varepsilon>0 end{cases}nonumber end{equation}
      1. begin{equation}nonumber begin{aligned} displaystylelim_{yto-infty}F_Y(y)&=displaystylelim_{yto-infty} left(displaystylefrac{1-varepsilon}{1+e^{-y}}right)=displaystylelim_{yto-infty} left(displaystylefrac{1-varepsilon}{1+frac{1}{e^{y}}}right)=0\[0.5cm] displaystylelim_{ytoinfty}F(y)&=displaystylelim_{ytoinfty} left(varepsilon+displaystylefrac{1-varepsilon}{1+e^{-y}}right)=varepsilon + displaystylelim_{ytoinfty} left(displaystylefrac{1-varepsilon}{1+frac{1}{e^{y}}}right)=1 end{aligned} end{equation}
      2. For $y0$ since $00$, implying that the function is increasing.

        For $ygeq 0$, begin{equation} begin{aligned} frac{d}{dy}left(varepsilon+frac{1-varepsilon}{1+e^{-y}}right)&=varepsilon+frac{(1-varepsilon)e^{-y}}{(1+e^{-y})^2} end{aligned}nonumber end{equation} The function is increasing since $varepsilon + frac{(1-varepsilon)e^{-y}}{(1+e^{-y})^2}>0$ for all $ygeq 0$.

      3. Since the function is continuous, then the function is right-continuous.


  2. Proof. We know that, begin{equation}nonumber P(X>t)=1-P(Xleq t)=1-F_X(t) end{equation} and begin{equation}nonumber P(Y>t)=1-P(Yleq t)=1-F_Y(t) end{equation} Hence we have, begin{equation}nonumber begin{aligned} P(X>t)=1-F_X(t);&overset{?}{geq};1-F_Y(t)=P(Y>t)\ end{aligned} end{equation} Since $F_X(t)leq F_Y(t)$, then the difference $1-F_X(t)$ tends to get bigger than $1-F_Y(t)$. Thus for all $t$, $P(X>t)geq P(X>t)$.

    Now if $F_X(t) < F_Y(t)$ for some $t$, then using the same argument above, $P(X>t) > P(X>t)$ for some $t$.

  3. Proof. For a function to be a pdf, it has to satisfy the following:
    1. $g(x)geq 0$ for all $x$; and,
    2. $displaystyleint_{-infty}^{infty}g(x),dx=1$.

    For any arbitrary $x_0$, $F(x_0)

  4. In order for $f(x)$ to be a pdf, it has to integrate to 1.
    1. begin{equation} begin{aligned} int_{-infty}^{infty}f(x)&=int_{0}^{frac{pi}{2}}mathrm{c}sin x=displaystyleleft.-(mathrm{c})cos xdisplaystylerightrvert_{0}^{frac{pi}{2}}\ &=-mathrm{c}left(cosleft(frac{pi}{2}right)-cos(0)right)\ &=-mathrm{c}(0-1)=1mathrm{c} end{aligned}nonumber end{equation} Hence, $mathrm{c}$ is 1. Confirm this with python,
    2. begin{equation} begin{aligned} int_{-infty}^{infty}f(x)&=int_{-infty}^{infty} mathrm{c},e^{-|x|}\ &=mathrm{c}left(int_{-infty}^{0} ,e^{x},dx+int_{0}^{infty} e^{-x},dxright)\ &=mathrm{c}left[(e^{0}-e^{-infty})-(e^{-infty}-e^{0})right]\ &=mathrm{c}(1+1) = 2mathrm{c} end{aligned}nonumber end{equation} Hence, c is $frac{1}{2}$. Confirm this with Python,


  1. Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.