Photo op with python
For a fee, some visitors to the Philippine Eagle Center in Malagos, Davao City pose with a python being handled by locals outside the facility on Saturday, December 2, 2017. MindaNews photo by H. MARCOS C. MORDENO
For a fee, some visitors to the Philippine Eagle Center in Malagos, Davao City pose with a python being handled by locals outside the facility on Saturday, December 2, 2017. MindaNews photo by H. MARCOS C. MORDENO
DAVAO CITY (MindaNews / 13 Apr) – A homegrown IT company here is holding a crash course on programming using the computer language “Python” which seeks to further develop the skills of local software programmers to… »
One of the problems often dealt in Statistics is minimization of the objective function. And contrary to the linear models, there is no analytical solution for models that are nonlinear on the parameters such as logistic regression, neural networks, an…
In my previous article, we talked about implementations of linear regression models in R, Python and SAS. On the theoretical sides, however, I briefly mentioned the estimation procedure for the parameter $boldsymbol{beta}$. So to help us understand how software does the estimation procedure, we’ll look at the mathematics behind it. We will also perform the estimation manually in R and in Python, that means we’re not going to use any special packages, this will help us appreciate the theory.
Consider the linear regression model, [ y_i=f_i(mathbf{x}boldsymbol{beta})+varepsilon_i,quadmathbf{x}_i=left[ begin{array}{cccc} 1&x_{11}&cdots&x_{1p} end{array}right],quadboldsymbol{beta}=left[begin{array}{c}beta_0\beta_1\vdots\beta_pend{array}right], ] where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,cdots, N$. The $f_i(mathbf{x}boldsymbol{beta})$ is the deterministic part of the model that depends on both the parameters $boldsymbol{beta}inmathbb{R}^{p+1}$ and the predictor variable $mathbf{x}_i$, which in matrix form, say $mathbf{X}$, is represented as follows [ mathbf{X}=left[ begin{array}{cccccc} 1&x_{11}&cdots&x_{1p}\ 1&x_{21}&cdots&x_{2p}\ vdots&vdots&ddots&vdots\ 1&x_{N1}&cdots&x_{Np}\ end{array} right]. ] $varepsilon_i$ is the error term at the $i$th case which we assumed to be Gaussian distributed with mean 0 and variance $sigma^2$. So that [ mathbb{E}y_i=f_i(mathbf{x}boldsymbol{beta}), ] i.e. $f_i(mathbf{x}boldsymbol{beta})$ is the expectation function. The uncertainty around the response variable is also modelled by Gaussian distribution. Specifically, if $Y=f(mathbf{x}boldsymbol{beta})+varepsilon$ and $yin Y$ such that $y>0$, then begin{align*} mathbb{P}[Yleq y]&=mathbb{P}[f(xbeta)+varepsilonleq y]\ &=mathbb{P}[varepsilonleq yf(mathbf{x}boldsymbol{beta})]=mathbb{P}left[frac{varepsilon}{sigma}leq frac{yf(mathbf{x}boldsymbol{beta})}{sigma}right]\ &=Phileft[frac{yf(mathbf{x}boldsymbol{beta})}{sigma}right], end{align*} where $Phi$ denotes the Gaussian distribution with density denoted by $phi$ below. Hence $Ysimmathcal{N}(f(mathbf{x}boldsymbol{beta}),sigma^2)$. That is, begin{align*} frac{operatorname{d}}{operatorname{d}y}Phileft[frac{yf(mathbf{x}boldsymbol{beta})}{sigma}right]&=phileft[frac{yf(mathbf{x}boldsymbol{beta})}{sigma}right]frac{1}{sigma}=mathbb{P}[yf(mathbf{x}boldsymbol{beta}),sigma^2]\ &=frac{1}{sqrt{2pi}sigma}expleft{frac{1}{2}left[frac{yf(mathbf{x}boldsymbol{beta})}{sigma}right]^2right}. end{align*} If the data are independent and identically distributed, then the loglikelihood function of $y$ is, begin{align*} mathcal{L}[boldsymbol{beta}mathbf{y},mathbf{X},sigma]&=mathbb{P}[mathbf{y}mathbf{X},boldsymbol{beta},sigma]=prod_{i=1}^Nfrac{1}{sqrt{2pi}sigma}expleft{frac{1}{2}left[frac{y_if_i(mathbf{x}boldsymbol{beta})}{sigma}right]^2right}\ &=frac{1}{(2pi)^{frac{n}{2}}sigma^n}expleft{frac{1}{2}sum_{i=1}^Nleft[frac{y_if_i(mathbf{x}boldsymbol{beta})}{sigma}right]^2right}\ logmathcal{L}[boldsymbol{beta}mathbf{y},mathbf{X},sigma]&=frac{n}{2}log2pinlogsigmafrac{1}{2sigma^2}sum_{i=1}^Nleft[y_if_i(mathbf{x}boldsymbol{beta})right]^2. end{align*} And because the likelihood function tells us about the plausibility of the parameter $boldsymbol{beta}$ in explaining the sample data. We therefore want to find the best estimate of $boldsymbol{beta}$ that likely generated the sample. Thus our goal is to maximize the likelihood function which is equivalent to maximizing the loglikelihood with respect to $boldsymbol{beta}$. And that’s simply done by taking the partial derivative with respect to the parameter $boldsymbol{beta}$. Therefore, the first two terms in the right hand side of the equation above can be disregarded since it does not depend on $boldsymbol{beta}$. Also, the location of the maximum loglikelihood with respect to $boldsymbol{beta}$ is not affected by arbitrary positive scalar multiplication, so the factor $frac{1}{2sigma^2}$ can be omitted. And we are left with the following equation, begin{equation}label{eq:1} sum_{i=1}^Nleft[y_if_i(mathbf{x}boldsymbol{beta})right]^2. end{equation} One last thing is that, instead of maximizing the loglikelihood function we can do minimization on the negative loglikelihood. Hence we are interested on minimizing the negative of Equation (ref{eq:1}) which is begin{equation}label{eq:2} sum_{i=1}^Nleft[y_if_i(mathbf{x}boldsymbol{beta})right]^2, end{equation} popularly known as the residual sum of squares (RSS). So RSS is a consequence of maximum loglikelihood under the Gaussian assumption of the uncertainty around the response variable $y$. For models with two parameters, say $beta_0$ and $beta_1$ the RSS can be visualized like the one in my previous article, that is
Performing differentiation under $(p+1)$dimensional parameter $boldsymbol{beta}$ is manageable in the context of linear algebra, so Equation (ref{eq:2}) is equivalent to begin{align*} lVertmathbf{y}mathbf{X}boldsymbol{beta}rVert^2&=langlemathbf{y}mathbf{X}boldsymbol{beta},mathbf{y}mathbf{X}boldsymbol{beta}rangle=mathbf{y}^{text{T}}mathbf{y}mathbf{y}^{text{T}}mathbf{X}boldsymbol{beta}(mathbf{X}boldsymbol{beta})^{text{T}}mathbf{y}+(mathbf{X}boldsymbol{beta})^{text{T}}mathbf{X}boldsymbol{beta}\ &=mathbf{y}^{text{T}}mathbf{y}mathbf{y}^{text{T}}mathbf{X}boldsymbol{beta}boldsymbol{beta}^{text{T}}mathbf{X}^{text{T}}mathbf{y}+boldsymbol{beta}^{text{T}}mathbf{X}^{text{T}}mathbf{X}boldsymbol{beta} end{align*} And the derivative with respect to the parameter is begin{align*} frac{operatorname{partial}}{operatorname{partial}boldsymbol{beta}}lVertmathbf{y}mathbf{X}boldsymbol{beta}rVert^2&=2mathbf{X}^{text{T}}mathbf{y}+2mathbf{X}^{text{T}}mathbf{X}boldsymbol{beta} end{align*} Taking the critical point by setting the above equation to zero vector, we have begin{align} frac{operatorname{partial}}{operatorname{partial}boldsymbol{beta}}lVertmathbf{y}mathbf{X}hat{boldsymbol{beta}}rVert^2&overset{text{set}}{=}mathbf{0}nonumber\ mathbf{X}^{text{T}}mathbf{y}+mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{0}nonumber\ mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{X}^{text{T}}mathbf{y}label{eq:norm} end{align} Equation (ref{eq:norm}) is called the normal equation. If $mathbf{X}$ is full rank, then we can compute the inverse of $mathbf{X}^{text{T}}mathbf{X}$, begin{align} mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=mathbf{X}^{text{T}}mathbf{y}nonumber\ (mathbf{X}^{text{T}}mathbf{X})^{1}mathbf{X}^{text{T}}mathbf{X}hat{boldsymbol{beta}}&=(mathbf{X}^{text{T}}mathbf{X})^{1}mathbf{X}^{text{T}}mathbf{y}nonumber\ hat{boldsymbol{beta}}&=(mathbf{X}^{text{T}}mathbf{X})^{1}mathbf{X}^{text{T}}mathbf{y}.label{eq:betahat} end{align} That’s it, since both $mathbf{X}$ and $mathbf{y}$ are known.
If $mathbf{X}$ is full rank and spans the subspace $Vsubseteqmathbb{R}^N$, where $mathbb{E}mathbf{y}=mathbf{X}boldsymbol{beta}in V$. Then the predicted values of $mathbf{y}$ is given by, begin{equation}label{eq:pred} hat{mathbf{y}}=mathbb{E}mathbf{y}=mathbf{P}_{V}mathbf{y}=mathbf{X}(mathbf{X}^{text{T}}mathbf{X})^{1}mathbf{X}^{text{T}}mathbf{y}, end{equation} where $mathbf{P}$ is the projection matrix onto the space $V$. For proof of the projection matrix in Equation (ref{eq:pred}) please refer to reference (1) below. Notice that this is equivalent to begin{equation}label{eq:yhbh} hat{mathbf{y}}=mathbb{E}mathbf{y}=mathbf{X}hat{boldsymbol{beta}}. end{equation}
Let’s fire up R and Python and see how we can apply those equations we derived. For purpose of illustration, we’re going to simulate data from Gaussian distributed population. To do so, consider the following codes
R ScriptPython ScriptHere we have two predictors x1
and x2
, and our response variable y
is generated by the parameters $beta_1=3.5$ and $beta_2=2.8$, and it has Gaussian noise with variance 7. While we set the same random seeds for both R and Python, we should not expect the random values generated in both languages to be identical, instead both values are independent and identically distributed (iid). For visualization, I will use Python Plotly, you can also translate it to R Plotly.
Now let’s estimate the parameter $boldsymbol{beta}$ which by default we set to $beta_1=3.5$ and $beta_2=2.8$. We will use Equation (ref{eq:betahat}) for estimation. So that we have
R ScriptPython ScriptThat’s a good estimate, and again just a reminder, the estimate in R and in Python are different because we have different random samples, the important thing is that both are iid. To proceed, we’ll do prediction using Equations (ref{eq:pred}). That is,
R ScriptPython ScriptThe first column above is the data y
and the second column is the prediction due to Equation (ref{eq:pred}). Thus if we are to expand the prediction into an expectation plane, then we have
You have to rotate the plot by the way to see the plane, I still can’t figure out how to change it in Plotly. Anyway, at this point we can proceed computing for other statistics like the variance of the error, and so on. But I will leave it for you to explore. Our aim here is just to give us an understanding on what is happening inside the internals of our software when we try to estimate the parameters of the linear regression models.
The following is the plot of the residual sum of squares of the data base on the SLR model over $beta_0$ and $beta_1$, note that we standardized the variables first before plotting it,
If you are interested on the codes of the above figure, please click here. To minimize this elliptic paraboloid, differentiation has to be done with respect to the parameters, and then equate this to zero to obtain the stationary point, and finally solve for $beta_0$ and $beta_1$. For more on derivation of the estimates of the parameters see reference 1.
In R, we can fit the model using the lm
function, which stands for linear models, i.e.
Formula, defined above as {response ~ predictor}
, is a handy method for fitting model to the data in R. Mathematically, our model is $$ weight = beta_0 + beta_1 (height) + varepsilon. $$ The summary of it is obtain by running model %>% summary
or for nonmagrittr user summary(model)
, given the model
object defined in the previous code,
The Coefficients section above returns the estimated coefficients of the model, and these are $beta_0 = 87.51667$ and $beta_1=3.45000$ (it should be clear that we used the unstandardized variables for obtaining these estimates). The estimates are both significant base on the pvalue under .05 and even in .01 level of the test. Using the estimated coefficients along with the residual standard error we can now construct the fitted line and it’s confidence interval as shown below.
Fig 1. Plot of the Data and the Predicted Values in R. 
In Python, there are two modules that have implementation of linear regression modelling, one is in scikitlearn (sklearn
) and the other is in Statsmodels (statsmodels
). For example we can model the above data using sklearn
as follows:
Above output is the estimate of the parameters, to obtain the predicted values and plot these along with the data points like what we did in R, we can wrapped the functions above into a class called linear_regression
say, that requires Seaborn package for neat plotting, see the codes below:
Using this class and its methods, fitting the model to the data is coded as follows:
The predicted values of the data points is obtain using the predict
method,
And Figure 2 below shows the plot of the predicted values along with its confidence interval and data points.
Fig 2. Plot of the Data and the Predicted Values in Python. 
If one is only interested on the estimates of the model, then LinearRegression
of scikitlearn is sufficient, but if the interest on other statistics such as that returned in R model summary is necessary, the said module can also do the job but might need to program other necessary routine. statsmodels
, on the other hand, returns complete summary of the fitted model as compared to the R output above, which is useful for studies with particular interest on these information. So that modelling the data using simple linear regression is done as follows:
Clearly, we could spare time with statsmodels, especially in diagnostic checking involving test statistics such as DurbinWatson and JarqueBera tests. We can of course add some plotting for diagnostic, but I prefer to discuss that on a separate entry.
Now let’s consider running the data in SAS, I am using SAS Studio and in order to import the data, I saved it as a CSV file first with columns height and weight. Uploaded it to SAS Studio, in which follows are the codes below to import the data.
Next we fit the model to the data using the REG
procedure,
Number of Observations Read  15 

Number of Observations Used  15 
Analysis of Variance  

Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  1  3332.70000  3332.70000  1433.02  <.0001 
Error  13  30.23333  2.32564  
Corrected Total  14  3362.93333 
Root MSE  1.52501  RSquare  0.9910 

Dependent Mean  136.73333  Adj RSq  0.9903 
Coeff Var  1.11531 
Parameter Estimates  

Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  1  87.51667  5.93694  14.74  <.0001 
height  1  3.45000  0.09114  37.86  <.0001 
To extend SLR to MLR, we’ll demonstrate this by simulation. Using the formulabased lm
function of R, assuming we have $x_1$ and $x_2$ as our predictors, then following is how we do MLR in R:
Although we did not use intercept in simulating the data, but the obtained estimates for $beta_1$ and $beta_2$ are close to the true parameters (.35 and .56). The intercept, however, will help us capture the noise term we added in simulation.
Next we’ll try MLR in Python using statsmodels, consider the following:
It should be noted that, the estimates in R and in Python should not (necessarily) be the same since these are simulated values from different software. Finally, we can perform MLR in SAS as follows:
Number of Observations Read  100 

Number of Observations Used  100 
Analysis of Variance  

Source  DF  Sum of Squares 
Mean Square 
F Value  Pr > F 
Model  2  610.86535  305.43268  303.88  <.0001 
Error  97  97.49521  1.00511  
Corrected Total  99  708.36056 
Root MSE  1.00255  RSquare  0.8624 

Dependent Mean  244.07327  Adj RSq  0.8595 
Coeff Var  0.41076 
Parameter Estimates  

Variable  DF  Parameter Estimate 
Standard Error 
t Value  Pr > t 
Intercept  1  18.01299  11.10116  1.62  0.1079 
X1  1  0.31770  0.01818  17.47  <.0001 
X2  1  0.58276  0.03358  17.35  <.0001 
In conclusion, SAS saves a lot of work, since it returns complete summary of the model, no doubt why companies prefer to use this, besides from their active customer support. R and Python, on the other hand, despite the fact that it is opensource, it can well compete with the former, although it requires programming skills to achieved all of the SAS outputs, but I think that’s the exciting part of it — it makes you think, and manage time. The achievement in R and Python is of course fulfilling. Hope you’ve learned something, feel free to share your thoughts on the comment below.
In Statistics, we model random phenomenon and make conclusions about its population. For example, in an experiment of determining the true heights of the students in the university. Suppose we take sample from the population of the students, and consider testing the null hypothesis that the average height is 5.4 ft against an alternative hypothesis that the average height is greater than 5.4 ft. Mathematically, we can represent this as $H_0:theta=theta_0$ vs $H_1:theta>theta_0$, where $theta$ is the true value of the parameter, and $theta_0=5.4$ is the testing value set by the experimenter. And because we only consider subset (the sample) of the population for testing the hypotheses, then we expect for errors we commit. To understand these errors, consider if the above test results into rejecting $H_0$ given that $thetainTheta_0$, where $Theta_0$ is the parameter space of the null hypothesis, in other words we mistakenly reject $H_0$, then in this case we committed a Type I error. Another is, if the above test results into accepting $H_0$ given that $thetainTheta_0^c$, where $Theta_0^c$ is the parameter space of the alternative hypothesis, then we committed a Type II error. To summarize this consider the following table,
Truth  Decision  

Table 1: Two Types of Errors in Hypothesis Testing.


Accept $H_0$  Reject $H_0$  
$H_0$  Correct Decision  Type I Error  
$H_1$  Type II Error  Correct Decision 
Let’s formally define the power function, from Casella and Berger (2001), see reference 1.
Definition 1. The power function of a hypothesis test with rejection region $R$ is the function of $theta$ defined by $beta(theta)=mathrm{P}_{theta}(mathbf{X}in R)$.
To relate the definition to the above problem, if $R$ is the rejection region of $H_0$. Then we make mistake if the sample observed, $mathbf{x}$, $mathbf{x}in R$ given that $thetainTheta_0$. That is, $beta(theta)=mathrm{P}_{theta}(mathbf{X}in R)$ is the probability of Type I error. Let’s consider an example, one that is popularly used in testing the sample mean. The example below is the combined problem of Example 8.3.3 and Exercise 8.37 (a) of reference 1.
Example 1. Let $X_1,cdots, X_noverset{r.s.}{sim}N(mu,sigma^2)$ — normal population where $sigma^2$ is known. Consider testing $H_0:thetaleq theta_0$ vs $H_1:theta> theta_0$, obtain the likelihood ratio test (LRT) statistic and its power function.
Solution:The LRT statistic is given by $$ lambda(mathbf{x})=frac{displaystylesup_{thetaleqtheta_0}L(thetamathbf{x})}{displaystylesup_{inftyentry, $lambda(mathbf{x})$ is rejected if it is small, such that $lambda(mathbf{x})leq c$ for some $cin[0,1]$. Hence, $$ begin{aligned} lambda(mathbf{x})&=expleft[frac{n(bar{x}theta_0)^2}{2sigma^2}right]sqrt{2log c}. end{aligned} $$ So that $H_0$ is rejected if $frac{bar{x}theta_0}{sigma/sqrt{n}}> c’$ for some $c’=sqrt{2log c}in[0,infty)$. Now the power function of the test, is the probability of rejecting the null hypothesis given that it is true, or the probability of the Type I error given by, $$ begin{aligned} beta(theta)&=mathrm{P}left[frac{bar{x}theta_0}{sigma/sqrt{n}}> c’right]\ &=mathrm{P}left[frac{bar{x}theta+thetatheta_0}{sigma/sqrt{n}}> c’right]\ &=mathrm{P}left[frac{bar{x}theta}{sigma/sqrt{n}}+frac{thetatheta_0}{sigma/sqrt{n}}> c’right]\ &=mathrm{P}left[frac{bar{x}theta}{sigma/sqrt{n}}> c’frac{thetatheta_0}{sigma/sqrt{n}}right]\ &=1mathrm{P}left[frac{bar{x}theta}{sigma/sqrt{n}}leq c’+frac{theta_0theta}{sigma/sqrt{n}}right]\ &=1Phileft[c’+frac{theta_0theta}{sigma/sqrt{n}}right]. end{aligned} $$ To illustrate this, consider $theta_0=5.4,sigma = 1,n=30$ and $c’=1.645$. Then the plot of the power function as a function of $theta$ is,
Since $beta$ is an increasing function with unit range, then $$ alpha = sup_{thetaleqtheta_0}beta(theta)=beta(theta_0)=1Phi(c’). $$ So that using values we set for the above graph, $alpha=0.049985approx 0.05$, $alpha$ here is called the size of the test since it is the supremum of the power function over $thetaleqtheta_0$, see reference 1 for level of the test. Now let’s investigate the power function above, the probability of committing Type I error, $beta(theta), forall thetaleq theta_0$, is acceptably small. However, the probability of committing Type II error, $1beta(theta), forall theta > theta_0$, is too high as we can see in the following plot,
Therefore, it’s better to investigate the error structure when considering the power of the test. From Casella and Berger (2001), the ideal power function is 0 $forallthetainTheta_0$ and 1 $forallthetainTheta_0^c$. Except in trivial situations, this ideal cannot be attained. Qualitatively, a good test has power function near 1 for most $thetainTheta_0^c$ and $thetainTheta_0$. Implying, one that has steeper power curve.
Now an interesting fact about power function is that it depends on the sample size $n$. Suppose in our experiment above we want the Type I error to be 0.05 and the Type II error to be 0.1 if $thetageq theta_0+sigma/2$. Since the power function is increasing, then we have $$ beta(theta_0)=0.05Rightarrow c’=1.645quadtext{and}quad 1 – beta(theta_0+sigma/2)=0.1Rightarrowbeta(theta_0+sigma/2)=0.9. $$ Where $$ begin{aligned} beta(theta_0+sigma/2)&=1Phileft[c’ +frac{theta_0sigma/2theta_0}{sigma/sqrt{n}}right]\ &=1Phileft[c’ – frac{sqrt{n}}{2}right]\ 0.9&=1Phileft[1.645 – frac{sqrt{n}}{2}right]\ 0.1&=Phileft[1.645 – frac{sqrt{n}}{2}right].\ end{aligned} $$ Hence, $n$ is chosen such that it solves the above equation. That is, $$ begin{aligned} 1.645 – frac{sqrt{n}}{2}&=1.28155,quadtext{since }Phi(1.28155)=0.1\ frac{3.29 – sqrt{n}}{2}&=1.28155\ 3.29 – sqrt{n}&=2.5631\ n&=(3.29+2.5631)^2=34.25878,;text{take }n=35. end{aligned} $$ For purpose of illustration, we’ll consider the nonrounded value of $n$. Below is the plot of this,
And for different values of $n$, consider the following power functions
From the above plot, the larger the sample size, $n$, the steeper the curve implying a better error structure. To see this, try hovering over the lines in the plot, and you’ll witness a fast departure for values of large $n$ on the unit range, this characteristics contribute to the sensitivity of the test.
In case you want to reproduce the above plots, click here for the source code.
Hypothesis testing have been extensively used on different discipline of science. And in this post, I will attempt on discussing the basic theory behind this, the Likelihood Ratio Test (LRT) defined below from Casella and Berger (2001), see reference 1.
Definition. The likelihood ratio test statistic for testing $H_0:thetainTheta_0$ versus $H_1:thetainTheta_0^c$ is begin{equation} label{eq:lrt} lambda(mathbf{x})=frac{displaystylesup_{thetainTheta_0}L(thetamathbf{x})}{displaystylesup_{thetainTheta}L(thetamathbf{x})}. end{equation} A likelihood ratio test (LRT) is any test that has a rejection region of the form ${mathbf{x}:lambda(mathbf{x})leq c}$, where $c$ is any number satisfying $0leq c leq 1$.
The numerator of equation (ref{eq:lrt}) gives us the supremum probability of the parameter, $theta$, over the restricted domain (null hypothesis, $Theta_0$) of the parameter space $Theta$, that maximizes the joint probability of the sample, $mathbf{x}$. While the denominator of the LRT gives us the supremum probability of the parameter, $theta$, over the unrestricted domain, $Theta$, that maximizes the joint probability of the sample, $mathbf{x}$. Therefore, if the value of $lambda(mathbf{x})$ is small such that $lambda(mathbf{x})leq c$, for some $cin [0, 1]$, then the true value of the parameter that is plausible in explaining the sample is likely to be in the alternative hypothesis, $Theta_0^c$.
Example 1. Let $X_1,X_2,cdots,X_noverset{r.s.}{sim}f(xtheta)=frac{1}{theta}expleft[frac{x}{theta}right],x>0,theta>0$. From this sample, consider testing $H_0:theta = theta_0$ vs $H_1:theta
Solution:
The parameter space $Theta$ is the set $(0,Theta_0]$, where $Theta_0={theta_0}$. Hence, using the likelihood ratio test, we have $$ lambda(mathbf{x})=frac{displaystylesup_{theta=theta_0}L(thetamathbf{x})}{displaystylesup_{thetaleqtheta_0}L(thetamathbf{x})}, $$ where, $$ begin{aligned} sup_{theta=theta_0}L(thetamathbf{x})&=sup_{theta=theta_0}prod_{i=1}^{n}frac{1}{theta}expleft[frac{x_i}{theta}right]\ &=sup_{theta=theta_0}left(frac{1}{theta}right)^nexpleft[displaystylefrac{sum_{i=1}^{n}x_i}{theta}right]\ &=left(frac{1}{theta_0}right)^nexpleft[displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right], end{aligned} $$ and $$ begin{aligned} sup_{thetaleqtheta_0}L(thetamathbf{x})&=sup_{thetaleqtheta_0}prod_{i=1}^{n}frac{1}{theta}expleft[frac{x_i}{theta}right]\ &=sup_{thetaleqtheta_0}left(frac{1}{theta}right)^nexpleft[displaystylefrac{sum_{i=1}^{n}x_i}{theta}right]=sup_{thetaleqtheta_0}f(mathbf{x}theta). end{aligned} $$ Now the supremum of $f(mathbf{x}theta)$ over all values of $thetaleqtheta_0$ is the MLE (maximum likelihood estimator) of $f(xtheta)$, which is $bar{x}$, provided that $bar{x}leq theta_0$.
So that, $$ begin{aligned} lambda(mathbf{x})&=frac{left(frac{1}{theta_0}right)^nexpleft[displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]} {left(frac{1}{bar{x}}right)^nexpleft[displaystylefrac{sum_{i=1}^{n}x_i}{bar{x}}right]},quadtext{provided that};bar{x}leq theta_0\ &=left(frac{bar{x}}{theta_0}right)^nexpleft[displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]exp[n]. end{aligned} $$ And we say that, if $lambda(mathbf{x})leq c$, $H_0$ is rejected. That is, $$ begin{aligned} left(frac{bar{x}}{theta_0}right)^nexpleft[displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]exp[n]&leq c\ left(frac{bar{x}}{theta_0}right)^nexpleft[displaystylefrac{sum_{i=1}^{n}x_i}{theta_0}right]&leq c’,quadtext{where};c’=frac{c}{exp[n]}\ nlogleft(frac{bar{x}}{theta_0}right)frac{n}{theta_0}bar{x}&leq log c’\ logleft(frac{bar{x}}{theta_0}right)frac{bar{x}}{theta_0}&leq frac{1}{n}log c’\ logleft(frac{bar{x}}{theta_0}right)frac{bar{x}}{theta_0}&leq frac{1}{n}log c1. end{aligned} $$ Now let $h(x)=log x – x$, then $h'(x)=frac{1}{x}1$. So the critical point of $h'(x)$ is $x=1$. And to test if this is maximum or minimum, we apply second derivative test. That is, $$ h”(x)=frac{1}{x^2}
$y_1 = 98$  $y_2 = 102$  $y_3=154$ 
$y_4 = 133$  $y_5 = 190$  $y_6=175$ 
Sample No.  Sample, $mathcal{S}$  $P(mathcal{S})$ 
1  ${1,3,5}$  $1/8$ 
2  ${1,3,6}$  $1/8$ 
3  ${1,4,5}$  $1/8$ 
4  ${1,4,6}$  $1/8$ 
5  ${2,3,5}$  $1/8$ 
6  ${2,3,6}$  $1/8$ 
7  ${2,4,5}$  $1/8$ 
8  ${2,4,6}$  $1/8$ 
Age (months)  9  10  11  12  13  14  15  16  17  18  19  20 
Number of Children  13  35  44  69  36  24  7  3  2  5  1  1 
Village  $Y_j$  Village  $Y_j$  Village  $Y_j$  Village  $Y_j$ 
1  105  11  319  21  70  31  16 
2  625  12  72  22  249  32  439 
3  47  13  109  23  384  33  123 
4  312  14  91  24  482  34  207 
5  327  15  152  25  378  35  145 
6  230  16  189  26  111  36  666 
7  240  17  365  27  534  37  338 
8  203  18  70  28  306  38  624 
9  535  9  249  29  655  39  501 
10  275  20  384  30  102  40  962 
In order to appreciate the codes, I will share some theoretical part of the solution. But our main focus here is to solve this problem computationally using Python and R.
Python Code R Code
Python Code R Code
Python Code R Code From the above code, the output tells us that $mathrm{E}bar{y}=140$.
Python Code R Code So that using the above output, 20182.94, and subtracting $(mathrm{E}bar{y})^2$ to it, will give us the variance. And hence the succeeding code:
Python Code: R Code: So the variance of the $bar{y}$ is $18.9444$.
p_s
function defined below. After obtaining the probabilities, we can then compute the expected value using the expectation
function we defined earlier.
Python Code R Code It should be clear in the data that the average age is about 12 months old, where the plot of it is shown below, For the code of the above plot please click here. Next is to compute the standard error which is just the square root of the variance of the sample,
Python Code R Code So the standard variability of the Age is 1.920824.
Python Code R Code You may notice in the output above, that the index returned in Python is not the same with the index returned in R. This is because Python index starts with 0, while that in R starts with 1. So that’s why we have the same population units sampled between the two language despite the differences between the index returned.
Let $X_1, X_2,cdots$ be iid (independent and identically distributed) and $X_{(n)}=max_{1leq ileq n}x_i$.
Therefore, if $1e^{y^{beta}}$ is a distribution function of $Y$, then $Y_n=n^{nu}(1X_{(n)})$ converges in distribution to $Y$ for $nu=frac{1}{beta}$.
$hspace{12.5cm}blacksquare$
In conclusion, if $e^{e^{y}}$ is a distribution function of Y, then $Y_n=X_{(n)}a_n$ converges in distribution to $Y$ for sequence $a_n=log n$.
$hspace{12.5cm}blacksquare$
This is the crucial step, we need to import the data in order to proceed with the succeeding analysis. And often times data are in CSV format, if not, at least can be converted to CSV format. In Python we can do this using the following codes:
To read CSV file locally, we need the pandas
module which is a python data analysis library. The read_csv
function can read data both locally and from the web.
Now that we have the data in the workspace, next is to do transformation. Statisticians and scientists often do this step to remove unnecessary data not included in the analysis. Let’s view the data first:
To R programmers, above is the equivalent of print(head(df))
which prints the first six rows of the data, and print(tail(df))
— the last six rows of the data, respectively. In Python, however, the number of rows for head of the data by default is 5 unlike in R, which is 6. So that the equivalent of the R code head(df, n = 10)
in Python, is df.head(n = 10)
. Same goes for the tail of the data.
Column and row names of the data are extracted using the colnames
and rownames
functions in R, respectively. In Python, we extract it using the columns
and index
attributes. That is,
Transposing the data is obtain using the T
method,
Other transformations such as sort can be done using sort
attribute. Now let’s extract a specific column. In Python, we do it using either iloc
or ix
attributes, but ix
is more robust and thus I prefer it. Assuming we want the head of the first column of the data, we have
By the way, the indexing in Python starts with 0 and not 1. To slice the index and first three columns of the 11th to 21st rows, run the following
Which is equivalent to print df.ix[10:20, ['Abra', 'Apayao', 'Benguet']]
To drop a column in the data, say columns 1 (Apayao) and 2 (Benguet), use the drop
attribute. That is,
axis
argument above tells the function to drop with respect to columns, if axis = 0
, then the function drops with respect to rows.
Next step is to do descriptive statistics for preliminary analysis of our data using the describe
attribute:
Python has a great package for statistical inference. And that’s the stats library of scipy. The one sample ttest is implemented in ttest_1samp
function. So that, if we want to test the mean of the Abra’s volume of palay production against the null hypothesis with 15000 assumed population mean of the volume of palay production, we have
The values returned are tuple of the following values:
From the above numerical output, we see that the pvalue = 0.2627 is greater than $alpha=0.05$, hence there is no sufficient evidence to conclude that the average volume of palay production is not equal to 15000. Applying this test for all variables against the population mean 15000 volume of production, we have
The first array returned is the tstatistic of the data, and the second array is the corresponding pvalues.
There are several module for visualization in Python, and the most popular one is the matplotlib library. To mention few, we have bokeh and seaborn modules as well to choose from. In my previous post, I’ve demonstrated the matplotlib package which has the following graphic for boxwhisker plot,
Now plotting using pandas module can beautify the above plot into the theme of the popular R plotting package, the ggplot. To use the ggplot theme just add one more line to the above code,
And you’ll have the following,
Even neater than the default matplotlib.pyplot theme. But in this post, I would like to introduce the seaborn module which is a statistical data visualization library. So that, we have the following
Sexy boxplot, scroll down for more.
To define a custom function in Python, we use the def
function. For example, say we define a function that will add two numbers, we do it as follows,
By the way, in Python indentation is important. Use indentation for scope of the function, which in R we do it with braces {...}
. Now here’s an algorithm from my previous post,
Coding this in Python we have,
Above code might be easy to read, but it’s slow in replication. Below is the improvement of the above code, thanks to Python gurus, see comments on my previous post.
For those who are interested in the ipython notebook of this article, please click here. This article was converted to ipython notebook by of Nuttens Claude.
Prove that the Axiom of Continuity and the Axiom of Finite Additivity imply Countable Additivity.
Above figure was generated by the following $mathrm{LaTeX}$ codes:
begin{equation}nonumber begin{aligned} displaystylelim_{xtoinfty}F(x)&=displaystylelim_{xtoinfty} left(frac{1}{2}+frac{1}{pi}arctan(x)right)\ &=frac{1}{2}+frac{1}{pi}displaystylelim_{xtoinfty}left(arctan(x)right)\ &=frac{1}{2}+frac{1}{pi} left(frac{pi}{2}right),;text{since};displaystylelim_{xtofrac{pi}{2}}frac{sin(x)}{cos(x)}=infty\ &=0\[0.5cm] displaystylelim_{xtoinfty}F(x)&=displaystylelim_{xtoinfty} left(frac{1}{2}+frac{1}{pi}arctan(x)right)\ &=frac{1}{2}+frac{1}{pi}displaystylelim_{xtoinfty}left(arctan(x)right)\ &=frac{1}{2}+frac{1}{pi} left(frac{pi}{2}right),;text{since};displaystylelim_{xtofrac{pi}{2}}frac{sin(x)}{cos(x)}=infty\ &=1 end{aligned} end{equation}
Since $x^2$ is always positive for all $x$, thus $frac{dF(x)}{dx}>0$, implying $F(x)$ is increasing.
$hspace{12.5cm}blacksquare$
begin{equation}nonumber begin{aligned} displaystylelim_{xtoinfty}F(x)&=displaystylelim_{xtoinfty} left(frac{1}{1+e^{x}}right)\ &=0\[0.5cm] displaystylelim_{xtoinfty}F(x)&=displaystylelim_{xtoinfty} left(frac{1}{1+e^{x}}right)\ &=displaystylelim_{xtoinfty} left(frac{1}{1+frac{1}{e^{x}}}right)\ &=1 end{aligned} end{equation} Confirm these in Python,
$hspace{12.5cm}blacksquare$
begin{equation}nonumber begin{aligned} displaystylelim_{xtoinfty}F(x)&=displaystylelim_{xtoinfty} left(e^{e^{x}}right)\ &=displaystylelim_{xtoinfty} left(frac{1}{e^{frac{1}{e^{x}}}}right)\ &=0\[0.5cm] displaystylelim_{xtoinfty}F(x)&=displaystylelim_{xtoinfty} left(e^{e^{x}}right)\ &=displaystylelim_{xtoinfty} left(frac{1}{e^{frac{1}{e^{x}}}}right)\ &=1 end{aligned} end{equation}
$hspace{12.5cm}blacksquare$
begin{equation}nonumber begin{aligned} displaystylelim_{xtoinfty}F(x)&=displaystyle lim_{xto 0}F(x)=1displaystylelim_{xto 0} left(frac{1}{e^{x}}right) =0\[0.5cm] displaystylelim_{xtoinfty}F(x)&=1 displaystylelim_{xtoinfty} left(frac{1}{e^{x}}right)=1 end{aligned} end{equation}
$hspace{12.5cm}blacksquare$
For $ygeq 0$, begin{equation} begin{aligned} frac{d}{dy}left(varepsilon+frac{1varepsilon}{1+e^{y}}right)&=varepsilon+frac{(1varepsilon)e^{y}}{(1+e^{y})^2} end{aligned}nonumber end{equation} The function is increasing since $varepsilon + frac{(1varepsilon)e^{y}}{(1+e^{y})^2}>0$ for all $ygeq 0$.
$hspace{12.5cm}blacksquare$
Now if $F_X(t) < F_Y(t)$ for some $t$, then using the same argument above, $P(X>t) > P(X>t)$ for some $t$.
$hspace{13.5cm}blacksquare$
For any arbitrary $x_0$, $F(x_0)