๐ฅท Chi Test
Non-parameteric Test
- The various tests of significance studied earlier such that as Z-test, t-test, F-test were based on the assumption that the samples were drawn from normal population. Under this assumption the various statistics were normally distributed.
- Since the procedure of testing the significance requires the knowledge about the type of population or parameters of population from which random samples have been drawn, these tests are known as
parametric tests
. - But there are many practical situations in which the assumption of any kind about the distribution of population or its parameter is not possible to make. The alternative technique where no assumption about the distribution or about parameters of population is made are known as non-parametric tests.
- Chi-square test is an example of the non-parametric test.
- Chi-square distribution is a distribution free test.
- Chi-square distribution was first discovered by Helmert in 1876 and later independently by
Karl Pearson
in 1900. - The range of chi-square distribution is
0 to โ
. - If observed frequency is equal to expected one than the value of ฯ2 static is zero.
- Measurement data: the data obtained by actual measurement is called measurement data. For example, height, weight, age, income, area etc.,
- Enumeration data: the data obtained by enumeration or counting is called enumeration data. For example, number of blue flowers, number of intelligent boys, number of curled leaves, etc.,
- ฯ2 โ test is
used for enumeration data
which generally relate to discrete variable whereas t-test and standard normal deviate tests are used for measuremental data which generally relate to continuous variable. - ฯ2 โ test can be used to know whether the given objects are segregating in a theoretical ratio or whether the two attributes are independent in a contingency table.
- The expression for ฯ2โtest for goodness of fit:
- Where
- Oi = observed frequencies
- Ei = expected frequencies
- n = number of cells (or classes)
- Which follows a chi-square distribution with (n-1) degrees of freedom.
- The null hypothesis H0 = the observed frequencies are in agreement with the expected frequencies.
- If the calculated value of ฯ2 < Table value of ฯ2 with (n-1) d.f. at specified level of significance (ฮฑ), we accept H0 otherwise we do not accept H0.
Conditions for the validity of ฯ2 โ test
- The validity of ฯ2-test of goodness of fit between theoretical and observed, the following conditions must be satisfied.
- The sample observations should be independent
- Constraints on the cell frequencies, if any, should be linear โOi = โEi
- N, the total frequency should be reasonably large, say
greater than 50
- If any theoretical (expected)
cell frequency is < 5
, then for the application of chi-square test it is pooled with the preceding or succeeding frequency so that the pooled frequency is more than 5 and finally adjust for the d.f. lost in pooling.
Applications of Chi-square Test
- Testing the
independence of attributes
- To test the
goodness of fit
(it tells you if your sample data represents the data you would expect to find in the actual population) - Testing of linkage in genetic problems
- Comparison of sample variance with population variance
- Testing the
homogeneity of variances
UPPSC 2021 - Testing the
homogeneity of correlation coefficient
- The test whether theory fits well in practical can be judged by Chi square test
Test for independence of two Attributes of (2x2) Contingency Table
- A characteristic which cannot be measured but can only be classified to one of the different levels of the character under consideration is called an attribute.
- 2x2 contingency table: When the individuals (objects) are classified into two categories with respect to each of the two attributes then the table showing frequencies distributed over 2x2 classes is called 2x2 contingency table.
- Suppose the individuals are classified according to two attributes say intelligence (A) and colour (B). The distribution of frequencies over cells is shown in the following table.
- Where
- R1 and R2 are the marginal totals of 1st row and 2nd row
- C1 and C2 are the marginal totals of 1st column and 2nd column
- N = grand total
- The null hypothesis H0: the two attributes are independent (if the colour is not dependent on intelligent)
- Based on above H0, the expected frequencies are calculated as follows.
- The degrees of freedom for m x n contingency table is
(m - 1) x (n - 1)
- The degrees of freedom for 2 x 2 contingency table is (2 - 1)(2 - 1) = 1
- This method is applied for all r x c contingency tables to get the expected frequencies.
- The degrees of freedom for r x c contingency table is (r - 1) x (c - 1)
- If the calculated value of ฯ2 < table value of ฯ2 at certain level of significance, then H0 is accepted otherwise we do not accept H0.
- The alternative formula for calculating ฯ2 in 2 x 2 contingency table is:
Example
- Examine the following table showing the number of plants having certain characters, test the hypothesis that the flower colour is independent of the shape of leaf.
Solution:
- Null hypothesis H0: attributes โflower colourโ and โshape of leafโ are independent of each other.
- Under H0 the statistic is
- Expected frequencies are calculated as follows.
Direct Method:
- Calculated value of ฯ2 < Table value of ฯ2 at 5% LOS for 1 d.f., Null hypothesis is accepted and hence we conclude that two characters, flower colour and shape of leaf are independent of each other.
- Yates correction for continuity in a 2 x 2 contingency table
- In a 2 x 2 contingency table, the number of d.f. is (2 - 1) x (2 - 1) = 1. If any one of Expected cell frequency is less than 5, then we use of pooling method for ฯ2โtest results with โ0โ d.f. (since 1 d.f. is lost in pooling) which is meaningless. In this case we apply a correction due to Yates, which is usually known a
Yates Correction for Continuity
. - Yates correction consists of the following steps:
- Add 0.5 to the cell frequency which is the least.
- Adjust the remaining cell frequencies in such a way that the row and column totals are not changed. It can be shown that this correction will result in the formula.
Example
- The following data are observed for hybrids of Datura.
- Flowers violet, fruits prickly = 47
- Flowers violet, fruits smooth = 12
- Flowers white, fruits prickly = 21
- Flowers white, fruits smooth = 3
- Using chi-square test, find the association between colour of flowers and character of fruits.
Solution:
- H0: The two attributes colour of flowers and fruits are independent.
- We cannot use Yateโs correction for continuity based on observed values.
- If only expected frequency less than 5, we use Yatesโs correction for continuity.
- The test statistic is
The figures in the brackets are the expected frequencies
- Calculated value of ฯ2 = 0.28
- Table value of ฯ2 for (2-1) (2-1) = 1 d.f. is 3.84
- Calculated value of ฯ2 < table value of ฯ2, H0 is accepted and hence we conclude that colour of flowers and character of fruits are not associated.