๐Ÿฅท Chi Test

Non-parameteric Test

  • The various tests of significance studied earlier such that as Z-test, t-test, F-test were based on the assumption that the samples were drawn from normal population. Under this assumption the various statistics were normally distributed.
  • Since the procedure of testing the significance requires the knowledge about the type of population or parameters of population from which random samples have been drawn, these tests are known as parametric tests.
  • But there are many practical situations in which the assumption of any kind about the distribution of population or its parameter is not possible to make. The alternative technique where no assumption about the distribution or about parameters of population is made are known as non-parametric tests.
  • Chi-square test is an example of the non-parametric test.
  • Chi-square distribution is a distribution free test.
  • Chi-square distribution was first discovered by Helmert in 1876 and later independently by Karl Pearson in 1900.
  • The range of chi-square distribution is 0 to โˆž.
  • If observed frequency is equal to expected one than the value of ฯ‡2 static is zero.
  • Measurement data: the data obtained by actual measurement is called measurement data. For example, height, weight, age, income, area etc.,
  • Enumeration data: the data obtained by enumeration or counting is called enumeration data. For example, number of blue flowers, number of intelligent boys, number of curled leaves, etc.,
  • ฯ‡2 โ€“ test is used for enumeration data which generally relate to discrete variable whereas t-test and standard normal deviate tests are used for measuremental data which generally relate to continuous variable.
  • ฯ‡2 โ€“ test can be used to know whether the given objects are segregating in a theoretical ratio or whether the two attributes are independent in a contingency table.
  • The expression for ฯ‡2โ€“test for goodness of fit:
  • Where
    • Oi = observed frequencies
    • Ei = expected frequencies
    • n = number of cells (or classes)
    • Which follows a chi-square distribution with (n-1) degrees of freedom.
  • The null hypothesis H0 = the observed frequencies are in agreement with the expected frequencies.
  • If the calculated value of ฯ‡2 < Table value of ฯ‡2 with (n-1) d.f. at specified level of significance (ฮฑ), we accept H0 otherwise we do not accept H0.

Conditions for the validity of ฯ‡2 โ€“ test

  • The validity of ฯ‡2-test of goodness of fit between theoretical and observed, the following conditions must be satisfied.
    • The sample observations should be independent
    • Constraints on the cell frequencies, if any, should be linear โˆ‘Oi = โˆ‘Ei
    • N, the total frequency should be reasonably large, say greater than 50
    • If any theoretical (expected) cell frequency is < 5, then for the application of chi-square test it is pooled with the preceding or succeeding frequency so that the pooled frequency is more than 5 and finally adjust for the d.f. lost in pooling.

Applications of Chi-square Test

  • Testing the independence of attributes
  • To test the goodness of fit (it tells you if your sample data represents the data you would expect to find in the actual population)
  • Testing of linkage in genetic problems
  • Comparison of sample variance with population variance
  • Testing the homogeneity of variances UPPSC 2021
  • Testing the homogeneity of correlation coefficient
  • The test whether theory fits well in practical can be judged by Chi square test

Test for independence of two Attributes of (2x2) Contingency Table

  • A characteristic which cannot be measured but can only be classified to one of the different levels of the character under consideration is called an attribute.
  • 2x2 contingency table: When the individuals (objects) are classified into two categories with respect to each of the two attributes then the table showing frequencies distributed over 2x2 classes is called 2x2 contingency table.
  • Suppose the individuals are classified according to two attributes say intelligence (A) and colour (B). The distribution of frequencies over cells is shown in the following table.
  • Where
    • R1 and R2 are the marginal totals of 1st row and 2nd row
    • C1 and C2 are the marginal totals of 1st column and 2nd column
    • N = grand total
  • The null hypothesis H0: the two attributes are independent (if the colour is not dependent on intelligent)
  • Based on above H0, the expected frequencies are calculated as follows.
  • The degrees of freedom for m x n contingency table is (m - 1) x (n - 1)
  • The degrees of freedom for 2 x 2 contingency table is (2 - 1)(2 - 1) = 1
  • This method is applied for all r x c contingency tables to get the expected frequencies.
  • The degrees of freedom for r x c contingency table is (r - 1) x (c - 1)
  • If the calculated value of ฯ‡2 < table value of ฯ‡2 at certain level of significance, then H0 is accepted otherwise we do not accept H0.
  • The alternative formula for calculating ฯ‡2 in 2 x 2 contingency table is:

Example

  • Examine the following table showing the number of plants having certain characters, test the hypothesis that the flower colour is independent of the shape of leaf.

Solution:

  • Null hypothesis H0: attributes โ€œflower colourโ€ and โ€œshape of leafโ€ are independent of each other.
  • Under H0 the statistic is
  • Expected frequencies are calculated as follows.

Direct Method:

  • Calculated value of ฯ‡2 < Table value of ฯ‡2 at 5% LOS for 1 d.f., Null hypothesis is accepted and hence we conclude that two characters, flower colour and shape of leaf are independent of each other.
  • Yates correction for continuity in a 2 x 2 contingency table
  • In a 2 x 2 contingency table, the number of d.f. is (2 - 1) x (2 - 1) = 1. If any one of Expected cell frequency is less than 5, then we use of pooling method for ฯ‡2โ€“test results with โ€˜0โ€™ d.f. (since 1 d.f. is lost in pooling) which is meaningless. In this case we apply a correction due to Yates, which is usually known a Yates Correction for Continuity.
  • Yates correction consists of the following steps:
    • Add 0.5 to the cell frequency which is the least.
    • Adjust the remaining cell frequencies in such a way that the row and column totals are not changed. It can be shown that this correction will result in the formula.

Example

  • The following data are observed for hybrids of Datura.
    • Flowers violet, fruits prickly = 47
    • Flowers violet, fruits smooth = 12
    • Flowers white, fruits prickly = 21
    • Flowers white, fruits smooth = 3
  • Using chi-square test, find the association between colour of flowers and character of fruits.

Solution:

  • H0: The two attributes colour of flowers and fruits are independent.
  • We cannot use Yateโ€™s correction for continuity based on observed values.
  • If only expected frequency less than 5, we use Yatesโ€™s correction for continuity.
  • The test statistic is

The figures in the brackets are the expected frequencies

  • Calculated value of ฯ‡2 = 0.28
  • Table value of ฯ‡2 for (2-1) (2-1) = 1 d.f. is 3.84
  • Calculated value of ฯ‡2 < table value of ฯ‡2, H0 is accepted and hence we conclude that colour of flowers and character of fruits are not associated.

Questions? Let's chat

Open Discord