What is the statistical test for data validation with an example, Chi-square, ANOVA test, Z statics, T statics, F statics, Hypothesis Testing?

 


Before discussing the different statistical test, we need to get a clear understanding of what a null hypothesis is. A null hypothesis proposes that has no significant difference exists in the set of a given observation.

 

Null: Two samples mean are equal. Alternate: Two samples mean are not equal.

For rejecting the null hypothesis, a test is calculated. Then the test statistic is compared with a critical value, and if found to be greater than the critical value, the hypothesis will be rejected.

 

Critical Value:-

Critical values are the point beyond which we reject the null hypothesis. Critical value tells us, what is the probability of N number of samples, belonging to the same distribution. Higher, the critical value which means lower the probability of N number of samples belonging to the same distribution.

 

Critical values can be used to do hypothesis testing in the following way.

 

1. Calculate test statistic

2. Calculate critical values based on the significance level alpha

3. Compare test statistics with critical values.

 

IMP-If the test statistic is lower than the critical value, accept the hypothesis or else reject the hypothesis.

 

Chi-Square Test:-

A chi-square test is used if there is a relationship between two categorical variables. 

Chi-Square test is used to determine whether there is a significant difference between the expected frequency and the observed frequency in one or more categories. Chi-square is also called the non-parametric test as it will not use any parameter

 

2-Anova test:-

ANOVA, also called an analysis of variance, is used to compare multiples (three or more) samples with a single test.

 

Useful when there are more than three populations. Anova compares the variance within and between the groups of the population. If the variation is much larger than the within variation, the means of different samples will not be equal. If the between and within variations are approximately the same size, then there will be no significant difference between sample means. Assumptions of ANOVA: 1-All populations involved follow a normal distribution. 2-All populations have the same variance (or standard deviation). 3-The samples are randomly selected and independent of one another.

 

ANOVA uses the mean of the samples or the population to reject or support the null hypothesis. Hence it is called parametric testing.

 

 

3-Z Statics:-

In a z-test, the samples are assumed to be normal distributed. A z score is calculated with population parameters as “population mean” and “population standard deviation” and it is used to validate a hypothesis that the sample drawn belongs to the same population.

 

The statistics used for this hypothesis testing is called z-statistic, the score for which is calculated as z = (x — μ) / (σ / n), where x= sample mean μ = population mean σ / n = population standard deviation If the test statistic is lower than the critical value, accept the hypothesis or else reject the hypothesis

 

4- T Statics:-

A t-test used to compare the mean of the given samples. Like z-test, t-test also assumed a normal distribution of the samples. A t-test is used when the population parameters (mean and standard deviation) are unknown

 

There are three versions of t-test

 

1. Independent samples t-test which compare means for two groups

 

2. Paired sample t-test which compares mean from the same group at different times

 

3. Sample t-test, which tests the mean of the single group against the known mean. The statistic for hypothesis testing is called t-statistic, the score for which is calculated as t = (x1 — x2) / (σ / n1 + σ / n2), where

 

x1 = It is mean of sample A, x2 = mean of sample B,

 

n1 = size of sample 1 n2 = size of sample 2

 

 

5- F Statics:-

The F-test is designed to test if the two population variances are equal. It compares the ratio of the two variances. Therefore, if the variances are equal, then the ratio of the variances will be 1.

 

The F-distribution is the ratio of two independent chi-square variables divided by their respective degrees of freedom.

 

F = s1^2 / s2^2 and where s1^2 > s2^2.

 

If the null hypothesis is true, then the F test-statistic given above can be simplified. This ratio of sample variances will be tested statistic used. If the null hypothesis is false, then we will reject the null hypothesis that the ratio was equal to 1 and our assumption that they were equal.

 

 

 

Comments

Popular posts from this blog

What is Time-Series forecasting? What is the difference between in Time series and regression?

Titanic Data EDA using Seaborn