Statistical Analysis of Data

Statistical Analysis of Data

By:  Siddiq ullah
Slide 1

What is statistics?
Latin “status”---political state—info useful to state (size of population, armed forces etc)
A branch of mathematics concerned with understanding and summarizing collections of numbers
A collection of numerical facts systematically arranged

Slide 2

Descriptive Statistics
Statistics which describe attributes of a sample or population.
Includes measures of central tendency statistics (e.g., mean, median, mode), frequencies, percentages. Minimum, maximum, and range for a data set, variance etc.
Organize and summaries a set of data

Slide 3

Inferential Statistics
Used to make inferences or judgments about a larger population based on the data collected from a small sample drawn from the population.

A key component of inferential statistics is the calculation of statistical significance of a research finding.

1.         Involves
·       Estimation
·       Hypothesis  Testing
2.         Purpose
·       Make Decisions About Population Characteristics

Slide 5

Key Terms
1.         Population (Universe)
All Items of Interest
2.         Sample
Portion of Population
3.         Parameter
Summary Measure about Population
4.         Statistic
Summary Measure about Sample

Slide 6

Key Terms

A characteristic of the population.  Denoted with Greek letters such as m or
A characteristic of a sample.  Denoted with English letters such as X or S.

Sampling Error:
Describes the amount of error that exists between a sample statistic and corresponding population parameter

Slide 7

Slide 8

Some Notations…

All items under consideration by researcher

m = population mean
s = population standard
N = population size
p = population percentage

A portion of the population selected for study

x = sample mean
s = sample standard  
n = sample size
p = sample percentage

Slide 9

Descriptive & Inferential Statistics (DS & IS)

DS gather information about a population characteristic (e.g. income) and describe it with a parameter of interest (e.g. mean)
IS uses the parameter to test a hypothesis pertaining to that characteristic. E.g.
    Ho: mean income = UD 4,000
    H1: mean income < UD 4,000)
The result for hypothesis testing is used to make inference about the characteristic of interest (e.g. Americans ® upper middle income)

Slide 10

Examples of Descriptive and Inferential Statistics

Descriptive Statistics                                      Inferential Statistics

·       Graphical                                                  *   Confidence interval
-Arrange data in tables                                    *   Margin of error
-Bar graphs and pie charts                  *   Compare means of two samples
·       Numerical                                                      - Pre/post scores
-Percentages                                             - t Test
-Averages                                            *   Compare means from three samples
-Range                                                      - Pre/post and follow-up
·       Relationships                                                 - ANOVA = analysis of variance
-Correlation coefficient                            - Levels of Measurement
-Regression analysis

 Slide 11

Another characteristic of data, which determines which statistical calculations are meaningful
Nominal: Qualitative data only;  categories of names, labels, or qualities; Can’t be ordered (i.e, best to worst)    ex: Survey responses of Yes/No
Ordinal: Qualitative/quantitative; can be ordered, but no meaningful subtractions:   ex. Grades A, B, C, D, F
Interval: Quantitative only; meaningful subtractions but not ratios, zero is only a position (not “none”)            ex: Temperatures
Ratio: Quantitative only, meaningful subtractions and ratios; zero represents “none”    ex. Weights of babies

Slide 12
Measures of Central Tendency
“Say you were standing with one foot in the oven and one foot in an ice bucket.  According to the average, you should be perfectly comfortable.”
The mode – applies to ratio, interval, ordinal or nominal scales.
The median – applies to ratio, interval and ordinal scales
The mean – applies to ratio and interval scales

Slide 13

Measuring Variability
Range:  lowest to highest score
Average Deviation:  average distance from the mean
Variance:  average squared distance from the mean
Used in later inferential statistics
Standard Deviation:  square root of variance
expressed on the same scale as the mean

Slide 13

Parametric statistics
Statistical analysis that attempts to explain the population parameter using a sample
E.g. of statistical parameters: mean, variance, std. dev., R2, t-value, F-ratio, rxy, etc.
It assumes that the distributions of the variables being assessed belong to known parameterized families of probability distributions

Slide 14

Frequencies and Distributions
Frequency-A frequency is the number of times a value is observed in a distribution or the number of times a particular event occurs.
Distribution-When the observed values are arranged in order they are called a rank order distribution or an array. Distributions demonstrate how the frequencies of observations are distributed across a range of values.

The Mode
Defined as the most frequent value (the peak)
·       Applies to ratio, interval, ordinal and nominal scales
·       Sensitive to sampling error (noise)
·       Distributions may be referred to as uni modal, bimodal or multimodal, depending upon the number of peaks

The Median
  • Defined as the 50th percentile
  • Applies to ratio, interval and ordinal scales
  • Can be used for open-ended distributions

The Mean

Applies only to ratio or interval scales
Sensitive to outliers
How to find?
Mean – the average of a group of numbers.
2, 5, 2, 1, 5
Mean = 3
Mean is found by evening out the numbers
2, 5, 2, 1, 5

2, 5, 2, 1, 5

2, 5, 2, 1, 5
mean = 3

How to Find the Mean of a Group of Numbers

Step 1 – Add all the numbers.
8, 10, 12, 18, 22, 26

8+10+12+18+22+26 = 96

Step 2 – Divide the sum by the number of addends.
8, 10, 12, 18, 22, 26
8+10+12+18+22+26 = 96
How many addends are there?

Step 2 – Divide the sum by the number of addends.
The mean or average of these numbers is 16.
8, 10, 12, 18, 22, 26
What is the mean of these numbers?
7, 10, 16
26, 33, 41, 52
is in the

Median – the middle number in a set of ordered numbers.
1, 3, 7, 10, 13
Median = 7

How to Find the Median in a Group of Numbers
Step 1 – Arrange the numbers in order from least to greatest.
21, 18, 24, 19, 27
18, 19, 21, 24, 27

Step 2 – Find the middle number.
21, 18, 24, 19, 27
18, 19, 21, 24, 27
This is your median number.

Step 3 – If there are two middle numbers, find the mean of these two numbers.
18, 19, 21, 25, 27, 28

When to use this measure?

With a non-normal distribution, the median is appropriate

21+ 25 = 46

What is the median of these numbers?
16, 10, 7
7, 10, 16

29, 8, 4, 11, 19
4, 8, 11, 19, 29

31, 7, 2, 12, 14, 19
2, 7, 12, 14, 19, 31                      13
12 + 14 = 26                            2) 26
is the most

Mode – the number that appears most frequently in a set of numbers.
1, 1, 3, 7, 10, 13
Mode = 1
How to Find the Mode in a Group of Numbers
Step 1 – Arrange the numbers in order from least to greatest.
21, 18, 24, 19, 18
18, 18, 19, 21, 24
Step 2 – Find the number that is repeated the most.
21, 18, 24, 19, 18
18, 18, 19, 21, 24
Which number is the mode?
1, 2, 2, 9, 9, 4, 9, 10
1, 2, 2, 4, 9, 9, 9, 10

When to use this measure?
If your data is nominal, you may use the mode and range
Using all three measures provides a more complete picture of the characteristics of your sample set.

Measures of Variability (Dispersion)
Range – applies to ratio, interval, ordinal scales
Semi-interquartile range – applies to ratio, interval, ordinal scales
Variance (standard deviation) – applies to ratio, interval scales
Understanding the variation

The more the data is spread out, the larger the range, variance, SD and SE (Low precision and accuracy)
The more concentrated the data (precise or homogenous), the smaller the range, variance, and standard deviation (high precision and accuracy)
If all the observations are the same, the range, variance, and standard deviation = 0
None of these measures can be negative
Two distant means with little variations are more likely to be significantly different and vice versa

Interval between lowest and highest values
Generally unreliable – changing one value (highest or lowest) can cause large change in range.
is the distance

Range – the difference between the greatest and the least value in a set of numbers.
1, 1, 3, 7, 10, 13
Range = 12

What is the range?
22, 21, 27, 31, 21, 32
21, 21, 22, 27, 31, 32
32 – 21 = 11
How to Find the Range in a Group of Numbers
Step 1 – Arrange the numbers in order from least to greatest.
21, 18, 24, 19, 27
18, 19, 21, 24, 27
Step 2 – Find the lowest and highest numbers.
21, 18, 24, 19, 27
18, 19, 21, 24, 27
Step 3 – Find the difference between these 2 numbers.
18, 19, 21, 24, 27
27 – 18 = 9
The range is 9
Mid-range: Average of the smallest and largest observations

Measure of relative position
Percentiles and Percentile Ranks
Percentile:  The score at or below which a given % of scores lie.
Percentile Rank:  The percentage of scores at or below a given score

Mid-hinge: The average of the first and third quartiles.

Observations that divide data into four equal parts.

First Quartile (Q1)
Semi-Interquartile Range
The interquartile range is the interval between the first and third quartile, i.e. between the 25th and 75th percentile.
The semi-inter quartile range is half the interquartile range.
Can be used with open-ended distributions
Unaffected by extreme scores

Example1: the third quartile of students in the Biometry class = ¾ X 36 = 27th item

Example 2: 60th percentile of the class would be 60/100*36 = 21.6 = 22nd item (round off)

Inter-quartile range/deviation
(Mid-spread): Difference between the Third and the First Quartiles, therefore, considers data of central half and ignores the extreme values

Inter-quartile Range = Q3 - Q1

Quartile deviation = (Q3 - Q1)/2

Quartile Deviation

Measures the dispersion of the middle 50% of the distribution

-- rank the data
-- calculate upper and lower quartiles (UQ & LQ)

             Number Sample          sorted Values                         
            1          25                   
            2          27                              
            3          20                              
            4          23                               
            5          26                               
            6          24       
            7          19                               
            8          16                               
            9          25                               
            10        18                               
            11        30                   
            12        29                               
            13        32                               
            14        26                               
            15        24                               
            16        21                   
            17        28                               
            18        27                               
            19        20                               
            20        16                               
            21        14
Number               Sample     Sorted Values     Ranked Values           
            1                                  25        14                   
            2                                  27        16                   
            3                                  20        16                   
            4                                  23        18                   
            5                                  26        19                   
            6                                  24        20                   
            7                                  19        20                   
            8                                  16        21                   
            9                                  25        23                   
            10                                18        24                   
            11                                30        24                   
            12                                29        25                   
            13                                32        25                   
            14                                26        26                   
            15                                24        26                   
            16                                21        27                   
            17                                28        27                   
            18                                27        28                   
            19                                20        29                   
            20                                16        30                   
            21                                14        32       
 Number       Sample    sorted Values  Ranked Values           
            1          25        14        LL       
            2          27        16                   
            3          20        16                   
            4          23        18                   
            5          26        19                   
            6          24        20        LQ or Q1        
            7          19        20                   
            8          16        21                   
            9          25        23                   
            10        18        24                   
            11        30        24        Md or Q2       
            12        29        25                   
            13        32        25                   
            14        26        26                   
            15        24        26                   
            16        21        27        UQ or Q3       
            17        28        27                   
            18        27        28                   
            19        20        29                   
            20        16        30                   
            21        14        32        UL


Variance is the average of the squared deviations
Closely related to the standard deviation
In order to eliminate negative sign, deviations are squared (squared units e.g. m2)

v = s2

Variance (for a sample)
Compute each deviation
Square each deviation
Sum all the squares
Divide by the data size (sample size) minus one: n-1
Example of Variance
Variance = 54/9 = 6

It is a measure of “spread”.
Notice that the larger the deviations (positive or negative) the larger the variance
Population Variance and Standard Deviation
Sample Variance and Standard Deviation
The standard deviation

It is defines as the square root of the variance
Standard deviation (SD):
Positive square root of the variance
                        SD = + √ S(y-ў)2÷ n
Variance and standard deviation are
useful for probability and hypothesis testing, therefore, is widely used unlike mean deviation

Population parameters and sample statistics

If we are working with samples, the calculation under-estimates the variance and SD which is biased
Therefore, instead of using n, n-1 (degrees of freedom) is used for sample, e.g.

Standard Deviation

Example:  {4, 7, 6, 3, 8, 6, 7, 4, 5, 3}
Measure of relationship
Correlation is a statistical technique
that is used to measure a relationship
between two variables.
Correlation requires two scores from
each individual (one score from each
of the two variables)
Correlation Coefficients

A correlation coefficient is a statistic that indicates the strength & direction of the relationship b/w 2 variables (or more) for 1 group of participants

Another definition – specifically for Spearman’s rho:
Spearman’s correlation coefficient is a standardized measure of the strength of relationship b/w 2 variables that does not rely on the assumptions of a parametric test (nonparametric data)

Uses Pearson’s correlation coefficient performed on data that have been converted into ranked scores

Distinguishing Characteristics of

Correlation procedures involve one
sample containing all pairs of X and Y
Neither variable is called the
independent or dependent variable
Use the individual pair of scores to
create a scatter plot
The scatter plot
Correlation and causality

The fact that there is a relationship
    between two variables does not mean that
    changes in one variable cause the
    changes in the other variable.
A statistical relationship can exist even though one variable does not cause or influence the other.
Correlation research cannot be used to
    infer causal relationships between two variables
in the following examples

�� example 1 - correlation coefficient =1
�� example 2 - correlation coefficient =-1
�� example 3 - correlation coefficient =0
�� the correlation coefficient for the parametric case is called the Pearson product moment correlation coefficient (r)
example 1

paired values
A   3   6   9   12   15
B   1   2   3     4      5
�� variable A (income of family) (1000  pounds)
�� variable B (# of cars owned)
�� here is a perfect and positive correlation as one variate increases in precisely the same proportion as the other variate increases

example 2

paired values
A   3   6   9   12   15
B   5   4   3    2      1
�� variable A (income of family) (100
�� variable B (# of children)
�� here is a perfect and negative correlation as one variate decreases in precisely the same proportion as the other variate increases
example 3

paired values

A   3   6  9   12  15
B   4   1  3    5    2

variable A (income of family)
 variable B (last number of postal code)
 here there is almost no correlation because one
variate does not systematically change with the
other. Any association is caused by A and B being
randomly distributed

Correlation coefficients provide a single numerical value to represent the relationship b/w the 2 variables

Correlation coefficients ranges -1 to +1

-1.00 (negative one) a perfect, inverse relationship

+1.00 (positive one) a perfect, direct relationship

  0.00 indicates no relationship                      
Graphic Representations of Correlation

The form of the relationship

In a linear relationship as the X scores increase the Y scores tend to change in one direction only and can be summarised by a straight line
In a non-linear or curvilinear relationship as the X scores change the Y scores do not tend to only increase or only decrease: the Y scores change their direction of change
Computing a correlation
Alternative Formula for the Correlation Coefficient
Computing a Correlation
2 Types of Correlation Coefficient Tests

1)Pearson r
Full name is “Pearson product-moment correlation coefficient”

r (lower case r & italicized) is the statistic (fact/piece of data obtained from a study of a large quantity of num. data) for this test
2)Spearman’s rho
Full name is “Spearman’s rank-order correlation coefficient”

rho (lower case rho & italicized) is the statistic for this test

Correlation Coefficients & Strength
Strength of relationship is one thing a correlation coefficient test can tell us

Rule of Thumb for strength size (generally)
A correlation coefficient (r or rho)
Value of 0.00 indicates “no relationship”
Values b/w .01 & .24 may be called “weak”
Values b/w .25 & .49 may be called “moderate”
Values b/w .50 & .74 may be called “moderately strong”
Values b/w .75 and .99 may be called “strong”
A value of 1.00 is called “perfect”

Describing strength of relationships with positive or negative values

What is true in the positive is true in the negative
Ex: values b/w .75 & .99 are “very strong” & values b/w -.75 & -.99 are “very strong” though it is an inverse relationship
Correlation Coefficients &Scatterplots

Scatterplots used to visually show trend of data
Tells us
If relationship indicated
Kind of relationship
Outliers – cases differing from general trend
Graph may indicate direction, strength, and/or relationship of two variables

It is ESSENTIAL to plot a scatter plot before conducting correlation analysis
If no relationship found in scatter plot,
No need to conduct correlation
When to Use Pearson r
Use Pearson r when:

Looking at relationship b/w 2 scale variables
Interval or ratio measurements
Data not highly skewed
Distribution of scores is approximately symmetrical
Relationship b/w variables is linear

When to Use Spearman’s rho
Use Spearman’s rho when:
One or both variables are ordinal
Ex: college degree, weight, or height given ranking order (i.e. 1 = lightest, 2 = middle, 3 = heaviest)
One or both sets of data are highly skewed
Distributions are not symmetrical
Relationship is not curvilinear
As determined in examination of scatter plot

Spearman Rank Order Correlation
This correlation coefficient is simply the Pearson r calculated on the rankings of the X and Y variables.
Because ranks of N objects are the integers from 1 to N, the sums and sums of squares are known (provided there are no ties).

Spearman Rank Order Correlation

Spearman Rank Order Correlation

Spearman Rank Order Correlation
Since we know the sum of the scores and the sum of their squares, we automatically know the variance of the integers from 1 to N.
Spearman Rank Order Correlation
Suppose we compute it with N in the denominator instead of
Spearman Rank Order Correlation

Different Scales, Different Measures of Association

Used to describe the linear
    relationship between two variables
    that are both interval or ratio variables
The symbol for Pearson’s correlation
    coefficient is r
The underlying principle of r is that it
    compares how consistently each Y
    value is paired with each X value in a
    linear fashion
The Pearson Correlation formula

             degree to which X and Y vary together
r = ---------------------------------------------------
         degree to which X and Y vary separately

               Co-variability of X and Y
 = -----------------------------------------
           variability of X and Y separately
      r = -----------------------------------------
         √ (∑X*2 –(∑X) *2/N) (∑Y*2 –(∑Y) *2/N)

            Degree of freedom=N-2
Sum of Product Deviations

We have used the sum of
   squares or SS to measure the amount
   of variation or variability for a single
The sum of products or SP provides a
   parallel procedure for measuring the
   amount of co variation or co variability
   between two variables
Definitional Formula

SS =Σ (X- x)(X -x)
     or =Σ (Y -y)(Y -y)

Note :
              SP =Σ (X -x)(Y- y)
1 3 3
2 6 12
4 4 16
5 7 35

SP = 66 - 12(20)/4
= 66 - 60
= 6

Calculation of Pearson’s
Correlation Coefficient

Pearson’s correlation coefficient is a
   ratio comparing the co variability of X
  and Y (the numerator) with the
  variability of X and Y separately (the
SP measures the co variability of X and Y  The variability of X and Y is measured by calculating the SS for X and Y scores separately
Pearson correlation coefficient
r = SP / √ SS X SS Y

X    Y  X-X  Y-Y  (X-X)(Y-Y)  (X-X)2  (Y-Y)2
    0    1   -6       -1       +6            36                 1
   10   3   +4      +1      +4            16           1
4     1   -2       -1       +2             4          1
8           2   +2        0         0             4            0
8     3   +2      +1        +2            4            1
SP = 6+4+2+0+2 = 14
SSX = 36+16+4+4+4 = 64
SSY = 1+1+1+0+1

r = SP / √ SS X SS Y

r=  14/√ 64 * 4
 14 ÷ 16
= + 0.875
Inferential statistics
Regression.  The best fit line of prediction.

Using a correlation (relationship between variables) to predict one variable from knowing the score on the other variable
Usually a linear regression (finding the best fitting straight line for the data)
Best illustrated in a scatter plot with the regression line also plotted
The scatter plot
In correlation data, it is sometimes useful to
regard one variable as an independent variable  and the other as a dependent variable.
In these circumstances, a linear relationship
between two variables X and Y can be
expressed by the equation Y=bX + a
Where Y is the dependent variable, X the
independent variable and b and a are
In the general linear equation the value of
b is called the slope
The slope determines how much the Y
variable will change when X is increased
by one point
The value of a in the general equation is
called the Y-intercept(cutting the graph)
It determines the value of Y when X=0
A regression is a statistical method for studying the relationship between a single dependent variable and one or more independent variables.

In its simplest form a regression specifies a linear relationship between the dependent and independent variables.
            Yi = b0 + b1 X1i + b2 X2i + ei
for a given set of observations

In the social sciences, a regression is generally used to represent a causal process.
Y represents the dependent variable
B0 is the intercept (it represents the predicted value of Y if X1 and X2 equal zero.)
X1 and X2 are the independent variables (also called predictors or regressors)
b1 and b2 are called the regression coefficients and provide a measure of the effect of the independent variables on Y (they measure the slope of the line)
e is the stuff not explained by the causal model.

Why use regression?

Regression is used as a way of testing hypotheses about causal relationships.
Specifically, we have hypotheses about whether the independent variables have a positive or a negative effect on the dependent variable.
Just like in our hypothesis tests about variable means, we also would like to be able to judge how confident we are in our inferences.
Standard Error of Estimate

A regression equation, by itself,
allows you to make predictions, but it
does not provide any information
about the accuracy of the predictions
The standard error of estimate gives a
measure of the standard distance
between a regression line and the
actual data points

To calculate the standard error of estimate
Find a sum of squared deviations (SS)
Each deviation will measure the distance
between the actual Y value (data) and the
predicted Ŷ value (regression line)
This sum of squares is commonly called
Definition of Standard Error
The standard deviation of the sampling distribution is the standard error.  For the mean, it indicates the average distance of the statistic from the parameter.

Example of Height
Raw Data vs. Sampling Distribution
Formula: Standard Error of Mean
To compute the SEM, use:

For our Example:

Standard Error (SE)

It has become popular recently
Researchers often misunderstand and mis- use SE
Variability of observations is SD while variability of 2 or more sample means is SE
Therefore, often called “Standard error of the means” and SD of a set of observations or a population

When two variables covary in opposite directions, as smoking and lung capacity do, values tend to be on opposite sides of the group mean.  That is, when smoking is above its group mean, lung capacity tends to be below its group mean.
Consequently, by averaging the product of deviation scores, we can obtain a measure of how the variables vary together.
The Sample Covariance
Instead of averaging by dividing by N, we divide by          .    The resulting formula is
Calculating Covariance
Calculating Covariance
So we obtain

What is Analysis of Variance?
ANOVA is an inferential test designed for use with 3 or more data sets

t-tests are just a form of ANOVA for 2 groups

ANOVA only interested in establishing the existence of a statistical differences, not their direction.

Based upon an F value (R. A. Fisher) which reflects the ratio between systematic and random/error variance…
Procedure for computing 1-way ANOVA for independent samples
Step 1: Complete the table
                                    -square each raw score

                                    -total the raw scores for each group

                                    -total the squared scores for each group.

Step 2: Calculate the Grand Total correction factor
                        GT =


Step 3: Compute total Sum of Squares
SStotal= åX2 - GT

                              = (åXA2+XB2+XC2) - GT


Step 4: Compute between groups Sum of Squares
SSbet=                 - GT

                              =             +                +               - GT


Step 5: Compute within groups Sum of Squares
SSwit= SStotal - SSbet


Step 6: Determine the d.f for each sum of squares
dftotal= (N - 1)
            dfbet= (k - 1)

            dfwit= (N - k)

Step 7/8: Estimate the Variances & Compute F



Step 9: Consult F distribution table
-d1 is your df for the numerator (i.e. systematic variance)
-d2 is your df for the                                                        denominator                                                                   (i.e. error variance)                            

Statistical Decision Process
Type I error – rejecting a true null hypothesis. (treatment has an effect when in fact the treatment has no effect)
Alpha level for a hypothesis test is the probability that the test will lead to a Type I error
Alpha and Probability Values

The level of significance that is selected prior to data collection for accepting or rejecting a null hypothesis is called alpha. The level of significance actually obtained after the data have been collected and analyzed is called the probability value, and is indicated by the symbol p.

Inferential Statistics
Level of significance. The second determinant of statistical power is the p value at which the null hypothesis is to be rejected. Statistical power can be increased by lowering the level of significance needed to reject the null hypothesis.

Error Types
Example - Efficacy Test for New drug
Type I error - Concluding that the new drug is better than the standard (HA) when in fact it is no better (H0). Ineffective drug is deemed better.

Type II error - Failing to conclude that the new drug is better (HA) when in fact it is. Effective drug is deemed to be no better.
Non- parametric                      statistics
Non-parametric methods
So far we assumed that our samples were drawn from normally distributed populations.
techniques that do not make that assumption are called distribution-free or nonparametric tests.
In situations where the normal assumption is appropriate, nonparametric tests are less efficient than traditional parametric methods.
Nonparametric tests frequently make use only of the order of the observations and not the actual values.
Usually do not state hypotheses in terms of a specific parameter
They make vary few assumptions about the population distribution- distribution-free tests.
Suited for data measured in ordinal and nominal scales
Not as sensitive as parametric tests; more likely to fail in detecting a real difference between two treatments

Statistical analysis that attempts to explain the population parameter using a sample without making assumption about the frequency distribution of the assessed variable
In other words, the variable being assessed is distribution-free

Types of nonparametric tests
Chi-square statistic tests for Goodness of Fit (how well the obtained sample proportions fit the population proportions specified by the null hypothesis
Test for independence – tests whether or not there is a relationship between two variables
Non-Parametric Methods

Spearman Rho Rank Order Correlation Coefficient
To calculate the Spearman rho:
Rank the observations on each variable from lowest to highest.
Tied observations are assigned the average of the ranks.
The difference between the ranks on the X and Y variables  are summed and squared:
rrho = 1 – [(6åD2)/ n (n2 – 1)
Is there a relationship between the number of cigarettes smoked and severity of illness?
The null and alternative hypotheses are:
HO: There is no relationship between the number of cigarettes smoked and severity of illness
HA: This is a relationship between the number of cigarettes smoked and severity of illness
a  = .05

rrho     = 1 – [(6åD2)/ n (n2 – 1)]
                                    = 1 – [6(24) / 8(64-1)]
                                    = .71

tcalc = 2.49
tcrit = 2.447, df = 6, p = .05
Since the calculated t is > the critical value of t, we reject the null hypothesis and conclude that there is a statistically significant positive relationship between the number of cigarettes smoked and severity of illness
Use:A non-parametric procedure that we can use to assess the relationship between variables is the Spearman rho.

Goodness  of Fit
The chi-square test is a “goodness of fit” test
 it answers the question of how well do experimental data fit expectations.

As an example, you count F2 offspring, and get  290 purple and 110 white flowers.  This is a total of 400 (290 + 110) offspring.
We expect a 3/4 : 1/4 ratio.  We need to calculate the expected numbers (you MUST use the numbers of offspring, NOT the proportion!!!); this is done by multiplying the total offspring by the expected proportions.  This we expect 400 * 3/4 = 300 purple, and 400 * 1/4 = 100 white. 
Thus, for purple, obs = 290 and exp = 300.  For white, obs = 110 and exp = 100.
Chi square formula

Now it's just a matter of plugging into the formula: 
         2 = (290 - 300)2 / 300 + (110 - 100)2 / 100
              =  (-10)2 / 300 + (10)2 / 100
              =  100 / 300 + 100 / 100
              = 0.333 + 1.000
              = 1.333.  
This is our chi-square value
State H0                          H0 :   120
State H1                       H1  : ¹
Choose                      = 0.05
Choose n                     n = 100
Choose Test:    Z, t, X2 Test (or p Value)       
  Compute Test Statistic (or compute P value)
  Search for Critical Value
  Make Statistical Decision rule
  Express Decision              

Steps in Test of Hypothesis
Determine the appropriate test
Establish the level of significance:α
Formulate the statistical hypothesis
Calculate the test statistic
Determine the degree of freedom
Compare computed test statistic against a tabled/critical value
1.  Determine Appropriate Test
Chi Square is used when both variables are measured on a nominal scale.
It can be applied to interval or ratio data that have been categorized into a small number of groups.
It assumes that the observations are randomly sampled from the population.
All observations are independent (an individual can appear only once in a table and there are no overlapping categories).
It does not make any assumptions about the shape of the distribution nor about the homogeneity of variances.
2. Establish Level of Significance
α is a predetermined value
The convention
α = .05
α = .01
α = .001
3. Determine The Hypothesis:
Whether There is an Association or Not
Ho : The two variables are independent
Ha : The two variables are associated

4. Calculating Test Statistics
5. Determine Degrees of Freedom
df = (R-1)(C-1)
6. Compare computed test statistic against a tabled/critical value
The computed value of the Pearson chi- square statistic is compared with the critical value to determine if the computed value is improbable
The critical tabled values are based on sampling distributions of the Pearson chi-square statistic
If calculated c2 is greater than c2 table value, reject  Ho
Suppose a researcher is interested in voting preferences on gun control issues.
A questionnaire was developed and sent to a random sample of 90 voters.
The researcher also collects information about the political party membership of the sample of 90 respondents.
Bivariate Frequency Table or Contingency Table

1.  Determine Appropriate Test
Party Membership ( 2 levels) and Nominal
Voting Preference ( 3 levels) and Nominal
2. Establish Level of Significance
Alpha of .05
3. Determine The Hypothesis
Ho : There is no difference between D & R in their opinion on gun control issue.

Ha : There is an association between responses to the gun control survey and the party membership in the population.
4. Calculating Test Statistics

5. Determine Degrees of Freedom

df = (R-1)(C-1) =
(2-1)(3-1) = 2
Critical Chi-Square
Critical values for chi-square are found on tables, sorted by degrees of freedom and probability levels.  Be sure to use p = 0.05.
If your calculated chi-square value is greater than the critical value from the table, you “reject the null hypothesis”.
If your chi-square value is less than the critical value, you “fail to reject” the null hypothesis (that is, you accept that your genetic theory about the expected ratio is correct).
Chi-Square Table
6. Compare computed test statistic against a tabled/critical value
α = 0.05
df = 2
Critical tabled value = 5.991
Test statistic, 11.03, exceeds critical value
Null hypothesis is rejected
Democrats & Republicans differ significantly in their opinions on gun control issues

Similar Videos