Tea is a domain specific programming language that automates statistical test selection and execution. It's currently written as an open-source Python library for integration into data analysis workflows using Python (3.6 or higher).
Tea is designed for non-statistical experts who want to perform statistical analyses within their specific domain of expertise. If this is you, you're in the right place :)
Tea is available on pip! You can install Tea by typing the following terminal command into your console:
pip install tealang
Then, open whatever IDE you use for Python, and create a new Python file. Make sure to add an import statement so that you'll have Tea ready to use in your file:
import tea
You're all set! The next section will walk you through the steps of writing your own program in Tea.
You'll want to provide Tea with 5 pieces of information:
- the dataset of interest,
- the variables you want to analyze,
- the study design (e.g., specifying independent, dependent variables),
- any assumptions you may have about the data based on domain knowledge, and
- a hypothesis.
Let's unpack each of these in detail...
Tea accepts data either as a CSV or a Pandas DataFrame. Tea assumes data is in "long format."
data_path = "./athlete_events.csv"
tea.data(data_path)
Optionally, you may want a column in the dataset to serve as a relational (a.k.a. primary) key. This key should be an attribute that acts as a row identifier.
If no key is included, then all rows are treated as individual observations, i.e. all variables are between-subjects.
Including a key is helpful when participants have multiple entries in the dataset, i.e. as for a within-subjects study.
To specify a key, simply add an argument when passing the dataset to Tea:
tea.data(data_path, key='ID')
The value included as the key name ('ID'
in the example above) should be the name
of a variable that you define from the dataset.
Variables correspond to columns given in the dataset. You only need to define variables for the attributes of interest; it's fine to have columns in the dataset that are not actually used in your Tea program.
You can describe a variable by including its column name and its data type. Depending on the data type, you'll also need to include additional information (detailed below).
Tea recognizes three data types: nominal, ordinal, and numeric.
A variable is nominal if its values represent discrete categories, and ordinal if these categories are ordered. These categories should be included in the variable description.
{
'name' : 'Condition',
'data type' : 'nominal',
'categories' : ['AR', 'TV']
},
{
'name' : 'Score',
'data type' : 'ordinal',
'categories' : [1,2,3,4,5]
}
A variable is numeric if it represents continuous values. A optional range can be included.
{
'name' : 'Prob',
'data type' : 'numeric',
'range' : [0,1] # optional
}
You'll want to define your variables as a list, which you can then pass to Tea:
variables = [
{
'name' : 'Condition',
'data type' : 'nominal',
'categories' : ['AR', 'TV']
},
{
'name' : 'Score',
'data type' : 'ordinal',
'categories' : [1,2,3,4,5]
}
]
tea.define_variables(variables)
Next, you'll specify your study design by including the type of the study as well as the relevant variables of interest.
Tea supports 2 kinds of studies: observational study and experiment.
For an observational study, you should describe contributor variables and outcome variables.
experimental_design = {
'study type': 'observational study',
'contributor variables': 'Sport',
'outcome variables': 'Weight',
}
tea.define_study_design(study_design)
For an experiment, you should describe independent variables and dependent variables.
study_design = {
'study type': 'experiment',
'independent variables': 'Condition',
'dependent variables': 'Score'
}
tea.define_study_design(study_design)
If you'd like, multiple variables can be specified as a list. For example:
experimental_design = {
'study type': 'observational study',
'contributor variables': ['Sport', 'Sex'],
'outcome variables': 'Weight',
}
tea.define_study_design(study_design)
This can be useful if you're planning to test several hypotheses using different attributes.
If you have domain knowledge that would be relevant to testing your hypothesis, you can provide this information to Tea explicitly.
Note: This step is optional! You don't have to include any assumptions in order for Tea to run.
A numeric variable could be normally distributed across all entries:
assumptions = {
'normal distribution': [['Prob']]
}
You could also assume a variable's normal distribution within the categories of another variable, as shown below.
assumptions = {
'groups normally distributed': [['So', 'Prob']]
}
Here, So
is a nominal variable indicating
whether a state is Southern or non-Southern. Prob
is assumed to be normally
distributed within the group of Southern states, as well as within the non-Southern
group.
Additionally, a numeric variable can be assumed to have equal variance across
the categories of another variable. Below, we assume that the variance of
Prob
within the group of Southern states is equal to its variance within the non-Southern
states.
assumptions = {
'equal variance': [['So', 'Prob']]
}
Multiple assumptions of the same type can be entered as a list:
assumptions = {
'normal distribution': [['Prob'], ['Ineq']]
'equal variance': [['So', 'Prob'], ['So', 'Ineq']]
}
Assumptions could also include any known data transformations that can be applied to a variable. Tea currently supports log transformations on a numeric variable.
assumptions = {
'log normal distribution': [['Ineq']]
}
As part of the assumptions, you can also specify the Type I (False Positive) Error Rate (a.k.a. "significance threshold", or alpha), which represents the largest p-value at which the null hypothesis should be rejected. The default alpha value is 0.05.
assumptions = {
'Type I (False Positive) Error Rate': 0.01
}
In the above example, the resulting p-value must be less than 0.01 in order to reject the null hypothesis.
You can declare all assumptions together before passing them to Tea, as shown below:
assumptions = {
'groups normally distributed': [['So', 'Prob']],
'Type I (False Positive) Error Rate': 0.05,
}
tea.assume(assumptions) # strict mode by default
Note: Tea checks your assumptions about variables' properties and will issue warnings for those that are not verified by statistical testing. To specify how Tea should handle these cases, you can select one of two modes: strict (default) or relaxed. In strict mode, Tea will override user assumptions if they are determined unverifiable. In relaxed mode, Tea will proceed with all user assumptions regardless of whether or not they are verified.
This mode can be set as shown:
tea.assume(assumptions, 'relaxed') # select relaxed mode
Hypotheses in Tea include the variables of interest, as well as the hypothesized relationship between them.
Tea currently allows you to express the following range of hypotheses:
- One-sided comparison between groups
tea.hypothesize(['Region', 'Imprisonment'], ['Region: Southern > Northern'])
Here, Region
is the name of a nominal or ordinal variable that has Southern
and Northern
as categories. This hypothesis describes a higher rate of Imprisonment
(an ordinal or numeric variable) in the Southern
category compared to the Northern
category.
tea.hypothesize(['Region', 'Imprisonment'], ['Region: Southern > Southwest', 'Region: Northeast > Midwest'])
Tea also allows your hypothesis to include multiple comparisons, which can be useful if
you are analyzing multiple independent variables (e.g., Region
has several distinct categories)
and one dependent variable (e.g., Imprisonment
).
When you provide these comparisons as a single hypothesis, Tea will take them into consideration for performing corrections, such as adjusting the p-value, in order to minimize a false discovery rate.
tea.hypothesize(['Region', 'Imprisonment'], ['Region: Southern != Northern'])
A two-sided comparison allows for directionality in two ways: in order for this
hypothesis to hold, Imprisonment
could be such that either Southern > Northern
or Southern < Northern
.
tea.hypothesize(['Region', 'Imprisonment'], ['Imprisonment ~ Region']) # positive by default
tea.hypothesize(['Region', 'Imprisonment'], ['Imprisonment ~ +Region'])
Here, both Region
and Imprisonment
are ordinal or numeric variables.
This hypothesis describes a positive correlation between the two—i.e.,
Imprisonment
increases as the Region
increases.
tea.hypothesize(['Region', 'Imprisonment'], ['Imprisonment ~ -Region'])
Similarly, you could describe a negative relationship between two ordinal or numeric variables.
Note: for linear relationships, it does not matter which variable is declared as independent vs. dependent when defining the study design.
As long as all variables of interest are defined and included in the study design, you can specify as many hypotheses as you'd like:
results_1 = tea.hypothesize(['Sport', 'Weight'], ['Sport:Wrestling < Swimming'])
results_2 = tea.hypothesize(['Sex', 'Weight'], ['Sex:F < M'])
Tea currently supports Null Hypothesis Significance Testing (NHST), which tests your hypothesis against the null hypothesis that the entries in your dataset occurred by random chance.
The statistical tests selected by Tea will determine whether the null hypothesis should
be accepted or rejected—i.e., whether you can conclude that there is indeed a relationship
between the variables specified in your hypothesis.
At runtime, Tea "compiles" the above 5 pieces of information into logical constraints, which it uses to select and execute a set of valid statistical tests for testing your hypothesis.
A test is considered valid if and only if all the assumptions it makes about the data (e.g., normal distribution, equal variance between groups, etc.) hold.
Tea outputs its decision process for every statistical test it considers.
Here's an example of what the output could look like for a statistical test that Tea finds to be invalid:
Currently considering kendalltau_corr
Testing assumption: is_bivariate.
Property holds.
Testing assumption: is_continuous_or_ordinal.
Property FAILS
Once Tea checks a certain assumption, it remembers whether or not that property holds. This allows Tea to more quickly evaluate other tests that require the same assumption(s). For instance, the following test was directly deemed unsatisfiable because Tea had already found that its properties failed:
Currently considering spearman_corr
Test is unsat.
A valid statistical test would have output showing that all required assumptions are satisfied:
Currently considering mannwhitney_u
Testing assumption: has_one_x.
Property holds.
Testing assumption: has_one_y.
Property holds.
Testing assumption: has_independent_observations.
Property holds.
Testing assumption: is_categorical.
Property holds.
Testing assumption: has_two_categories.
Property holds.
Testing assumption: is_continuous_or_ordinal.
Property holds.
The moment you've been waiting for... Tea will execute each valid statistical test to test your hypothesis!
These results are displayed as output and also returned by Tea's hypothesize()
function. For each test, the
results include the name of the test, its assumptions about the variables, and the
calculated statistics.
For example, the results of executing the Student's T Test would look something like this:
Test: students_t
***Test assumptions:
Exactly two variables involved in analysis: So, Prob
Exactly one explanatory variable: So
Exactly one explained variable: Prob
Independent (not paired) observations: So
Variable is categorical: So
Variable has two categories: So
Continuous (not categorical) data: Prob
Equal variance: So, Prob
Groups are normally distributed: So, Prob
***Test results:
name = Student's T Test
test_statistic = 4.20213
p_value = 0.00012
adjusted_p_value = 0.00006
alpha = 0.05
dof = 45
Effect size:
Cohen's d = 1.24262
A12 = 0.83669
Null hypothesis = There is no difference in means between So = 0 and So = 1 on Prob.
Interpretation = t(45) = 4.20213, p = 0.00006. Reject the null hypothesis at alpha = 0.05. The mean of Prob for So = 1 (M=0.06371, SD=0.02251) is significantly greater than the mean for So = 0 (M=0.03851, SD=0.01778). The effect size is Cohen's d = 1.24262, A12 = 0.83669. The effect size is the magnitude of the difference, which gives a holistic view of the results [1].
[1] Sullivan, G. M., & Feinn, R. (2012). Using effect size—or why the P value is not enough. Journal of graduate medical education, 4(3), 279-282.
Tea then states whether or not the null hypothesis should be rejected based on the results for that particular test execution.
To help you become more familiar with Tea, we've provided some sample Tea programs with their respective datasets:
-
ar_tv_tea.py
(dataset:ar_tv_long.csv
) -
crime_tea.py
(dataset:USCrime.csv
) -
olympics_tea.py
(dataset:athlete_events.csv
) -
drug_small_tea.py
(Pandas DataFrame using a subset of this dataset:drug.csv
) -
co2_tea.py
(dataset:co2.csv
)
[TODO]