Conditional Probability with Pandas Indexing
Alex | Last updated: December 18, 2020
I’ve been brushing up on statistics using Python lately, and my brain was dizzy with Think Bayes’s walkthrough of computing conditional probability with pandas.
They define
def prob(A):
"""
Computes the probability of a proposition, A.
Note A is a pandas Series of either 1 and 0 or True and False
"""
return A.mean()
def conditional(proposition, given):
"""Probability of A conditioned on given."""
return prob(proposition[given])
So to compute a conditional probability, we are effectively computing a fraction of a given sample:
- Using the bracket operator to select the givens
- Using panda’s
mean
function to compute the fraction of givens that fulfill the proposition
proposition[given].mean()
Observe that selecting the proposition from a given sample is:
proposition[given]
and not
given[proposition]
Why is this?
Bracket Operator in pandas
Consider these three series, a
, b
, and c
:
a = pd.Series(['a', 'b', 'c'])
b = pd.Series([2, 0, 1])
c = pd.Series([True, True, False])
Calling a
, b
, and c
below results in the following displays:
>>> a
0 a
1 b
2 c
dtype: object
>>> b
0 2
1 0
2 1
dtype: int64
>>> c
0 True
1 True
2 False
dtype: bool
What happens if we call a[b]
?
>>> a[b]
2 c
0 a
1 b
dtype: object
This pandas-fashion indexing retrieves the contents of a
using b
as indices. So a[b]
retrieves a[2]
, a[0]
, and a[1]
in that order. a[b]
in this instance is then equivalent to a[[2, 0, 1]]
.
An analogous python
list comprehension would be:
a = ['a', 'b', 'c']
b = [2, 0, 1]
out = [a[i] for i in b] # out == a[b]
Okay, now what happens if we call c[b]
?
>>> c[b]
2 False
1 True
0 True
dtype: bool
The exact same behavior occurs. pandas retrieves the contents of c
using b
as indices—equivalently, c[[2, 0, 1]]
So this is cool. But what about calling something like, a[c]
?
>>> a[c]
0 a
1 b
dtype: object
Interesting. What results when we use booleans as indices is we filter out values that are False
. For c
, indices 0
and 1
were True
, and these index values are what remain when we index the series a
using c
. It looks like this: a[[True, True, False]]
or a[[0, 1]]
.
Similarly indexing into b
using c
results in:
>>> b[c]
0 2
1 1
dtype: int64
Again, we only index using True
values and filter out False
values.
Okay, so what happens if we call b[a]
or c[a]
?
>>> b[a]
Traceback (most recent call last):
...
KeyError: "None of [Index(['a', 'b', 'c'], dtype='object')] are in the [index]"
>>> c[a]
Traceback (most recent call last):
...
KeyError: "None of [Index(['a', 'b', 'c'], dtype='object')] are in the [index]"
Alright got it. Turns out, we can’t use strings as indices without a bit more finessing. Great.
So how does this relate to conditional probabilities?
Well, Think Bayes uses an example:
What is the probability that a respondent is a Democrat, given that they are liberal? […] which we can interpret like this: “Of all the respondents who are liberal, what fraction are Democrats?” We can compute this probability in two steps:
- Select all respondents who are liberal.
- Compute the fraction of the selected respondents who are Democrats. To select liberal respondents, we can use the bracket operator, [], like this:
selected = democrat[liberal]
selected contains the values of democrat for liberal respondents, so prob(selected) is the fraction of liberals who are Democrats.
The question is, why do we set selected
like so:
selected = democrat[liberal]
This uses the liberal
series as indices and democrat
series as values. In other words, we take all instances where liberal
is True
and use those instances as indices for the democrats
Series.
Then, probability(democrat[liberal])
means given the entire sample of liberals
, calculate the probability that someone is a democrat
.
Ahh okay. Got it, understood.
And that’s why it’s democrat[liberal]
instead of liberal[democrat]
for democrat
given liberal
. Okay!