Conditional Probability with Pandas Indexing

Python

pandas

Alex | Last updated: December 18, 2020

I’ve been brushing up on statistics using Python lately, and my brain was dizzy with Think Bayes’s walkthrough of computing conditional probability with pandas.

They define

def prob(A):
    """
    Computes the probability of a proposition, A.
    
    Note A is a pandas Series of either 1 and 0 or True and False
    """    
    return A.mean()

def conditional(proposition, given):
    """Probability of A conditioned on given."""
    return prob(proposition[given])

So to compute a conditional probability, we are effectively computing a fraction of a given sample:

Using the bracket operator to select the givens
Using panda’s mean function to compute the fraction of givens that fulfill the proposition

proposition[given].mean()

Observe that selecting the proposition from a given sample is:

proposition[given]

and not

given[proposition]

Why is this?

Bracket Operator in pandas

Consider these three series, a, b, and c:

a = pd.Series(['a', 'b', 'c'])
b = pd.Series([2, 0, 1])
c = pd.Series([True, True, False])

Calling a, b, and c below results in the following displays:

>>> a
0    a
1    b
2    c
dtype: object

>>> b
0    2
1    0
2    1
dtype: int64

>>> c
0     True
1     True
2    False
dtype: bool

What happens if we call a[b]?

>>> a[b]
2    c
0    a
1    b
dtype: object

This pandas-fashion indexing retrieves the contents of a using b as indices. So a[b] retrieves a[2], a[0], and a[1] in that order. a[b] in this instance is then equivalent to a[[2, 0, 1]].

An analogous python list comprehension would be:

a = ['a', 'b', 'c']
b = [2, 0, 1]

out = [a[i] for i in b] # out == a[b]

Okay, now what happens if we call c[b]?

>>> c[b]
2    False
1     True
0     True
dtype: bool

The exact same behavior occurs. pandas retrieves the contents of c using b as indices—equivalently, c[[2, 0, 1]]

So this is cool. But what about calling something like, a[c]?

>>> a[c]
0    a
1    b
dtype: object

Interesting. What results when we use booleans as indices is we filter out values that are False. For c, indices 0 and 1 were True, and these index values are what remain when we index the series a using c. It looks like this: a[[True, True, False]] or a[[0, 1]].

Similarly indexing into b using c results in:

>>> b[c]
0    2
1    1
dtype: int64

Again, we only index using True values and filter out False values.

Okay, so what happens if we call b[a] or c[a]?

>>> b[a]
Traceback (most recent call last):
  ...
KeyError: "None of [Index(['a', 'b', 'c'], dtype='object')] are in the [index]"

>>> c[a]
Traceback (most recent call last):
  ...
KeyError: "None of [Index(['a', 'b', 'c'], dtype='object')] are in the [index]"

Alright got it. Turns out, we can’t use strings as indices without a bit more finessing. Great.

So how does this relate to conditional probabilities?

Well, Think Bayes uses an example:

What is the probability that a respondent is a Democrat, given that they are liberal? […] which we can interpret like this: “Of all the respondents who are liberal, what fraction are Democrats?” We can compute this probability in two steps:

Select all respondents who are liberal.

Compute the fraction of the selected respondents who are Democrats. To select liberal respondents, we can use the bracket operator, [], like this:
selected = democrat[liberal]
selected contains the values of democrat for liberal respondents, so prob(selected) is the fraction of liberals who are Democrats.

The question is, why do we set selected like so:

selected = democrat[liberal]

This uses the liberal series as indices and democrat series as values. In other words, we take all instances where liberal is True and use those instances as indices for the democrats Series.

Then, probability(democrat[liberal]) means given the entire sample of liberals, calculate the probability that someone is a democrat.

Ahh okay. Got it, understood.

And that’s why it’s democrat[liberal] instead of liberal[democrat] for democrat given liberal. Okay!