Follow me on Telegram for more content. Contact me for business opportunities.
Python is today among the most popular programming languages. And in my opinion, there is one main reason for it, readability.
A clear example of how Python was designed to be readable is next example:
if 'melon' not in ('apple', 'coconut'): print('it is missing!')
And compare this to for example Javascript:
var fruits = new Array("apple", "coconut");
if (!(fruits.indexOf("melon") > 0)) { console.log("it is missing!"); }
I think it's clear how Python is trying to make humans life easy, even at the cost of extra complexity in the interpreter.
Example¶
The above comment about readability is not only true when you are writing code that you need to read later. It also applies when you are programming libraries, that users will use. Python is also designed to let you write libraries in a way that your users will be able to write readable code.
For example, let's think of a toy library that implements colors.
class ColorStep1:
def __init__(self, red=0, green=0, blue=0):
self.red = red
self.green = green
self.blue = blue
def __str__(self):
"""Convert the color from the 3 integer values, to a string like #ffffff."""
return f'#{self.red:02x}{self.green:02x}{self.blue:02x}'
def _repr_html_(self):
"""Display the color as a box of its color in Jupyter."""
return f'<span style="color: {self}">▅</span>'
blue = ColorStep1(blue=255)
blue
Note: In practice it would make sense to have a single Color
class with all the methods. I'll be writing it in separate ColorStepN
classes that inherit from the previous to show the development step by step.
A common way to mix colors could be to simply implement a mix
method.
class ColorStep2(ColorStep1):
@staticmethod
def _mix_one(color1, color2):
"""There are many ways to mix colors, here we just take the sum of the components."""
return min(color1 + color2, 255)
def mix(self, other):
return ColorStep2(red=self._mix_one(self.red, other.red),
green=self._mix_one(self.green, other.green),
blue=self._mix_one(self.blue, other.blue))
red = ColorStep2(red=255)
green = ColorStep2(green=255)
# Mixing red and green to generate yellow
red.mix(green)
This works well, but could we make that last line more readable? I think so. I think it would be really cool for the users
of our colors library if they could simply write red + green
.
As mentioned before, Python is designed to not only let us write readable code, but to write libraries that will make the code of our users readable.
Operators¶
A first version of our class with operators could look like:
class ColorStep3(ColorStep2):
def __add__(self, other):
return self.mix(other)
red = ColorStep3(red=255)
green = ColorStep3(green=255)
red + green
You can disagree, but to me, and I bet to most Python programmers, red + green
is easier to read than red.mix(green)
.
So, we managed to let users use this syntax, with just the addition of the special __add__
method.
Interacting with other types¶
An extra feature that I would like to have, is to be able to mix my color class, with colors in the form #ffffff
.
Let's give it a try first, and see why it fails:
red = ColorStep3(red=255)
green = '#00ff00'
red + green
Our implementation of the mix
method is assuming that we'll receive an instance of the color class.
Since it expects to find the attributes red
, green
and blue
.
What we will do is to create a method to convert the string representation to our class.
And then we will automatically convert the other
parameter if it is a string.
class ColorStep4(ColorStep3):
@staticmethod
def _parse_rgb_string(value):
import re
match = re.search('^#?([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})$',
value.lower())
return ColorStep4(red=int(match.group(1), 16),
green=int(match.group(2), 16),
blue=int(match.group(3), 16))
def __add__(self, other):
if isinstance(other, str):
other = self._parse_rgb_string(other)
return self.mix(other)
red = ColorStep4(red=255)
green = '#00ff00'
red + green
This wasn't difficult. Now I can add (mix) a string to my color class. But can I add my color class to a string? Python strings are generic, and they don't know anything about the color class I just implemented.
The answer is no:
red = '#ff0000'
green = ColorStep4(green=255)
red + green
I don't think it makes sense to modify the str
type in Python to let it know about our new class (and it wouldn't be simple).
Luckily, Python provides a way to let this happen easily. The idea is that after the operation raises the TypeError
exception,
and before it is reported to the user, Python will try something else. Will try to find a __radd__
method in the class at the
right of the operation. In this case it didn't exist, but let's see what happens if we implement it:
class ColorStep5(ColorStep4):
def __radd__(self, other):
return self + other
red = '#ff0000'
green = ColorStep5(green=255)
red + green
And volià, it worked. :)
What happened here is next:
- We tried the operation
add
withstr + color
- Python called the
__add__
method ofstr
, and it raisedTypeError
- Then Python captured the error, and called the
__radd__
ofcolor
, with thestr
instance as theother
parameter - That worked, and Python reported the result to the user
Limitations¶
This is great, and we can not only operate our class with additions from and to any other class, but there are many other operations we can do. Just some random examples:
color + color
color + whatever
whatever + color
color - color
color * whatever
whatever == color
color > color
- ...
The Python documentation has the full list of Python operators.
This is great, but there are some operators that are not in this list:
color and color
color or color
not color
color in (color1, color2)
There was a proposal to be able to overload them, which was rejected by Guido van Rossum.
While I don't know what are the implications for the interpreter of accepting the proposal (in terms of performance, complexity...), I do know what is the implication for library authors, and specifically to pandas.
Operators in pandas¶
pandas makes heavy use of operation overloading. See these examples:
df['distance_in_miles'] * 1.609344
df['base_price'] + df['base_price'] * df['vat_rate']
df['age'] >= 18
Now consider this other example:
df['airline'] == 'DL' and not df['first_class']
While this looks very readable, there is a problem with this. The and
and not
operators are not being overloaded by pandas,
since this is not allowed. So, they are the original operators from the Python interpreter.
The original and
and not
operators will convert their parameters to a boolean value, and then evaluate the condition based on
that. So, in this case df['airline'] == 'DL'
won't be evaluated to one value per row, but converted to a single value True
or False
.
This is not what a pandas user would expect, and it's inconsistent with the other operators, so this is not the syntax used by pandas.
If pandas maintainers can't offer the above syntax, what are the alternatives? There are in my opinion two reasonable approaches.
The first solution is to go back to using methods, like we started with mix
. This would look like:
pandas.and(df['airline'] == 'DL',
pandas.not(df['first_class']))
This is not valid Python syntax, since and
and not
are reserved keywords in Python, and will result in a syntax error.
The recommended solution based on PEP-8 is to add a single trailing underscore, so the final syntax would be:
pandas.and_(df['airline'] == 'DL',
pandas.not_(df['first_class']))
I think we will all agree that is less readable than using the and
and not
operators.
A second solution is to use other operators that we can overload. There are few that don't have an immediate use for dataframes, that can be considered. In particular, the bitwise operators.
Let's have a look at the original bitwise operators:
binary_value_1 = 0b0010
binary_value_2 = 0b1010
result_and = binary_value_1 & binary_value_2
result_or = binary_value_1 | binary_value_2
result_not = ~ binary_value_1
print(f'binary and: {binary_value_1:04b} & {binary_value_2:04b} = {result_and:04b}')
print(f'binary or: {binary_value_1:04b} | {binary_value_2:04b} = {result_or:04b}')
print(f'binary inverse: ~ {binary_value_1:04b} = {result_not & 0b1111:04b}')
Python provides these operators to opearte at the bit level. The &
operator is like an and
, but it doesn't operate for the
whole value, but for each individual bit of it. We can see in the result, that there is a 1
in the positions where there is a
1
in the first value and in the second. The or
operator is equivalent, there is a 1
where there is a 1
in the first
value or there is a 1
in the second.
Finally, the inverse just reverses every 0
and makes it a 1
, and the other way round.
In pandas, initially, there was not much use for those, in the original meaning. So, they could be borrowed as the and
, or
and not
operators for dataframes (or series).
The result with the previous example would look like:
df['airline'] == 'DL' & ~ df['first_class']
This looks correct, and this syntax is the one supported by pandas, but this is not equivalent to:
df['airline'] == 'DL' and not df['first_class']
It is not because of Python operator precedence. This is the order in which operators are evaluated.
See this example:
1 + 2 * 3
If operators were evaluated from left to right, the previous result would be 1 + 2 = 3
and then 3 * 3 = 9
.
But the 2 * 3
is actually happening first.
Something similar is happening with the previous example.
We would expect that the first to evaluate is:
df['airline'] == 'DL'
And once this is computed, the and
is performed with the second part of the expression (the condition on not being first class).
This is what it would actually happen when using the Python and
operator. But the bitwise &
has a difference precedence.
So how thinigs are actually being executed are:
df['airline'] == ('DL' & (~ df['first_class']))
So, the and
is not performed between the two conditions, but between DL
and the second condition. This makes pandas conditions very
tricky. And it's easy to get unexpected results.
The solution for pandas is to be explicit on the order by using brackets:
(df['airline'] == 'DL') & (~ df['first_class'])
This will ensure that the order in which operators are evaludated is the expected.
I understand why pandas was designed this way, and I see value on having a more compact representation of conditions. But this feels quite hacky and counter-intuitive, and I would personally prefer the syntax presented before:
pandas.and_(df['airline'] == 'DL',
pandas.not_(df['first_class']))