Unexpected Integer behaviour in RLang
Understanding why “100000” and 100000 are not equal… and a deep dive into the
factor
logic of RLang.
Introduction
Why was I even working with RLang
? That’s a good question! The data scientists that I worked with were all quite familiar with the language and were quite happy with the performance of a particular ML algorithm implementation in RLang
.
Backstory
In most model training environments you will have data which will need to be transformed into factors. What does this mean? For categorical data types to help the model capture the differences in features correctly they are assigned a factor - all of them being on the same order of magnitude, and having a few other characteristics, such as improving data storage required.
In our case we had a big data frame with a few columns that contain categorical data, such as day-of-week, month and a zone_id column, which we want to transform into factors.
df <- data.table(
dow = c("Mon", "Tue", "Mon", "Fri"),
zone = c(100, 100, 101, 200),
x = 1:4
)
Transforming dow
into a factor we would have to do the following:
df$dow_factor <- factor(df$dow, levels=unique(df$dow))
This works fine for character
or string
data types, however, when you want to transform zone_id
into a factor, well…
Let the fun begin!
In our particular case, we wanted to include a fallback value of 0
in our levels. So our transformation code would be:
df$zone_factor <- factor(df$zone, levels=c(0, unique(df$zone)))
This would seem like perfectly fine code that should work - to the untrained eye. From here on I will go very deep into the inner workings of how factor
and data types work in this particular case - if you just want to see the TL;DR - browse to the end.
In R when you type 0
in your code you don’t get a value of type integer
rather you get
a numerical
. Why is this problematic? Because, under the hood, deep in the C code that transforms numerical
into string
there’s something which does not correctly transforms them back to string
. This means that match("100000", 100000L) == TRUE
, because the 2nd value is now an integer
which gets correctly transformed back into a string.
How we stumbled into this
However, little did we know thatclass(0) == numerical
, class(unique(df$zone)) == integer
, however calling c
will cast all of the following values to the type of the first value in the list; meaning that class(c(0, unique(x))) == numerical
. Under the hood factor
calls match
to find the value in the levels that matches the current array element. factor
has some code that eventually ends up transforming the current array element into a character
data type (by calling as.character).
This means we were now in a position where we were calling match(<character>, <numerical>)
, and this is how we found out that is.na(match("100000", 100000)) == TRUE
, meaning no match.
Lesson learnt
Do not use factor(<type>, levels=<other_type>)
or match(<type>, <other_type>)
and strive to use factor(<character>, levels=<character>)
or R somehow find a way to change your data types into character, but not do this consistently and you’ll end up in a similar debugging nightmare.
Code to exemplify the issue with match
:
> is.na(match("100000", 100000))
[1] TRUE
> is.na(match("100001", 100001))
[1] FALSE
> is.na(match("100000", 100000L))
[1] FALSE
First two comparisons is between character
and numeric
- and no match is found, whilst the 3rd comparison is between character
and integer
and a match is found. I have not tested this extensively but I found that all multiples of 100k are facing this problem…
I’m sure someone braver than me will dive deeper and understand this problem!