How to handle factors in R

Many people seem to have a problem with factors. This might be because factors do not always behave like you expect them to.

Factors are vector like objects, but the items can only take certain values.

This post gives a very basic introduction to factors.

Introduction and usage

Factors can be created from vectors.

chr.v<-c(rep("chr1", 10), rep("chr2", 20), rep("chr3", 8), rep("chr4", 19), rep("chr5", 15), rep("chr6", 19))
chr.f<-factor(chr.v)

Now we have a vector chr.v and a factor chr.f containing the same values. In the vector they are strings and because of that the content looks like this:

> chr.v
 [1] "chr1" "chr1" "chr1" "chr1" "chr1" "chr1" "chr1" "chr1" "chr1" "chr1" "chr2" "chr2" ...

and the factor looks like this:

chr.f
 [1] chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr1 chr2 chr2 ...
Levels: chr1 chr2 chr3 chr4 chr5 chr6

Notice the additional line at the bottom? These are the levels() of a factor. You can also get this information by:

> levels(chr.f)
[1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"

To get a summary you can call table(chr.f) and you get this information:

chr.f
chr1 chr2 chr3 chr4 chr5 chr6 
  10   20    8   19   15   19 

This works also for vectors, so why are factors useful? For example, you can define your own order. With a normal factor you can not use comparison operators other than == and !=.

chr.f[1] < chr.f[2]

Output:

[1] NA
Warning message:
In Ops.factor(chr.f[1], chr.f[2]) : < not meaningful for factors

If you create a factor and set the parameter ordered=TRUE, than you can compare values of a factor. If you don't set the parameters levels, it takes sort(unique(chr.f)) as order of the items, which would result in alphanumeric sorting in our case.

chr.f.order<-factor(chr.v, ordered=TRUE)
chr.f.order[2] < chr.f.order[15]

returns TRUE.

Levels

There is a lot more that can be done with levels, e.g. you can not add any unwanted values to your factor:

chr.f[1]="chrZ"
Warning message:
In `[<-.factor`(`*tmp*`, 1, value = "chrZ") :
  invalid factor level, NA generated

Take care, the value is replaced by NA and the old value is still gone!

If you initialise your factor, you can add a factor that is not in the vector:

chr.f.levels<-factor(chr.v, levels<-c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7"))
table(chr.f.levels)

returns

chr.f.levels
chr1 chr2 chr3 chr4 chr5 chr6 chr7 
  10   20    8   19   15   19    0 

You can also define less levels than are in the data, all other values are set to NA:

chr.f.levels2<-factor(chr.v, levels<-c("chr1", "chr2", "chr3", "chr4"))
table(chr.f.levels2)
chr.f.levels2
chr1 chr2 chr3 chr4 
  10   20    8   19 

As I mentioned above, the order is created by the sort() method. This might not always be what I want, e.g. in bioinformatics, chromosomes are often called chr1, chr2, chr3, ..., chr10, chr11, ... chr21, chr22, chrX, chrY and the order given by sort would be:

"chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr2"  "chr20" "chr21" "chr22" "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chrX"  "chrY"

This might be correct in an alphanumerical sense but it might not be what you want. If you consider the next two factors:

chr.f.wrong.order<-factor(chromosomes, ordered=T)

chr.f.order2<-factor(chromosomes, ordered=T, levels<-c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chrY"))

The first one will look like this:

 [1] chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chr10 chr11 chr12 chr13 chr14 chr15 chr16 ...
24 Levels: chr1 < chr10 < chr11 < chr12 < chr13 < chr14 < chr15 < chr16 < chr17 < chr18 < chr19 < chr2 < ... < chrY

The second one like this:

 [1] chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chr10 chr11 chr12 chr13 chr14 chr15 chr16 ...
24 Levels: chr1 < chr2 < chr3 < chr4 < chr5 < chr6 < chr7 < chr8 < chr9 < chr10 < chr11 < chr12 < chr13 < ... < chrY

Here, with chromosomes, don't think about lower or greater than, but before and after. This feature can be useful for any kind of ordinal data like grades.

In case you want to add a new level, you can do this:

levels(chr.f)<-c(levels(chr.f), "chrZ")
chr.f[1]<-"chrZ"

Now we can add the mysterious chromosome Z to our factor.

Above we saw that it is possible to define less labels than there are in the data. If I want to remove only few levels from many, it might be more feasible to explicitly exclude them, using the exclude parameter.

chr.f.exclude<-factor(chr.v, exclude=c("chr4"))

Now there is no "chr4" in the factor and all these values are replace by NA.

What if I want to have NA as a level? I do not see any reason for it but even that is possible!

chr.f.na<-factor(c(1, 2, NA, 3, 4), exclude=NULL)
table(chr.f.na)

returns this:

chr.f.na
   1    2    3    4  
   1    1    1    1    1 

(otherwise NA would not show up here as you can see in some of the examples above!)

There is one more parameter, which is called labels, you can use it in that way:

chr.f.labels<-factor(chr.v, levels<-c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6"), labels=paste("Chromosome", 1:6))

length(which(chr.f.labels=="chr1")) # 0
length(which(chr.f.labels=="Chromosome 1")) # 10

The factor looks similar to the above chr.f but instead of "chr1", "chr2", ... it contains the longer name "Chromosome 1", ...

Conversions

Consider a factor that contains numbers (might be because the input was read that way or some other reason).

number.factor<-factor(c(1, 3, 2, 4, 8, 10))
table(number.factor)

looks like this:

number.factor
 1  2  3  4  8 10 
 1  1  1  1  1  1 

Now you might want to convert it to their numeric values.

as.numeric(number.factor)

gives this output:

[1] 1 3 2 4 5 6

But why? Where are 8 and 10? Where did 5 and 6 come from? If you look at the order of 2 and 3 you might guess what happened here. as.numeric() applied to a factor gives each values position in the factors ordering (this can be useful, but for now it is not helpful!).

To achieve what we actually want, there are two possible solutions:

  • as.numeric(levels(chr.f))[chr.f]
  • as.numeric(as.character(chr.f))

The first is more efficient, the second seems more intuitive (at least to me), both give the same result:

[1]  1  3  2  4  8 10

Summary

  • Similar to vectors, but more useful for ordinal data
  • It prevents the programmer from adding values that should not be there
  • Levels can be removed or added at any time
  • The property that it can be converted to numerics from 1 to the number of levels is very useful for plotting!

Leave a Reply

Your email address will not be published. Required fields are marked *