r - Create a new column that counts the number of a sub-string in a string column? -


new r here. have problem solve: need create new columns count 1 if sub-string appears 1 or more times in string column. this:

existing column         new col (true if apple)    new col (true if banana) apple, apple, orange            1                              0 banana, banana, orange          0                              1 apple, banana, orange           1                              1 

anyone can me this? thank in advance.

so thought wanted columns of counts (not whether strings contained) first time read question (the previous edit), it's sort of useful code anyway, left it. here options both base r , stringr package:

first let's make sample data.frame similar data

# stringsasfactors = false smart here, let's not assume... df <- data.frame(x = c('a, b, c, a', 'b, b, c', 'd, a'))    

which looks like

> df            x 1 a, b, c, 2    b, b, c 3       d, 

base r

use strsplit make list of vectors of separated strings, using as.character coerce factors useful form,

list <- strsplit(as.character(df$x), ', ') 

then make list of unique strings

lvls <- unique(unlist(list)) 

making contains columns

loop on rows of data.frame/list sapply. (all sapply functions in answer replaced for loops, that's considered poor style in r speed reasons.) test if unique strings in each, , change integer format. set result (transposed) new column of df, 1 each unique string.

df[, lvls] <- t(sapply(1:nrow(df), function(z){as.integer(lvls %in% list[[z]])}))  > df            x b c d 1 a, b, c, 1 1 1 0 2    b, b, c 0 1 1 0 3       d, 1 0 0 1 

to keep values boolean true/false instead of integers, remove as.integer.

making count columns

loop on rows of data.frame/list outside sapply, while inner 1 loops on unique strings in each, , counts occurrences summing true values. set result (transposed) new column of df, 1 each unique string.

df[, lvls] <- t(sapply(1:nrow(df), function(z){     sapply(seq_along(lvls), function(y){sum(lvls[y] == list[[z]])}) }))  > df            x b c d 1 a, b, c, 2 1 1 0 2    b, b, c 0 2 1 0 3       d, 1 0 0 1 

stringr

stringr can make these tasks more straightforward.

first, find unique strings in df$x. split strings str_split (which can take factor), flatten them vector unlist, , find unique ones:

library(stringr) lvls <- unique(unlist(str_split(df$x, ', '))) 

making contains columns

str_detect allows loop on unique strings, not rows:

df[, lvls] <- sapply(lvls, function(y){as.integer(str_detect(df$x, y))}) 

making count columns

str_count simplifies our syntax dramatically, again looping on lvls:

df[,lvls] <- sapply(lvls, function(y){str_count(df$x, y)}) 

results both identical in base r above.


Comments