new r here. have problem solve: need create new columns count 1 if sub-string appears 1 or more times in string column. this:
existing column new col (true if apple) new col (true if banana) apple, apple, orange 1 0 banana, banana, orange 0 1 apple, banana, orange 1 1
anyone can me this? thank in advance.
so thought wanted columns of counts (not whether strings contained) first time read question (the previous edit), it's sort of useful code anyway, left it. here options both base r , stringr
package:
first let's make sample data.frame similar data
# stringsasfactors = false smart here, let's not assume... df <- data.frame(x = c('a, b, c, a', 'b, b, c', 'd, a'))
which looks like
> df x 1 a, b, c, 2 b, b, c 3 d,
base r
use strsplit
make list of vectors of separated strings, using as.character
coerce factors useful form,
list <- strsplit(as.character(df$x), ', ')
then make list of unique strings
lvls <- unique(unlist(list))
making contains columns
loop on rows of data.frame/list sapply
. (all sapply
functions in answer replaced for
loops, that's considered poor style in r speed reasons.) test if unique strings in each, , change integer format. set result (t
ransposed) new column of df
, 1 each unique string.
df[, lvls] <- t(sapply(1:nrow(df), function(z){as.integer(lvls %in% list[[z]])})) > df x b c d 1 a, b, c, 1 1 1 0 2 b, b, c 0 1 1 0 3 d, 1 0 0 1
to keep values boolean true
/false
instead of integers, remove as.integer
.
making count columns
loop on rows of data.frame/list outside sapply
, while inner 1 loops on unique strings in each, , counts occurrences summing true
values. set result (t
ransposed) new column of df
, 1 each unique string.
df[, lvls] <- t(sapply(1:nrow(df), function(z){ sapply(seq_along(lvls), function(y){sum(lvls[y] == list[[z]])}) })) > df x b c d 1 a, b, c, 2 1 1 0 2 b, b, c 0 2 1 0 3 d, 1 0 0 1
stringr
stringr
can make these tasks more straightforward.
first, find unique strings in df$x
. split strings str_split
(which can take factor), flatten them vector unlist
, , find unique ones:
library(stringr) lvls <- unique(unlist(str_split(df$x, ', ')))
making contains columns
str_detect
allows loop on unique strings, not rows:
df[, lvls] <- sapply(lvls, function(y){as.integer(str_detect(df$x, y))})
making count columns
str_count
simplifies our syntax dramatically, again looping on lvls
:
df[,lvls] <- sapply(lvls, function(y){str_count(df$x, y)})
results both identical in base r above.
Comments
Post a Comment