i read text of few hundred words r (using read_file on .txt file). lines of text contain short segments (e.g. 'figure 1') before \n
. i'd replace these blank \n
. so, in below, i'd gsub
out last 3 lines. think they'll under ~10 words, , none have period .
except possibly @ end. start , end \n
.
some long. might have short segments (like preceding sentence), they'll on length, , have @ least 2 sentence closings (abnormally long sentences aside). others short, these: figure 1: description materials , methods introduction.
i've tried:
gsub("\\n(.{90,}[\\.\\?\\:].*){2,}\\n$", "\n", string1, perl=t)
and regex works i.e. after newline, want characters (at least 50) appear before punctuation (.?:
), , want pattern repeat @ least twice before next new line. want add (?gmi)
modifiers (at least, works in regex101 them), can't find how add them in r. think modifiers code above works, other options (e.g. gsub
on \n (text) \n\
fewer 90 characters , 1 ':.?'
or similar might interesting).
update think can use like: str_replace_all(test, regex("^\\n(.{50,}[\\.\\?\\:].*){2,}\\n$", multiline = t), "\n")
stri_opts_regex
stringi
add options...but i'm not clear on how (or, if it'll work).
thanks carlos in comments, gave on regex , have used strsplit
holding <- unlist(strsplit(y,"\n")) holding <- lapply(holding, function (bits) ifelse(nchar(bits) < 75, "", ifelse(nchar(bits)<150, ifelse(sum(str_count(bits, "\\."),str_count(bits, "\\:"),str_count(bits, "\\?"))<3, "", bits), bits))) holding <- holding[holding != ""]; # without elements empty #recombine y y <- paste(holding, collapse = "\n")
not terribly elegant want without need regex
.
Comments
Post a Comment