regex - I want to find short strings in new lines in an R character vector -


i read text of few hundred words r (using read_file on .txt file). lines of text contain short segments (e.g. 'figure 1') before \n. i'd replace these blank \n. so, in below, i'd gsub out last 3 lines. think they'll under ~10 words, , none have period . except possibly @ end. start , end \n.

some long. might have short segments (like preceding sentence), they'll on length, , have @ least 2 sentence closings (abnormally long sentences aside). others short, these:  figure 1: description   materials , methods introduction. 

i've tried:

gsub("\\n(.{90,}[\\.\\?\\:].*){2,}\\n$", "\n", string1, perl=t) 

and regex works i.e. after newline, want characters (at least 50) appear before punctuation (.?:), , want pattern repeat @ least twice before next new line. want add (?gmi) modifiers (at least, works in regex101 them), can't find how add them in r. think modifiers code above works, other options (e.g. gsub on \n (text) \n\ fewer 90 characters , 1 ':.?' or similar might interesting).

update think can use like: str_replace_all(test, regex("^\\n(.{50,}[\\.\\?\\:].*){2,}\\n$", multiline = t), "\n") stri_opts_regex stringi add options...but i'm not clear on how (or, if it'll work).

thanks carlos in comments, gave on regex , have used strsplit

holding <- unlist(strsplit(y,"\n")) holding <- lapply(holding, function (bits) ifelse(nchar(bits) < 75, "", ifelse(nchar(bits)<150, ifelse(sum(str_count(bits, "\\."),str_count(bits, "\\:"),str_count(bits, "\\?"))<3, "", bits), bits))) holding <- holding[holding != ""]; # without elements empty #recombine y y <- paste(holding, collapse = "\n") 

not terribly elegant want without need regex.


Comments