Remove duplicate row from CSV file based on a string

i have scraped tripadvisor review data , have dataset following structure.

organization,address,reviewer,review title,review,review count,help count,attraction count,restaurant count,hotel count,location,rating date,rating  temple of tooth (sri dalada maligawa),address: sri dalada veediya kandy 20000 sri lanka,wowlao,temple tour,visits places of worship bring home me power of superstition. temple of tooth no exception. couldn't marvel @ fervor devotees praying. 1 tip though: shrine houses tooth  open twice day , it's best check these timings ...   more,89,48,7,0,0,vientiane,2 days ago,3  temple of tooth (sri dalada maligawa),address: sri dalada veediya kandy 20000 sri lanka,wowlao,temple tour,visits places of worship bring home me power of superstition. temple of tooth no exception. couldn't marvel @ fervor devotees praying. 1 tip though: shrine houses tooth  open twice day , it's best check these timings  though imagine crowds @ peak.,89,48,7,0,0,vientiane,2 days ago,3

as can see, first row of objects has partial review, second row has full review.

what want achieve check duplicates this, , remove object(row) has partial review, , keep row has full review.

i see every partial review ends 'more' @ end, can somehow used filter out partial reviews?

how can go using opencsv?

note: not okay commercially use data of webservice without explicit permission.

having said that: basically, opencsv give enumeration of arrays. arrays lines.

you need copy lines other, more semantic data structure. judging header rows, create bean this.

public class travelrow {    string organization;    string address;    string reviewer;    string reviewtitle;    string review; // it...      public travelrow(string[] row) {        // assign row-index property        this.organization = row[0];        // ...    } }

you may want generate getxxx , setxxx functions it.

now need find primary key row, suggest organisation. iterate on rows, create bean it, add hashmap key organisation.

if organisation in hashmap, compare current review stored review. if new review longer or stored 1 ends ... more, replace object in map.

after iterating on lines, have map reviews want.

map<travelrow> result = new hashmap<travelrow>(); csvreader reader = new csvreader(new filereader("yourfile.csv")); string [] nextline; while ((nextline = reader.readnext()) != null) {    // nextline[] array of values line    if( result.containskey(nextline[0]) ) {        // compare review        if( reviewneedsupdate(result.get(nextline[0]), nextline[4]) ) {            result.get(nextline[0]).setreview(nextline[4]); // update review, create new object, if        }    }    else {        // create travelrow array using constructor eating line        result.put(nextline[0], new travelrow(nextline));    } }

reviewneedsupdate(travelrow row, string review) compare review row.review , return true, if new review better. can extend function until matches needs....

private boolean reviewneedsupdate( travelrow row, string review ) {     return ( row.review.endswith("more") && !review.endswith("more") );  }

Shah

Search This Blog

Remove duplicate row from CSV file based on a string - JAVA -

Comments

Post a Comment