i have scraped tripadvisor review data , have dataset following structure.
organization,address,reviewer,review title,review,review count,help count,attraction count,restaurant count,hotel count,location,rating date,rating temple of tooth (sri dalada maligawa),address: sri dalada veediya kandy 20000 sri lanka,wowlao,temple tour,visits places of worship bring home me power of superstition. temple of tooth no exception. couldn't marvel @ fervor devotees praying. 1 tip though: shrine houses tooth open twice day , it's best check these timings ... more,89,48,7,0,0,vientiane,2 days ago,3 temple of tooth (sri dalada maligawa),address: sri dalada veediya kandy 20000 sri lanka,wowlao,temple tour,visits places of worship bring home me power of superstition. temple of tooth no exception. couldn't marvel @ fervor devotees praying. 1 tip though: shrine houses tooth open twice day , it's best check these timings though imagine crowds @ peak.,89,48,7,0,0,vientiane,2 days ago,3 as can see, first row of objects has partial review, second row has full review.
what want achieve check duplicates this, , remove object(row) has partial review, , keep row has full review.
i see every partial review ends 'more' @ end, can somehow used filter out partial reviews?
how can go using opencsv?
note: not okay commercially use data of webservice without explicit permission.
having said that: basically, opencsv give enumeration of arrays. arrays lines.
you need copy lines other, more semantic data structure. judging header rows, create bean this.
public class travelrow { string organization; string address; string reviewer; string reviewtitle; string review; // it... public travelrow(string[] row) { // assign row-index property this.organization = row[0]; // ... } } you may want generate getxxx , setxxx functions it.
now need find primary key row, suggest organisation. iterate on rows, create bean it, add hashmap key organisation.
if organisation in hashmap, compare current review stored review. if new review longer or stored 1 ends ... more, replace object in map.
after iterating on lines, have map reviews want.
map<travelrow> result = new hashmap<travelrow>(); csvreader reader = new csvreader(new filereader("yourfile.csv")); string [] nextline; while ((nextline = reader.readnext()) != null) { // nextline[] array of values line if( result.containskey(nextline[0]) ) { // compare review if( reviewneedsupdate(result.get(nextline[0]), nextline[4]) ) { result.get(nextline[0]).setreview(nextline[4]); // update review, create new object, if } } else { // create travelrow array using constructor eating line result.put(nextline[0], new travelrow(nextline)); } } reviewneedsupdate(travelrow row, string review) compare review row.review , return true, if new review better. can extend function until matches needs....
private boolean reviewneedsupdate( travelrow row, string review ) { return ( row.review.endswith("more") && !review.endswith("more") ); }
Comments
Post a Comment