i have scraped tripadvisor review data , have dataset following structure.
organization,address,reviewer,review title,review,review count,help count,attraction count,restaurant count,hotel count,location,rating date,rating temple of tooth (sri dalada maligawa),address: sri dalada veediya kandy 20000 sri lanka,wowlao,temple tour,visits places of worship bring home me power of superstition. temple of tooth no exception. couldn't marvel @ fervor devotees praying. 1 tip though: shrine houses tooth open twice day , it's best check these timings ... more,89,48,7,0,0,vientiane,2 days ago,3 temple of tooth (sri dalada maligawa),address: sri dalada veediya kandy 20000 sri lanka,wowlao,temple tour,visits places of worship bring home me power of superstition. temple of tooth no exception. couldn't marvel @ fervor devotees praying. 1 tip though: shrine houses tooth open twice day , it's best check these timings though imagine crowds @ peak.,89,48,7,0,0,vientiane,2 days ago,3
as can see, first row of objects has partial review, second row has full review.
what want achieve check duplicates this, , remove object(row) has partial review, , keep row has full review.
i see every partial review ends 'more' @ end, can somehow used filter out partial reviews?
how can go using opencsv?
note: not okay commercially use data of webservice without explicit permission.
having said that: basically, opencsv give enumeration of arrays. arrays lines.
you need copy lines other, more semantic data structure. judging header rows, create bean this.
public class travelrow { string organization; string address; string reviewer; string reviewtitle; string review; // it... public travelrow(string[] row) { // assign row-index property this.organization = row[0]; // ... } }
you may want generate getxxx
, setxxx
functions it.
now need find primary key row, suggest organisation
. iterate on rows, create bean it, add hashmap key organisation.
if organisation in hashmap, compare current review stored review. if new review longer or stored 1 ends ... more
, replace object in map.
after iterating on lines, have map
reviews want.
map<travelrow> result = new hashmap<travelrow>(); csvreader reader = new csvreader(new filereader("yourfile.csv")); string [] nextline; while ((nextline = reader.readnext()) != null) { // nextline[] array of values line if( result.containskey(nextline[0]) ) { // compare review if( reviewneedsupdate(result.get(nextline[0]), nextline[4]) ) { result.get(nextline[0]).setreview(nextline[4]); // update review, create new object, if } } else { // create travelrow array using constructor eating line result.put(nextline[0], new travelrow(nextline)); } }
reviewneedsupdate(travelrow row, string review)
compare review
row.review
, return true
, if new review better. can extend function until matches needs....
private boolean reviewneedsupdate( travelrow row, string review ) { return ( row.review.endswith("more") && !review.endswith("more") ); }
Comments
Post a Comment