list - C++ - Efficient way to group double vectors following a certain criteria -


i have list of objects saved in csv-like file using following scheme:

[value11],...,[value1n],[label1]

[value21],...,[value2n],[label2]

...

[valuen1],...,[valuenn],[labeln]

(each line single object, i.e. vector of doubles , respective label). collect them in groups custom criteria (i.e. same values @ n-th , (n+1)-th position of objects of group). , need in efficient way, since text file contains hundreds of thounsands of objects. i'm using c++ programming language.

to so, firstly load csv lines in simple custom container (with getobject, getlabel , import methods). use following code read them , make groups. "verifygrouprequirements" function returns true if group conditions satisfied, false otherwise.

for (size_t = 0; < objectslist.getsize(); ++i) {   myobject currentobj;   currentobj.attributes = objectslist.getobject(i);   currentobj.label = objectslist.getlabel(i);    if (i == 0) {     // sequence initialization first object     objectsgroup currentgroup = objectsgroup();      currentgroup.objectslist.push_back(currentobj);     tmpgrouplist.push_back(currentgroup);   } else {     // if not first pattern, check sequence conditions     list<objectsgroup>::iterator it5;      (it5 = tmpgrouplist.begin(); it5 != tmpgrouplist.end(); ++it5) {       bool addobjecttogrouprequirements =         verifygrouprequirements(it5->objectslist.back(), currentobj) &          ( (it5->objectslist.size() < maxnumberofobjectspergroup) |         (maxnumberofobjectspergroup == 0) );        if (addobjecttogrouprequirements) {         // object added group         it5->objectslist.push_back(currentobj);          break;       } else {         // if can't find group satisfy conditions ,         // arrived @ end of list of groups, create new         // group object.         size_t gg = std::distance(it5, tmpgrouplist.end());          if (gg == 1) {           objectsgroup tmp1 = objectsgroup();           tmp1.objectslist.push_back(currentobj);            tmpgrouplist.push_back(tmp1);            break;         }       }     }   }    if (maxnumberofobjectspergroup > 0) {     // loop can take elements of      // tmpgrouplist have reached maximum size     list<objectsgroup>::iterator it2;      (it2 = tmpgrouplist.begin(); it2 != tmpgrouplist.end(); ++it2) {       if (it2->objectslist.size() == maxnumberofobjectspergroup)         finalgrouplist.push_back(*it2);     }      // since tmpgrouplist list can use remove_if remove them     tmpgrouplist.remove_if(rmcondition);   } }  if (maxnumberofobjectspergroup == 0)    finalgrouplist = vector<objectsgroup> (tmpgrouplist.begin(), tmpgrouplist.end()); else {   list<objectsgroup>::iterator it6;    (it6 = tmpgrouplist.begin(); it6 != tmpgrouplist.end(); ++it6)     finalgrouplist.push_back(*it6); } 

where tmpgrouplist list<myobject>, finalgrouplist vector<myobject> , rmcondition boolean function returns true if size of objectsgroup bigger fixed value. myobject , objectsgroup 2 simple data structures, written in following way:

// data structure of single object class myobject {   public:     myobject(           unsigned short int &spacetoreserve,           double &defaultcontent,           string &lab) {        attributes = vector<double>(spacetoreserve, defaultcontent);       label = lab;     }     vector<double> attributes;     string label; };  // data structure of group of object class objectsgroup {   public:     list<myobject> objectslist;     double health; }; 

this code seems work, slow. since, said before, have apply on large set of objects, there way improve , make faster? thanks.

[edit] i'm trying achieve make groups of objects each object vector<double> (got csv file). i'm asking here is, there more efficient way collect kind of objects in groups exposed in code example above?

[edit2] need make groups using of vectors.

so, i'm reading question...

... collect them in groups custom criteria (i.e. same values @ n-th , (n+1)-th position of objects of group) ...

ok, read part, , kept on reading...

... , need in efficient way, since text file contains hundreds of thounsands of objects...

i'm still you, makes perfect sense.

... so, firstly load csv lines ...

{thud} {crash} {loud explosive noises}

ok, stopped reading right there, , didn't pay attention rest of question, including large code sample. because have basic problem right start:

1) intention is, typically, read small portion of huge csv file, and...

2) ... load entire csv file, sophisticated data structure.

these 2 statements @ odds each. you're reading huge number of values file. creating object each value. based on premise of question, you're going have large number of these objects. then, when said , done, you're going @ small number of them, , throw rest away?

you doing lot of work, presumably using lot of memory, , cpu cycles, loading huge data set, ignore of it. , wondering why you're having performance issues? seems pretty cut , dry me.

what alternative way of doing this? well, let's turn whole problem inside out, , approach piecemeal. let's read csv file, 1 line @ time, parse values in csv-formatted file, , pass resulting strings lambda.

something this:

template<typename callback> void parse_csv_lines(std::ifstream &i,                                                  callback &&callback) {     std::string line;      while (1)     {         line.clear();         std::getline(i, line);          // deal missing newline on last line...          if (i.eof() && line.empty())              break;          std::vector<std::string> words;          // @ point, you'll take "line", , split apart, @         // commas, individual words. parsing csv-         // formatted file. not exciting, you're doing         // already, algorithm boring implement, know         // how it, let's replace entire comment         // boiler-plate csv parsing logic existing         // code          callback(words);     } } 

ok, we've done task of parsing csv file. now, let's want task you've set out in beginning of question, grab every nth , n+1th position. so...

void do_something_with_n_and_nplus1_words(size_t n) {     std::ifstream input_file("input_file.csv");      // insert code check if input_file.is_open(), , if not,     // whatever      parse_csv_lines(input_file,                     [n]                     (const auto &words)                     {                        // now, grab words[n] , words[n+1]                        // (after checking, of course, malformed                        // csv file fewer "n+2" values)                        // , whatever want them.                     }); } 

that's it. now, end reading csv file, , doing absolute minimum amount of work required extract nth , n+1th values each csv file. it's going difficult come approach less work (except, of course, micro-optimizations related csv parsing, , word buffers; or perhaps foregoing overhead of std::ifstream, rather mmap-ing entire file, , parsing out scanning mmap-ed contents, that), i'd think.

for other similar one-off tasks, requiring small number of values csv files, write appropriate lambda fetch them out.

perhaps, need retrieve 2 or more subsets of values large csv file, , want read csv file once, maybe? well, hard give best general approach. each 1 of these situations require individual analysis, pick best approach.


Comments