i have list of objects saved in csv-like file using following scheme:
[value11],...,[value1n],[label1]
[value21],...,[value2n],[label2]
...
[valuen1],...,[valuenn],[labeln]
(each line single object, i.e. vector of doubles , respective label). collect them in groups custom criteria (i.e. same values @ n-th , (n+1)-th position of objects of group). , need in efficient way, since text file contains hundreds of thounsands of objects. i'm using c++ programming language.
to so, firstly load csv lines in simple custom container (with getobject, getlabel , import methods). use following code read them , make groups. "verifygrouprequirements" function returns true if group conditions satisfied, false otherwise.
for (size_t = 0; < objectslist.getsize(); ++i) { myobject currentobj; currentobj.attributes = objectslist.getobject(i); currentobj.label = objectslist.getlabel(i); if (i == 0) { // sequence initialization first object objectsgroup currentgroup = objectsgroup(); currentgroup.objectslist.push_back(currentobj); tmpgrouplist.push_back(currentgroup); } else { // if not first pattern, check sequence conditions list<objectsgroup>::iterator it5; (it5 = tmpgrouplist.begin(); it5 != tmpgrouplist.end(); ++it5) { bool addobjecttogrouprequirements = verifygrouprequirements(it5->objectslist.back(), currentobj) & ( (it5->objectslist.size() < maxnumberofobjectspergroup) | (maxnumberofobjectspergroup == 0) ); if (addobjecttogrouprequirements) { // object added group it5->objectslist.push_back(currentobj); break; } else { // if can't find group satisfy conditions , // arrived @ end of list of groups, create new // group object. size_t gg = std::distance(it5, tmpgrouplist.end()); if (gg == 1) { objectsgroup tmp1 = objectsgroup(); tmp1.objectslist.push_back(currentobj); tmpgrouplist.push_back(tmp1); break; } } } } if (maxnumberofobjectspergroup > 0) { // loop can take elements of // tmpgrouplist have reached maximum size list<objectsgroup>::iterator it2; (it2 = tmpgrouplist.begin(); it2 != tmpgrouplist.end(); ++it2) { if (it2->objectslist.size() == maxnumberofobjectspergroup) finalgrouplist.push_back(*it2); } // since tmpgrouplist list can use remove_if remove them tmpgrouplist.remove_if(rmcondition); } } if (maxnumberofobjectspergroup == 0) finalgrouplist = vector<objectsgroup> (tmpgrouplist.begin(), tmpgrouplist.end()); else { list<objectsgroup>::iterator it6; (it6 = tmpgrouplist.begin(); it6 != tmpgrouplist.end(); ++it6) finalgrouplist.push_back(*it6); }
where tmpgrouplist list<myobject>
, finalgrouplist vector<myobject>
, rmcondition boolean function returns true if size of objectsgroup bigger fixed value. myobject , objectsgroup 2 simple data structures, written in following way:
// data structure of single object class myobject { public: myobject( unsigned short int &spacetoreserve, double &defaultcontent, string &lab) { attributes = vector<double>(spacetoreserve, defaultcontent); label = lab; } vector<double> attributes; string label; }; // data structure of group of object class objectsgroup { public: list<myobject> objectslist; double health; };
this code seems work, slow. since, said before, have apply on large set of objects, there way improve , make faster? thanks.
[edit] i'm trying achieve make groups of objects each object vector<double>
(got csv file). i'm asking here is, there more efficient way collect kind of objects in groups exposed in code example above?
[edit2] need make groups using of vectors.
so, i'm reading question...
... collect them in groups custom criteria (i.e. same values @ n-th , (n+1)-th position of objects of group) ...
ok, read part, , kept on reading...
... , need in efficient way, since text file contains hundreds of thounsands of objects...
i'm still you, makes perfect sense.
... so, firstly load csv lines ...
{thud} {crash} {loud explosive noises}
ok, stopped reading right there, , didn't pay attention rest of question, including large code sample. because have basic problem right start:
1) intention is, typically, read small portion of huge csv file, and...
2) ... load entire csv file, sophisticated data structure.
these 2 statements @ odds each. you're reading huge number of values file. creating object each value. based on premise of question, you're going have large number of these objects. then, when said , done, you're going @ small number of them, , throw rest away?
you doing lot of work, presumably using lot of memory, , cpu cycles, loading huge data set, ignore of it. , wondering why you're having performance issues? seems pretty cut , dry me.
what alternative way of doing this? well, let's turn whole problem inside out, , approach piecemeal. let's read csv file, 1 line @ time, parse values in csv-formatted file, , pass resulting strings lambda.
something this:
template<typename callback> void parse_csv_lines(std::ifstream &i, callback &&callback) { std::string line; while (1) { line.clear(); std::getline(i, line); // deal missing newline on last line... if (i.eof() && line.empty()) break; std::vector<std::string> words; // @ point, you'll take "line", , split apart, @ // commas, individual words. parsing csv- // formatted file. not exciting, you're doing // already, algorithm boring implement, know // how it, let's replace entire comment // boiler-plate csv parsing logic existing // code callback(words); } }
ok, we've done task of parsing csv file. now, let's want task you've set out in beginning of question, grab every nth , n+1th position. so...
void do_something_with_n_and_nplus1_words(size_t n) { std::ifstream input_file("input_file.csv"); // insert code check if input_file.is_open(), , if not, // whatever parse_csv_lines(input_file, [n] (const auto &words) { // now, grab words[n] , words[n+1] // (after checking, of course, malformed // csv file fewer "n+2" values) // , whatever want them. }); }
that's it. now, end reading csv file, , doing absolute minimum amount of work required extract nth , n+1th values each csv file. it's going difficult come approach less work (except, of course, micro-optimizations related csv parsing, , word buffers; or perhaps foregoing overhead of std::ifstream
, rather mmap-ing entire file, , parsing out scanning mmap-ed contents, that), i'd think.
for other similar one-off tasks, requiring small number of values csv files, write appropriate lambda fetch them out.
perhaps, need retrieve 2 or more subsets of values large csv file, , want read csv file once, maybe? well, hard give best general approach. each 1 of these situations require individual analysis, pick best approach.
Comments
Post a Comment