r - secondary key ("index" attribute) in data.table is lost when table is "copied" by selecting columns -


i have data.table mydt, , i'm making "copies" of table 3 different ways:

mydt <- data.table(cola = 1:3) mydt[cola == 3]  copy1 <- copy(mydt) copy2 <- mydt # yes know it's reference, not real copy copy3 <- mydt[,.(cola)] # list columns original table 

then i'm comparing copies original table:

identical(mydt, copy1)  # true identical(mydt, copy2) # true identical(mydt, copy3) # false 

i trying figure out difference between mydt , copy3

identical(names(mydt), names(copy3)) # true all.equal(mydt, copy3, check.attributes=false) # true all.equal(mydt, copy3, check.attributes=false, trim.levels=false, check.names=true) # true attr.all.equal(mydt, copy3, check.attributes=false, trim.levels=false, check.names=true) # null all.equal(mydt, copy3) # [1] "attributes: < length mismatch: comparison on first 1 components >" attr.all.equal(mydt, copy3) # [1] "attributes: < names: 1 string mismatch >"                                          # [2] "attributes: < length mismatch: comparison on first 3 components >"                 # [3] "attributes: < component 3: attributes: < modes: list, null > >"                    # [4] "attributes: < component 3: attributes: < names target not current > >" # [5] "attributes: < component 3: attributes: < current not list-like > >"             # [6] "attributes: < component 3: numeric: lengths (0, 3) differ >" 

my original question how understand last output. came using attributes() function:

attr0 <- attributes(mydt) attr3 <- attributes(copy3) str(attr0) str(attr3) 

it has shown original data.table had index attribute not copied when created copy3.

in order make question bit clearer (and maybe useful future readers), happened here (probably not) set secondary key while explicitly calling set2key, or, data.table seemingly set secondary key while making ordinary operations such filtering. (not so) new feature added in v 1.9.4

dt[column==value] , dt[column %in% values] optimized use dt's key when key(dt)[1]=="column", otherwise secondary key (a.k.a. index) automatically added next dt[column==value] faster. no code changes needed; existing code should automatically benefit. secondary keys can added manually using set2key() , existence checked using key2(). these optimizations , function names/arguments experimental , may turned off options(datatable.auto.index=false).


lets reproduce this

mydt <- data.table(a = 1:3) options(datatable.verbose = true) mydt[a == 3]     # creating new index 'a' <~~~~ here # forder took 0 sec # coercing double column i.'v1' integer match type of x.'a'. please avoid coercion efficiency. # starting bmerge ...done in 0 secs # # 1: 3  attr(mydt, "index") # or using `key2(mydt)` # integer(0) # attr(,"__a") # integer(0) 

so, unlike assuming, did create copy , secondary key wasn't transferred it. compare

copy1 <- mydt attr(copy1, "index") # integer(0) # attr(,"__a") # integer(0) copy2 <- mydt[,.(a)] # detected j uses these columns: <~~~ copy occures attr(copy2, "index") # null  identical(mydt, copy1) # [1] true identical(mydt, copy2) # [1] false 

and further validation

tracemem(mydt) # [1] "<00000000159cbbb0>" tracemem(copy1) # [1] "<00000000159cbbb0>" tracemem(copy2) # [1] "<000000001a5a46d8>" 

the interesting conclusion here, 1 claim, [.data.table does create copy, if object remains unchanged.


Comments