i have pretty small, simple mysql table holding precalculated financial data. table looks like: refdate | instrtument | rate|startdate |maturitydate|carry1|carry2|carry3
with 3 indices defined as:
unique unique_id(refdate,instrument)
refdate (refdate)
instrument (instrument)
rows right 10 million, though each refdate, there 5000 distinct instruments right now
i have query self joins on table generate output like: refdate|rate instrument=x | rate instrument = y| rate instrument=z|....
basically returning time series data can own analytics in.
here problem: original query looked like:
select distinct audspot1yfq.refdate,audspot1yfq.rate 'audspot1yfq', aud1y1yfq.rate aud1y1yfq audratedb audspot1yfq inner join audratedb aud1y1yfq on audspot1yfq.refdate=aud1y1yfq.refdate audspot1yfq.instrument = 'audspot1yfq' , aud1y1yfq.instrument = 'aud1y1yfq' order audspot1yfq.refdate
note, in particular query timing below, getting 10 different instruments, means query longer followed same pattern of naming, inner joins, , statements.
this slow, in workbench time 7-8 second duration (but near 0 fetch time, have workbench on machine running server). when stripped distinct, duration drops 0.25-0.5 seconds (far more manageable) , when stripped "order by" got faster (<0.1 seconds, @ point don't care). fetchtime exploded ~7 seconds. in total, gain nothing has become fetch time issue. when run query python scripts doing lifting , work, same timing whether include distinct or not.
when run explain on cut down query (which has horrid fetch time) get:
1 simple audspot1yfq ref unique_id,refdate,instrument instrument 39 const 1432 100.00 using 1 simple aud1y1yfq ref unique_id,refdate,instrument unique_id 42 historicalratesdb.audspot1yfq.refdate,const 1 100.00 using 1 simple aud2y1yfq ref unique_id,refdate,instrument unique_id 42 historicalratesdb.audspot1yfq.refdate,const 1 100.00 using 1 simple aud3y1yfq ref unique_id,refdate,instrument unique_id 42 historicalratesdb.audspot1yfq.refdate,const 1 100.00 using 1 simple aud4y1yfq ref unique_id,refdate,instrument unique_id 42 historicalratesdb.audspot1yfq.refdate,const 1 100.00 using 1 simple aud5y1yfq ref unique_id,refdate,instrument unique_id 42 historicalratesdb.audspot1yfq.refdate,const 1 100.00 using 1 simple aud6y1yfq ref unique_id,refdate,instrument unique_id 42 historicalratesdb.audspot1yfq.refdate,const 1 100.00 using 1 simple aud7y1yfq ref unique_id,refdate,instrument unique_id 42 historicalratesdb.audspot1yfq.refdate,const 1 100.00 using 1 simple aud8y1yfq ref unique_id,refdate,instrument unique_id 42 historicalratesdb.audspot1yfq.refdate,const 1 100.00 using 1 simple aud9y1yfq ref unique_id,refdate,instrument unique_id 42 historicalratesdb.audspot1yfq.refdate,const 1 100.00 using
i realize distinct not required, , order can throw out , sort in pandas when output dataframe. great. don't know how fetch time down. i'm not going win competency competitions on website, have searched as can , can't find solution issue. appreciated.
~cocoa
(i had simplify table aliases in order read it:)
select distinct s.refdate, s.rate audspot1yfq, y.rate aud1y1yfq audratedb s join audratedb y on s.refdate = y.refdate s.instrument = 'audspot1yfq' , y.instrument = 'aud1y1yfq' order s.refdate
index needed:
index(instrument, refdate) -- filter , sort, or index(instrument, refdate, rate) -- "cover" query.
that assumes query not more complex said. see explain
has many more tables. please provide show create table audratedb
, entire select
.
back questions...
distinct
done 1 of 2 ways: (1) sort table, dedup, or (2) dedup in hash in memory. keep in mind dedupping 3 columns (refdate, s.rate, y.rate).
order by
sort after gathering data. however, suggested index (not indexes had), sort not needed, since index rows in desired order.
but... having both distinct
, order by
may confuse optimizer point 'dumb'.
you (refdate,instrument)
unique
, not mention primary key
, nor have mentioned engine using. if using innodb, primary key(instrument, refdate)
, in order, further speed things up, , avoid need new index.
furthermore, redundant have (a,b)
, (a)
. is, current schema not need index(refdate)
, changing pk, not need index(instrument)
, instead.
bottom line: only
primary key(instrument, refdate), index(refdate)
and no other indexes (unless can show query needs it).
more on explain
. notice how rows
column says 1432, 1, 1, ... means scanned estimated 1432 rows of first table. far more necessary because of lack of proper index. needed @ 1 row in each of other tables. (can't better that.)
how many rows in select
without distinct
or order by
? tells how work needed after doing fetching , joining
. suspect few. "few" cheap distinct
, order by
; hence think barking wrong tree. 1432 rows fast process.
as buffer_pool... how big table? show table status
. suspect table more 1gb, hence cannot fit in buffer_pool. hence raising cache size let query run in ram, not hitting disk (at least after gets cached). keep in mind running query on cold cache have lots of i/o. cache warms up, queries run faster. if cache small, continue need i/o. i/o slowest part of processing.
i hope have @ least 6gb of ram; otherwise, 2g dangerously large. swapping bad performance.
Comments
Post a Comment