i have pretty small, simple mysql table holding precalculated financial data. table looks like: refdate | instrtument | rate|startdate |maturitydate|carry1|carry2|carry3
with 3 indices defined as:
unique unique_id(refdate,instrument)
refdate (refdate)
instrument (instrument)
rows right 10 million, though each refdate, there 5000 distinct instruments right now
i have query self joins on table generate output like: refdate|rate instrument=x | rate instrument = y| rate instrument=z|....
basically returning time series data can own analytics in.
here problem: original query looked like:
select distinct audspot1yfq.refdate,audspot1yfq.rate 'audspot1yfq', aud1y1yfq.rate aud1y1yfq audratedb audspot1yfq inner join audratedb aud1y1yfq on audspot1yfq.refdate=aud1y1yfq.refdate  audspot1yfq.instrument = 'audspot1yfq' ,  aud1y1yfq.instrument = 'aud1y1yfq'  order audspot1yfq.refdate note, in particular query timing below, getting 10 different instruments, means query longer followed same pattern of naming, inner joins, , statements.
this slow, in workbench time 7-8 second duration (but near 0 fetch time, have workbench on machine running server). when stripped distinct, duration drops 0.25-0.5 seconds (far more manageable) , when stripped "order by" got faster (<0.1 seconds, @ point don't care). fetchtime exploded ~7 seconds. in total, gain nothing has become fetch time issue. when run query python scripts doing lifting , work, same timing whether include distinct or not.
when run explain on cut down query (which has horrid fetch time) get:
1   simple  audspot1yfq     ref unique_id,refdate,instrument    instrument  39  const   1432    100.00  using 1   simple  aud1y1yfq       ref unique_id,refdate,instrument    unique_id   42  historicalratesdb.audspot1yfq.refdate,const 1   100.00  using 1   simple  aud2y1yfq       ref unique_id,refdate,instrument    unique_id   42  historicalratesdb.audspot1yfq.refdate,const 1   100.00  using 1   simple  aud3y1yfq       ref unique_id,refdate,instrument    unique_id   42  historicalratesdb.audspot1yfq.refdate,const 1   100.00  using 1   simple  aud4y1yfq       ref unique_id,refdate,instrument    unique_id   42  historicalratesdb.audspot1yfq.refdate,const 1   100.00  using 1   simple  aud5y1yfq       ref unique_id,refdate,instrument    unique_id   42  historicalratesdb.audspot1yfq.refdate,const 1   100.00  using 1   simple  aud6y1yfq       ref unique_id,refdate,instrument    unique_id   42  historicalratesdb.audspot1yfq.refdate,const 1   100.00  using 1   simple  aud7y1yfq       ref unique_id,refdate,instrument    unique_id   42  historicalratesdb.audspot1yfq.refdate,const 1   100.00  using 1   simple  aud8y1yfq       ref unique_id,refdate,instrument    unique_id   42  historicalratesdb.audspot1yfq.refdate,const 1   100.00  using 1   simple  aud9y1yfq       ref unique_id,refdate,instrument    unique_id   42  historicalratesdb.audspot1yfq.refdate,const 1   100.00  using i realize distinct not required, , order can throw out , sort in pandas when output dataframe. great. don't know how fetch time down. i'm not going win competency competitions on website, have searched as can , can't find solution issue. appreciated.
~cocoa
(i had simplify table aliases in order read it:)
select  distinct            s.refdate,            s.rate audspot1yfq,            y.rate aud1y1yfq      audratedb s     join  audratedb y  on s.refdate = y.refdate      s.instrument = 'audspot1yfq'       ,  y.instrument = 'aud1y1yfq'     order  s.refdate  index needed:
index(instrument, refdate)  -- filter , sort, or index(instrument, refdate, rate)  -- "cover" query. that assumes query not more complex said.  see explain has many more tables.  please provide show create table audratedb , entire select.
back questions...
distinct done 1 of 2 ways:  (1) sort table, dedup, or (2) dedup in hash in memory.  keep in mind dedupping 3 columns (refdate, s.rate, y.rate).
order by sort after gathering data.  however, suggested index (not indexes had), sort not needed, since index rows in desired order.
but...  having both distinct , order by may confuse optimizer point 'dumb'.
you (refdate,instrument) unique, not mention primary key, nor have mentioned engine using.  if using innodb, primary key(instrument, refdate), in order, further speed things up, , avoid need new index.
furthermore, redundant have (a,b) , (a).  is, current schema not need index(refdate), changing pk, not need index(instrument), instead.
bottom line: only
primary key(instrument, refdate), index(refdate) and no other indexes (unless can show query needs it).
more on explain.  notice how rows column says 1432, 1, 1, ...  means scanned estimated 1432 rows of first table.  far more necessary because of lack of proper index.  needed @ 1 row in each of other tables.  (can't better that.)
how many rows in select without distinct or order by?  tells how work needed after doing fetching , joining.  suspect few.  "few" cheap distinct , order by; hence think barking wrong tree.  1432 rows fast process.
as buffer_pool...  how big table?  show table status.  suspect table more 1gb, hence cannot fit in buffer_pool.  hence raising cache size let query run in ram, not hitting disk (at least after gets cached).  keep in mind running query on cold cache have lots of i/o.  cache warms up, queries run faster.  if cache small, continue need i/o.  i/o slowest part of processing.
i hope have @ least 6gb of ram; otherwise, 2g dangerously large. swapping bad performance.
Comments
Post a Comment