database - How to efficiently compare two subsets of rows of a large table? -
i use postgresql 9.6 , table schema follows: department, key, value1, value2, value3, ...
each department has hundreds of millions of unique keys, set of keys more or less same departments. it's possible keys don't exist departments, such situations rare.
i prepare report 2 departments points out differences in values each key (comparison involves logic based on values key).
my first approach write external tool in python that:
- creates server-side cursor query:
select * my_table department = 'abc' order key;
- creates server-side cursor query:
select * my_table department = 'xyz' order key;
- iterates on both cursors, , compares values.
it worked fine, thought more efficient perform comparison inside stored procedure in postgresql. wrote stored procedure takes 2 cursors arguments, iterates on them , compares values. differences written temporary table. @ end, external tool iterates on temporary table - there shouldn't many rows there.
i thought latter approach more efficient, because doesn't require transferring lots of data outside database. surprise, turned out slower 40%.
to isolate problem compared performance of iterating cursor inside stored procedure, , in python:
fetch cur_1 row_1; while (row_1 not null) loop rows = rows + 1; fetch cur_1 row_1; end loop;
vs.
conn = psycopg2.connect(pg_uri) cur = conn.cursor('test') cur.execute(query) cnt = 0 row in cur: cnt += 1
query same in both cases. again, external tool faster. hypothesis because stored procedure fetches rows one-by-one (fetch curs_1 row_1
) while application fetches rows in batches of 2000. couldn't find way fetch batch of rows cursor inside pgsql procedure. thus, can't test hypothesis.
so question possible speed stored procedure?
what best approach problems this?
why can not self-join rather using cursors? like:
select t1.key, t1.value1 - t2.value1 diff1, t1.value2 - t2.value2 diff2, ... my_table t1 inner join my_table t2 on t1.key = t2.key t1.department = 'xyz' , t2.department = 'abc' union select t1.key, t1.value1 diff1, t1.value2 diff2, ... my_table t1 not exists (select 1 my_table t2 t1.key = t2.key , t2.dept = 'abc') , t1.dept = 'xyz' union select t1.key, t1.value1 diff1, t1.value2 diff2, ... my_table t1 not exists (select 1 my_table t2 t1.key = t2.key , t2.dept = 'xyz') , t1.dept = 'abc';
the first part deals common cases , 2 unions pick missing values. have thought faster cursor approach.
Comments
Post a Comment