view counter

Group by Groups

Thanks to Kim Berg Hansen for this story

For some time now I have struggled to efficiently "group by groups" on data containing references between our suppliers item numbers and the original equipment manufacturer (OEM) number. I can group those data by supplier and their item number and get a "group of OEM numbers." That group I will call a unique item and I will group the data once more by that group to find which suppliers agree on an item having the same set (group) of OEM numbers.It took me some time to figure out how to do this efficiently (the data was several million records and the "unique groups" to be found was about 3/4 million groups.) But after several tries, some good help from the OTN forums and some reading in my Tom Kyte books, I finally got a solution that seems to work well ;-)Let me show what I tried with an example using good old scott.emp:SQL> create table orig_data as 2 select distinct job, deptno 3 from scott.emp e 4 / Table created. SQL> select job 2 , deptno 3 from orig_data 4 order by 5 job 6 , deptno 7 / JOB DEPTNO--------- ----------ANALYST 20CLERK 10CLERK 20CLERK 30MANAGER 10MANAGER 20MANAGER 30PRESIDENT 10SALESMAN 30 9 rows selected.First I group by job (I use xmlagg here because I am on version 11.1 and therefore no listagg  ):SQL> select od.job 2 , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos 3 from orig_data od 4 group by od.job 5 / JOB DEPTNOS--------- ------------------------------ANALYST 20CLERK 10,20,30MANAGER 10,20,30PRESIDENT 10SALESMAN 30I notice here that both job CLERK and MANAGER has the same set of deptnos.So if I group by deptnos I can get this result:SQL> select s2.deptnos 2 , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs 3 from ( 4 select od.job 5 , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos 6 from orig_data od 7 group by od.job 8 ) s2 9 group by s2.deptnos 10 / DEPTNOS JOBS------------------------------ ------------------------------10 PRESIDENT10,20,30 CLERK,MANAGER20 ANALYST30 SALESMAN(If you substitute OEM number for deptnos and supplier items for jobs, this represents my real world case - except the real world finds near 3/4 million such "group by groups.")So my requirement is to identify all such unique groups of deptnos in my orig_data table, give each such group a surrogate key in a parent table, and then populate two child tables with the deptnos of each group and the jobs that have that group of deptnos:SQL> create table groups ( 2 groupkey number primary key 3 ) 4 / Table created. SQL> create table groups_depts ( 2 groupkey number references groups (groupkey) 3 , deptno number(2) 4 ) 5 / Table created. SQL> create table groups_jobs ( 2 groupkey number references groups (groupkey) 3 , job varchar2(9) 4 ) 5 / Table created.For the surrogate groupkey I can just use a rownumber on my group by deptnos query:SQL> select row_number() over (order by s2.deptnos) groupkey 2 , s2.deptnos 3 , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs 4 from ( 5 select od.job 6 , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos 7 from orig_data od 8 group by od.job 9 ) s2 10 group by s2.deptnos 11 / GROUPKEY DEPTNOS JOBS---------- ------------------------------ ------------------------------ 1 10 PRESIDENT 2 10,20,30 CLERK,MANAGER 3 20 ANALYST 4 30 SALESMANThat query I can use for a (slow) insert into my three tables in this simple manner:SQL> begin 2 for g in ( 3 select row_number() over (order by s2.deptnos) groupkey 4 , s2.deptnos 5 , rtrim(xmlagg(xmlelement(j,s2.job,',').extract('//text()') order by s2.job),',') jobs 6 from ( 7 select od.job 8 , rtrim(xmlagg(xmlelement(d,od.deptno,',').extract('//text()') order by od.deptno),',') deptnos 9 from orig_data od 10 group by od.job 11 ) s2 12 group by s2.deptnos 13 ) loop 14 insert into groups values (g.groupkey); 15 16 insert into groups_depts 17 select g.groupkey 18 , to_number(regexp_substr(str, '[^,]+', 1, level)) deptno 19 from ( 20 select rownum id 21 , g.deptnos str 22 from dual 23 ) 24 connect by instr(str, ',', 1, level-1) > 0 25 and id = prior id 26 and prior dbms_random.value is not null; 27 28 insert into groups_jobs 29 select g.groupkey 30 , regexp_substr(str, '[^,]+', 1, level) job 31 from ( 32 select rownum id 33 , g.jobs str 34 from dual 35 ) 36 connect by instr(str, ',', 1, level-1) > 0 37 and id = prior id 38 and prior dbms_random.value is not null; 39 40 end loop; 41 end; 42 / PL/SQL procedure successfully completed.The tables now contain this data:SQL> select * 2 from groups 3 order by groupkey 4 / GROUPKEY---------- 1 2 3 4 SQL> select * 2 from groups_depts 3 order by groupkey, deptno 4 / GROUPKEY DEPTNO---------- ---------- 1 10 2 10 2 20 2 30 3 20 4 30 6 rows selected. SQL> select * 2 from groups_jobs 3 order by groupkey, job 4 / GROUPKEY JOB---------- --------- 1 PRESIDENT 2 CLERK 2 MANAGER 3 ANALYST 4 SALESMANI can now from these data get the same result as before (just to test I have created the desired data):SQL> select g.groupkey 2 , d.deptnos 3 , j.jobs 4 from groups g 5 join ( 6 select groupkey 7 , rtrim(xmlagg(xmlelement(d,deptno,',').extract('//text()') order by deptno),',') deptnos 8 from groups_depts 9 group by groupkey 10 ) d 11 on d.groupkey = g.groupkey 12 join ( 13 select groupkey 14 , rtrim(xmlagg(xmlelement(j,job,',').extract('//text()') order by job),',') jobs 15 from groups_jobs 16 group by groupkey 17 ) j 18 on j.groupkey = g.groupkey 19 / GROUPKEY DEPTNOS JOBS---------- ------------------------------ ------------------------------ 1 10 PRESIDENT 2 10,20,30 CLERK,MANAGER 3 20 ANALYST 4 30 SALESMANSo far so good. This all works pretty much as desired - except for a couple of things:The very simple loop insert code will be slow. OK, it is a one-time conversion job (in theory, but very few times at least) so that could probably be acceptable (except for my professional pride. ;-)But worse is, that I have groups where the string aggregation won't work - the string would have to be about varchar2(10000) for my real world case (some groups would contain something like 1600 OEM numbers) which won't work in SQL in the group by. :-(So I have tried an attempt using collections. First a collection of deptnos:SQL> create type deptno_tab_type as table of number(2) 2 / Type created. SQL> select od.job 2 , cast(collect(od.deptno order by od.deptno) as deptno_tab_type) deptnos 3 from orig_data od 4 group by od.job 5 / JOB DEPTNOS--------- ------------------------------ANALYST DEPTNO_TAB_TYPE(20)CLERK DEPTNO_TAB_TYPE(10, 20, 30)MANAGER DEPTNO_TAB_TYPE(10, 20, 30)PRESIDENT DEPTNO_TAB_TYPE(10)SALESMAN DEPTNO_TAB_TYPE(30)All very good - no problems here. But then a collection of jobs:SQL> create type job_tab_type as table of varchar2(9) 2 / Type created. SQL> select s2.deptnos 2 , cast(collect(s2.job order by s2.job) as job_tab_type) jobs 3 from ( 4 select od.job 5 , cast(collect(od.deptno order by od.deptno) as deptno_tab_type) deptnos 6 from orig_data od 7 group by od.job 8 ) s2 9 group by s2.deptnos 10 / group by s2.deptnos *ERROR at line 9:ORA-00932: inkonsistente datatyper: forventede -, fik XAL_SUPERVISOR.DEPTNO_TAB_TYPENow it fails with an "inconsistent datatype" error - I cannot group by a collection datatype (without doing something more...)This is the time when I asked for help on the OTN forums, and odie_63 kindly reminded me that an object datatype can be grouped if it has a MAP or an ORDER member method. A MAP function would probably be the most efficient, but it requires being able to return a numerical value that will be unique - not realistic in this case. Odie suggested a simple ORDER method instead:SQL> create type deptno_container as object ( 2 nt deptno_tab_type 3 , order member function match (o deptno_container) return integer 4 ); 5 /Type created.SQL>SQL> create or replace type body deptno_container as 2 order member function match (o deptno_container) return integer is 3 begin 4 return case when nt = o.nt then 0 else 1 end; 5 end; 6 end; 7 /Type body created.It relies on the fact that when the nested tables are collections of simple datatypes, they can be compared for equality by default.That allows me to "group by groups":SQL> select row_number() over(order by null) groupkey 2 , s2.deptnos 3 , cast(collect(s2.job order by s2.job) as job_tab_type) jobs 4 from ( 5 select od.job 6 , deptno_container( 7 cast(collect(od.deptno order by od.deptno) as deptno_tab_type) 8 ) deptnos 9 from orig_data od 10 group by od.job 11 ) s2 12 group by s2.deptnos 13 / GROUPKEY DEPTNOS(NT) JOBS---------- ------------------------ -------------------------------- 1 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('SALESMAN') TAB_TYPE(30)) 2 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('PRESIDENT') TAB_TYPE(10)) 3 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('CLERK', 'MANAGER') TAB_TYPE(10, 20, 30)) 4 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('ANALYST') TAB_TYPE(20))And Odie then also suggested a method to use that query to insert into the three tables:SQL> insert all 2 when rn0 = 1 then 3 into groups (groupkey) 4 values (groupkey) 5 when rn1 = 1 then 6 into groups_jobs (groupkey, job) 7 values (groupkey, job) 8 when rn2 = 1 then 9 into groups_depts (groupkey, deptno) 10 values(groupkey, deptno) 11 with all_groups as ( 12 select row_number() over(order by null) groupkey 13 , s2.deptnos 14 , cast(collect(s2.job order by s2.job) as job_tab_type) jobs 15 from ( 16 select od.job 17 , deptno_container( 18 cast(collect(od.deptno order by od.deptno) as deptno_tab_type) 19 ) deptnos 20 from orig_data od 21 group by od.job 22 ) s2 23 group by s2.deptnos 24 ) 25 select groupkey 26 , value(j) job 27 , value(d) deptno 28 , row_number() over ( 29 partition by groupkey 30 order by null 31 ) rn0 32 , row_number() over ( 33 partition by groupkey, value(j) 34 order by null 35 ) rn1 36 , row_number() over ( 37 partition by groupkey, value(d) 38 order by null 39 ) rn2 40 from all_groups t 41 , table(t.jobs) j 42 , table(t.deptnos.nt) d 43 ;15 rows created.Let's just check the result again:SQL> select g.groupkey 2 , d.deptnos 3 , j.jobs 4 from groups g 5 join ( 6 select groupkey 7 , rtrim(xmlagg(xmlelement(d,deptno,',').extract('//text()') order by deptno),',') deptnos 8 from groups_depts 9 group by groupkey 10 ) d 11 on d.groupkey = g.groupkey 12 join ( 13 select groupkey 14 , rtrim(xmlagg(xmlelement(j,job,',').extract('//text()') order by job),',') jobs 15 from groups_jobs 16 group by groupkey 17 ) j 18 on j.groupkey = g.groupkey 19 / GROUPKEY DEPTNOS JOBS---------- ------------------------ -------------------------------- 1 30 SALESMAN 2 10 PRESIDENT 3 10,20,30 CLERK,MANAGER 4 20 ANALYSTYes! This creates the same groupings as I did with the string aggregation (OK, different key, but that is just a pseudo key so no matter ;-). Now I have a method that can "group by groups" using collections rather than string aggregation and thus work on larger groups than can be contained in a VARCHAR2.But then I used that method on the real large set of data. Bummer :-( It created about 1½ million groups rather than 3/4 millions - there were duplicate groups!So I guessed it might have something to do with the ORDER member method always returning "1" for un-equal collections, rather than "-1" for "smaller than" and "1" for "greater than". So I made a new body for the container object type:SQL> create or replace type body deptno_container as 2 order member function match (o deptno_container) return integer is 3 begin 4 for i in 1..nt.count loop 5 if i > o.nt.count then 6 return 1; 7 elsif nt(i) > o.nt(i) then 8 return 1; 9 elsif nt(i) < o.nt(i) then 10 return -1; 11 end if; 12 end loop; 13 if nt.count < o.nt.count then 14 return -1; 15 else 16 return 0; 17 end if; 18 end; 19 end; 20 /Type body created.This one compares the elements of the collections and defines when a collection is "greater than" or "smaller than" another.Let's test the new type body:SQL> select row_number() over(order by null) groupkey 2 , s2.deptnos 3 , cast(collect(s2.job order by s2.job) as job_tab_type) jobs 4 from ( 5 select od.job 6 , deptno_container( 7 cast(collect(od.deptno order by od.deptno) as deptno_tab_type) 8 ) deptnos 9 from orig_data od 10 group by od.job 11 ) s2 12 group by s2.deptnos 13 / GROUPKEY DEPTNOS(NT) JOBS---------- ------------------------ -------------------------------- 1 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('PRESIDENT') TAB_TYPE(10)) 2 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('CLERK', 'MANAGER') TAB_TYPE(10, 20, 30)) 3 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('ANALYST') TAB_TYPE(20)) 4 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('SALESMAN') TAB_TYPE(30))Hooray - we get the same groups! (OK, different group keys - presumably due to the new order method - but that's OK.)But I was still troubled that it did not really work as it should. If I just did the above query, it would perfectly correct create about 3/4 million groups. But when I put the same query into the INSERT ALL construct, somewhat over 800.000 groups were created - duplicates still existed :-(So I went back to my good old trusted copy of Tom Kyte's "Expert One-On-One Oracle" book in the chapter on object relational coding. I decided to try to do this using an O-R view and an instead-of trigger:First yet another object type:SQL> create type groups_container as object ( 2 groupkey number 3 , deptnos deptno_container 4 , jobs job_tab_type 5 ) 6 /Type created.Then an object type view:SQL> create view groups_view 2 of groups_container 3 with object identifier (groupkey) 4 as 5 select g.groupkey 6 , deptno_container( 7 cast( 8 multiset( 9 select gd.deptno 10 from groups_depts gd 11 where gd.groupkey = g.groupkey 12 order by gd.deptno 13 ) 14 as deptno_tab_type 15 ) 16 ) deptnos 17 , cast( 18 multiset( 19 select gj.job 20 from groups_jobs gj 21 where gj.groupkey = g.groupkey 22 order by gj.job 23 ) 24 as job_tab_type 25 ) jobs 26 from groups g 27 /View created.And finally the instead-of trigger on the view:SQL> create trigger groups_view_ins 2 instead of insert 3 on groups_view 4 begin 5 insert into groups (groupkey) 6 values (:new.groupkey); 7 8 forall i in 1..:new.deptnos.nt.count 9 insert into groups_depts (groupkey, deptno) 10 values (:new.groupkey, :new.deptnos.nt(i)); 11 12 forall i in 1..:new.jobs.count 13 insert into groups_jobs (groupkey, job) 14 values (:new.groupkey, :new.jobs(i)); 15 end; 16 /Trigger created.Now I can insert into my three tables by inserting into the O-R view:SQL> insert into groups_view ( 2 groupkey, deptnos, jobs 3 ) 4 select rownum groupkey 5 , s3.deptnos 6 , s3.jobs 7 from ( 8 select s2.deptnos 9 , cast(collect(s2.job order by s2.job) as job_tab_type) jobs 10 from ( 11 select od.job 12 , deptno_container( 13 cast(collect(od.deptno order by od.deptno) as deptno_tab_type) 14 ) deptnos 15 from orig_data od 16 group by od.job 17 ) s2 18 group by s2.deptnos 19 ) s3 20 /4 rows created.Note how it says it created 4 "rows" - which is 4 "instances of the object", but translates into this much data in the three tables:SQL> select * 2 from groups 3 order by groupkey 4 / GROUPKEY---------- 1 2 3 4SQL> select * 2 from groups_depts 3 order by groupkey, deptno 4 / GROUPKEY DEPTNO---------- ---------- 1 10 2 10 2 20 2 30 3 20 4 306 rows selected.SQL> select * 2 from groups_jobs 3 order by groupkey, job 4 / GROUPKEY JOB---------- --------- 1 PRESIDENT 2 CLERK 2 MANAGER 3 ANALYST 4 SALESMANOr if we take the same control statement I've used before:SQL> select g.groupkey 2 , d.deptnos 3 , j.jobs 4 from groups g 5 join ( 6 select groupkey 7 , rtrim(xmlagg(xmlelement(d,deptno,',').extract('//text()') order by deptno),',') deptnos 8 from groups_depts 9 group by groupkey 10 ) d 11 on d.groupkey = g.groupkey 12 join ( 13 select groupkey 14 , rtrim(xmlagg(xmlelement(j,job,',').extract('//text()') order by job),',') jobs 15 from groups_jobs 16 group by groupkey 17 ) j 18 on j.groupkey = g.groupkey 19 / GROUPKEY DEPTNOS JOBS---------- ------------------------ -------------------------------- 1 10 PRESIDENT 2 10,20,30 CLERK,MANAGER 3 20 ANALYST 4 30 SALESMANYes - we got the same groups of groups again. And we now even have a short and sweet method of querying out the same group objects:SQL> select * 2 from groups_view 3 order by groupkey 4 / GROUPKEY DEPTNOS(NT) JOBS---------- ------------------------ -------------------------------- 1 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('PRESIDENT') TAB_TYPE(10)) 2 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('CLERK', 'MANAGER') TAB_TYPE(10, 20, 30)) 3 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('ANALYST') TAB_TYPE(20)) 4 DEPTNO_CONTAINER(DEPTNO_ JOB_TAB_TYPE('SALESMAN') TAB_TYPE(30))Same result as we've also seen before.And finally I am happy with the result. When I did this to my real world data, I got exactly the 3/4 million unique groups. Yay :-DI had actually expected the instead-of trigger insertion to slow down the process, as I introduced a kind of "row-by-row" processing here. But in reality it turned out to be slightly faster (about 50 minutes compared to an hour) than the method using TABLE to "break out" the collections and then INSERT ALL. There may have been differences in load on the server at the time, so I won't conclude that one method or the other is faster - I'll just note that the instead-of trigger method for this particular use case at least didn't slow things down ;-)And that's a wrap - now I can start using the generated groups of data :-D

view counter

Read the entire article at its source

view counter