Home » RDBMS Server » Performance Tuning » Best practice for querying weighted sparse data
Best practice for querying weighted sparse data [message #168691] Fri, 21 April 2006 11:22 Go to next message
kfritsche
Messages: 3
Registered: April 2006
Junior Member
I am trying to figure out the best way to efficently query a subset of rows based on a sparsely populated column.
For instance consider a table with 10 million rows and a status column with the values and respective counts as follows:
Pending 100
Active 10,000
Completed 99,989,900

For reporting purposes I don't want to archive the "completed", I want them in the same table, however I don't want to search through the "completed" rows when looking for a particular "active" record.

I have tried using a partition for the completed records, however this seems too intensive for OLTP and seems to require an analyze to make it work effectively.

In the past I have created shadow tables populated via triggers which only store "Pending" and "Active" records and when they are changed to "Completed", they are deleted from the table.

I tried compressed indexes to no effect.

This seems a common problem, based on how many times I have seen it.
So what does everyone else do?

Thanks

Re: Best practice for querying weighted sparse data [message #168694 is a reply to message #168691] Fri, 21 April 2006 12:10 Go to previous messageGo to next message
markmal
Messages: 113
Registered: April 2006
Location: Toronto, Canada
Senior Member
Honestly, running heavy reports on OLTP database is not desirable.

You may consider to get your data from OLTP DB into a separate reporting DB - data warehouse. I said "data warehouse" because you have mentioned "historical" aspect of your reporting.

You can transform data while you are loading data into data warehouse, to be more easy for reporting and analysis. You can pre-join tables, denormalize lookup tables (dimensions), cleanse dirty data, etc..
After historical data has been loaded into DW you can purge some oldest portion from operational OLTP DB (depending on your OLTP data retention period).

In DW you can create a bitmap indexes on columns that have low cardinality. they help a lot for reporting, but not as good in OLTP.

About partitioning. Most of reports have date ranges, constrained by a reporting period.
I would not partition table just by Status. I would better partition by a row creation date, and *probably* sub-partition by Status. This can give optimizer a chance to effectively prune not ineresting partitions based on your searching conditions.

Re: Best practice for querying weighted sparse data [message #168710 is a reply to message #168694] Fri, 21 April 2006 13:54 Go to previous messageGo to next message
kfritsche
Messages: 3
Registered: April 2006
Junior Member
There are no heavy reports running against these types of tables. I was just making a point that access to all records should be available, if a full table scan is needed for a query run once per month, then I don't care if that one is slow.
The problem is in the oltp side where new records need to be processed, but it takes a great deal of time to find them since an index on the status is useless. Date searching helps, but doesn't always work, like a record may be added late and it still needs processing, or maybe it went from pending to active after a week, where the average time is a minute.
I'm thinking in more general terms for quickly finding one of a small (enumerable) number of data values where the ocurrence of that value is in less than 1% of the records.
Bitmap indexes seem a logical choice, but as you mentioned they work better with a DW.
Re: Best practice for querying weighted sparse data [message #168725 is a reply to message #168691] Fri, 21 April 2006 16:19 Go to previous messageGo to next message
markmal
Messages: 113
Registered: April 2006
Location: Toronto, Canada
Senior Member
I see.

You have to gather stats with histograms. After that indexes will be used for Active and Pending, full scan for Completed.

example

MARK_MALAKANOV>begin
2 dbms_stats.gather_table_stats(user, 'SKEWED',
3 METHOD_OPT=>'FOR ALL INDEXED COLUMNS SIZE 250',
4 cascade=>true);
5 end;
6 /

[Updated on: Fri, 21 April 2006 16:21]

Report message to a moderator

Re: Best practice for querying weighted sparse data [message #168869 is a reply to message #168725] Sun, 23 April 2006 22:19 Go to previous messageGo to next message
rleishman
Messages: 3728
Registered: October 2005
Location: Melbourne, Australia
Senior Member
Is partitioning too slow because rows that change status have to be migrated to a new partition? Fair enough, it could double your IO for UPDATEs(or worse if you have a lot of indexes). Otherwise, I know of no other reason why partitioning would cause a performance problem.

One technique I have used before (and I don't necessarily recommend it) is to create a Function Based Index that translates the "common" values to NULL.
eg. CREATE INDEX myindex on mytab(nullif(status, 'Completed'))

The up-side is that the index ONLY stores non-null values. In your case, that's around 10100 rows. Also, it is impossible to use the index wrongly to read the skewed column.

The down-side is that your SQL is non-intuitive. eg:
SELECT * FROM my_tab WHERE nullif(status, 'Completed') = 'Active'

On balance, it's usually better to just make sure you have a histogram (see Mark's reply above) although this can cause problems when your status is supplied as a bind-variable.

Ross Leishman
Re: Best practice for querying weighted sparse data [message #169614 is a reply to message #168869] Thu, 27 April 2006 14:38 Go to previous message
kfritsche
Messages: 3
Registered: April 2006
Junior Member
Excellent help.
Thanks
Previous Topic: SQL Tuning
Next Topic: Oracle tool on performance tuning
Goto Forum:
  


Current Time: Wed May 22 19:53:13 CDT 2024