Finally, got some time to write the third post of this series. The Optimizer – Part I and Optimizer – Part II are the best reference before reading this post. From the Part II, we inferred that :
- TIME_ID – Assumptions v/s Actuals √
- AMOUNT_SOLD – Assumptions v/s Actuals Χ
- PROMO_ID – Assumptions v/s Actuals Χ
However, with a minor change, which was on a copy of TEST_SELECTIVITY table, the equation changed to:
- TIME_ID – Assumptions v/s Actuals √
- TIME_ID – Assumptions v/s Actuals (minor change) Χ
- AMOUNT_SOLD – Assumptions v/s Actuals Χ
- PROMO_ID – Assumptions v/s Actuals Χ
A small change triggered a mismatch in the cardinality calculations of TIME_ID column, which was otherwise nearly accurate. For a Query Performance, optimal execution plan is very critical and for an optimal execution plan, it is very important that the Optimizer comes out with an accurate cardinality. As we have seen, in our previous blogs, SELECTIVITY is another significant factor and is the starting point for the Optimizer. While Cardinality is calculated by the Optimizer, Selectivity is (in most of the cases) stored in the data dictionary, by way of Statistics gathered using dbms_stats (or any other method provided by some Application Vendors).
Optimizer is a piece of code. The default behaviour (at least for a newly created table) of the optimizer is that it considers the data distribution as UNIFORM. For example, in our case (before the minor change), the data in TIME_ID column was Uniform and therefore, the optimizer calculation was nearly accurate. However, the other two columns (AMOUNT_SOLD & PROMO_ID), the data was non-uniform and therefore, Optimizer assumption v/s the actual data distribution were way out. After the table creation, the statistics were gathered automatically (as a part new feature of 12c). In 11g or earlier versions, you will have to gather the statistics manually. You should see the same results. The initial statistics were fed to the optimizer as a Uniform data. See below :
COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY
------------------ ------------ ---------- ----------
AMOUNT_SOLD 636 0 .001572327
CUST_ID 7056 0 .000141723
PROD_ID 72 0 .013888889
PROMO_ID 4 0 .25
QUANTITY_SOLD 1 0 1
TIME_ID 1096 0 .000912409
How do we fix the problem of Mis-Estimates? In this case, the DENSITY column was used as a SELECTIVITY and for each of the columns, it is calculated as if the data is Uniform. This mis-calculation resulted in errorneous optimizer calculation. How do we fix it? As mentioned, optimizer is a piece of code and it has to come out with it’s calculation based on the input provided. In the absence of additional statistics or accurate statistics, Optimizer will assume UNIFORM distribution and will mis-calculate the SELECTIVITY and the CARDINALITY, as we have seen with our test cases. We have to provide accurate inputs for the optimizer to come up with nearly accurate statistics and one approach to provide these additional and accurate statistics are Histograms.
Let us regather statistics on the table again and check the change in the DENSITY value for each of the columns.
exec dbms_stats.gather_table_stats(user,'TEST_SELECTIVITY', method_opt=>'for all columns size auto', estimate_percent=>100);
The resultant output is as below:
select column_id, column_name, num_distinct, num_nulls,
density, histogram
from dba_tab_columns
where owner='SCOTT'
and table_name='TEST_SELECTIVITY'
order by 1;
COLUMN_ID COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY HISTOGRAM
---------- ----------------- ------------ ---------- ---------- --------------------
1 PROD_ID 72 0 .013888889 NONE
2 CUST_ID 7056 0 .000141723 NONE
3 TIME_ID 1096 0 .000912409 NONE
4 PROMO_ID 4 0 .000000625 FREQUENCY
5 QUANTITY_SOLD 1 0 1 NONE
6 AMOUNT_SOLD 636 0 .0018217 HEIGHT BALANCED
The Density for the two out of the three columns is changed and the HISTOGRAM column gives an additional information that we have some additional statistics on the two columns.
There are 2 questions here
- Why the subsequent gathering of statistics gathered additional statistics (HISTOGRAM)?
- Why there are no Additional Statistics (HISTOGRAMS) on other Columns?
The answer to the first question is that the queries on each of the tables and each of the columns are tracked in SYS.COL_USAGE$. The subsequent stats gathering job will refer to this table to get the column details on which the additional statistics are required. See below :
exec dbms_stats.flush_database_monitoring_info();
select intcol#, column_name, equality_preds, RANGE_PREDS
from sys.col_usage$ cu, dba_tab_columns tc
where obj# = (select data_object_id from dba_objects
where object_name='TEST_SELECTIVITY')
and cu.intcol# = tc.column_id
and tc.table_name='TEST_SELECTIVITY';
INTCOL# COLUMN_NAME EQUALITY_PREDS RANGE_PREDS
---------- ----------------- -------------- -----------
6 AMOUNT_SOLD 1 1
4 PROMO_ID 1 0
3 TIME_ID 1 1
The answer to the second question is for the other columns (except TIME_ID), there were no queries executed, thus there were no information collected in COL_USAGE$. For the TIME_ID, there are no HISTOGRAMS even though we executed few queries (and COL_USAGE$ has an entry). The data in this column is UNIFORM and this is the additional check, that is internally made at the time of gathering statistics. During statistics generation, sample data for each of the column is computed and data is validated. If the data is found to be UNIFORM, no histograms are generated as it is a resource intensive process and generating histogram will not make any sense (at least not worth the resources required to generate histograms).
If you recollect from our Part II, the minor changes on the TIME_ID column was on another table TEST_SELECTIVITY_M, which was an exact replica of TEST_SELECTIVITY. If we gather statistics on TEST_SELECTIVITY_M, let’s see the results.
select column_id, column_name, num_distinct, num_nulls,
density, histogram
from dba_tab_columns
where owner='SCOTT'
and table_name='TEST_SELECTIVITY_M'
order by 1;
COLUMN_ID COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY HISTOGRAM
---------- -------------------- ------------ ---------- ---------- ---------------
1 PROD_ID 72 0 .013888889 NONE
2 CUST_ID 7056 0 .000141723 NONE
3 TIME_ID 1097 0 .000914025 HEIGHT BALANCED
4 PROMO_ID 4 0 .25 NONE
5 QUANTITY_SOLD 1 0 1 NONE
6 AMOUNT_SOLD 636 0 .001572327 NONE
On this table, the query executed was only on TIME_ID column and therefore, the additional statistics were on TIME_ID column.
Coming back to TEST_SELECTIVITY. Now, we have a Frequency Histograms on PROMO_ID Column and Height Balanced Histogram on AMOUNT_SOLD column. Until 11g, we had these 2 types of Histograms. 12c introduced TopN Frequency and Hybrid Histograms, which I will cover in the last part of this series. I am on 12c and therefore, to generate Frequency and Height Balanced Histograms, I had to use estimate_percent as 100 (more on this in the next blog).
Frequency Histograms are generated if the number of distinct values are less than the number of Buckets. These Buckets, if not specified during statistics gathering, defaults to 254. PROMO_ID column has 4 distinct values, whereas, AMOUNT_SOLD has 636, which is more than 254 and hence Height Balanced Histograms. Lets execute our queries on these 2 columns and check the CARDINALITY estimates.
select column_id, column_name, num_distinct, num_nulls,
density, histogram
from dba_tab_columns
where owner='SCOTT'
and table_name='TEST_SELECTIVITY'
order by 1;
COLUMN_ID COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY HISTOGRAM
---------- ----------------- ------------ ---------- ---------- --------------------
1 PROD_ID 72 0 .013888889 NONE
2 CUST_ID 7056 0 .000141723 NONE
3 TIME_ID 1096 0 .000912409 NONE
4 PROMO_ID 4 0 .000000625 FREQUENCY
5 QUANTITY_SOLD 1 0 1 NONE
6 AMOUNT_SOLD 636 0 .0018217 HEIGHT BALANCED
Since we have additional statistics, lets check the details from DBA_TAB_HISTOGRAMS for this column.
SQL> select ENDPOINT_NUMBER, ENDPOINT_VALUE
2 from dba_tab_histograms
3 where table_name='TEST_SELECTIVITY'
4 and column_name='PROMO_ID'
5 order by 1;
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
2074 33
20052 350
22297 351
800000 999
For the Frequency Histogram, the data is stored in a cumulative manner. The Endpoint_number stores the cumulative number of rows and the Endpoint_value stores the actual column value. For example, for PROMO_ID=33, we expect 2074 rows, for PROMO_ID=350, we expect 20052-2074=17981 rows, for PROMO_ID=351, we expect 22297-20052=2245 rows and so on.. Lets run the queries for each of these PROMO_ID’s.
SQL> set autot trace
SQL> select cust_id, amount_sold, promo_id from test_selectivity where promo_id=999;
777703 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 4083831454
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 777K| 9873K| 960 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_SELECTIVITY | 777K| 9873K| 960 (2)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("PROMO_ID"=999)
The Optimizer Calculation for cardinality matches the actual number of rows fetched. For other values too, these were perfectly matching (see below).
SQL> select cust_id, amount_sold, promo_id from test_selectivity where promo_id=350;
17978 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 4083831454
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 17978 | 228K| 958 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_SELECTIVITY | 17978 | 228K| 958 (2)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("PROMO_ID"=350)
SQL> select cust_id, amount_sold, promo_id from test_selectivity where promo_id=33;
2074 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 4083831454
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2074 | 26962 | 958 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_SELECTIVITY | 2074 | 26962 | 958 (2)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("PROMO_ID"=33)
SQL> select cust_id, amount_sold, promo_id from test_selectivity where promo_id=351;
2245 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 4083831454
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2245 | 29185 | 958 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_SELECTIVITY | 2245 | 29185 | 958 (2)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("PROMO_ID"=351)
Perfect. The calculation in this case is very simple. Take the values from DBA_TAB_HISTOGRAMS and get the accurate CARDINALITY. However, this stands good for the values that exists and are part of the histograms. What if we run a query against a value that doesn’t exists in the table or had no rows when the stats were gathered, but have few or more rows when the queries are executed against this value ? This value will have no cumulative data into DBA_TAB_HISTOGRAMS. In such cases, will Optimizer fall back to CARDINALITY = SELECTIVITY x NUM_ROWS, where SELECTIVITY is DENSITY ? Lets check.
select column_id, column_name, num_distinct, num_nulls,
density, histogram
from dba_tab_columns
where owner='SCOTT'
and table_name='TEST_SELECTIVITY'
and column_name='PROMO_ID';
COLUMN_ID COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY HISTOGRAM
---------- ----------------- ------------ ---------- ---------- --------------------
4 PROMO_ID 4 0 .000000625 FREQUENCY
SQL> select &&optdensity * 800000 Cardinality from dual;
old 1: select &&optdensity * 800000 Cardinality from dual
new 1: select .000000625 * 800000 Cardinality from dual
CARDINALITY
-----------
.5
If the Density is considered as a SELECTIVITY, the expected CARDINALITY will be 1 (ceil of 0.5). I will run a query with PROMO_ID=500, which doesn’t exists.
SQL> set autot trace
SQL> select * from test_selectivity where promo_id=500;
no rows selected
Execution Plan
----------------------------------------------------------
Plan hash value: 4083831454
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1037 | 25925 | 958 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_SELECTIVITY | 1037 | 25925 | 958 (2)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("PROMO_ID"=500)
Actual Number of Rows are ZERO, Optimizer Estimated as 1037 and SELECTIVITY (density) based expected was 1. ZERO v/s 1037, a huge mis-estimate. Also, we can see that with histograms, optimizer does not consider DENSITY column. How do we get the calculation ? Here, 10053 trace file comes handy. Lets generate a 10053 trace for a non-existent value and see the relevant portion that contains the calculation.
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for TEST_SELECTIVITY[A]
SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
Column (#4):
NewDensity:0.001296, OldDensity:0.000001 BktCnt:800000.000000, PopBktCnt:800000.000000, PopValCnt:4, NDV:4
Column (#4): PROMO_ID(NUMBER)
AvgLen: 4 NDV: 4 Nulls: 0 <b?Density: 0.001296 Min: 33.000000 Max: 999.000000
Histogram: Freq #Bkts: 4 UncompBkts: 800000 EndPtVals: 4 ActualVal: yes
Table: TEST_SELECTIVITY Alias: A
Card: Original: 800000.000000 Rounded: 1037 Computed: 1037.000000 Non Adjusted: 1037.000000
As per 10053, the Rounded and Computed Cardinality is 1037. The Density is 0.001296. However, the Density from DBA_TAB_COLUMNS is .000000625. There are two additional statistics : NewDensity and OldDensity. OldDensity is 0.000001, which is the rounded off value for the actual Density stored in DBA_TAB_COLUMNS i.e .000000625. What is NewDensity ? The value against this is used as a final Density to calculate the Cardinality i.e.0.001296*800000 = 1037. It seems, for a non-existent value, Optimizer computes this NewDensity and uses this as a SELECTIVITY to come out with the Expected Cardinality.
The calculation for NewDensity, in case of Frequency Histogram is 50% of the lowest number of rows in DBA_TAB_HISTOGRAMS, which is 0.5 x 2074/NUM_ROWS = 0.00129625. So, NewDensity becomes the SELECTIVITY and CARDINALITY is SELECTIVITY x NUM_ROWS, 0.00129625 x 800000 = 1037(see below).
select promo_Id, count(*) from test_selectivity group by promo_id order by 2;
PROMO_ID COUNT(*)
---------- ----------
33 2074 select 0.5*2074/800000 NewDensity from dual;
NEWDENSITY
----------
.00129625
SQL> select round(&&new_density*800000,0) from dual;
old 1: select round(&&new_density*800000,0) from dual
new 1: select round( .00129625*800000,0) from dual
ROUND(.00129625*800000,0)
-------------------------
1037
Before we get into more details, let us check the Height Balanced Histograms. We have a Height Balanced Histogram on Amount_Sold Column.
SQL> select column_id, column_name, num_distinct, num_nulls,
2 density, histogram
3 from dba_tab_columns
4 where owner='SCOTT'
5 and table_name='TEST_SELECTIVITY'
6 and column_name='AMOUNT_SOLD'
7 order by 1;
COLUMN_ID COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY HISTOGRAM
---------- ----------------- ------------ ---------- ---------- --------------------
6 AMOUNT_SOLD 636 0 .0018217 HEIGHT BALANCED
We have 636 Distinct Values for this column and the maximum number of Buckets are 254. The way these histograms are generated is that the number of rows in the table is equally divided into 254 buckets. The Maximum value for each of the bucket is calculated and then the buckets are compressed, if a value spans across more than 1 Bucket. I executed a query, which is similar to the query executed by the Optimizer during the statistics gathering (see below).
SQL> select bucket, count(*), min(amount_sold) min_amt, max(amount_sold) max_amt from (
2 select amount_sold, ntile(254) over (order by amount_sold) bucket
3 from test_selectivity
4 order by amount_sold)
5 group by bucket
6 order by 1;
BUCKET COUNT(*) MIN_AMT MAX_AMT
---------- ---------- ---------- ----------
1 3150 6 7
2 3150 7 7
3 3150 7 7 <-- Popular Value (3 Buckets)
4 3150 7 8
5 3150 8 8
6 3150 8 8
7 3150 8 8
8 3150 8 8
9 3150 8 8
10 3150 8 8
11 3150 8 8 <-- Popular Value (8 Buckets)
12 3150 8 9
13 3150 9 9
14 3150 9 9
15 3150 9 9
16 3150 9 9
17 3150 9 9
18 3150 9 9
19 3150 9 9
20 3150 9 9
21 3150 9 9 <-- Popular Value (10 Buckets)
22 3150 9 10
23 3150 10 10
24 3150 10 10
25 3150 10 10
26 3150 10 10
27 3150 10 10
28 3150 10 10
29 3150 10 10
30 3150 10 10
31 3150 10 10
32 3150 10 10
33 3150 10 11
34 3150 11 11
35 3150 11 11
36 3150 11 11
37 3150 11 11
38 3150 11 11
39 3150 11 11
40 3150 11 11
41 3150 11 11
42 3150 11 12
43 3150 12 12
44 3150 12 12
45 3150 12 12
46 3150 12 13
47 3150 13 13
48 3150 13 13
49 3150 13 13
50 3150 13 13
51 3150 13 13
52 3150 13 13
53 3150 13 13
54 3150 13 14
55 3150 14 14
56 3150 14 14
57 3150 14 14
58 3150 14 15 <-- Non-Popular (Only 1 Bucket)
59 3150 15 16
60 3150 16 16
61 3150 16 17
62 3150 17 17
63 3150 17 17
64 3150 17 17
65 3150 17 18 <-- Non-Popular (1 Bucket)
66 3150 18 19
67 3150 19 19
68 3150 19 19
69 3150 19 20
70 3150 20 20
71 3150 20 21
72 3150 21 21
73 3150 21 21
74 3150 21 21
75 3150 21 21
76 3150 21 21
77 3150 21 22
78 3150 22 22
79 3150 22 22
80 3150 22 22
81 3150 22 22
82 3150 22 23
83 3150 23 23
84 3150 23 23
85 3150 23 23
86 3150 23 23
87 3150 23 24
88 3150 24 24
89 3150 24 24
90 3150 24 24
91 3150 24 24
92 3150 24 25
93 3150 25 25
94 3150 25 25
95 3150 25 25
96 3150 25 25
97 3150 25 26
98 3150 26 26
99 3150 26 26
100 3150 26 26
101 3150 26 27
102 3150 27 27
103 3150 27 28
104 3150 28 28
105 3150 28 28
106 3150 28 28
107 3150 28 29
108 3150 29 29
109 3150 29 29
110 3150 29 30
111 3150 30 30
112 3150 30 30
113 3150 30 30
114 3150 30 30
115 3150 30 31
116 3150 31 31
117 3150 31 31
118 3150 31 32
119 3150 32 32
120 3150 32 33
121 3150 33 33
122 3150 33 33
123 3150 33 34
124 3150 34 34
125 3150 34 34
126 3150 34 35
127 3150 35 36
128 3150 36 36
129 3150 36 38
130 3150 38 38
131 3150 38 38
132 3150 38 39
133 3150 39 39
134 3150 39 39
135 3150 39 40
136 3150 40 40
137 3150 40 41
138 3150 41 41
139 3150 41 42
140 3150 42 42
141 3150 42 43
142 3150 43 43
143 3150 43 45
144 3150 45 45
145 3150 45 46
146 3150 46 46
147 3150 46 46
148 3150 46 46
149 3150 46 46
150 3150 46 46
151 3150 46 47
152 3150 47 47
153 3150 47 47
154 3150 47 47
155 3149 47 47
156 3149 47 47
157 3149 47 47
158 3149 47 47
159 3149 47 48
160 3149 48 48
161 3149 48 48
162 3149 48 48
163 3149 48 48
164 3149 48 48
165 3149 48 48
166 3149 48 48
167 3149 48 49
168 3149 49 49
169 3149 49 49
170 3149 49 49
171 3149 49 49
172 3149 49 49
173 3149 49 49
174 3149 49 49
175 3149 49 50
176 3149 50 50
177 3149 50 51
178 3149 51 51
179 3149 51 51
180 3149 51 51
181 3149 51 51
182 3149 51 52
183 3149 52 52
184 3149 52 52
185 3149 52 53
186 3149 53 53
187 3149 53 53
188 3149 53 54
189 3149 54 54
190 3149 54 54
191 3149 54 55
192 3149 55 56
193 3149 56 56
194 3149 56 57
195 3149 57 57
196 3149 57 58
197 3149 58 58
198 3149 58 59
199 3149 59 60
200 3149 60 60
201 3149 60 62
202 3149 62 62
203 3149 62 63
204 3149 63 63
205 3149 63 64
206 3149 64 64
207 3149 64 65
208 3149 65 66
209 3149 66 70
210 3149 70 72
211 3149 72 74
212 3149 74 79
213 3149 79 90
214 3149 90 94
215 3149 94 97
216 3149 97 101
217 3149 101 113
218 3149 113 115
219 3149 115 117
220 3149 117 123
221 3149 123 125
222 3149 125 127
223 3149 127 131
224 3149 131 136
225 3149 136 151
226 3149 151 158
227 3149 158 163
228 3149 163 170
229 3149 170 180
230 3149 180 199
231 3149 199 203
232 3149 203 208
233 3149 208 211
234 3149 211 214
235 3149 214 225
236 3149 225 302
237 3149 302 307
238 3149 307 531
239 3149 531 552
240 3149 552 594
241 3149 594 602
242 3149 602 629
243 3149 629 900
244 3149 900 973
245 3149 973 1016
246 3149 1016 1054
247 3149 1054 1093
248 3149 1093 1192
249 3149 1192 1237
250 3149 1237 1301
251 3149 1301 1463
252 3149 1463 1546
253 3149 1546 1639
254 3149 1639 1783
254 rows selected.
Total Number of rows in this table is 800000 divided by 254 Buckets is 3149 Rows. From the output above, it can be seen that each bucket has 3149 rows and there are some popular and non-popular values. For example : 7.8.9 are Popular (there are other popular values as well) and 15,18 are Non-Popular (there are other non-popular values as well). Popular values are values spanning across 2 or more Buckets. Non-Popular Values are values with 1 or less bucket. Finally, when the histogram is generated, the popular buckets are compressed to save dictionary space and the resultant output from DBA_TAB_HISTOGRAM is as under.
SQL> select ENDPOINT_NUMBER, ENDPOINT_VALUE
2 from dba_tab_histograms
3 where table_name='TEST_SELECTIVITY'
4 and column_name='AMOUNT_SOLD'
5 order by 1;
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
0 6 <-- Popular Value
3 7 <-- Popular Value (3-0=3 Buckets)
11 8 <-- Popular Value (11-3=8 Buckets)
21 9 <-- Popular Value (21-11=10 Buckets)
32 10
41 11
45 12
53 13
57 14
58 15 <-- Non-Popular Value (58-57=1 Bucket)
60 16
64 17
65 18
68 19
70 20
76 21
81 22
86 23
91 24
96 25
100 26
102 27
106 28
109 29
114 30
117 31
119 32
122 33
125 34
126 35
128 36
131 38
134 39
136 40
138 41
140 42
142 43
144 45
150 46
158 47
166 48
174 49
176 50
181 51
184 52
187 53
190 54
191 55
193 56
195 57
197 58
198 59
200 60
202 62
204 63
206 64
207 65
208 66
209 70
210 72
211 74
212 79
213 90
214 94
215 97
216 101
217 113
218 115
219 117
220 123
221 125
222 127
223 131
224 136
225 151
226 158
227 163
228 170
229 180
230 199
231 203
232 208
233 211
234 214
235 225
236 302
237 307
238 531
239 552
240 594
241 602
242 629
243 895
244 973
245 1016
246 1054
247 1093
248 1192
249 1237
250 1301
251 1463
252 1546
253 1639
254 1783
104 rows selected.
254 Buckets are compressed into 104 Buckets. The CARDINALITY calculations, in these cases are very simple. For Popular Value, it is 3149 (number of rows in each bucket) multiplied by number of Buckets. Let us run the queries and see the results.
## For 2 Buckets
SQL> select * from test_selectivity where amount_sold=56;
6204 rows selected.
Elapsed: 00:00:00.16
Execution Plan
----------------------------------------------------------
Plan hash value: 4083831454
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6299 | 153K| 960 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_SELECTIVITY | 6299 | 153K| 960 (2)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("AMOUNT_SOLD"=56)
## For 10 Buckets
SQL> select * from test_selectivity where amount_sold=9;
31964 rows selected.
Elapsed: 00:00:00.64
Execution Plan
----------------------------------------------------------
Plan hash value: 4083831454
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 31496 | 768K| 960 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_SELECTIVITY | 31496 | 768K| 960 (2)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("AMOUNT_SOLD"=9)
For Non-Popular or Non-Existent values.
Will it be DENSITY x NUM_ROWS ? i.e. 0.0018217 x 800000 = 1457. Lets run the query to check this.
SQL> select column_id, column_name, num_distinct, num_nulls,
2 density, histogram
3 from dba_tab_columns
4 where owner='SCOTT'
5 and table_name='TEST_SELECTIVITY'
6 and column_name='AMOUNT_SOLD'
7 order by 1;
COLUMN_ID COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY HISTOGRAM
---------- ----------------- ------------ ---------- ---------- --------------------
6 AMOUNT_SOLD 636 0 .0018217 HEIGHT BALANCED
SQL> select &&densit*800000 from dual;
old 1: select &&densit*800000 from dual
new 1: select .0018217*800000 from dual
.0018217*800000
---------------
1457.36
The Cardinality for non-popular values, as can be seen after executing the queries is as under.
SQL> select * from test_selectivity where amount_sold=55;
3372 rows selected.
Elapsed: 00:00:00.11
Execution Plan
----------------------------------------------------------
Plan hash value: 4083831454
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 285 | 7125 | 960 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_SELECTIVITY | 285 | 7125 | 960 (2)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("AMOUNT_SOLD"=55)
Value 55 is a Non-Popular Value. We expected the expected cardinality as 1457, but it is 285. Let us generate a 10053 trace for this and check the trace.
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for TEST_SELECTIVITY[A]
SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
Column (#6):
NewDensity:0.000356, OldDensity:0.001822 BktCnt:254.000000, PopBktCnt:201.000000, PopValCnt:50, NDV:636
Column (#6): AMOUNT_SOLD(NUMBER)
AvgLen: 4 NDV: 636 Nulls: 0 Density: 0.000356 Min: 6.000000 Max: 1783.000000
Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 104 ActualVal: yes
Table: TEST_SELECTIVITY Alias: A
Card: Original: 800000.000000 Rounded: 285 Computed: 284.862003 Non Adjusted: 284.862003
We see a similar pattern here. NewDensity is used as a SELECTIVITY to compute the CARDINALITY (0.000356×800000=285). How is this NewDensity calculated for Height Balanced Histograms ? It is computed as :
[(NPBKTCNT)/(BKTCNT * (NDV – POPVALCNT))]
From the 10053 trace, we can get the values of each of these. BKTCNT (Bucket Count) is 254, POPBKCNT (Popular Bucket Count) are 201. This makes NPBKCNT as 254-201=53. NDV (Number of Distinct Values is 636 and POPVALCNT (Popular Value Counts) are 50. Applying these values, we get [53/(254 *(636-50))] = .000356078
SQL> select (53/(254*(636-50))) newdensity from dual;
NEWDENSITY
----------
.000356078
SQL> select ceil(&&ndensit * 800000) from dual;
old 1: select ceil(&&ndensit * 800000) from dual
new 1: select ceil(.000356078 * 800000) from dual
CEIL(.000356078*800000)
-----------------------
285
NewDensity, I assume, was introduced in 11g, but is backported in 10204 as well. This was introduced as a Bug Fix. However, in our case, this is actually causing a mis-estimation. How do we disable this fix? The solution is disabling the fix_control 5483301 and setting _optimizer_enable_density_improvements to FALSE. Both these needs to be set together. We will set this at the session level and see the results for a Non-Existent value in a Frequency Histogram and a Non-Popular value in a Height Balanced Histogram.
SQL> alter session set "_fix_control"='5483301:off';
SQL> alter session set "_optimizer_enable_density_improvements"=false;
SQL> set autot trace
SQL> select * from test_selectivity where promo_id=500;
no rows selected
Execution Plan
----------------------------------------------------------
Plan hash value: 4083831454
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 25 | 958 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_SELECTIVITY | 1 | 25 | 958 (2)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("PROMO_ID"=500)
SQL> select * from test_selectivity where amount_sold=55;
3372 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 4083831454
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1457 | 36425 | 960 (2)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_SELECTIVITY | 1457 | 36425 | 960 (2)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("AMOUNT_SOLD"=55)
With these 2 settings, the Optimizer falls back to its Original Calculation of DENSITY x NUM_ROWS for Cardinality calculation.
It had been a long posting, however, I felt this to be necessary because many people still don’t know about this NewDensity. I was myself surprised when I was working on a real life issue and came across this mis-estimation. A 10053 trace revealed NewDensity, which was new for me as well. For the real life example, see below.
select count(*) from nca.s_p_attributes a1
WHERE a1.value='olwassenen';
COUNT(*)
------------
591168
SQL> select plan_table_output from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
----------------------------------------------------------
SQL_ID 79dfpvydpk710, child number 0
-------------------------------------
select count(*) from nca.s_p_attributes a1 WHERE
a1.value='olwassenen’
----------------------------------------------------------
| Id | Operation | Name | Rows |
----------------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | SORT AGGREGATE | | 1|
|* 2 | INDEX SKIP SCAN | SP_P_IND3 | 8|
----------------------------------------------------------
The estimated and actual is way out. 8 Rows v/s 591168 Rows. At this point, I requested a 10053 trace, which pointed me to NewDensity value. The issue was resolved by way of disabling the fix_control and setting _optimizer_enable_density_improvements to FALSE.