The initial situation
One day, not so far before I went to my summer holidays, we got an ask to help one of our customers running Autocad (Autodesk) for a geo information system. The database, autocad is running on, is an 11g Enterprise Edition with spatial and tuning and diagnostics pack.The runtime for some queries, the customer told me, was over ten minutes - and they would be happy to get results within the application in less than a minute.
As I didn't had any points of contact with Autocad on Oracle database before, I made the decision to visit our customer and to get an impression of the database performance. There I saw 2 different kind of schemas. One user/schema type is normal, but the other one is using virtual private database (VPD). The application running on the vpd user schema was horrible in performance, the same application running on the non vpd user schema was still incredible slow with some statements, but still better. All normal performance and memory parameters have been set to reasonable values. All performance critical things like read-writes, cache hit rates, etc. showed me a very good performing database - all in all, this database seemed to be in a very good condition - which I normaly don't see if we where ask for performance reviews.
I searched on the internet for a possible answer and I found an autocad knowledge base entry named:
AutoCAD Map 3D and Oracle Versioning causes performance degradation
And what I have found there?
Causes:There will always be a performance degradation once the database is versioned. The same behavior happens with the Industry Model when the database is job-enabled. Both implementations use Virtual Private Databases. There is no conflict in Oracle versioning.
Solution:
Oracle tuning is necessary and requires participation with an Oracle DBA.
Well, not that helpful, also the other entries in the knowledge system of autodesk were not helpful, too (if autocad is slow, you should index - isn't that a job also for an ISV to sell a FAST and STABLE application?)
I first recognized, that most of the slow queries I have seen, were bound to selects from views - views with up to 20 different tables and a lot of joins. The customer then told me, that he have heard from other Autocad users that they stepped back to an old Oracle database version (9i!), as the selects where running faster. The newer the database, the worser the performance? Rule based optimizer beats Cost based optimizer? Sounds like an unsolved riddle, maybe it's also an unsolvable riddle? I didn't hoped so...
So I've asked the customer to get a dump of both schemas and I made a trace file with a small number of the worst performing selects. With this and with my curiosity I hoped to shed some light in the dark at home in my lab...
Tracing and Testing
On my MS Windows 10 (customer is running oracle on MS Windows server) laptop I imported both schemas in my 12.1.0.2 database. After that, I've started with my tests. First I made some tests of the sqls and traced them with a normal sql trace to get also the execution plans. The behavior of my 12.1 database seemed to be totally the same as the 11.2 of our customer, not in total seconds, but in factors of the runtime between non-VPD and VPD schema selects.
Again, I've found a nearly good execution plan for one of the worst statements - even the roundabout 20 joins seemed to be a good choice. But the total runtime was still bad.
Parts of the trace file showed again, that the database executed and fetched the values incredible fast. Again, all parameters and cache hit rates, etc. looked very promising. But there was a very high elapsed and cpu time for parsing the statements:
select count(*) from el_v_wp_sleeve
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 171.39 173.50 4 276 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 1 0.07 0.25 924 13744 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 3 171.46 173.75 928 14020 0 1
--------------------------------------------------------------------------------
Together with the customer statement, that autocad is running faster on older oracle versions, I realized that my first thought I had before is right:
It seems to be likely to have some problems within the cost based optimizer as the older versions of oracle used other internal functions in CBO or the RBO. I decided to turn on the cost based optimizer tracing:
alter session set events '10053 trace name context forever, level 1';
Some test executions and 2 GB of trace files later, I started my investigation. What was happening in the parse phase that was so exhausting for the optimizer to blow up the runtime to nearly 3 minutes, when executing and fetch only do need a quarter of a second? It didn't took me that long to find out, that the optimizer was trying a lot of different execution and join plans to find the best one. And - at the end - the plan found and used is really good - this was crystal clear by looking at the execution and fetch times.
The system statistics, the optimizer uses, are fine:
-----------------------------
SYSTEM STATISTICS INFORMATION
-----------------------------
Using dictionary system stats.
Using NOWORKLOAD Stats
CPUSPEEDNW: 3305 millions instructions/sec (default is 100)
IOTFRSPEED: 4096 bytes per millisecond (default is 4096)
IOSEEKTIM: 10 milliseconds (default is 10)
MBRC: NO VALUE blocks (default is 8)
And the Table and Index Stats in the Base Statistical Information didn't show any strange values. But in the optimizer optimation part of the trace file, I've found a couple of different trys: Considering view merge, considering outer-join elimination, considering predicate, find best directive, ... and again considering, find directive ... and again considering, ...Obvious to say - most of them are
--- snip ---
rejected
OJE: outer-join not eliminated
OJE: End: finding best directive for query block SEL$2 (#0)
CVM: Checking validity of merging in query block SEL$3 (#0)
CVM: Considering view merge in query block SEL$3 (#0)
OJE: Begin: find best directive for query block SEL$3 (#0)
OJE: End: finding best directive for query block SEL$3 (#0)
SVM: SVM bypassed: Query NO MERGE hint.
CVM: Considering view merge in query block SEL$4 (#0)
OJE: Begin: find best directive for query block SEL$4 (#0)
OJE: Considering outer-join elimination on query block SEL$4 (#0)
OJE: considering predicate"SP"."FID_ATTR"="IP"."FID"
--- snip end ---
Considering so many different possible execution plans made it at the end to a fast execution plan, but, the next time this statement executes, it starts parsing from the beginning, considering again a lot of possibilities. I never have seen that behavior before, but as I have found the problem, I knew, there will be some kind of workaround.
Speeding up performance from minutes to parts of seconds
Finding the problem is normally 80 percent of the solution - so I was happy to find some kind of root cause, even if I didn't know if this could be normal behavior or if this is a bug in the optimizer.
And the optimizer said: Help me - Hint me
The first thing I have tested was to tune the queries in the non-VPD schema. This is the most relevant case for our customer. After searching in google and in Oracle support a lot, trying to change session (or system parameters) to other values, I decided to create some more indexes (more than 10 new ones) on the tables that are used in the view. This speeded up the parse phase - for a couple of seconds, but it was still a parse phase nearly 3 minutes in time. So I knew, the new indexes are used in the new execution plan and the new execution plan executed and fetched faster as the old one. Speeding up the execution and fetch phase from 0.25 seconds to 0.17 - great work!
But as the customer problem was not the execution and fetch phase, it still was not a solution. As I knew the query and the final execution plan, I started to put some hints in the view - I wanted to tell the optimizer more to let him less decisions to consider. Trying different Index and Use_-/NO_USE hints also speeded up the parse time for seconds, but it was still some boring 2 minutes in time. One of the last shots I wanted to try fitted: Creating the view with a NO_QUERY_TRANSFORMATION hint. Great!
SELECT /*+ NO_QUERY_TRANSFORMATION */ G.FID, G.SLEEVE_SPECIFIC_TYPE,
G.FID_ATTR, G.ORIENTATION, G.DISPLAY_COLOR, G.ID_SLEEVE_TYPE, G.SLEEVE_TYPE,
G.ID_VOLTAGE, G.VOLTAGE, G.ID_DISPOSITION_STATE, G.DISPOSITION_STATE,
G.ID_ACCURACY, G.ACCURACY, G.DATE_INSTALLATION, G.ID_SLEEVE_CONSTRUCTION,
G.SLEEVE_CONSTRUCTION, G.ID_SLEEVE_SPECIFIC_TYPE, G.GEOM.SDO_POINT.X,
G.GEOM.SDO_POINT.Y, G.GEOM.SDO_POINT.Z
FROM
EL_V_WP_SLEEVE G
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 1.01 1.02 0 210 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 6 0.01 0.02 0 2951 0 569
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 8 1.03 1.05 0 3161 0 569
Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 104 (recursive depth: 1)
I have reduced the parse time from 3 minutes to 1 second. Good enough for our customer, so I stopped with more investigations on the non-VPD schema.
The next step: Will my new hint also fit to the VPD schema? What's your guess? Oh yes, you are totally right.
NO!!
Using the hint reduced the parsing time for selects on the view in the VPD schema by some seconds, but the parsing time was still some minutes also here. Creating and testing the VPD policy is part of the parsing phase. That means, it is a must have to check the PL/SQL function which is executed for the VPD policy - in this case it was running roundabout 2 or 3 seconds with a high number of queries and some PL/SQL stuff. Not a huge time killer...
And the optimizer said: Help me - with database/session parameters
Back to start - Tracing 10053 with VPD. And yes, it was the same problem, the optimizer still is considering hundreds of paths/changes/joins/predictions and nearly never comes to an end. But the execution and fetch phase - still incredible fast. I knew why - the VPD workflow is killing my hint somewhere.
It was more or less just a little try to change some of the system parameters. But one parameter (and ONLY this one) worked:
alter session set "_optimizer_cost_based_transformation"=off;
Later on, I have found another blog with some cost based optimizer related problems. There the guy was setting _btree_bitmap_plans and optimizer_index_cost_adj in addition, but changing both parameters did not made any difference on the total statement execution time. I've gotten, for the VPD schema, a really good two to five second runtime (spending nearly 5 minutes before).
Ok, I have two different approaches to push the Autocad performance: For VPD schemas I can set the system or session parameter _optimizer_cost_based_transformation. For non vpd schemas creating the views (or using SQL Patch) with a NO_QUERY_TRANSFORMATION hint also speeds up. But vice versa? No, I have to use both approaches in parallel.
And the optimizer said: Help me - give me a goal
As we do let the customer decide, which solution they would implement, I checked for another possibility that is used not often (enough) in my opinion: DBMS_SPM allows to create sql plan baselines. This feature is part of the Enterprise Edition without licensing Diagnostics and Tuning Pack (https://blogs.oracle.com/optimizer/does-the-use-of-sql-plan-management-and-the-dbmsspm-database-package-require-a-tuning-or-diagnostic-pack-license).
Why is using an SQL Plan Baseline a good choice here? Because we know, that after parsing, the execution plans are running fast as hell (More information of managing baselines can be found here: https://docs.oracle.com/database/121/TGSQL/tgsql_spm.htm#TGSQL94621). I run the statement (after flushing the sga) again without any hint and without the _optimizer_cost_based_transformation=off session/system parameter to get my plan.
This time, the execution was as slow as I knew from the beginning. Then I captured the plan from the cursor cache and created a SQL Baseline:
SQL> variable cnt number;
SQL> execute :cnt := dbms_spm.load_plans_from_cursor_cache(sql_id=>'1pmbtxzgktxfj')
That's it, folks! After that, I have checked by querying dba_sql_plan_baselines that the baseline was accepted and enabled.
Then I have re-run the statement again. As the cost based optimizer now skips any optimization after hard parse (creating sql-id, checking statement syntax and rights and permissions), the statement just "executes" and "fetches". So all the statements running minutes before are now running in parts of seconds.
No comments:
Post a Comment
Your comment will be published soon...