Thursday, August 27, 2009

Where do you stop as a DBA? Thoughts about problem solving....

This is one of my few posts that are not about Informix features. I will write about a real life situation and use it for developing some thoughts about problem solving ability (or lack of it).

I will not use real names from either customers or suppliers except of course Informix and my employer. The scenario is this:

Several IBM Informix Dynamic Servers will be upgraded from version 10 on platform P1 to version 11.50 on platform P2. P2 is from a third party supplier, let's call it S1.
Currently the customer uses ontape utility for backup, but wants to take advantage of OnBar. So tests were made using OnBar, IDS 11.50 and a Storage Manager from another supplier S2 on platform P2. After some issues the tests run fine and some development instances are migrated. Suddently we start having issues. OnBar reports a BAR_TIMEOUT in BAR_ACT_LOG file... This message informs us that OnBar executed a BSA (XOpen standard for backup API) call which failed to return (successfuly or unsuccessfuly) within a specific ammount of time. OnBar will not give up, and will keep logging messages in BAR_ACT_LOG file. The Informix error message description is this:

"-43296 Warning: BAR_TIMEOUT Storage Manager Progress may be stalled. The current command has exceeded the BAR_TIMEOUT value. This might indicate a normal transfer of a large data set or it can mean the storage manager needs attention. Check to make sure the storage manager is still running and that data is still being transferred to/from the storage manager. Refer to the storage manager message log (or equivalent) for more information. If the cause of the error is still unclear, contact your storage manager Technical Support. If you are using IBM Informix Storage Manager, contact Technical Support at tsmail@us.ibm.com"

The message looks explicit and clear, but the Storage Manager operators decided that it should be a problem with OnBar. Let's make a small pause here... this is the first point I'd like to raise... What would you do? Trust the Informix error message that explicitly says to contact the Storage Manager technical support or do your own investigation?
Well, as an onsite consultant I decided to do some investigation... First things I did were:
  1. Running truss against onbar_d processes
  2. Running dbx/gdb/adb against the onbar_d processes
  3. Running lsof against the onbar_d processes
These simple steps revealed that:

  1. truss showed OnBar was "stuck" on send(). This is an OS function used to send data through a socket
  2. gdb showed that onbar_d had called the BSASendData() function which in turn had called send(). BSASendData() is part of the dynamic library we have to configure in $ONCONFIG (BSALIBPATH)
  3. lsof showed that the file descriptor (fd) used in the send() call was a socket opened to the storage manager "media server". This is a host which has direct access to the storage manager allocated tape drives
So, it was clear (only for me as we shall see...) that there was a problem in the storage manager while it tried to send data from the database server to the media server. The next logical step would be to analyse the status of the process receiving the data in the media server. Since I had no access to that server I asked for it, but without success... In the following days I received all kinds of evasive replies and useless requests both from the storage manager team and from the storage manager supplier (S2) technical support. Some of these were:

  • You cannot run a whole system backup with more than one concurrent process
    True for versions earlier than 11.10, but incorrect for version 11.50 (and this suggestion ignored all the debug info above)
  • We were trying to use more onbar_d processes than what the storage manager was configured to support, leading to waits on the OnBar side
    This could make sense, but when this happens, the onbar_d processes wait for a free "slot" at the begining of a dbspace backup.... they don't get "stuck" in the middle (usually it happened around 90% of a dbspace backup)
  • They suggested that we should change the BAR_TIMEOUT parameter (it exists only in XPS, not IDS....)
  • They even suggest me that I should consult the OnBar manual ;)
Finnally the tech support guy from S2 was on site, and I managed to have him collect the info (truss, gdb, lsof) from the media server processes. At that moment I understood why they ignored the info above. He didn't knew how to use or what were these utilities...
But I got a big surprise when I saw the outputs.... The process receiving data was "stuck" on recv(). This is an OS function used to get data from a socket. So, in short, we had an ESTABLISHED socket between two servers, one of them was on send(), and the other was on recv().... But no data was being transferred.

This made me think about network or TCP layer issues. But usually this are the last places to look for issues. Note that I don't consider network problems a rarity, but these usually translate into broken connections, not exactly hangs...
This observation was a turning point. I immediately assumed that the platform provider (S1) would have to be involved. The situation was even more complex because the media server was a Linux machine. So the problem could be in either platform...

During the next two months (!), I've been through a real nightmare, because the customer insisted that there could be something wrong with OnBar, I spent several hours in useless conference calls (customer and S2) and meetings. They insisted in making all kind of weird tests like:
  • Running onbar with few and more processes
  • Reconfigure their whole backup infrastructure
  • Reconfigure their storage manager parameterization
At some point they asked me the following:
  • Do you have customers running with this setup?
    They seemed desesperate to prove OnBar was doing something wrong...
    They don't understand we don't keep any kind of tracking on customer environments. We do keep an incredible useful internal knowledge base where customer problems are recorded, but as expected I was tired of making exhaustive searches on that KB without finding any similar cases...
  • They proposed to use another storage manager (interestingly from vendor S1)
    The point was more or less this: If it doesn't work with that, surely the problem is not with the storage manager (and that again would imply the problem was with IDS...)
  • They insisted in having a tech support case opened on IBM
    This seems reasonably, but the behavior of OnBar was completely explained in detail by me previously
At this point I was rather tired of the whole issue. Interestingly the tests with the second storage manager revealed two things:
  1. The new environment hit another problem
    This problem was already reported in IBM knowledge base and it was a problem in that storage manager BSA library. I had to wait one week in order to know if the version they were using included the necessary fix... it turned out it didn't ;)
  2. After solving that first issue the backups aborted with some sort of storage manager timeout... The real difference was in the storage manager behavior. First one waited indefinitely and this one establishes a timeout (much like OnBar does, but since it's the one calling the low level OS function it aborts the process instead of just warning the administrator like OnBar does)
After about two months of outstanding time waste I understood the only way to solve this in a timely manner was to reproduce the issue without any Informix or storage manager layer. Let me point that some people in the customer team had made exhaustive tests sending data across servers using scp for example. All these tests were successful, and they concluded it couldn't be anything in the TCP layer. I warn them (and showed them using the tools above that scp didn't use the same functions, block size etc.). if the tests had failed it would have beeen nice, but a working scp session didn't proove us that there was nothing wrong with the TCP stack.

After a few more days and a lot of GB sent, I was able to create a test case with just C programming. A simple client/server program that kept sending data through a socket. I made sure I established the socket options and block size similar to what the storage manager's BSA library used.

After this, they finally accepted that something was wrong with the vendor OS functions (specifically send() ). In the next day we had a conference call with all the parties involved. And in less than a week we had a fix (newly coded) for the OS. After installing we never had another similar hang on send() function.

The purpose of this post is to show how important it is to know how to debug a problem. Tools like truss, dbx/gdb/adb and lsof are invaluable for this. I personally think anybody working on IT problem solving independentely of their role (DBA, programer, system administrator, network administrator etc.) should have a minimal knowlege about how they work and what kind of information you can collect with them. There are obviously other tools equally useful in some scenarios. Tools like tcpdump, netstat etc. This whole case revealed that many people involved in this area don't have a clue about how these tools work and the info they can collect.
Everybody involved had a theory about the problem. But none of those theories were based on facts or real observations. They were merely speculations of what could be happening and all these tend to pass the problem to a different team...
Some facts about the situation:
  • lots of emails including polished accusations were exchanged
  • during two months I spent a considerable amout of time paying attention to this, trying to lead people into the righ direction (mostly without success)
  • Vendor S2 had a very large team involved. They even sent a consultant from another country into this customer site (when he arrived we were about to receive the fix, so apart from politics this guy was not able to do much about this problem)
  • The problem could have been solved in around two weeks (one for debugging and another for the vendor investigation and code fixing)
  • No one admitted that didn't understand the output of the tools above and no one (even after the conclusion) took the opportunity to ask me to explain how to use this kind of tools
    I don't think I know everything (and in fact my knowledge about these tools is somewhat superficial), but I only learn how to use this kind of stuff because in some situation in my past I came across some references to them and I took the time to experiment and read about them. In other words, we should use the problems as opportunities to gather more knowledge and to learn new things.
I keep receiving reports about problems without any useful information. My favorite ones are:

  • I tried it and it doesn't work!
    Surely I believe it doesn't work... But usually when something "doesn't work" it raises an error. Only rarely people include the error code/description. More often they include possible causes (usually not related to the error) than they include the error code or error message
  • The process was hang! It was probably in some kind of loop! So I killed it
    Well... processes don't "hang". They wait on something or they effectively stay in some loop. And the way to see that is by using the tools... And the tools don't work on a process that doesn't exist anymore...
  • I've found a bug!
    Sure... A bug is a malfunction. Something that works differently from what's documented or what is expected. Most of the times people find a problem. After some analisys (many times involving supplier's technical support) it may be mapped to a bug. People tend to expect something. If it doesn't happen they "found a bug". Usually they don't bother to read the documentation and try to understand the reasons for the unexpected behavior.
    Obviously people do hit bugs. Most of the cases that I open within IBM end up as bugs. But this is just a very small portion of the problems that customers report.
In short, I feel that in general, people's ability to study a problem in the IT world is vey limited. Usually people spend more time trying alternatives than collecting and understand problem data. Error messages and codes are ignored many times. And all these translate into a big waste of time, and obviously money... And of course, this directly impacts the quality and availability of the IT systems.


Monday, August 17, 2009

Current time? Really?!

"Let S be an <SQL procedure statement> that is not generally contained in a <triggered action>. All <datetime
value function>>s that are contained in <value expression>s that are generally contained, without an intervening
<routine invocation> whose subject routines do not include an SQL function, either in S without
an intervening <SQL procedure statement> or in an <SQL procedure statement> contained in the <triggered
action> of a trigger activated as a consequence of executing S, are effectively evaluated simultaneously.
The time of evaluation of a <datetime value function> during the execution of S and its activated triggers
is implementation-dependent."



The above is a citation of SQL Standard 2008/2003/1999 (part 2 - SQL/Foundation). It's a little nightmare to read, follow and understand, but it's the cause of one behavior generally considered an annoyance in Informix: Inside a stored procedure, all CURRENT references will show the same value.

Let's check the Informix SQL Syntax guide to see how the above ANSI rule is explained in the fine manual:

"SQL is not a procedural language, and CURRENT might not execute in the lexical order of its position in a statement.
You should not use CURRENT to mark the start, the end, nor a specific point in the execution of an SQL statement.

If you use the CURRENT operator in more than once in a single statement,
identical values might be returned by each instance of CURRENT. You cannot rely
on CURRENT to return distinct values each time it executes.

The returned value is based on the system clock and is fixed when the SQL
statement that specifies CURRENT starts execution. For example, any call to
CURRENT from inside the SPL function that an EXECUTE FUNCTION (or
EXECUTE PROCEDURE) statement invokes returns the value of the system clock
when the SPL function starts."

This looks like a more friendly explanation. But it doesn't point out the reason why this is implemented as is, and that reason is mainly for ANSI compliance. In practice, what I usually hear from customers is that this is inconvenient. Many times they are trying to use CURRENT YEAR TO SECOND/FRACTION to find out the time spent on a procedure or parts of it. And it simply doesn't work, because all the values they get are the same. Is there a solution? Yes, if you just need precision up to second:


SELECT
DBINFO('utc_to_datetime', sh_curtime)
INTO
current_time
FROM
sysmaster:sysshmvals;


The sh_curtime field of sysmaster:sysshmvals contains the current unix time (number of seconds since January 1 1970). The DBINFO function with the 'utc_to_datetime' converts it into a DATETIME YEAR TO SECOND value.

Sunday, August 16, 2009

IDS 11.50.xC5

It's been too long and I have a lot of catch up to do... Let's start with the latest version of IBM Informix Dynamic Server. It's 11.50.xC5, the latest fixpack in the 11.50 family, available since the end of July 2009.

Keeping up with what can now be considered as usual, this fixpack brings important features. Let's list them:

  • High availability
    • Configuring RS Secondary Server Latency for Disaster Recovery
      This allows the system administrator to configure a delay of time between the primary server and a RSS (remote secondary server). This means that the secondary server will lag behind the primary server for the specified amount of time.
      Associated to this we have the ability to stop and restart the roll forward of the logical logs on the secondary server.
      Note that for integrity purposes the primary will still send the logical logs as soon as possible. It's up to the secondary to hold the logs without applying them. This means that if your primary server disappears you're still able to recover up to the most current image of the primary server.
      The configuration of this feature involves three new parameters: DELAY_APPLY, STOP_APPLY and LOG_STAGING_DIR.
      DELAY_APPLY can be used to configure a delay between primary and secondary.
      STOP_APPLY can be used to specify a specific time to stop applying logs
      LOG_STAGING_DIR is a secondary locally writable directory where the server will keep the logs before applying them.
      All this can be dinamically changed with onmode -wf/-wm command
  • Administration
    • Forcing the Database Server to Shut Down
      Two new utilities were made available to handle situations where things went wrong... If for example your IDS server is hang, or when it was not shutdwon properly, you may have trouble to stop it and/or restart it (it can leave shared memory segments behind for example).
      The new utilities are onclean and onshutdown. Onclean can be used to force the IDS server down (it can kill the engine processes) and it will try to clean all the shared memory segments that were left behind. Without options it's use is to remove the sahred memory segments of a server that was already stopped. With "-k" it will kill the server and then make sure the shared memory is freed.
      Onshutdown is used to attempt to kill the server using the normal "onmode -ky", but if it fails it calls onclean.
      This is not a big feature, but it can be handy for scripting and for new users.
    • Enhanced Support for Multibyte Character Strings
      By using a new environment variable, SQL_LOGICAL_CHAR, the server will interpret the size of character fields in terms of characters instead of bytes as usual
    • New default for the INFORMIXTERM environment variable for terminal set up on UNIX
      Simply a new default for INFORMIXTERM variable
    • onconfig Portal: Configuration Parameters Listed by Functional Categories
    • onstat Portal: onstat Commands Listed by Functional Categories
      These two are important documentation enhancements available at IDS Infocenter
    • Enhancements to the Enterprise Replication plug-in for the OpenAdmin Tool for IDS
      Open Admin Tool keeps being expanded and improved. And free...
  • Enterprise Replication
    • Enterprise Replication Stops if Memory Allocation Fails
      Better handling of low memory conditions. Replication will stop and raise an alarm. After solving the memory issues just run cdr start
    • Notification of an Incorrect Log Position When Enterprise Replication Restarts
      A new alarm is fired if the replay position is invalid (too old or later than current log position)
    • Workaround: Reclaiming Space from Enterprise Replication Paging Smart Large Objects
      Later versions could allocate space for replication of SLOBs that was not necessary. This space can now be reclaim.
    • Improving the Performance of Consistency Checking with an Index
      A new shadow column can be added to replicated tables. This column (ifx_replcheck) can be used in a composite index with the primary key in order to speed up the checking of replicates.
    • Specifying the Range of Data Sync Threads to Apply Replicated Transactions
      An ONCONFIG parameter can control the number of replication threads
  • Warehousing
    These features are presented in the Warehousing category, but I personally think they can be used in any kind of environment
    • Loading Data into a Warehouse with the MERGE Statement
      This IDS version introduces support for the MERGE sql statement. We specify a destination table, a source table/query and a match condition. If the match condition is true than the row in the destination table is UPDATEd with the data from the source table/query. If the match condition is false a new row is inserted in the destination table
    • Retrieving Data by Using Hierarchical Queries
      Introduces support for the CONNECT BY sql construct. This allows the use of the so called hierarchical queries. Please note that the most efficient way to handle this would be using the node datablade. But for compatibility reasons IBM decided to introduce this feature. It's a nice feature for application portability
So, that's all for xC5. Let's use the new features and keep an eye on xC6 that should appear before the end of the year.