DCD (dead connection detection) or Terminated Connection Detection as called in 12c

------------------------------------------------------------------------------------------------------------

Update 2010-12:
* ASM complication, or rather, lack of

Test in 10gR2 and 11gR2 shows that you should always set DCD, or even SQL*Net trace for server side 
tracing, in sqlnet.ora under DB ORACLE_HOME/network/admin, regardless whether you have ASM, RAC, 11g or 10g. 
(Ref: Doc 1136945.1)

11gR2 RAC has scan listeners. You don't need to restart or reload it; restarting (reloading) the regular 
listener is enough.

------------------------------------------------------------------------------------------------------------

* How to check if DCD is set up right

On UNIX, the easiest way to check for DCD is to trace the shadow process for your SQL*Net connection. For 
example, on a Solaris Oracle server where 1 minute sqlnet.expire_time is set:

$ truss -tsetitimer -vsetitimer -p 9266
    Received signal #14, SIGALRM, in read() [caught]
setitimer(ITIMER_REAL, 0x08044E30, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   60.000000 sec
setitimer(ITIMER_REAL, 0x08044F00, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   60.000000 sec
setitimer(ITIMER_REAL, 0x08045054, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   60.000000 sec

After you wait for at most DCD expire time, you should see the above output where the value matches the 
expire time. If not, DCD is not enabled, at least for your connection. 

On Linux the output is similar and the 60 second timer is seen even without -v option:

$ strace -e trace=setitimer -p 13193
Process 13193 attached - interrupt to quit
--- SIGALRM (Alarm clock) @ 0 (0) ---
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={60, 0}}, NULL) = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={60, 0}}, NULL) = 0
setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={60, 0}}, NULL) = 0

The documented way to check if DCD is set up right involves SQL*Net trace. Suppose server side 
sqlnet.ora has

sqlnet.expire_time = 1
trace_level_server=16
trace_file_server=svr
trace_directory_server=/some/path/on/server
trace_unique_server=true # Add '_pid' to trace filename
trace_timestamp_server=ON # Only in Oracle8i onwards

and `lsnrctl reload' (stop and restart also works), and when the client 
connects to the server, server side trace file will have:

[04-AUG-2007 22:44:03:441] niotns: Enabling CTO, value=60000 (milliseconds)
[04-AUG-2007 22:44:03:441] niotns: Enabling dead connection detection (1 min)

Note CTO (probably connection timeout) of 60 seconds matching our expire_time setting.

Note:395505.1 says to check DCD, enable client trace and wait more than twice DCD timeout time 
doing nothing and then type any query on client side. My client side trace has

[04-AUG-2007 23:07:00:942] nsprecv: tlen=20, plen=10, type=6
[04-AUG-2007 23:07:00:942] nsprecv: 10 bytes to leftover
[04-AUG-2007 23:07:00:942] nsprecv: packet dump
[04-AUG-2007 23:07:00:942] nsprecv: 00 0A 00 00 06 00 00 00  |........|
[04-AUG-2007 23:07:00:942] nsprecv: 00 00                    |..      |
[04-AUG-2007 23:07:00:942] nsprecv: normal exit
[04-AUG-2007 23:07:00:942] nsrdr: got NSPTDA packet
[04-AUG-2007 23:07:00:942] nsrdr: NSPTDA flags: 0x0
[04-AUG-2007 23:07:00:942] nsrdr: normal exit
[04-AUG-2007 23:07:00:942] nsdo: got "null" packet

That 'got "null" packet' is the tell-tale sign DCD is working. (Note: Client side trace has the 
line "niotns: Not trying to enable dead connection detection." Ignore that. It probably means the 
client machine is not a DB server which has DCD enabled.)

Note that enabling or disabling DCD will not affect existing connections. For instance, listener 
reload (or restart) after you remove "sqlnet.expire_time" from sqlnet.ora on the server, existing 
connections are still watched by DCD.

Ref: Note:438923.1 "How To Track Dead Connection Detection(DCD) Mechanism Without Enabling Any Client/Server Network Tracing".

------------------------------------------------------------------------------------------------------------

* Shared server complication (test with 9.2.0.1 client connecting to 10.2.0.2 server)

Note:191209.1 says on VMS, "For a shared server (MTS) connection, a dispatcher will need one timer 
for each client process connection." In fact, this is true for a dispatcher on Solaris too (and 
possibly generic to any UNIX and Linux). In the following, 9355 is our only dispatcher D000 (all 
others were shutdown by alter system, or prevented from starting by dispatchers parameter). 
When the first client comes in, we see

$ truss -tsetitimer -vsetitimer -p 9355
setitimer(ITIMER_REAL, 0x08044A60, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   60.000000 sec
setitimer(ITIMER_REAL, 0x08046730, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:    0.000000 sec
setitimer(ITIMER_REAL, 0x08045370, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   60.000000 sec

About 18 seconds later, a second client connects to the same dispatcher (verify by select spid, 
program from v$process where addr in (select dispatcher from v$circuit)) and we see

    Received signal #14, SIGALRM, in pollsys() [caught]
setitimer(ITIMER_REAL, 0x08045EE0, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   42.380000 sec
setitimer(ITIMER_REAL, 0x08046034, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   42.380000 sec

From that point on, we see

    Received signal #14, SIGALRM, in pollsys() [caught]
setitimer(ITIMER_REAL, 0x08045EE0, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   17.610000 sec
setitimer(ITIMER_REAL, 0x08046034, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   17.610000 sec
    Received signal #14, SIGALRM, in pollsys() [caught]
setitimer(ITIMER_REAL, 0x08045EE0, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   42.380000 sec
setitimer(ITIMER_REAL, 0x08046034, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   42.380000 sec
    Received signal #14, SIGALRM, in pollsys() [caught]
setitimer(ITIMER_REAL, 0x08045EE0, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   17.610000 sec
setitimer(ITIMER_REAL, 0x08046034, 0x00000000)  = 0
         value:  interval:    0.000000 sec  value:   17.610000 sec
    Received signal #14, SIGALRM, in pollsys() [caught]
setitimer(ITIMER_REAL, 0x08045EE0, 0x00000000)  = 0
...<more lines omitted>...

This indicates that the dispatcher process sets up two alarms, one for each client. Initially, the 
60-second alarm reminds of itself to do DCD check in 1 minute (our expire_time=1). But since at 
17.6 seconds, a second client comes in, it adjusts its alarm to 42.4 seconds to match its original 
schedule for the first client. Of course it does not forget the new client. When the original 1 
minute alarm goes off, it sets up a 17.6 second alarm for the second client. When this 17.6 second 
alarm expires, it immediately reminds itself of a first client DCD check in 42.4 seconds, and so on.

Another shared server complication is in Bug 4018031 "CLIENT SESSION WHICH CONNECT THROUGH MTS 
SERVER REMAIN EVEN THOUGH CLIENT DEATH". Workaround is set disable_oob=on in sqlnet.ora if using UNIX.

------------------------------------------------------------------------------------------------------------

* Minimum expire_time is 1 minute. If you set it to, say, 0.1, in sqlnet.ora, DCD will *not* be enabled.


------------------------------------------------------------------------------------------------------------

* TCPView (Windows utility from sysinternals.com) can Close Connection for a specific connection. DCD is 
not playing a role here and the trace file is the same as in absence of DCD.

[04-AUG-2007 21:15:03:533] ntt2err: entry
[04-AUG-2007 21:15:03:533] ntt2err: soc 15 error - operation=5, ntresnt[0]=517, ntresnt[1]=131, ntresnt[2]=0
[04-AUG-2007 21:15:03:533] ntt2err: exit
[04-AUG-2007 21:15:03:533] nttrd: exit
[04-AUG-2007 21:15:03:533] nsprecv: error exit
[04-AUG-2007 21:15:03:533] nserror: entry
[04-AUG-2007 21:15:03:533] nserror: nsres: id=0, op=68, ns=12547, ns2=12560; nt[0]=517, nt[1]=131, nt[2]=0; ora[0]=0, ora[1]=0, ora[2]=0
[04-AUG-2007 21:15:03:533] nsrdr: error exit
[04-AUG-2007 21:15:03:533] nsdo: nsctxrnk=0
[04-AUG-2007 21:15:03:533] nsdo: error exit
[04-AUG-2007 21:15:03:533] nioqrc:  wanted 1 got 0, type 0
[04-AUG-2007 21:15:03:533] nioqper:  error from nioqrc
[04-AUG-2007 21:15:03:533] nioqper:    ns main err code: 12547
[04-AUG-2007 21:15:03:533] nioqper:    ns (2)  err code: 12560
[04-AUG-2007 21:15:03:533] nioqper:    nt main err code: 517
[04-AUG-2007 21:15:03:533] nioqper:    nt (2)  err code: 131
[04-AUG-2007 21:15:03:533] nioqper:    nt OS   err code: 0
[04-AUG-2007 21:15:03:533] nioqer: entry
[04-AUG-2007 21:15:03:533] nioqer:  incoming err = 12151
[04-AUG-2007 21:15:03:533] nioqce: entry
[04-AUG-2007 21:15:03:533] nioqce: exit
[04-AUG-2007 21:15:03:533] nioqer:  returning err = 3135
[04-AUG-2007 21:15:03:533] nioqer: exit
[04-AUG-2007 21:15:03:533] nioqrc: exit
[04-AUG-2007 21:15:03:534] nioqds: entry
[04-AUG-2007 21:15:03:534] nioqds:  disconnecting...
...<more lines omitted>...

The words "ntt2err: soc 15 error - operation=5" definitively tell us the connection was already 
severed. ntt2err is a function called when a TNS transport layer error occurs. What Oracle calls 
socket (e.g. "soc 15") is not client side port number (as in `netstat -an'), but instead the client 
process file descriptor (file handle if on Windows), vieweable with lsof, pfiles, /proc/<pid>/fd, 
Process Explorer, etc. Once this error is identified as the root cause, other errors can be 
ignored, such as 12547 (TNS:lost contact), 12560 (TNS:protocol adapter error).
(Ref: http://www.itpub.net/thread-1358266-1-1.html)