Tracing a Process in Software Troubleshooting: A Linux/UNIX Tradition

Tracing a Process in Software Troubleshooting
A Linux/UNIX Tradition

Suppose you ask, Why do I get this error?, What’s the program doing now?, or Can I see what’s going on behind the scenes? You check the log file(s), read documentation, search on the web, ask AI, or search your company internal knowledge base. But if you’re on Linux (or UNIX), you have one generic troubleshooting method, which is particularly powerful before you become an expert at the software, that is, tracing the running process.

Anyone having used Linux for a few years may have heard of or have personally done process tracing when troubleshooting a problem. The most basic tool is the strace program (which you may have to install). This short article is not a guide on how to use it. Instead I’ll describe a real case and show how strace helped us find the cause of a problem.

As a newbie to Neo4j, a graph database, I’m tasked to solve a mystery that a Neo4j instance suddenly throws the “403 Forbidden” error on its web interface after it’s been running fine for some days.

HTTP ERROR 403 Forbidden

URL: /browser/
STATUS: 403
MESSAGE: Forbidden
SERVLET: default

Browser screen text at URL https://myserver:7474/browser/

The messages in the log file are not helpful. A quick search on the web returns nothing. My skills in process tracing comes to the rescue. Neo4j runs as a java process. So I fire up

strace -f -p <pid of the java process>

As expected, it quickly spews out lots of lines showing various system calls, especially futex() (for shared-memory synchronization), which are useless for our purposes. Let’s filter that and a few others out

strace -f -p <pid> 2>&1 | egrep -v 'futex|restart_syscall|epoll_pwait2|getrusage'

It’s much cleaner. But why do we get “403 Forbidden”?

While strace is tracing the process, I go to the web browser to refresh the page to get the 403 error again, and strace immediately shows system calls accessing files in a directory we never saw before, /tmp/decompressed-browser15402555885507884685/browser/assets/. Under that directory, there are numerous files that look like for a website. Obviously Neo4j reads these files when a user accesses the web interface. But why do we get “403 Forbidden”?

After some digging, I find that /var/lib/neo4j/web/neo4j-browser-2026.04.18+0.zip contains very similar webpage files (by unzip -l neo4j-browser-2026.04.18+0.zip). It’s reasonable to assume that the files under the cryptic /tmp/decompressed-browser directory must be the result of unzipping or decompressing this zip file, used by the web server, which is one of the functions of the Java process. But why do we get “403 Forbidden” now after the web server has run fine for some time?

On Linux, by default, there is a job, more accurately, a timer (“systemd-tmpfiles-clean.timer”), controlled by systemctl, that deletes files not accessed (or modified) for 10 days. Our Neo4j deployment is still in its experimental phase so it’s rarely used. Some files under /tmp will be deleted subject to the rule of the timer. If a certain file, perhaps related to the website certificate, is deleted, you’ll get 403 when you access the site. That explains it!

Now the solution. We don’t see a way to modify this directory path in the Neo4j documentation. One simple workaround would be running a cron job to touch the files every 9 days or more frequently.

But my coworker, who knows JVM better, has a better idea. By default, JDK deploys temporary files into /tmp. We just need to change this directory to somewhere outside of /tmp with a JVM setting. Specifically, we add to neo4j.conf this line

server.jvm.additional=-Djava.io.tmpdir=/u01/app/neo4j/temp

which is a setting for JVM, not Neo4j. And sure enough, the Neo4j java process is shown accessing the files in the new directory we designated. Problem solved!

Earlier I said process tracing is particularly powerful when you’re still new to the software you’re troubleshooting. Let me explain. If you are an expert at the software, you can solve almost all its problems with your knowledge and experience. Rarely do you need to check its bahavior at the OS level to find a solution. But IT professionals are constantly challenged by new software. In these cases, learning the software and debugging at the low level may become equally productive. That’s when process tracing is a fruitful supplement to your software-specific troubleshooting.

By the way, to effectively use strace, you may want to have basic familiarity with OS system calls, ideally with some experience in systems programming. The output of strace may look overwhelming. But you can limit it with one of the these options

strace -f -e trace=file -p <pid>

strace -f -e trace=network -p <pid>

[neo4j@myserver ~]$ strace -f -p 618020 -e trace=file
strace: Process 618020 attached with 70 threads
[pid 652213] access("/tmp/decompressed-browser6166723773063777664/browser", F_OK
) = 0
[pid 652213] access("/tmp/decompressed-browser6166723773063777664/browser", F_OK
) = 0
[pid 652213] access("/tmp/decompressed-browser6166723773063777664/browser", F_OK
) = 0
[pid 652213] access("/tmp/decompressed-browser6166723773063777664/browser", F_OK
) = 0
[pid 652213] statx(AT_FDCWD, "/tmp/decompressed-browser6166723773063777664/brows
er", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_att
ributes=0, stx_mode=S_IFDIR|0755, stx_size=36, ...}) = 0
[pid 652213] statx(AT_FDCWD, "/tmp/decompressed-browser6166723773063777664/brows
er", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_att
ributes=0, stx_mode=S_IFDIR|0755, stx_size=36, ...}) = 0
[pid 652213] statx(AT_FDCWD, "/tmp/decompressed-browser6166723773063777664/brows
er", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_att
ributes=0, stx_mode=S_IFDIR|0755, stx_size=36, ...}) = 0
[pid 652213] access("/tmp/decompressed-browser6166723773063777664/browser", F_OK
) = 0
[pid 652213] access("/tmp/decompressed-browser6166723773063777664/browser", R_OK
) = 0
[pid 652213] statx(AT_FDCWD, "/tmp/decompressed-browser6166723773063777664/brows
er", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_att
ributes=0, stx_mode=S_IFDIR|0755, stx_size=36, ...}) = 0
[pid 652213] statx(AT_FDCWD, "/tmp/decompressed-browser6166723773063777664/brows
er/index.html", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7f1cc6ef4030) = -1 ENOENT (N
o such file or directory)

Tracing file system calls after index.html and other files were deleted and when Webpage shows 403; note the bold text

so only file and network, respectively, system calls are traced. You can of course save the output to a file first with -o. Other tools combined with strace are also helpful, such as lsof to list open files so you can see what the file descriptors given by strace point to what files. (You could of course check /proc/<pid>/fd but lsof does it better.) If the process is busy doing work on CPU, not making system calls, you can run pstack (from the GDB package, which you may need to install). The top few functions you see can be used as keywords for a Google search, even if the software is closed source. Tricks are plenty, and your skills are growing, after many troubleshooting episodes.

Over the years, I’ve used process tracing to solve numerous problems or at least get to know where to look first. Of course many Linux/UNIX users in the world do so. It is a tradition, or a culture.

But this tradition or culture is not in the Windows world. Many years ago, I made a great effort to do something similar on Windows with limited success. The work is hampered by my lack of Windows systems programming experience, complicated system calls (or native Windows APIs, which are undocumented although mostly guessable), among other things. When I ask some Windows experts about process tracing, at most they just mention such tools as Process Explorer and Process Monitor. Well, if a phenomenal Windows expert doggedly insists on troubleshooting a Windows process without source code, he/she will attach a debugger to it, and that definitely calls for the kind of skills not mastered by a regular but experienced Windows IT professional, comparable to a seasoned Linux user. The minority of these expert Windows professionals do not make what they do a common practice, creating a tradition or a culture.

June 2026
(originally posted on Medium)