On debugging

💡

This article explores how to debug faster, building a mental model of debugging using graph theory.

A mental model

One mental model of debugging is:

There’s a knowledge graph of the software engineering, tooling, and the systems in your head. As you gain more knowledge, you expand this graph.

This includes a collective knowledge graph of the documentation, wikis, and messages of your organization. As the organization produces more docs, wikis, and chats, this graph expands.

There’s a problem graph of the stack, code, components, API calls, services, etc.

Debugging effectively is being able to traverse both graphs rapidly, identifying where in the problem graph lies the problem, what knowledge to apply, and what to change to fix it.

We get stuck when our knowledge graphs are insufficiently developed, or we’re not traversing the problem graph enough to find the problematic node or the knowledge graph enough to apply the correct knowledge.

Traversing the graph

In all of debugging, we are implicitly applying graph traversal strategies.

DFS on the problem graph: Put a debugger through whole stacktrace and trace through the code. Navigate up and down code trees.

BFS on the knowledge graph: Apply a bunch of different ideas and tools that you know (React Component Tools, Debugger, Datadog, etc), ask your neighbors for ideas.

BFS on the problem graph: Look at all the recent commit messages and edited files and see if anything stands out.

Binary search on the problem graph: Do a git bisect to narrow down the exact offending commit.

Additionally, if you have multiple people tackling the same problem, it helps to traverse the graph in different areas than each other, as this avoids redundancies.

There’s an incident! The SEV manager helps delegate. One person may try to git bisect, another might look at regression code profiles, another might look at user and employee reports. They all contribute to the collective knowledge graph by reporting their findings.

Improving the graph traversal

How can you more quickly and thoroughly explore nodes in these graphs?

I think getting better at debugging comes down to:

Improving the knowledge graph:

Getting better at understanding programming languages, frameworks, architecture.

Getting better at understanding tools and applying them.

Debugger, React Component Tools, Datadog, Sentry, Git, etc.

Getting better at searching internal tools / google / LLM / stackoverflow / Slack / etc for similar issues.

Improving the collective knowledge graph:

Creating better organizational processes and communication norms for managing problems and incidents.

Problems: Asking good questions. Adding engineering investigation notes to tasks. Communicating problems proactively (e.g. dev infra is broken).
Incidents: Solid incident management practices, blameless Postmortems.

Creating better product, technical, and tooling documentation.

General docs. FAQs. Collections of past incidents and solved problems. Debugging and tooling docs.

Organizing channels of collective knowledge.

Categorizing docs, auditing and improving wiki organization, archiving old docs, organizing Slack channels in a more efficient way.

Improving the problem graph traversal:

Navigating up and down code stacks, files, and versions very rapidly.

Becoming more familiar with the codebase, services, APIs.

Narrowing down the problem very fast, e.g. ruling out the problem areas (frontend or backend? which component? which versions?)

Why this matters:

Just like how improvements in data structures & algorithms tend to have major impact on graph traversals, the difference between a great debugging strategy and an okay one could mean root causing a regression in 5 minutes instead of 4+ engineer-hours.

Let’s say you debug about 3 times a week and your average problem solving time for these bugs is ~2 hours. If you can bring this down to ~20 minutes, you’ve saved 5 hours each week, and you can help other people debug faster too to save them time.

Improving the collective knowledge graph helps everybody solve the same classes of problems much faster, and these collective time savings can be massive.

“Fixer” archetypes:

One common archetype of staff+ engineers at companies is a “Fixer” engineer. They help their org operate faster by helping other engineers solve problems much faster than they would otherwise. One Fixer might help 5 people solve problems via Slack messages or pairing a day and help resolve multiple major incidents a month. They might also spend their time identifying and solving critical, complex problems or meta problems (e.g. root infrastructural problems that cause many problems)!

Practical tips

Solve problems outside of your direct domain, because this builds significant expertise and adds breadth & depth to your knowledge graph. Often, many bugs are caused at a layer outside of your ownership (eg you work on product components but the problem is actually in core web components), and just triage ping ponging the task away or solving the bug at the wrong level costs significant time.

Pair often. Pay attention to how other people apply strategies and ask them questions to fill in your knowledge graph!

Document the systems (product and technical), the strategies (tooling), and past and present investigations! This expands the collective knowledge graph everybody has access to.

Strategize. When debugging, take a step back and consider the two graphs and your debugging strategy. How can you navigate knowledge, tooling, or the code faster to solve the problem as quickly as possible?

Maybe this is (1) realizing you need to ask for help sooner because you are not making forward progress or rubber duck debugging (2) timeboxing how long you try a particular approach, or (3) document your investigation, assumptions, and learnings more clearly.

Check your assumptions. Often, we get stuck because we’ve marked the problematic nodes in our problem graph as not problematic, but we made a logical error. Writing down all our assumptions and then checking them can help us identify whether this is actually true!

Perhaps we thought the offending code MUST be in a particular set of versions or MUST be in a particular part of the codebase, but this actually isn’t true!

Example assumptions:

The code is in our frontend components, because the issue occurs on frontend but not when querying via the API call directly.

^ This assumption could be wrong if either the API call doesn’t match exactly the way the frontend’s calling it or if there’s some middleware or server-side rendering shenanigans.

This issue began on July 23 2024, so only new code written after that is problematic.

^ This assumption could be wrong if we only started getting a few reports on July 23 but we did not do a proper bisect yet to verify.

The problem is in product code, because it only affects my product.

^ This assumption could be wrong, because we may not realize that the issue affects other parts of the app, so the problematic area might actually be a core component and not a product component!

Sometimes it’s a random issue from an external vendor (…like CloudStrike).

Try other ideas before bisecting. I typically try to solve the problem directly (via inspecting the product, reading the code, traversing git blames, or version diffs) instead of bisecting. I think if you bisect, you rob yourself of debugging a problem directly and all the knowledge you get from it!

I hope this was helpful! Let me know if you have other tips & ideas to expand this model. 🙂

endnote: I think that If you replace “Debugging” with “Problem Solving”, most of this generalized mental model holds. How do we traverse and apply information faster to solve problems? How do we best contribute to the collective knowledge graphs of an organization? How do we strategize to solve the same patterns of problems more quickly and thoroughly?

recommended reading:

What Science Can Learn from Car Mechanics by Trevor Klee