Ghost in the Machine

Ghost in the Machine

That's why documentation is so important

Mar 12, 2023ยท

4 min read

Why am I writing this? Because sometimes the best way to deal with your problems is to force everyone to read about them. The facts have been altered to make them more awesome.

After a month-long collaborative effort of the entire backend team, Raju ๐ŸŽญ (the backend lead) had a breakthrough thought that changing the type of deployment to WSGI might fix the issue. But right before he could, Prof. PP ๐ŸŽค called him to his office. He said that he has had enough of Raju's shenanigans and that Raju's services are no longer required. Raju had spent past 2 yeas tirelessly fulfilling one requirement after another. But this sudden dismissal did not disappoint Raju, infact he was relieved that he no longer has to work under a management which does not recognise his talent or his contribution.

Prof PP pushed everyone out of this bug hunt and pulled out his big gun, his prodigy Lil Krishna, straight outta SCAMS, restarted the investigation with their own "way of debugging". Needless to say they wasted 2 weeks chasing 10 year old stack overflow answers. Spoiler alert, they ended up doing what Raju was doing at the first place. All this and there wasn't a drop less in prof PP's vast sea of arrogance.

Moving on from people who have no idea what they're doing, Raju packs his stuff in a cardboard box wondering what he'd do next, freelancing, startup or another soul killing job. While Raju was lost in his thoughts, Altomush, one of the few employees who was happy with the company, saw him. Alto asked Raju why he was packing his stuff. Raju replied with a smile, "Well, it looks like the company is downsizing and unfortunately, my stapler didn't make the cut. It'd have if John had simply implemented it how PP wanted. Atleast responsibility of this catastrophe wouldn't have been on our head".

One month ago...

John's module didn't seem to be stable. His django application kept on crashing with OOM error, seemingly randomly.

To make debugging and deploying easier John segregated the paginated sync request from rest of the API endpoints and containerised both separately.

Turns out that the paginated request was having the crashing issue indeed. But this API wasn't doing any analysis or anything complex at all. This API fetched a chunk of partitioned parquet file, converted it to json and sent it back. John and his team had tested this function in their local systems countless times never facing any issue. The issue only occurred when the function was ran in the containerised django application.

The confusing thing in all this was that the container and django configuration was the exact same in every other module too. But no one else had reported any such issue. It was as if there was a ghost in John's module bent on delaying his release and ruining his sleep.

The suspects were many and the days left were few. Something wrong in the middleware or a bug in a library. What was it! John conferred with Raju and they distributed their team to investigate every loose end. They checked every open issue in every library they were using even if it was remotely related to their issue. At the end the prime suspects were 2, Prometheus middleware or pyarrow library.

At this point, Raju left on a solo mission to implement common WSGI deployment for all modules. John and his team kept on grinding, they tried every method to try and find the memory leak. Graphs in grafana kept showing that the application is stockpiling memory.

Raju returned from his solo adventure. He was about to test what he found out. His discovery would lead the team towards much needed relief or towards absolute despair.

Well Well Well ...

But this wasn't the instance that Raju was telling Alto about. Then why did I tell you this story you ask, because it's #DebuggingFeb and because misleading the audience is awesome. We'll get to the story that Raju tells Alto later.

At the end of the story, Alto asked for a couple pieces of advice that Raju would offer him to succeed in this cruel world of software development. Raju adviced, "always read the documentation and never use runserver in production".

Thanks for reading, I hope you enjoyed it.

Did you find this article valuable?

Support Azanul Haque by becoming a sponsor. Any amount is appreciated!