Quite a few users reached me out last year and requested this feature in Bookmark Ninja. If you have many old bookmarks in your bookmark manager then it can be a pretty useful tool to cleanup your stuffs. At first glance it seemed to be a very simple task. You just iterate through all the bookmarks and check each link if it returns 404 (page not found). If so you mark the bookmark by adding the “dead-link” tag. Then as I started to dig myself deeper into it, it turned out it wasn’t that easy.
How to check if a link is dead? — The easy part
One of the reasons I love Java is that you can find tons of resources about any topic on the net. I could quickly find all the info I needed to write the below method:
It’s pretty clear and straightforward. It will return true if the response code is 404 (page not found) or the link contains an invalid domain (UnknownHostException). In all other cases it will return false (it’s not a dead link). Before I call getResponseCode I set these 3 parameters:
The Request Method (#1) can be either GET or HEAD. If it’s set to HEAD then only the header is read, so it’s going to be faster. In case of GET the whole page is read. I ran some tests on HEAD vs GET on 4K bookmarks and got the following results: with HEAD it took 34 minutes to process the links, while with GET it took 47 minutes. That’s a big difference and it will be even bigger if users run this tool on 30–40K bookmarks (yes, there are some users who have this many bookmarks). So HEAD would be the obvious choice but it turned out that when HEAD was used a few bookmarks were marked as dead link wrong. For any reason using the setting HEAD was not reliable so I had to go with GET.
It’s not a question that Follow Redirect (#2) has to be turned on because redirects will be executed when you open the link in the browser and I’m interested in the state of the “last” page in the chain.
About Timeout (#3) I had to decide something. Finally I set it to 10 seconds. It means that if timeout happens after 10 seconds then I can’t decide whether it’s a dead link or not, so I rather consider it not to be a dead link. It’s better to miss identifying a dead link than marking a bookmark as dead link which is actually not a broken link.
Then things got complicated
Coding the isLinkDead method (see above) was quite easy, but integrating it into Ninja at production quality level was another story. I ran one more test on 40K bookmarks and it took 27(!) hours to complete. It’s not a question that this is a really long operation so I have to take care of the followings:
- The process must be run in the background (in a thread) on the server.
- Once a user started the process, I have to make make sure that they can’t restart it while it’s running.
- When the process completes, user must get a notification email that includes the followings: how many bookmarks were checked, how many dead links were found, what tag was used to mark the bookmarks, a link to the bookmarks with dead links.
- I have to check how much load the server takes (memory, cpu) when several concurrent users run FDL (Find Dead Links) and how these extra background processes impact the everyday usage of Ninja. Fortunately it turned out that the load is so minimal, literally I couldn’t even see the extra load in Amazon CloudWatch.
- What if the server has to be restarted or a new build is deployed while FDL processes are just running? Handling this scenario caused the most complications. After a server restart or deploying a new build I have to restart the interrupted FDL processes. This mechanism had to be coded, too. It also turned out that after a new build is deployed, the threads started by the previous deployment will not terminate (even if exceptions happen since the old EJB objects don’t exist anymore) they continue to run. It’s weird because after an exception a thread should terminate. So this is something that had to be handled, too, the old threads are now terminated manually.
The whole story is interesting because coding of the core functionality of the “Finding Dead Links” feature was roughly 5% or even less of the whole effort. The majority of the effort was making a good user experience at a production quality level. And I didn’t even mention the mistakes (maybe in another post?) I made when I screwed up the “thread safe concept” or when I forgot to close the connection (connection.disconnect() in the above code) during iterating through thousands of links. All these mistakes caused several extra hours of debugging. The lesson learned here was that it’s not always easy to estimate the required effort to develop a new feature. It’s better to leave more time for coming up with an effort estimation even if the feature seems to be easy to implement at first glance.