Research Workflows
Research is fun but without the right tools, a lot of annoyingly small issues can add up and become a source of friction over time. In this post, I’m going to try and document some of my existing strategies as well as plan out workflows for the remaining bottlenecks. Overall, there’s 5 broad areas which have high coefficients of friction (yes this is the entire research process but bear with me):
- Archiving and organizing literature
- Revisiting literature
- Running and managing experiments
- Analyzing and communicating results
- Writing the paper
Archiving and Organizing Literature
Who needs Twitter’s Firehose API when you can get your own personal firehose of papers for the low, low price of being a terminally online doomscroller. I used to bookmark everything interesting I came across but since bookmarks are inherently unsearchable (why isn’t this a thing yet?? nvm while writing this, I found a Chrome product announcement from Dec 2022, so turns out you can indeed search through your bookmarks now), this quickly became a mountain of links that I never ended up revisiting. Regardless, the bottleneck is neither finding nor saving papers, but rather organizing them in a way that makes them easy to recall and consume. I’ve tried a number of different approaches over the years but I think I’ve finally settled on a workflow that works for me.
Saving Papers
tldr; Save to Readwise Reader inbox and immediately tag based on title and abstract.
I’ve been using Readwise Reader for a few months now and it’s one of the few subscriptions I don’t regret paying for. The biggest benefit is that I can easily save links, papers, Twitter threads, or even YouTube videos to the mobile app (in 2 clicks) or with the Chrome extension on desktop (1 click) and it automatically syncs across devices. Since Reader is designed for reading, annotations are a first-class feature (even for PDFs), which means that I don’t necessarily have to wait until I get back to my desktop to read something I have saved. I can just read it on my phone and jot down annotations as I go along. The best part is that everything I annotate gets automatically pushed to my Obsidian knowledge base, which means that if I need to pull up a paper again in the future, I can just search for it in Obsidian and all my annotations will be there. I also try to add some tags as soon as I save something, and these get automatically exported to the Obsidian note as well, which makes it easier to find papers later on.
Organizing Papers
tldr; Use Zotero to index the correct metadata for a paper and then export it to Obsidian where you can synthesize a short note based on your previous Readwise annotations.
If it seems like I’m doing double (triple?) the work by saving the same paper to Readwise, Obsidian, AND Zotero, yessir you’re absolutely correct. Unfortunately, c’est la vie and until someone builds a (free) app that does everything I want, I’m going to continue with this workflow. But fret not because this isn’t as bad as it sounds and it effectively addresses a number of different pain points.
Consider the situation where let’s say CVPR acceptances have just come out and your firehose is full of “Thrilled to share” posts… that still link to an arXiv. Or maybe someone posts a thread to a neat finding they made and Reviewer 2 jumps in the replies to point out that this was already done by Schmidhuber in 1996… with a link to a PDF hosted on HAL. I love open dissemination of science as much as the next guy but until Zotero figures out a way to intelligently merge duplicate entries, I’m going to try and save the “official” version of the paper, so that I don’t have to manually update my BibLaTeX file a couple hours before whichever conference deadline I’m rushing to meet. That being said, I still want to save the link and quickly skim the paper I came across, which I can do with Reader. And at the end of the day, after I’ve finished annotating the saved copy or tagged and moved it from my Inbox tab into the Later tab on Reader, I use the Google Scholar extension to quickly find the published version of the paper and save the final version to Zotero (along with the correct metadata).
Once it’s saved to Zotero, I use ZotFile to automatically rename the PDF and reformat the metadata to create a citation key that’s both short, informative, and easy to remember. A few seconds after a paper is formatted correctly, a background service uses Zotero’s Export Library feature to add the newly indexed paper to a BibLaTeX-based mirror of my Zotero database. This file lives inside my Obsidian vault and is automatically backed up to my GitHub repo. I can then switch to my Obsidian vault and use a keyboard shortcut to trigger a refresh of the citation plugin which queries the updated database and creates a new templated note for the newly added paper, along with an #unread
tag, which signifies that I haven’t properly synthesized a note for that paper yet. The nice thing about Obsidian is that the Dataview plugin lets me create a dashboard in my Daily Note showing all the papers I’ve saved to Zotero but haven’t read yet. This means that I can quickly skim through the list and decide which papers I want to read next. Once I’ve read a paper and synthesized a note, I can just remove the #unread
tag and the paper will disappear from the dashboard.
Revisiting Literature
tldr; Unsure, check back later to see what I’ve converged on.
I ended off the last section saying how the paper will disappear from the dashboard once you remove the tag. Unfortunately, the paper usually disappears from my memory as well, especially if it’s something that’s not directly related to my current research.
Local RAG-based Search?
I don’t have a great solution for this yet, but over New Years I wanted to dig into all the new LLM/RAG methods that have come out so I got a simple app working that does RAG-style question answering based on the notes/papers in my Obsidian vault. I’ll have to write another post about it soon but I was pretty happy with the results, especially considering the fact that it was running locally (albeit still on a GPU). The next step is to get it running fully on-device on my M1 Macbook which I think should be doable based on some of my preliminary tests of Apple’s new MLX framework.
Improved Spaced-Repetition?
There’s a few AI assistant plugins for Obsidian already but none of them seem to tick all of the boxes so at some point, I’d like to build my own plugin. Ideally it would be able to work with me to generate Anki cards for papers I’ve recently read and then provide additional context after I review each card, to remind me of neat tidbits I might’ve forgotten. I’d also like to be able to ask it questions about papers I’ve read in the past and have it pull up the relevant notes. I think this is doable with the current state of the art but I’ll have to do some more research to figure out the best way to go about it.
Serendipity Based Research Recommender System?
One thing I’ve gotten hooked on recently though is the idea of using serendipity in recommender systems. I had some really exciting conversations with folks after the ALOE workshop at NeurIPS (completely unrelated to recommender systems) and over the holidays I serendipitously came across Ken Stanley’s new startup, Maven, which is building a new social network based on serendipity. I think there’s a lot of unexplored potential in this space and I think one area where it could be really useful is in the context of research. Imagine having an assistant that serendipitously reminds you of a paper or an idea that’s not directly related to your current research but makes connections between ideas in different fields, thereby giving you a fresh perspective on how to solve something you’re stuck on! Plus, since it’s an Obsidian plugin, the chances of it hallucinating something completely random would be pretty low because it’s grounded in your notes, and your TODO list, and is actually aware of what you’re working on a day-to-day basis.
Running and Managing Experiments
I often work on a number of different projects simultaneously and because they’re all research-y/open-ended, I end up running a lot of experiments and making iterative improvements. But with the number of things I try out, I don’t seem to have yet found a clean yet straightforward approach for keeping track of everything. So far, I’ve been parallelizing experiments across heterogeneous compute nodes using WandB sweeps, logging everything to their backend, and using WandB reports to share key findings with collaborators. This is a much better workflow than what I had when I originally started doing ML and I had to SSH into remote servers to inspect Tensorboard logs saved on each filesystem.
While this setup had its frustrations and bottlenecks, it’s recently become unusable. Turns out that my WandB academic tier plan only supports 100GB of storage (which I’d surpassed months ago) and I am currently sitting at 3.1TB of logs/data that they’ve recently sent me an update about. I love WandB but there’s no way I’m paying them $90/month which means it’s time to be scrappy and build something ourselves. Honestly, I think as long as I can keep the per experiment data usage below the threshold (which should definitely be possible), it should be ok? Or even better, if I can get an on-premise, locally hosted version of the WandB server up and running, I can keep using as much storage as I want. The main problem is reliability - I don’t know if the servers we have on campus will be able to keep up and I don’t particularly fancy being a sysadmin. It’s honestly kind of wild when you think about the various challenges involved and look into how people solve it at scale, and turns out you either have some universities with their own massive IT department or startups burning their runway on other startups that provide niche monitoring solutions. If you have neither, you build it yourself. Onwards!
As I migrate to my own locally hosted experiment management system, I’ll progressively update this section with my findings.
Analyzing and Communicating Results
tldr; Use Quarto to create interactive reports with rich multimedia and LaTEX.
Especially now that I’m moving away from storing my data on WandB, I need a way to quickly pull the data I want, process it, and publish results/findings. I’ve also decided to start blogging about my intermediate results and creating polished figures as I go along, so that hopefully the process of writing the paper at the end is less painful as I can just copy-paste writing that I already published on my blog into the paper. No, this isn’t plagiarism, and if some handbook says it is, we ought to rewrite the handbook (but in the meantime I’ll just cite myself if you reeeeally want). Science (and publishing by extension) should be about sharing knowledge and ideas in a manner that is clear and reproducible, not about jumping through hoops to appease some arbitrary standard.
I think Quarto is the way to go here. It’s a relatively new framework but it’s already got a lot of the features I want and I think it’s only going to get better over time. The community seems to love sharing reproducible demos and tutorials in Google Colab, and I think that’s fine. But I personally don’t think Colab (or regular Jupyter Notebooks) are a good interface for sharing results. It’s clunky (scrolling through a long notebook is an awful experience), aesthetically unpleasing (does not make use of screen space very well), difficult to style, and is not responsive on mobile. While tools like nbdev
solve some of the problems related to using notebooks with version control, it hasn’t become mainstream yet (despite being around for a few years now) and most Colab notebooks today are still that – notebooks. Lastly, notebooks cannot be easily exported to other formats or embedded into the wider web ecosystem. I’m really excited that Quarto solves all of these problems and part of my goal in starting this blog is to figure out how to use it better and communicate my research more effectively with the wider community.
Writing the Paper
tldr; (for now) Use Overleaf to collaborate with co-authors and write the paper.
While most researchers use Overleaf today, it seems that many people use it on the free personal plan, without realizing that if you’re at a University, you (probably) have access to the Overleaf Premium tier for free. The only real difference between the two is that the Premium tier gives you more features for collaborating with co-authors, tracking changes, and longer revision history, but I think it’s quite worth it.
While Overleaf is great, I find myself often annoyed at how long it takes to recompile the document after making a change. I’ve noticed this to be more of a problem if the document has a lot of images in it (gg computer vision) and I’m not sure if this is a limitation with how I use Overleaf or if it’s a problem that others have with the platform as well. It seems that compiling is faster if I do it locally but then I lose the benefits of Overleaf’s collaboration features. I’m not sure if there’s a way to get the best of both worlds but I’ll update this section if I find a solution.