Towards minimal bibliographic managment software
As a command-line junky, I find most bibliographic managment software (such as mendeley) too bloated. All I want such software to be capable of is
- to add a new entry to my bibliography (bibtex), and
- search through the articles.
Here’s my minimal approach to implementing these two features:
1. Bibtex
Crossref provides a simple solution to obtain the bibtex entry based on the doi number. To get the bibtex entry for the article with say, doi number 10.1901/jaba.1974.7-497a, simply issue the command
>> curl -LH "Accept: text/bibliography; style=bibtex" "http://dx.doi.org/10.1901/jaba.1974.7-497a"
@article{Upper_1974, title={The unsuccessful self-treatment of a case of “writer’s block”1}, volume={7}, url={http://dx.doi.org/10.1901/jaba.1974.7-497a}, DOI={10.1901/jaba.1974.7-497a}, number={3}, journal={Journal of Applied Behavior Analysis}, publisher={Society for the Experimental Analysis of Behavior}, author={Upper, Dennis}, year={1974}, pages={497-497}}
To make the output look nice, we apply some sed magic:
>> curl -LH "Accept: text/bibliography; style=bibtex" "http://dx.doi.org/10.1901/jaba.1974.7-497a" | sed "s/, /,\n/;s/},/},\n/g;s/\(.*\)}}/\1}\n}\n/"
@article{Upper_1974,
title={The unsuccessful self-treatment of a case of “writer’s block”1},
volume={7},
url={http://dx.doi.org/10.1901/jaba.1974.7-497a},
DOI={10.1901/jaba.1974.7-497a},
number={3},
journal={Journal of Applied Behavior Analysis},
publisher={Society for the Experimental Analysis of Behavior},
author={Upper, Dennis},
year={1974},
pages={497-497}
}
The above output can be piped to a separate file. I created a separate directory in my articles
directory, called .txtfiles
. For each file article.pdf
, there is a corresponding .txtfiles/article.pdf.txt
in this directory. These txt-files contain the bibtex information generated above, and the output produced during the next section.
2. Making PDF files searchable
My implementation of pdf full text search involves an application of pdftotext
to all the articles in the directory. This definitely produces some overhead, about 50Kb per 10 pages, but I accept this. The following sed-sequence removes a lot of the single-letter, special-character and empty-lines junk produced by running pdftotext
on an article that contains lots of figures and equations:
FILE=article.pdf pdftotext $FILE - \ | sed "s/[^a-zA-Z ]//g"\ | sed "s/ / /g"\ | sed "s/ . //g"\ | sed "s/ / /g"\ | sed "s/^. //"\ | sed "s/ .$//"\ | sed "/^\s*\{0,1\}.\s*$/d"\ >> .txtfiles/$FILE.txt
Line-by-line translation:
transform file article.pdf to textfile and write to standard output remove all non-letters replace single space by double space remove all single character words replace double space by single space remove single characters at beginning of line remove single characters at end of line remove all lines that consist of at most a single character and spaces and append the output to .txtfiles/article.pdf.txt
That’s it for now. The file .txtfiles/article.pdf.txt
now contains all the information I need about this article.
Outlook
Using the two functions 1.) and 2.) you can hack your own minimal bib manager. After the directory .txtfiles has been filled, you can invoke full text searches using grep
, for example
grep -H 'author=.*Upper' .txtfiles/*
or have all the bibtex entries returned using sed
, like this
sed -s '/^$/,$d' .txtfiles/*
which assumes that the bibtex entry is at the top of the txt-file and that the first blank line of the file appears right after the bibtex entry.
Of course there is a lot of room for improvement for the pdftotext conversion. If you are interested only in keywords, this link might be of interest to you. You might also want to implement your own search script if you don’t like to invoke sed
and grep
manually. I am going to post a shell script soon which does some things in a more automated way.