Skip to content

Towards minimal bibliographic managment software

May 17, 2012

As a command-line junky, I find most bibliographic managment software (such as mendeley) too bloated. All I want such software to be capable of is

  1. to add a new entry to my bibliography (bibtex), and
  2. search through the articles.

Here’s my minimal approach to implementing these two features:

1. Bibtex

Crossref provides a simple solution to obtain the bibtex entry based on the doi number. To get the bibtex entry for the article with say, doi number 10.1901/jaba.1974.7-497a, simply issue the command

>> curl -LH "Accept: text/bibliography; style=bibtex" "http://dx.doi.org/10.1901/jaba.1974.7-497a"

@article{Upper_1974, title={The unsuccessful self-treatment of a case of “writer’s block”1}, volume={7}, url={http://dx.doi.org/10.1901/jaba.1974.7-497a}, DOI={10.1901/jaba.1974.7-497a}, number={3}, journal={Journal of Applied Behavior Analysis}, publisher={Society for the Experimental Analysis of Behavior}, author={Upper, Dennis}, year={1974}, pages={497-497}}

To make the output look nice, we apply some sed magic:

>> curl -LH "Accept: text/bibliography; style=bibtex" "http://dx.doi.org/10.1901/jaba.1974.7-497a" | sed "s/, /,\n/;s/},/},\n/g;s/\(.*\)}}/\1}\n}\n/"

@article{Upper_1974,
title={The unsuccessful self-treatment of a case of “writer’s block”1},
volume={7},
url={http://dx.doi.org/10.1901/jaba.1974.7-497a},
DOI={10.1901/jaba.1974.7-497a},
number={3},
journal={Journal of Applied Behavior Analysis},
publisher={Society for the Experimental Analysis of Behavior},
author={Upper, Dennis},
year={1974},
pages={497-497}
}

The above output can be piped to a separate file. I created a separate directory in my articles directory, called .txtfiles. For each file article.pdf, there is a corresponding .txtfiles/article.pdf.txt in this directory. These txt-files contain the bibtex information generated above, and the output produced during the next section.

2. Making PDF files searchable

My implementation of pdf full text search involves an application of pdftotext to all the articles in the directory. This definitely produces some overhead, about 50Kb per 10 pages, but I accept this. The following sed-sequence removes a lot of the single-letter, special-character and empty-lines junk produced by running pdftotext on an article that contains lots of figures and equations:

FILE=article.pdf
pdftotext $FILE - \
| sed "s/[^a-zA-Z ]//g"\
| sed "s/ /  /g"\
| sed "s/ . //g"\
| sed "s/  / /g"\
| sed "s/^. //"\
| sed "s/ .$//"\
| sed "/^\s*\{0,1\}.\s*$/d"\
>> .txtfiles/$FILE.txt

Line-by-line translation:

transform file article.pdf to textfile and write to standard output
remove all non-letters
replace single space by double space
remove all single character words
replace double space by single space
remove single characters at beginning of line
remove single characters at end of line
remove all lines that consist of at most a single character and spaces
and append the output to .txtfiles/article.pdf.txt

That’s it for now. The file .txtfiles/article.pdf.txt now contains all the information I need about this article.

Outlook

Using the two functions 1.) and 2.) you can hack your own minimal bib manager. After the directory .txtfiles has been filled, you can invoke full text searches using grep, for example

grep -H 'author=.*Upper' .txtfiles/*

or have all the bibtex entries returned using sed, like this

sed -s '/^$/,$d' .txtfiles/*

which assumes that the bibtex entry is at the top of the txt-file and that the first blank line of the file appears right after the bibtex entry.

Of course there is a lot of room for improvement for the pdftotext conversion. If you are interested only in keywords, this link might be of interest to you. You might also want to implement your own search script if you don’t like to invoke sed and grep manually. I am going to post a shell script soon which does some things in a more automated way.

Advertisements

From → Linux

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: