This is still mostly proof-of-concept stage, but it seems to work on my Mac. It requires the pdftotext utility from the Xpdf project, which parses PDFs into plain text files. The Zotero fulltext indexer calls pdftotext on the PDF file and saves the plaintext version as .zotero-ft-cache in the attachment item's storage directory. It runs the fulltext word indexer on the plaintext file and also scans the plaintext file when doing a phrase search.
To try it out, install a copy of Xpdf (or just pdftotext) and either place pdftotext into the Zotero data directory or create a symlink. Either way, the file must be named pdftotext-{platform}[.exe], where {platform} is navigator.platform, with spaces replaced by hyphens (e.g. "Win32", "Linux-i686", "MacPPC", "MacIntel", etc.). On my Mac, with Xpdf installed via Darwin Ports, I create a symlink to /opt/local/bin/pdftotext named pdftotext-MacIntel. This setup will allow users to sync their Firefox profiles and still have Zotero use the appropriate platform-specific binary.
Assuming we go this pdftotext route, I think we'll instruct users to download and install Xpdf/pdftotext, possibly even providing binaries ourselves. The binaries are too big to include in the XPI. I'm going to look into creating a GUI to make linking Zotero to pdftotext easier. I also need to finish some of the other tickets related to indexer feedback and control.
There are also two new hidden prefs, fulltext.pdfMaxPages and fulltext.textMaxLength, currently set to 100 and 500K, respectively. The first determines how many pages of each PDF pdftotext processes, and the second determines how many characters and/or bytes of text files (the PDF cache files included) Zotero indexes and scans. These defaults may want to be adjusted higher or lower.
Closes#315, Hidden pref to set maximum file size to index/scan