This is still mostly proof-of-concept stage, but it seems to work on my Mac. It requires the pdftotext utility from the Xpdf project, which parses PDFs into plain text files. The Zotero fulltext indexer calls pdftotext on the PDF file and saves the plaintext version as .zotero-ft-cache in the attachment item's storage directory. It runs the fulltext word indexer on the plaintext file and also scans the plaintext file when doing a phrase search.
To try it out, install a copy of Xpdf (or just pdftotext) and either place pdftotext into the Zotero data directory or create a symlink. Either way, the file must be named pdftotext-{platform}[.exe], where {platform} is navigator.platform, with spaces replaced by hyphens (e.g. "Win32", "Linux-i686", "MacPPC", "MacIntel", etc.). On my Mac, with Xpdf installed via Darwin Ports, I create a symlink to /opt/local/bin/pdftotext named pdftotext-MacIntel. This setup will allow users to sync their Firefox profiles and still have Zotero use the appropriate platform-specific binary.
Assuming we go this pdftotext route, I think we'll instruct users to download and install Xpdf/pdftotext, possibly even providing binaries ourselves. The binaries are too big to include in the XPI. I'm going to look into creating a GUI to make linking Zotero to pdftotext easier. I also need to finish some of the other tickets related to indexer feedback and control.
There are also two new hidden prefs, fulltext.pdfMaxPages and fulltext.textMaxLength, currently set to 100 and 500K, respectively. The first determines how many pages of each PDF pdftotext processes, and the second determines how many characters and/or bytes of text files (the PDF cache files included) Zotero indexes and scans. These defaults may want to be adjusted higher or lower.
Closes#315, Hidden pref to set maximum file size to index/scan
Works for regular items and notes, not attachments (and doesn't clone child items when duplicating parent)
New method Item.clone()
Unrelated changes:
- Fix note/attachment dragging, broken by notifier changes (r1131) a while back
- Item.save() now triggers Notifier even if a transaction is in progress, which I hopefully no longer had a reason not to be doing
Closes#227, Indent nested collections in search drop-down
Addresses #528, Make search condition drop-down menu less unwieldy
- Created new distinct fields for differently labeled fields
- Mapped lots of fields to base fields
- Made base field search conditions search type-specific fields as well
- Removed type-specific fields that are based on base fields not show up in search conditions drop-down
- Added a tooltip when hovering over a condition in the search conditions drop-down that shows the fields it searches (when there's more than one)
- Moved search dialog CSS to separate file
Abstract displays in metadata pane as a cropped one-line field by default; clicking the 'Abstract' label toggles between the cropped field and an expanded view
Some problems with import/export: https://www.zotero.org/trac/ticket/537
Refs #537
Addresss #352, Make sure data layer doesn't allow bad data via the API
Access date field is now human-friendly. Also enforcing SQL date form for the field in the DB and discarding bad data passed via setField().
Closes#453, Check if any fields will actually be discarded on item type change before giving warning
Refs #530, Add base field conversion to translation level
Added mechanism for linking item type fields via base fields, e.g. publisher => label in audioRecording
New methods:
Item.getFieldsNotInType(itemTypeID, allowBaseConversion)
ItemFields.getLocalizedString(itemTypeID, field)
ItemFields.isBaseField(fieldID)
ItemFields.getFieldIDFromTypeAndBase(itemType, baseField)
ItemFields.getBaseIDFromTypeAndField(itemType, typeField)
ItemFields.getTypeFieldsFromBase(baseField)
Currently only the publisher fields are mapped -- I need more feedback on #346 before I implement the others (specifically on whether or not all these sorts of fields should be done as distinct fields or whether some should just be localized strings (in which case they'll autocomplete but not show up separately as search conditions))
Also added 'university' as distinct publisher field for thesis
Values of equivalent fields are now preserved when switching between item types (e.g. the 'studio' value becomes the 'label' value when switching between videoRecording and audioRecording), and the pop-up is much smarter--it will only prompt you if fields will in fact be lost, and it will list the fields that would be deleted.
Not finished:
- Searching for base fields doesn't yet search the type-specific fields, as Elena requested
- import/export/bib should be updated to use the ItemFields base conversion methods where appropriate -- data coming from the 'publisher' field from translators, for example, should be put into the appropriate type-specific field.
Sent as a fourth parameter to notify() -- parameter is an array of objects (in the same order as the ids) that currently contain a single property, 'old', which holds the toArray() object
Copies are not sent with 'modify' when it's only meant to refresh the UI and there's another trigger that covers the data change (e.g. removing a tag from an item sends both an item modify and an item-tag add, but the modify doesn't get a pre-change copy of the item since any consumers that care should just monitor item-tag)
Also:
- Removed Notifier.enable()/disable()
- Notifier no longer sends modify() if item already deleted
- New methods: Collection.toArray(), Zotero.Tags.toArray(tagID)
- Removed a few extraneous triggers
Simon, reopen if this doesn't fix the problem you were referring to.
(Also removes Notifier.enable()/disable() from its use in Item.erase() while we're at it.)
- Automatic tags now appear in orange; tooltip says either "User-added tag" or "Automatically added tag"
- New menu in tag selector to toggle automatic tags
- User and automatic tags are combined in tag selector, so renaming/deleting a tag will affect both user and automatic, regardless of view mode
- Editing a tag makes it a user tag, as does adding an identical user tag to an item (rather than creating a second one)
- ingester/export will need to be adjusted to add automatic tags
Changed:
Item.addTag(tag) => addTag(tag, type)
Item.getTags() - now returns 'id', 'tag', 'type'
Item.toArray() - tags now include 'type' property (from Item.getTags())
Tags.getID(tag) => getID(tag, type)
Tags.getAll() => getAll([types]) - types is an optional array of tagTypes to fetch; now returns objects with 'tag' and 'type' properties
Tags.getAllWithinSearch(search) => Tags.getAllWithinSearch(search, [types]) - now returns object with 'tag'/'type'
Added:
Tags.get(tagID) - returns object with 'tag' and 'type' properties
Tags.getIDs(tag) - returns all tagIDs for this tag (of all types)
Tags.getType(tag) - returns array of tag types matching given tag
For type property, 0 == user, 1 == automatic
It ain't pretty, but (I think) it works.
Also:
- Fulltext content search should handle ANY/ALL modes better, but that needs some more testing.
- Tag selector now properly takes fulltext content search conditions into account when filtering to scope.
- Added Zotero.Search.hasPostSearchFilter(), since getSQL() isn't sufficient with post-search filters.
Also:
- Clicking OK on rename dialog with "Rename associated file" checked but without changing the filename would delete the original file.
- Add "Show File" button for snapshots
New methods, Item.renameAttachmentFile(newName, force) -- _force_ forces overwrite of an existing file
For the moment, implemented in the UI via a checkbox in the attachment title rename dialog (accessible by clicking on the title in the right pane) to rename the associated file as well -- this might be replaced by the upcoming keep-filenames-in-sync-with-attachment-titles feature, but it's probably fine for Beta 3.
Also new:
- Zotero.Attachments.getPath(file, linkMode) to get a relative or persistent path as appropriate given the link mode
New methods:
Item.removeAllRelated()
Item.removeAllTags()
Also:
- Tag selector didn't initialize properly if it was closed when Firefox was started
- Items pane would lose open state of items and current scroll position when an item was edited while the tag selector was open -- added save/rememberOpenState() and save/rememberFirstRow() to fix this, and these could also fairly easily be used to remember the open state while switching between collections
There might be some regressions from this, but it seems to work fine.
Also:
- Fixed JS strict warnings in popup note window
- Use Zotero.Notes.add() when using toolbar button instead of a two-stage save with ZP.newItem('note')
Closes#471, Tag selector should update when tags are added/removed
Tag Selector overhaul:
- Right-click to rename/delete tags globally
- Filter tags to only those associated with currently visible items, with a Display All checkbox to show others in gray -- scope list set via new callback mechanism in the items tree
- Drag and drop items onto tags to batch assign
- Tag Notifier events, currently unused (tag selector currently just refreshes on all item events, since doing granular tag updates is considerably more complicated)
- Performance improvements, offset by the new features that make it slower
There should probably be an option to use either an ANY or an ALL search in the tag selector... (It's ALL by default now.)
New methods:
- Zotero.hasValues(obj) -- return true if an object (/associative array) has at least one value, false if not
- Zotero.Item.addTagByID()
- Zotero.Item.hasTag()
- Zotero.Tags.getAllWithinSearch(search)
- Zotero.Tags.rename(tagID, tag)
- Zotero.Tags.remove(tagID)
- ItemTreeView.addCallback()
- ItemTreeView.setFilter('search'|'tags', data) -- replaces searchText()
- CollectionTreeView.getSearchObject() -- search object used to generate the items list
- CollectionTreeView.getChildTags()
I'm doing this manually with Notifier.begin(true)/commit(unlock) instead of putting them all in a DB transaction since Item.erase() erases files too, so rolling back previous deletes would be bad
Notice the finally {...} block -- this ensures that the event queue is unlocked even if there's an error deleting (or else notifications would break until Firefox was restarted)
Other changes:
- Zotero.Items.erase() now supports multiple items and the recursive flag to erase children
- Fixed a few JS strict warnings in the items view
N.B.: Some changes from plan on ticket
New methods:
Item.setAbstract(true|false) -- make a note an abstract (and clear existing abstract if there is one for source item) or clear abstract status
Item.isAbstract() -- returns true if note is an abstract, false if not
Item.getAbstract() - get itemID of child abstract note or false if none
ZoteroPane.toggleAbstractForSelectedItem()
Changed methods:
Item.updateNoteCache(text, isAbstract)
Notes.add(note, sourceItemID, isAbstract)
Item.setSource() -- moving abstract note to another source with an existing abstract or setting as an independent note will make note not abstract
Other changes:
- Context menu options in items pane: "Set note as abstract" and "Unset note as abstract"
- Child notes are now displayed before child attachments so that abstract will be first
- Don't try to get MIME type from extension if extension is blank
- Add text/css to native text types, even if snapshots add some html tags (why is that?)
- Get rid of extraneous "this." prefixes
A work in progress:
- Implemented zotero:// custom protocol handler, which will likely be useful for other things too
- First version of XHTML/CSS detail view -- definitely needs feedback, work, and refinement but is more or less functional
- Added XUL-side interface and context menu options for loading report URLs
Going forward:
- Other formats (RTF, CSV)
- Other views (list view, annotated bibliography, etc.)
- Report options window (let the user which fields to include (with saved templates?))
- Ability to specify custom CSS files?
- Extension of Zotero protocol handler to trigger Zotero events? This would allow more interactive reports with the ability to click to select items in the Z pane, run searches by clicking on tags, etc., but would have to be limited to idempotent actions.
Other changes:
- ZoteroPane.getSortField() and ZoteroPane.getSortDirection()
- Zotero.Utilities.htmlSpecialChars(str)
- Fixed sort direction in items pane (triangle icon now goes the right direction, though the default direction on clicking a new column is incorrect)
- firstCreator now included in toArray(), though it's not particularly correct (#287, more or less)
- ZoteroPane.getSelectedCollection/SavedSearch/Items now take asIDs parameter to return ids instead of objects
Fixes#226, Insert new collections and saved searches in the proper order
Also:
- Only display "New Collection..." and "New Saved Search..." in Library drop-down
- Sort collections and saved searches case-insensitively
Will be getting a lot more functionality (e.g. renaming, deleting, maybe assigning of tags from the pane), some UI changes, and possibly some modified behavior (e.g. should it only show the available tags for the items that you're viewing, show all and let you use the interface to assign tags (say, by batch drag and drop), or have a checkbox to toggle between the two modes)
Other notes:
- Implemented as XBL binding, so should be reusable elsewhere if necessary
- Needs a better icon and possibly different icon placement
- Doesn't handle live updating of modified tags yet -- will need a Notifier target
- New methods Tags.getAll() and Tags.search()
- I really wish I'd created a ticket for this so I could check it off
- Item.setField() stores dates in a multipart format beginning with an SQL date followed by the user's entry, so "November 3, 2006" becomes "2006-11-03 November 3, 2006" -- date field entries are parsed with Zotero.Date.strToDate() if not already in multipart format
- Item.getField() returns just the user part unless passed the new second parameter, _unformatted_, which returns the field directly from DB without processing (e.g. the full multipart string)
- Added SQLite triggers on the itemData table to enforce multipart format even if the table is modified outside the API
- Migration step to update existing dates
- Indicator next to date field to show what we've parsed and a tooltip over the date field to show the SQL date -- though I'm not sure how well the abbreviation part will localize (i.e. can you abbreviate 'month' in Chinese?)
One obvious problem is how to handle date ranges when sorting or searching, which may end up rendering this whole method fairly useless (though I guess the multipart format could begin with two SQL dates instead of just one, at the cost of some storage space...).
Other changes:
- Utilities.lpad() handling for undefined value parameter
- new Zotero.Date methods: strToMultipart(), isMultipart(), multipartToSQL(), multipartToStr(), isSQLDate(), sqlHasYear(), sqlHasMonth, sqlHasDay getLocaleDateOrder() (the last one unused for now)
- try/catch around manual itemData INSERT execute() statements in Item.save()
- Catch errors trying to display missing files and display message to user
- Switch to persistent descriptors rather than relative paths for attachment paths -- this will fix attachments on networked drives (which, at least on Windows, were not working and apparently in some cases breaking entire Zotero installs), but since WebBrowserPersist.saveDocument() is asynchronous and file.persistentDescriptor can't be set on Macs before the file exists, Attachments.importFromDocument() no longer returns the id of the new attachment, so translate.js had to be changed accordingly
- Try to convert relative descriptors to persistent ones with migration step (and later on access, if persistent fails)
- Added Zotero.WebProgressFinishListener(onDone)
Next step would be to throw up a file dialog to let the user locate the missing file
Feel free to remove redundant calls to ItemTypes.getID()
(I actually see a whole bunch of calls to the constructor using type names in scrapers, but presumably those are converted to itemTypeIDs in translate.js, since they seem to have been working just fine...)