i'm designing new web application deal lot of ms office documents.
one of requirements ability search not in columns of sql server database tables, in documents uploaded web application on tune of 50-200 documents per day. solution should able search both document content, metadata of office documents (creator etc.)
i wondering if has had practical experience such solution, , me design it.
my first idea use sql server 2012's filetable
approach: define common directory documents, surface filetable
in sql server table, , put sql server fulltext catalog on top of that. i'm pretty confident allow me search files name, , content (using fulltext search) - metadata? can't seem find on ....
also: have hands-on, practical experience in terms of performance of such solution? have hard time judging how win32 i/o store new documents filetable
folder affect performance. fulltext search on filetable
based set of ms office documents? experiences there?
a second idea use kind of dedicated fulltext search system, elasticsearch - comments on that? es support indexing , searching ms office documents, including metadata? or index contents of documents only?
any ideas , pointers - , especially hands-on, real-life experiences - welcome!
regarding second idea, elasticsearch supports indexing ms office documents through mapper attachments plugin powered apache tika, , supports kinds of ms office document formats. plugin not indexes file content metadata require, i.e. date
, title
, author
, content type
, etc.
so idea create index , mapping type field having attachment
type , metadata fields want index , search well.
put /test_index { "mappings": { "test_type": { "properties": { "my_attachment": { "type": "attachment", "fields": { "content": { "type": "string", "index": "no" }, "title": { "type": "string", "store": "yes" }, "date": { "type": "date", "store": "yes" }, "author": { "type": "string", "analyzer": "myanalyzer" }, "keywords": { "type": "string", "store": "yes" }, "content_type": { "type": "string", "store": "yes" }, "content_length": { "type": "integer", "store": "yes" }, "language": { "type": "string", "store": "yes" } } } } } } }
then can search of fields, namely file content metadata fields.
if want make dry run of plugin, offers a standalone tool can run see extracted documents, , hence can searched them.
Comments
Post a Comment