Skip to content

Commit

Permalink
Custom replacers per namespace if we need it (currently needed for 2)
Browse files Browse the repository at this point in the history
  • Loading branch information
Shazwazza committed Jan 8, 2020
1 parent d2949ba commit 3db3847
Show file tree
Hide file tree
Showing 23 changed files with 362 additions and 83 deletions.
7 changes: 5 additions & 2 deletions src/Lucene.Net.Analysis.SmartCn/package.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ summary: *content

Analyzer for Simplified Chinese, which indexes words.
@lucene.experimental

<div>
Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.

* StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
Expand All @@ -31,10 +31,13 @@ Three analyzers are provided for Chinese, each of which treats Chinese text in a

* SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.


Example phrase: "我是中国人"

1. StandardAnalyzer: 我-是-中-国-人

2. CJKAnalyzer: 我是-是中-中国-国人

3. SmartChineseAnalyzer: 我-是-中国-人
3. SmartChineseAnalyzer: 我-是-中国-人

</div>
12 changes: 9 additions & 3 deletions src/Lucene.Net.Benchmark/ByTask/package.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,12 @@ summary: *content
-->

Benchmarking Lucene By Tasks.
<div>

This package provides "task based" performance benchmarking of Lucene. One can use the predefined benchmarks, or create new ones.
This package provides "task based" performance benchmarking of Lucene. One can use the predefined benchmarks, or create new ones.

Contained packages:

Contained packages:

<table border="1" cellpadding="4">
<tr>
Expand Down Expand Up @@ -492,4 +494,8 @@ Example: max.buffered=buf:10:10:100:100 -

The traverse and retrieve tasks "count" more: a traverse task would add 1 for each traversed result (hit), and a retrieve task would additionally add 1 for each retrieved doc. So, regular Search would count 1, SearchTrav that traverses 10 hits would count 11, and a SearchTravRet task that retrieves (and traverses) 10, would count 21.

Confusing? this might help: always examine the `elapsedSec` column, and always compare "apples to apples", .i.e. it is interesting to check how the `rec/s` changed for the same task (or sequence) between two different runs, but it is not very useful to know how the `rec/s` differs between `Search` and `SearchTrav` tasks. For the latter, `elapsedSec` would bring more insight.
Confusing? this might help: always examine the `elapsedSec` column, and always compare "apples to apples", .i.e. it is interesting to check how the `rec/s` changed for the same task (or sequence) between two different runs, but it is not very useful to know how the `rec/s` differs between `Search` and `SearchTrav` tasks. For the latter, `elapsedSec` would bring more insight.


</div>
<div> </div>
27 changes: 15 additions & 12 deletions src/Lucene.Net.Benchmark/package.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,21 +21,24 @@ summary: *content
-->

The benchmark contribution contains tools for benchmarking Lucene using standard, freely available corpora.
<div>

ANT will
download the corpus automatically, place it in a temp directory and then unpack it to the working.dir directory specified in the build.
The temp directory
and working directory can be safely removed after a run. However, the next time the task is run, it will need to download the files again.
ANT will
download the corpus automatically, place it in a temp directory and then unpack it to the working.dir directory specified in the build.
The temp directory
and working directory can be safely removed after a run. However, the next time the task is run, it will need to download the files again.

Classes implementing the Benchmarker interface should have a no-argument constructor if they are to be used with the Driver class. The Driver
class is provided for convenience only. Feel free to implement your own main class for your benchmarker.
Classes implementing the Benchmarker interface should have a no-argument constructor if they are to be used with the Driver class. The Driver
class is provided for convenience only. Feel free to implement your own main class for your benchmarker.

The StandardBenchmarker is meant to be just that, a standard that runs out of the box with no configuration or changes needed.
Other benchmarking classes may derive from it to provide alternate views or to take in command line options. When reporting benchmarking runs
you should state any alterations you have made.
The StandardBenchmarker is meant to be just that, a standard that runs out of the box with no configuration or changes needed.
Other benchmarking classes may derive from it to provide alternate views or to take in command line options. When reporting benchmarking runs
you should state any alterations you have made.

To run the short version of the StandardBenchmarker, call "ant run-micro-standard". This should take a minute or so to complete and give you a preliminary idea of how your change affects the code
To run the short version of the StandardBenchmarker, call "ant run-micro-standard". This should take a minute or so to complete and give you a preliminary idea of how your change affects the code

To run the long version of the StandardBenchmarker, call "ant run-standard". This takes considerably longer.
To run the long version of the StandardBenchmarker, call "ant run-standard". This takes considerably longer.

The original code for these classes was donated by Andrzej Bialecki at http://issues.apache.org/jira/browse/LUCENE-675 and has been updated by Grant Ingersoll to make some parts of the code reusable in other benchmarkers
The original code for these classes was donated by Andrzej Bialecki at http://issues.apache.org/jira/browse/LUCENE-675 and has been updated by Grant Ingersoll to make some parts of the code reusable in other benchmarkers
</div>
<div> </div>
3 changes: 2 additions & 1 deletion src/Lucene.Net.Memory/overview.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
uid: Lucene.Net.Index.Memory
uid: Lucene.Net.Memory
title: Lucene.Net.Memory
summary: *content
---

Expand Down
3 changes: 2 additions & 1 deletion src/Lucene.Net.Memory/package.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
uid: Lucene.Net.Index.Memory
uid: Lucene.Net.Memory
title: Lucene.Net.Memory
summary: *content
---

Expand Down
3 changes: 2 additions & 1 deletion src/Lucene.Net.QueryParser/overview.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
uid: Lucene.Net.Queryparser
uid: Lucene.Net.QueryParser
title: Lucene.Net.QueryParser
summary: *content
---

Expand Down
2 changes: 1 addition & 1 deletion src/Lucene.Net.Replicator/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ summary: *content
limitations under the License.
-->

Provides index files replication capabilities.
Provides index files replication capabilities.
76 changes: 37 additions & 39 deletions src/Lucene.Net.Replicator/package.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ uid: Lucene.Net.Replicator
summary: *content
---


<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
Expand All @@ -23,49 +22,48 @@ summary: *content

# Files replication framework

The
[Replicator](Replicator.html) allows replicating files between a server and client(s). Producers publish
[revisions](Revision.html) and consumers update to the latest revision available.
[ReplicationClient](ReplicationClient.html) is a helper utility for performing the update operation. It can
be invoked either
[manually](ReplicationClient.html#updateNow()) or periodically by
[starting an update thread](ReplicationClient.html#startUpdateThread(long, java.lang.String)).
[HttpReplicator](http/HttpReplicator.html) can be used to replicate revisions by consumers that reside on
a different node than the producer.

The replication framework supports replicating any type of files, with built-in support for a single search index as
well as an index and taxonomy pair. For a single index, the application should publish an
[IndexRevision](IndexRevision.html) and set
[IndexReplicationHandler](IndexReplicationHandler.html) on the client. For an index and taxonomy pair, the
application should publish an [IndexAndTaxonomyRevision](IndexAndTaxonomyRevision.html) and set
[IndexAndTaxonomyReplicationHandler](IndexAndTaxonomyReplicationHandler.html) on the client.

The
[Replicator](Replicator.html) allows replicating files between a server and client(s). Producers publish
[revisions](Revision.html) and consumers update to the latest revision available.
[ReplicationClient](ReplicationClient.html) is a helper utility for performing the update operation. It can
be invoked either
[manually](ReplicationClient.html#updateNow()) or periodically by
[starting an update thread](ReplicationClient.html#startUpdateThread(long, java.lang.String)).
[HttpReplicator](http/HttpReplicator.html) can be used to replicate revisions by consumers that reside on
a different node than the producer.
When the replication client detects that there is a newer revision available, it copies the files of the revision and
then invokes the handler to complete the operation (e.g. copy the files to the index directory, fsync them, reopen an
index reader etc.). By default, only files that do not exist in the handler's
[current revision files](ReplicationClient.ReplicationHandler.html#currentRevisionFiles()) are copied,
however this can be overridden by extending the client.

The replication framework supports replicating any type of files, with built-in support for a single search index as
well as an index and taxonomy pair. For a single index, the application should publish an
[IndexRevision](IndexRevision.html) and set
[IndexReplicationHandler](IndexReplicationHandler.html) on the client. For an index and taxonomy pair, the
application should publish an [IndexAndTaxonomyRevision](IndexAndTaxonomyRevision.html) and set
[IndexAndTaxonomyReplicationHandler](IndexAndTaxonomyReplicationHandler.html) on the client.
An example usage of the Replicator:

When the replication client detects that there is a newer revision available, it copies the files of the revision and
then invokes the handler to complete the operation (e.g. copy the files to the index directory, fsync them, reopen an
index reader etc.). By default, only files that do not exist in the handler's
[current revision files](ReplicationClient.ReplicationHandler.html#currentRevisionFiles()) are copied,
however this can be overridden by extending the client.
// ++++++++++++++ SERVER SIDE ++++++++++++++ //
IndexWriter publishWriter; // the writer used for indexing
Replicator replicator = new LocalReplicator();
replicator.publish(new IndexRevision(publishWriter));

An example usage of the Replicator:
// ++++++++++++++ CLIENT SIDE ++++++++++++++ //
// either LocalReplictor, or HttpReplicator if client and server are on different nodes
Replicator replicator;

// ++++++++++++++ SERVER SIDE ++++++++++++++ //
IndexWriter publishWriter; // the writer used for indexing
Replicator replicator = new LocalReplicator();
replicator.publish(new IndexRevision(publishWriter));

// ++++++++++++++ CLIENT SIDE ++++++++++++++ //
// either LocalReplictor, or HttpReplicator if client and server are on different nodes
Replicator replicator;

// callback invoked after handler finished handling the revision and e.g. can reopen the reader.
Callable<Boolean> callback = null; // can also be null if no callback is needed
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);
SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);
ReplicationClient client = new ReplicationClient(replicator, handler, factory);
Callable<Boolean> callback = null; // can also be null if no callback is needed
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);
SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);
ReplicationClient client = new ReplicationClient(replicator, handler, factory);

// invoke client manually
client.updateNow();
client.updateNow();

// or, periodically
client.startUpdateThread(100); // check for update every 100 milliseconds
client.startUpdateThread(100); // check for update every 100 milliseconds
1 change: 1 addition & 0 deletions src/Lucene.Net.Sandbox/overview.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
uid: Lucene.Net.Sandbox
title: Lucene.Net.Sandbox
summary: *content
---

Expand Down
1 change: 1 addition & 0 deletions src/Lucene.Net.Suggest/overview.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
uid: Lucene.Net.Suggest
title: Lucene.Net.Suggest
summary: *content
---

Expand Down
3 changes: 2 additions & 1 deletion src/Lucene.Net.TestFramework/overview.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
uid: Lucene.Net.Testframework
uid: Lucene.Net.TestFramework
title: Lucene.Net.TestFramework
summary: *content
---

Expand Down
29 changes: 28 additions & 1 deletion src/Lucene.Net/Codecs/Lucene40/package.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ Lucene 4.0 file format.

# Apache Lucene - Index File Formats

<div>

* [Introduction](#introduction)

* [Definitions](#definitions)
Expand All @@ -48,16 +50,24 @@ Lucene 4.0 file format.

* [Limitations](#limitations)

</div>

## Introduction

<div>

This document defines the index file formats used in this version of Lucene. If you are using a different version of Lucene, please consult the copy of `docs/` that was distributed with the version you are using.

Apache Lucene is written in Java, but several efforts are underway to write [versions of Lucene in other programming languages](http://wiki.apache.org/lucene-java/LuceneImplementations). If these versions are to remain compatible with Apache Lucene, then a language-independent definition of the Lucene index format is required. This document thus attempts to provide a complete and independent definition of the Apache Lucene file formats.

As Lucene evolves, this document should evolve. Versions of Lucene in different programming languages should endeavor to agree on file formats, and generate new versions of this document.

</div>

## Definitions

<div>

The fundamental concepts in Lucene are index, document, field and term.

An index contains a sequence of documents.
Expand Down Expand Up @@ -106,8 +116,12 @@ The numbers stored in each segment are unique only within the segment, and must

When documents are deleted, gaps are created in the numbering. These are eventually removed as the index evolves through merging. Deleted documents are dropped when segments are merged. A freshly-merged segment thus has no gaps in its numbering.

</div>

## Index Structure Overview

<div>

Each segment index maintains the following:

* [Segment info](xref:Lucene.Net.Codecs.Lucene40.Lucene40SegmentInfoFormat).
Expand Down Expand Up @@ -160,16 +174,24 @@ An optional file indicating which documents are deleted.

Details on each of these are provided in their linked pages.

</div>

## File Naming

<div>

All files belonging to a segment have the same name with varying extensions. The extensions correspond to the different file formats described below. When using the Compound File format (default in 1.4 and greater) these files (except for the Segment info file, the Lock file, and Deleted documents file) are collapsed into a single .cfs file (see below for details)

Typically, all segments in an index are stored in a single directory, although this is not required.

As of version 2.1 (lock-less commits), file names are never re-used (there is one exception, "segments.gen", see below). That is, when any file is saved to the Directory it is given a never before used filename. This is achieved using a simple generations approach. For example, the first segments file is segments_1, then segments_2, etc. The generation is a sequential long integer represented in alpha-numeric (base 36) form.

</div>

## Summary of File Extensions

<div>

The following table summarizes the names and extensions of the files in Lucene:

<table cellspacing="1" cellpadding="4">
Expand Down Expand Up @@ -266,6 +288,7 @@ systems that frequently run out of file handles.</td>
<td>Info about what files are deleted</td>
</tr>
</table>
</div>

## Lock File

Expand Down Expand Up @@ -331,4 +354,8 @@ term vectors.

## Limitations

Lucene uses a Java `int` to refer to document numbers, and the index file format uses an `Int32` on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either `UInt64` values, or better yet, [VInt](xref:Lucene.Net.Store.DataOutput#methods) values which have no limit.
<div>

Lucene uses a Java `int` to refer to document numbers, and the index file format uses an `Int32` on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either `UInt64` values, or better yet, [VInt](xref:Lucene.Net.Store.DataOutput#methods) values which have no limit.

</div>
Loading

0 comments on commit 3db3847

Please sign in to comment.