Bootstrapping¶
Bootstrapping is the process of feeding data from your database into the Forage search index. This happens both at startup and periodically to keep the index synchronized.
Correctness
Ensure that your bootstrapper ALWAYS returns all values from your primary data store.
Why?
Internally, Forage creates a new Lucene index from scratch during each bootstrap cycle and then hot-swaps it in place. The IndexSearcher is immutable once created, to ensure that search queries are built for performance. Every bootstrap creates a new IndexSearcher, and cleans up the older one (garbage-collected).
If your bootstrapper returns only a subset of data, the index will be incomplete and search results will be incorrect.
The Bootstrapper Interface¶
public interface Bootstrapper<T> {
void bootstrap(Consumer<T> consumer) throws Exception;
}
The bootstrap method receives a Consumer that accepts indexable documents. Your implementation iterates through your data and feeds each document to the consumer.
Basic Implementation¶
public class BookStore implements Bootstrapper<IndexableDocument> {
private final BookRepository repository;
@Override
public void bootstrap(Consumer<IndexableDocument> consumer) {
// Iterate through all data
for (Book book : repository.findAll()) {
// Create and submit indexable document
consumer.accept(createDocument(book));
}
}
private ForageDocument createDocument(Book book) {
return new ForageDocument(book.getId(), Arrays.asList(
new TextField("title", book.getTitle()),
new TextField("author", book.getAuthor()),
new FloatField("rating", new float[]{book.getRating()})
));
}
}
Streaming Large Datasets¶
For large datasets, use streaming to avoid loading everything into memory:
JPA/Hibernate Streaming¶
@Override
public void bootstrap(Consumer<IndexableDocument> consumer) {
try (Stream<Book> bookStream = repository.streamAll()) {
bookStream.forEach(book -> {
consumer.accept(createDocument(book));
});
}
}
JDBC Cursor¶
@Override
public void bootstrap(Consumer<IndexableDocument> consumer) throws SQLException {
try (Connection conn = dataSource.getConnection();
PreparedStatement stmt = conn.prepareStatement(
"SELECT * FROM books",
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY)) {
stmt.setFetchSize(1000); // Fetch in batches
try (ResultSet rs = stmt.executeQuery()) {
while (rs.next()) {
Book book = mapResultSet(rs);
consumer.accept(createDocument(book));
}
}
}
}
Batched Processing¶
@Override
public void bootstrap(Consumer<IndexableDocument> consumer) {
int pageSize = 1000;
int page = 0;
while (true) {
List<Book> batch = repository.findAll(PageRequest.of(page, pageSize));
if (batch.isEmpty()) {
break;
}
batch.forEach(book -> consumer.accept(createDocument(book)));
page++;
}
}
Periodic Updates¶
The PeriodicUpdateEngine manages automatic re-bootstrapping:
// Create the update engine
PeriodicUpdateEngine<IndexableDocument> updateEngine = new PeriodicUpdateEngine<>(
bootstrapper, // Your Bootstrapper implementation
new AsyncQueuedConsumer<>(engine), // Wraps the search engine
60, // Interval
TimeUnit.SECONDS // Time unit
);
// Initial bootstrap
updateEngine.bootstrap();
// Start periodic updates
updateEngine.start();
// Later, when shutting down
updateEngine.stop();
Update Behavior¶
sequenceDiagram
participant UE as UpdateEngine
participant BS as Bootstrapper
participant AC as AsyncConsumer
participant LI as Lucene Index
Note over UE: Every 60 seconds
UE->>BS: bootstrap(consumer)
loop For each document
BS->>AC: accept(document)
AC->>LI: Queue for indexing
end
AC->>LI: Flush & swap index
Note over LI: Old index replaced atomically
Error Handling¶
Handle errors gracefully during bootstrap:
@Override
public void bootstrap(Consumer<IndexableDocument> consumer) {
int successCount = 0;
int errorCount = 0;
for (Book book : repository.findAll()) {
try {
consumer.accept(createDocument(book));
successCount++;
} catch (Exception e) {
log.error("Failed to index book {}: {}", book.getId(), e.getMessage());
errorCount++;
}
}
log.info("Bootstrap complete: {} indexed, {} errors", successCount, errorCount);
}
Multi-Source Bootstrapping¶
Combine data from multiple sources:
@Override
public void bootstrap(Consumer<IndexableDocument> consumer) {
// Index books
bookRepository.findAll().forEach(book ->
consumer.accept(createBookDocument(book))
);
// Index magazines
magazineRepository.findAll().forEach(magazine ->
consumer.accept(createMagazineDocument(magazine))
);
}
private ForageDocument createBookDocument(Book book) {
return new ForageDocument("book-" + book.getId(), Arrays.asList(
new TextField("title", book.getTitle()),
new StringField("type", "BOOK")
));
}
private ForageDocument createMagazineDocument(Magazine magazine) {
return new ForageDocument("magazine-" + magazine.getId(), magazine, Arrays.asList(
new TextField("title", magazine.getTitle()),
new StringField("type", "MAGAZINE")
));
}
Performance Optimization¶
1. Parallel Processing¶
The consumer is thread-safe, so you can parallelize:
@Override
public void bootstrap(Consumer<IndexableDocument> consumer) {
List<Book> allBooks = repository.findAll();
allBooks.parallelStream().forEach(book ->
consumer.accept(createDocument(book))
);
}
2. Minimize Field Processing¶
// Expensive: compute during bootstrap
consumer.accept(new ForageDocument(book.getId(), Arrays.asList(
new TextField("summary", generateSummary(book)) // Slow!
)));
// Better: pre-compute and store
consumer.accept(new ForageDocument(book.getId(), Arrays.asList(
new TextField("summary", book.getCachedSummary()) // Fast
)));
3. Monitor Bootstrap Duration¶
@Override
public void bootstrap(Consumer<IndexableDocument> consumer) {
long startTime = System.currentTimeMillis();
AtomicInteger count = new AtomicInteger();
repository.findAll().forEach(book -> {
consumer.accept(createDocument(book));
count.incrementAndGet();
});
long duration = System.currentTimeMillis() - startTime;
log.info("Bootstrapped {} documents in {} ms", count.get(), duration);
}
Lifecycle¶
stateDiagram-v2
[*] --> Idle
Idle --> Bootstrapping: bootstrap() called
Bootstrapping --> Indexing: Documents submitted
Indexing --> Flushing: All documents processed
Flushing --> Swapping: Index flushed
Swapping --> Idle: Reference swapped
Idle --> Bootstrapping: Timer triggers
Next Steps¶
- Query Types - Search your indexed data
- Scoring & Ranking - Customize result ordering