I recently explored how to leverage large language models for legal document creation while maintaining their original precision and relevance. A widely-used solution is RAG (Retrieval Augmented Generation) that allows AI to access and use specific document repositories when generating responses.
Think of RAG as a smart assistant with access to a carefully curated library. When you input a contract type and key phrases, it suggests relevant completions and follow-up clauses, similar to how GitHub Copilot works for coding.
While this approach seems straightforward for simple documents, it becomes tricky with legal texts where identical phrases (like “Any amendments to this Agreement must be made in writing to be valid”) appear across dozens of documents in slightly different contexts.
I tested various approaches on a sample of mostly real estate contracts, comparing data augmentation, hybrid search, re-ranking, and newer methods like Anthropic’s Contextual Retrieval and Jina AI’s late chunking. Each approach showed different strengths in improving retrieval accuracy.
Most interestingly, I found that standard, one-size-fits-all solutions often fall short. The key was in the details: analysing specific use cases and tailoring the approach to match the particular challenges of legal document generation.
While comparing retrieval accuracy scores, I noticed how even minor adjustments in document chunking/augmentation strategy could significantly impact the quality of suggested completions – something I had not quite anticipated when starting this experiment.
Data Preparation #
I started by converting contract points into structured documents using Gemini for structured output extraction. Each document included:
- The original contract text
- Document title and paragraph number
- Automatically extracted keywords
Here’s an example structure for a final provisions clause:
{
"number": "7",
"title": "Postanowienia końcowe",
"points": [
{
"content": "W sprawach nie uregulowanych niniejszą umową mają zastosowanie przepisy kodeksu cywilnego.",
"key_words": [
"sprawy nieuregulowane",
"umowa",
"przepisy",
"kodeks cywilny"
],
"query": "Stosuje się kodeks cywilny."
}
]
}
For vector representation, I used the multilingual jina-embeddings-v3 model and stored the results in a Milvus database. To simulate real usage, I generated test queries using the llama-3.3-70b-versatile model on Groq. While this model isn’t optimized for Polish, it worked well enough for generating simplified query patterns. Of course, these synthetic queries should eventually be replaced with real user data.
Evaluation Approach #
My use case had a specific requirement: unlike typical RAG applications that might use multiple search results, I needed exact matches only. Each query (combining contract title and synthetic question) needed to return the precise corresponding contract clause.
For example, the query “Umowa najmu stanowiska garażowego: Stosuje się kodeks cywilny” should return “W sprawach nie uregulowanych niniejszą umową mają zastosowanie przepisy kodeksu cywilnego.”
I tested various configurations, focusing on:
- Hybrid search ratio (dense/sparse)
- Reranking with jina-reranker-v2-base-multilingual
- Jina AI’s late chunking options
Success was measured by whether the relevant fragment appeared in the top 5 results. This approach formed the basis for a copilot-style system that suggests related contract clauses to users.
While I ran numerous additional tests, I’ll focus on sharing just the most significant results. The optimal configuration was as follows:
- reranker: top-k 10
- sparse-weight: 0.3
- late chunking: no
Match Type | Accuracy | MRR | P@1 | P@3 | P@5 |
---|---|---|---|---|---|
All elements | 95.97% | 0.914 | 88.18% | 33.43% | 20.75% |
Title Only | 100.00% | 0.972 | 95.10% | 69.74% | 63.75% |
Paragraph Only | 99.14% | 0.943 | 91.35% | 48.80% | 37.29% |
Content Only | 95.97% | 0.914 | 88.18% | 36.41% | 23.00% |
- Total Queries: 347
- Perfect Matches (any rank): 333
- Perfect Match Rate (any rank): 95.97%
- Perfect Matches (rank 1): 306
- Perfect Match Rate (rank 1): 88.18%
Metrics #
- Accuracy: Ratio of queries with correct matches anywhere in 5 first results
- MRR (Mean Reciprocal Rank): Measures how highly the first correct answer appears in results. Scale 0-1, higher is better
- Precision@N (P@1, P@3, P@5): Percentage of relevant results in top N positions
- P@1: Accuracy of the first result
- P@3: Accuracy within top 3 results
- P@5: Accuracy within top 5 results
Match Type Metrics #
- All elements: Percentage of queries where all elements matched correctly
- Title Only: Accuracy when matching just document titles
- Paragraph Only: Accuracy when matching just paragraphs
- Content Only: Accuracy when matching full document content
Summary #
- Better Without Late Chunking
Late chunking significantly underperforms compared to processing full documents upfront. When tested with identical parameters, dropping late chunking improved the first-hit accuracy (P@1) by nearly 11 percentage points, from 73.78% to 84.73%, showing that maintaining document context during initial processing is crucial. - Sweet Spot in Sparse Weight
A lower sparse weight of 0.3 (versus 0.5) creates a better balance in the retrieval mechanism. This configuration achieved perfect title matching and improved overall accuracy to 93.66%, indicating that giving more weight to dense embeddings helps capture semantic relationships better. - Reranker as a Game Changer
The introduction of a reranker proved transformative, especially with a top-k value of 10. This addition pushed the system to its peak performance with 95.97% accuracy and 88.18% P@1, representing a significant improvement over non-reranked results. - Precision vs Reranking Depth
While both reranking configurations (top-k=5 and top-k=10) showed improvements, the deeper reranking depth of 10 achieved marginally better results. The small differences suggest that even a shallower reranking depth can provide substantial benefits while being more computationally efficient.
The evaluation points to an optimal configuration: no late chunking, a sparse weight of 0.3, and reranking with top-k=10. This combination provides the best balance between accuracy and comprehensive matching across different content types. When considering computational resources, users might opt for a top-k=5 configuration, which offers nearly equivalent performance (87.90% vs 88.18% P@1) while requiring approximately half the tokens to be processed by the resource-intensive reranker model.