RMSearch aligns retrieval with downstream rewards
RMSearch swaps traditional embedding-only similarity for an optimized reward model so the search stack can reason like an agent. Each query/key pair is scored through a vLLM-backed reward head, letting the engine pick agents that unlock chain-of-thought reasoning paths (as highlighted in the project README and SEIMEI examples).
Key traits, stack, and use cases
Key Traits
- State-of-the-art retrieval on BigCodeBench / DeepSeek-R1 benchmarks after reward-model training.
- Performs better than semantic embeddings (e.g., vs. e5-mistral-7b) when selecting agents with dissimilar text.
- Supports step-by-step CoT planning, aligning with SEIMEI's agent search demos.
Stack
- vLLM
- Hugging Face
- TRL + PEFT LoRA
- FastAPI / Uvicorn
- PyTorch + CUDA
Use Cases
- Search for the best agent persona to answer a query.
- Reward-aligned reranking for RAG pipelines.
- Tag tree routing with representative tags per document cluster.
Environment checklist
Hardware & OS
- Linux environment with CUDA-visible GPUs (>=12 GB VRAM recommended in README).
- Python 3.9+ with PyTorch + CUDA installed; vLLM handles inference.
- Plenty of disk space for Hugging Face datasets and checkpoints.
- Collaborators mirror the RunPod convention: work under
/workspace/<name>/.
Install Options
# pip release (placeholder while packaging finishes)
pip install rmsearch
# editable install from the repo
git clone https://github.com/kyotoai/RMSearch.git
cd RMSearch
pip install -e .
Optional Model Downloads
All training README files provide hf_transfer commands for local checkpoints:
- Reward:
Ray2333/GRM-Llama3.2-3B-rewardmodel-ft - Instruction/Q&A:
Qwen/Qwen3-4B-Instruct-2507 - Reranker:
Qwen/Qwen3-Reranker-4B - Embedding baseline:
intfloat/e5-mistral-7b-instruct
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
Ray2333/GRM-Llama3.2-3B-rewardmodel-ft \
--local-dir ./llama3b-rm/
Search programmatically or via FastAPI
The Search runtime exposes helpers such as search_by_df and get_relevance for batched
scoring. All public methods are coroutine-friendly; wrap calls with asyncio.run inside scripts.
FastAPI responses mirror the async client: you receive query_id, key_id, the matched text, and
optional relevance scores.
Runtime modules
rmsearch.py: async top-k search, checkpoint-aware streaming,save_results.utils/vllm_*.py: builders for generation, embedding, reward, and HTTP-backed inference.multi_service.py: multi-model utility combining search, generation, and embeddings.
Tag tree workflow
tree/generate_tag.pycreates JSON tag inventories.tree/embed_tags.pypools embeddings per tag.tree/assign_key.pywalks the tag tree to map queries to optimal paths.
Dataset creation → LoRA fine-tune
The training README files break the process into reproducible CLI stages that mirror examples/train_en.ipynb.
Execute them sequentially inside a GPU workspace.
- Collect corpora with
rmsearch.train.process_data. Supports Hugging Face streaming, sampling, and deterministic shuffles. Producesdataset_dict.json,df.csv, anddf_small.csv. - Generate queries using
make_query_recs.pyfollowed byfilter_query_recs.pyto focus on titles/questions/keywords. - Retrieve candidates by embedding queries and keys (
get_top_relevant_keys_embed.py), or rely on an RM run. - Sample preference pairs with
sample_dpo_batch.py, then judge them viajudge_dataset.pyto producedataset_list_train/test.json. - Train using
lora_example.py, logging to Weights & Biases if desired.
LoRA training command
export WANDB_API_KEY=<key>
wandb login
python -m rmsearch.train.lora_example \
--dataset-list-train ./exp1/dataset_list_train.json \
--dataset-list-test ./exp1/dataset_list_test.json \
--model-name /workspace/llama3b-rm \
--output-dir ./exp1/model1 \
--wandb-project rmsearch \
--wandb-run-name example-lora
Default LoRA configuration targets projection layers (e.g., k_proj, q_proj, gate_proj)
with r=16 and lora_alpha=16. Adjust inside rmsearch/train/lora_example.py.
Script highlights
process_data.py: streaming HF download, subset sampling, CSV export.get_top_relevant_keys_embed.py: usesutils/vllm_embed.pywith configurable batch sizes and GPU similarity.sample_dpo_batch.py: mixes retrieved positives with df-based negatives.judge_dataset.py: prompts an LLM (via vLLM) to pick chosen/rejected keys per query.
For test-set construction the team processes the mteb/arguana split, merges corpus.jsonl +
queries.jsonl, and generates query records manually before running the same sampling steps.
Advanced DPO batching
Beyond the standard DPO pipeline, the README_adpo_*.md guides outline qualitative datasets where every
query/key pair is accompanied by systematically degraded alternatives.
Less relevant key generation
make_query_and_less_relevant_keys_recs.pyuses Qwen or GPT-OSS to produce a query plusn_key_generationvariants, ensuringkey[0] > key[1] > ...in relevance.- Prompts cycle through instructions (titles, paragraphs, etc.) so the data is not biased toward a single pattern.
- The exported JSON stores
correspond_keyandless_relevant_keysper df row.
Query difficulty ladders
make_query_dpo_pairs.py/v2/v3generate queries spanning highly relevant to partially off-target.- Configurations support large runs (
n_query_generationup to 5, multi-instance vLLM workers). - Pairs can be fed directly into
adpo_lora_example.pyfor training with explicit difficulty order.
Use sample_advanced_dpo_batch.py to mix the structured positives with randomly sampled negatives
(n-sampled-keys) before exporting adpo_sampled_query_key_set.json.
End-to-end evaluation on BEIR datasets
The evaluation README focuses on BEIR tasks that expose binary relevance scores. Each stage consumes the previous output, so clean up the dataset folder if you need a fresh run.
Embedding stage
python -m rmsearch.evaluation.embed \
--dataset-path /workspace/beir_out/scifact \
--split test \
--output .../relevant_emb.json \
--output-eval .../relevant_emb_eval.json \
--model-name /workspace/e5-mistral7b \
--tensor-parallel-size 1 \
--num-instances 1 \
--top-k 100 \
--similarity-device auto
- Automatically downloads the BEIR dataset when missing.
- Writes two JSON files: one for reranking, one for nDCG scoring.
- Example entry stores
query_id,key_ids,positive_key_ids, and embedding scores.
Rerank + nDCG
python -m rmsearch.evaluation.rerank \
--dataset-path /workspace/beir_out/scifact \
--embed-output .../relevant_emb.json \
--output-eval .../relevant_rerank_eval.json \
--model-name /workspace/qwen4b-reward-step560 \
--tensor-parallel-size 1 \
--num-instances 1 \
--request-batch-size 128
Edit rmsearch/evaluation/ndcg.py to point to your dataset, path, and rerank output, then run
python rmsearch/evaluation/ndcg.py to compute nDCG scores.
Remember to convert reward checkpoints (if necessary) with utils.py before evaluation.
Roadmap excerpts
Search
- Async vLLM integration enhancements.
- Automatic compatibility solver for distributed inference.
Training
- Reward trainer polish + expanded examples.
- MCTS-style example notebooks for reasoning search.
Track open issues on GitHub for the latest tasks across search, training, and evaluation.
Key links
- Repository: github.com/kyotoai/RMSearch
- Demo gallery: demo/
- Roadmap & planning: see
ROADMAP.mdand pinned GitHub issues. - Contact KyotoAI Inc.: office@kyotoai.org · kyotoai.org
- License: Apache-2.0 (see
LICENSE).