Reward-model native retrieval

RMSearch aligns retrieval with downstream rewards

RMSearch swaps traditional embedding-only similarity for an optimized reward model so the search stack can reason like an agent. Each query/key pair is scored through a vLLM-backed reward head, letting the engine pick agents that unlock chain-of-thought reasoning paths (as highlighted in the project README and SEIMEI examples).

Supports step-by-step CoT planning, aligning with SEIMEI's agent search demos.

Key traits, stack, and use cases

Key Traits

  • State-of-the-art retrieval on BigCodeBench / DeepSeek-R1 benchmarks after reward-model training.
  • Performs better than semantic embeddings (e.g., vs. e5-mistral-7b) when selecting agents with dissimilar text.
  • Supports step-by-step CoT planning, aligning with SEIMEI's agent search demos.

Stack

  • vLLM
  • Hugging Face
  • TRL + PEFT LoRA
  • FastAPI / Uvicorn
  • PyTorch + CUDA

Use Cases

  • Search for the best agent persona to answer a query.
  • Reward-aligned reranking for RAG pipelines.
  • Tag tree routing with representative tags per document cluster.

Environment checklist

Hardware & OS

  • Linux environment with CUDA-visible GPUs (>=12 GB VRAM recommended in README).
  • Python 3.9+ with PyTorch + CUDA installed; vLLM handles inference.
  • Plenty of disk space for Hugging Face datasets and checkpoints.
  • Collaborators mirror the RunPod convention: work under /workspace/<name>/.

Install Options

# pip release (placeholder while packaging finishes)
pip install rmsearch

# editable install from the repo
git clone https://github.com/kyotoai/RMSearch.git
cd RMSearch
pip install -e .

Optional Model Downloads

All training README files provide hf_transfer commands for local checkpoints:

  • Reward: Ray2333/GRM-Llama3.2-3B-rewardmodel-ft
  • Instruction/Q&A: Qwen/Qwen3-4B-Instruct-2507
  • Reranker: Qwen/Qwen3-Reranker-4B
  • Embedding baseline: intfloat/e5-mistral-7b-instruct
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
  Ray2333/GRM-Llama3.2-3B-rewardmodel-ft \
  --local-dir ./llama3b-rm/

Search programmatically or via FastAPI

import asyncio
from rmsearch import Search

async def main():
    search = Search(
        model_name="/workspace/llama3b-rm",
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
    )
    queries = ["Summarise retrieval augmented generation."]
    keys = [
        "Retrieval augmented generation (RAG) combines external documents with LLMs.",
        "An unrelated sentence about cooking pasta.",
    ]
    results = await search(queries, keys, k=1)
    print(results[0]["keys"][0])
    search.close()

asyncio.run(main())
              

The Search runtime exposes helpers such as search_by_df and get_relevance for batched scoring. All public methods are coroutine-friendly; wrap calls with asyncio.run inside scripts.

FastAPI responses mirror the async client: you receive query_id, key_id, the matched text, and optional relevance scores.

Runtime modules

  • rmsearch.py: async top-k search, checkpoint-aware streaming, save_results.
  • utils/vllm_*.py: builders for generation, embedding, reward, and HTTP-backed inference.
  • multi_service.py: multi-model utility combining search, generation, and embeddings.

Tag tree workflow

  • tree/generate_tag.py creates JSON tag inventories.
  • tree/embed_tags.py pools embeddings per tag.
  • tree/assign_key.py walks the tag tree to map queries to optimal paths.

Dataset creation → LoRA fine-tune

The training README files break the process into reproducible CLI stages that mirror examples/train_en.ipynb. Execute them sequentially inside a GPU workspace.

  1. Collect corpora with rmsearch.train.process_data. Supports Hugging Face streaming, sampling, and deterministic shuffles. Produces dataset_dict.json, df.csv, and df_small.csv.
  2. Generate queries using make_query_recs.py followed by filter_query_recs.py to focus on titles/questions/keywords.
  3. Retrieve candidates by embedding queries and keys (get_top_relevant_keys_embed.py), or rely on an RM run.
  4. Sample preference pairs with sample_dpo_batch.py, then judge them via judge_dataset.py to produce dataset_list_train/test.json.
  5. Train using lora_example.py, logging to Weights & Biases if desired.

LoRA training command

export WANDB_API_KEY=<key>
wandb login

python -m rmsearch.train.lora_example \
  --dataset-list-train ./exp1/dataset_list_train.json \
  --dataset-list-test ./exp1/dataset_list_test.json \
  --model-name /workspace/llama3b-rm \
  --output-dir ./exp1/model1 \
  --wandb-project rmsearch \
  --wandb-run-name example-lora

Default LoRA configuration targets projection layers (e.g., k_proj, q_proj, gate_proj) with r=16 and lora_alpha=16. Adjust inside rmsearch/train/lora_example.py.

Script highlights

  • process_data.py: streaming HF download, subset sampling, CSV export.
  • get_top_relevant_keys_embed.py: uses utils/vllm_embed.py with configurable batch sizes and GPU similarity.
  • sample_dpo_batch.py: mixes retrieved positives with df-based negatives.
  • judge_dataset.py: prompts an LLM (via vLLM) to pick chosen/rejected keys per query.

For test-set construction the team processes the mteb/arguana split, merges corpus.jsonl + queries.jsonl, and generates query records manually before running the same sampling steps.

Advanced DPO batching

Beyond the standard DPO pipeline, the README_adpo_*.md guides outline qualitative datasets where every query/key pair is accompanied by systematically degraded alternatives.

Less relevant key generation

  • make_query_and_less_relevant_keys_recs.py uses Qwen or GPT-OSS to produce a query plus n_key_generation variants, ensuring key[0] > key[1] > ... in relevance.
  • Prompts cycle through instructions (titles, paragraphs, etc.) so the data is not biased toward a single pattern.
  • The exported JSON stores correspond_key and less_relevant_keys per df row.

Query difficulty ladders

  • make_query_dpo_pairs.py/v2/v3 generate queries spanning highly relevant to partially off-target.
  • Configurations support large runs (n_query_generation up to 5, multi-instance vLLM workers).
  • Pairs can be fed directly into adpo_lora_example.py for training with explicit difficulty order.

Use sample_advanced_dpo_batch.py to mix the structured positives with randomly sampled negatives (n-sampled-keys) before exporting adpo_sampled_query_key_set.json.

End-to-end evaluation on BEIR datasets

The evaluation README focuses on BEIR tasks that expose binary relevance scores. Each stage consumes the previous output, so clean up the dataset folder if you need a fresh run.

Embedding stage

python -m rmsearch.evaluation.embed \
  --dataset-path /workspace/beir_out/scifact \
  --split test \
  --output .../relevant_emb.json \
  --output-eval .../relevant_emb_eval.json \
  --model-name /workspace/e5-mistral7b \
  --tensor-parallel-size 1 \
  --num-instances 1 \
  --top-k 100 \
  --similarity-device auto
  • Automatically downloads the BEIR dataset when missing.
  • Writes two JSON files: one for reranking, one for nDCG scoring.
  • Example entry stores query_id, key_ids, positive_key_ids, and embedding scores.

Rerank + nDCG

python -m rmsearch.evaluation.rerank \
  --dataset-path /workspace/beir_out/scifact \
  --embed-output .../relevant_emb.json \
  --output-eval .../relevant_rerank_eval.json \
  --model-name /workspace/qwen4b-reward-step560 \
  --tensor-parallel-size 1 \
  --num-instances 1 \
  --request-batch-size 128

Edit rmsearch/evaluation/ndcg.py to point to your dataset, path, and rerank output, then run python rmsearch/evaluation/ndcg.py to compute nDCG scores.

Remember to convert reward checkpoints (if necessary) with utils.py before evaluation.

Roadmap excerpts

Search

  • Async vLLM integration enhancements.
  • Automatic compatibility solver for distributed inference.

Training

  • Reward trainer polish + expanded examples.
  • MCTS-style example notebooks for reasoning search.

Track open issues on GitHub for the latest tasks across search, training, and evaluation.