Skip to content

An LLM-powered knowledge curation system that researches a topic and generates a full-length report with citations.

License

Notifications You must be signed in to change notification settings

stanford-oval/storm

Repository files navigation

STORM: Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking

|Research preview|Paper|Website|

Latest News🔥

  • [2024/07] You can now install our package withpip install knowledge-storm!
  • [2024/07] We addVectorRMto support grounding on user-provided documents, complementing existing support of search engines (YouRM,BingSearch). (check out#58)
  • [2024/07] We release demo light for developers a minimal user interface built with streamlit framework in Python, handy for local development and demo hosting (checkout#54)
  • [2024/06] We will present STORM at NAACL 2024! Find us at Poster Session 2 on June 17 or check ourpresentation material.
  • [2024/05] We add Bing Search support inrm.py.Test STORM withGPT-4o- we now configure the article generation part in our demo usingGPT-4omodel.
  • [2024/04] We release refactored version of STORM codebase! We defineinterfacefor STORM pipeline and reimplement STORM-wiki (check outsrc/storm_wiki) to demonstrate how to instantiate the pipeline. We provide API to support customization of different language models and retrieval/search integration.

STORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search.

While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage.

Try out ourlive research previewto see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!

How STORM works

STORM breaks down generating long articles with citations into two steps:

  1. Pre-writing stage:The system conducts Internet-based research to collect references and generates an outline.
  2. Writing stage:The system uses the outline and references to generate the full-length article with citations.

STORM identifies the core of automating the research process as automatically coming up with good questions to ask. Directly prompting the language model to ask questions does not work well. To improve the depth and breadth of the questions, STORM adopts two strategies:

  1. Perspective-Guided Question Asking:Given the input topic, STORM discovers different perspectives by surveying existing articles from similar topics and uses them to control the question-asking process.
  2. Simulated Conversation:STORM simulates a conversation between a Wikipedia writer and a topic expert grounded in Internet sources to enable the language model to update its understanding of the topic and ask follow-up questions.

Based on the separation of the two stages, STORM is implemented in a highly modular way usingdspy.

Installation

To install the knowledge storm library, usepip install knowledge-storm.

You could also install the source code which allows you to modify the behavior of STORM engine directly.

  1. Clone the git repository.

    git clone https://github.com/stanford-oval/storm.git
    cdstorm
  2. Install the required packages.

    conda create -n storm python=3.11
    conda activate storm
    pip install -r requirements.txt

API

The STORM knowledge curation engine is defined as a simple PythonSTORMWikiRunnerclass.

As STORM is working in the information curation layer, you need to set up the information retrieval module and language model module to create aSTORMWikiRunnerinstance. Here is an example of using You.com search engine and OpenAI models.

importos
fromknowledge_stormimportSTORMWikiRunnerArguments,STORMWikiRunner,STORMWikiLMConfigs
fromknowledge_storm.lmimportOpenAIModel
fromknowledge_storm.rmimportYouRM

lm_configs=STORMWikiLMConfigs()
openai_kwargs={
'api_key':os.getenv("OPENAI_API_KEY"),
'temperature':1.0,
'top_p':0.9,
}
# STORM is a LM system so different components can be powered by different models to reach a good balance between cost and quality.
# For a good practice, choose a cheaper/faster model for `conv_simulator_lm` which is used to split queries, synthesize answers in the conversation.
# Choose a more powerful model for `article_gen_lm` to generate verifiable text with citations.
gpt_35=OpenAIModel(model='gpt-3.5-turbo',max_tokens=500,**openai_kwargs)
gpt_4=OpenAIModel(model='gpt-4-o',max_tokens=3000,**openai_kwargs)
lm_configs.set_conv_simulator_lm(gpt_35)
lm_configs.set_question_asker_lm(gpt_35)
lm_configs.set_outline_gen_lm(gpt_4)
lm_configs.set_article_gen_lm(gpt_4)
lm_configs.set_article_polish_lm(gpt_4)
# Check out the STORMWikiRunnerArguments class for more configurations.
engine_args=STORMWikiRunnerArguments(...)
rm=YouRM(ydc_api_key=os.getenv('YDC_API_KEY'),k=engine_args.search_top_k)
runner=STORMWikiRunner(engine_args,lm_configs,rm)

Currently, our package support:

  • OpenAIModel,AzureOpenAIModel,ClaudeModel,VLLMClient,TGIClient,TogetherClient,OllamaClientas language model components
  • YouRM,BingSearch,VectorRMas retrieval module components

🌟PRs for integrating more language models intoknowledge_storm/lm.pyand search engines/retrievers intoknowledge_storm/rm.pyare highly appreciated!

TheSTORMWikiRunnerinstance can be evoked with the simplerunmethod:

topic=input('Topic: ')
runner.run(
topic=topic,
do_research=True,
do_generate_outline=True,
do_generate_article=True,
do_polish_article=True,
)
runner.post_run()
runner.summary()
  • do_research:if True, simulate conversations with difference perspectives to collect information about the topic; otherwise, load the results.
  • do_generate_outline:if True, generate an outline for the topic; otherwise, load the results.
  • do_generate_article:if True, generate an article for the topic based on the outline and the collected information; otherwise, load the results.
  • do_polish_article:if True, polish the article by adding a summarization section and (optionally) removing duplicate content; otherwise, load the results.

Quick Start with Example Scripts

We provide scripts in ourexamples folderas a quick start to run STORM with different configurations.

To run STORM withgptfamily models with default configurations:

  1. We suggest usingsecrets.tomlto set up the API keys. Create a filesecrets.tomlunder the root directory and add the following content:
    #Set up OpenAI API key.
    OPENAI_API_KEY="your_openai_api_key"
    #If you are using the API service provided by OpenAI, include the following line:
    OPENAI_API_TYPE="openai"
    #If you are using the API service provided by Microsoft Azure, include the following lines:
    OPENAI_API_TYPE="azure"
    AZURE_API_BASE="your_azure_api_base_url"
    AZURE_API_VERSION="your_azure_api_version"
    #Set up You.com search API key.
    YDC_API_KEY="your_youcom_api_key"
  2. Run the following command.
    python examples/run_storm_wiki_gpt.py \
    --output-dir $OUTPUT_DIR \
    --retriever you \
    --do-research \
    --do-generate-outline \
    --do-generate-article \
    --do-polish-article
    

To run STORM using your favorite language models or grounding on your own corpus:Check outexamples/README.md.

Customization of the Pipeline

If you have installed the source code, you can customize STORM based on your own use case. STORM engine consists of 4 modules:

  1. Knowledge Curation Module: Collects a broad coverage of information about the given topic.
  2. Outline Generation Module: Organizes the collected information by generating a hierarchical outline for the curated knowledge.
  3. Article Generation Module: Populates the generated outline with the collected information.
  4. Article Polishing Module: Refines and enhances the written article for better presentation.

The interface for each module is defined inknowledge_storm/interface.py,while their implementations are instantiated inknowledge_storm/storm_wiki/modules/*.These modules can be customized according to your specific requirements (e.g., generating sections in bullet point format instead of full paragraphs).

Replicate NAACL2024 result

Please switch to the branchNAACL-2024-code-backup

Show me instructions

Paper Experiments

The FreshWiki dataset used in our experiments can be found in./FreshWiki.

Run the following commands under./src.

Pre-writing Stage

For batch experiment on FreshWiki dataset:

python -m scripts.run_prewriting --input-source file --input-path../FreshWiki/topic_list.csv --engine gpt-4 --do-research --max-conv-turn 5 --max-perspective 5
  • --engine(choices=[gpt-4,gpt-35-turbo]): the LLM engine used for generating the outline
  • --do-research:if True, simulate conversation to research the topic; otherwise, load the results.
  • --max-conv-turn:the maximum number of questions for each information-seeking conversation
  • --max-perspective:the maximum number of perspectives to be considered, each perspective corresponds to an information-seeking conversation.
    • STORM also uses a general conversation to collect basic information about the topic. So, the maximum number of QA pairs ismax_turn * (max_perspective + 1).💡 Reducingmax_turnormax_perspectivecan speed up the process and reduce the cost but may result in less comprehensive outline.
    • The parameter will not have any effect if--disable-perspectiveis set (the perspective-driven question asking is disabled).

To run the experiment on a single topic:

python -m scripts.run_prewriting --input-source console --engine gpt-4 --max-conv-turn 5 --max-perspective 5 --do-research
  • The script will ask you to enter theTopicand theGround truth urlthat will be excluded. If you do not have any url to exclude, leave that field empty.

The generated outline will be saved in{output_dir}/{topic}/storm_gen_outline.txtand the collected references will be saved in{output_dir}/{topic}/raw_search_results.json.

Writing Stage

For batch experiment on FreshWiki dataset:

python -m scripts.run_writing --input-source file --input-path../FreshWiki/topic_list.csv --engine gpt-4 --do-polish-article --remove-duplicate
  • --do-polish-article:if True, polish the article by adding a summarization section and removing duplicate content if--remove-duplicateis set True.

To run the experiment on a single topic:

python -m scripts.run_writing --input-source console --engine gpt-4 --do-polish-article --remove-duplicate
  • The script will ask you to enter theTopic.Please enter the same topic as the one used in the pre-writing stage.

The generated article will be saved in{output_dir}/{topic}/storm_gen_article.txtand the references corresponding to citation index will be saved in{output_dir}/{topic}/url_to_info.json.If--do-polish-articleis set, the polished article will be saved in{output_dir}/{topic}/storm_gen_article_polished.txt.

Customize the STORM Configurations

We set up the default LLM configuration inLLMConfigsinsrc/modules/utils.py.You can useset_conv_simulator_lm(),set_question_asker_lm(),set_outline_gen_lm(),set_article_gen_lm(),set_article_polish_lm()to override the default configuration. These functions take in an instance fromdspy.dsp.LMordspy.dsp.HFModel.

Automatic Evaluation

In our paper, we break down the evaluation into two parts: outline quality and full-length article quality.

Outline Quality

We introduceheading soft recallandheading entity recallto evaluate the outline quality. This makes it easier to prototype methods for pre-writing.

Run the following command under./evalto compute the metrics on FreshWiki dataset:

python eval_outline_quality.py --input-path../FreshWiki/topic_list.csv --gt-dir../FreshWiki --pred-dir../results --pred-file-name storm_gen_outline.txt --result-output-path../results/storm_outline_quality.csv

Full-length Article Quality

eval/eval_article_quality.pyprovides the entry point of evaluating full-length article quality using ROUGE, entity recall, and rubric grading. Run the following command underevalto compute the metrics:

python eval_article_quality.py --input-path../FreshWiki/topic_list.csv --gt-dir../FreshWiki --pred-dir../results --gt-dir../FreshWiki --output-dir../results/storm_article_eval_results --pred-file-name storm_gen_article_polished.txt

Use the Metric Yourself

The similarity-based metrics (i.e., ROUGE, entity recall, and heading entity recall) are implemented ineval/metrics.py.

For rubric grading, we use theprometheus-13b-v1.0introduced inthis paper.eval/evaluation_prometheus.pyprovides the entry point of using the metric.

Roadmap & Contributions

Our team is actively working on:

  1. Human-in-the-Loop Functionalities: Supporting user participation in the knowledge curation process.
  2. Information Abstraction: Developing abstractions for curated information to support presentation formats beyond the Wikipedia-style report.

If you have any questions or suggestions, please feel free to open an issue or pull request. We welcome contributions to improve the system and the codebase!

Contact person:Yijia ShaoandYucheng Jiang

Acknowledgement

We would like to thank Wikipedia for their excellent open-source content. The FreshWiki dataset is sourced from Wikipedia, licensed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license.

We are very grateful toMichelle Lamfor designing the logo for this project andDekun Mafor leading the UI development.

Citation

Please cite our paper if you use this code or part of it in your work:

@inproceedings{shao2024assisting,
title={{Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models}},
author={Yijia Shao and Yucheng Jiang and Theodore A. Kanell and Peter Xu and Omar Khattab and Monica S. Lam},
year={2024},
booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)}
}