Discover more from FireBirdTech

AI.Tech.Fire! Stay ahead with exclusive updates on our AI-driven projects, client innovations, and the latest in AI software development. Subscribe for exclusive content

Already have an account? Sign in

Creating an AI-Powered Researcher: A Step-by-Step Guide

How to build a system that searches the internet and summarizes, researches and explains the information

Arslan Shahid

Apr 13, 2025

One of the most successful "AI wrappers" is the AI knowledge engine Perplexity. It enables users to research and find relevant information, which a Large Language Model then summarizes and explains as an answer to the user's query.

This post will explain how you can create your own AI researcher that finds contextually relevant information from the internet and provides an appropriate answer.

Tech Stack used

There are lots of tools out there, but here’s the tech stack I went with for this blog:

DSPy – An AI framework with built-in prompt optimization. It’s simple, modular, and easy to work with. You can learn more about it here.
Exa – Offers an easy-to-use API to search the web, find relevant info, and pass it along to an LLM.

What do search APIs like Exa do?

Architecture Diagram — How Exa explains they are building their search system—image taken from their blog

Search APIs work by indexing the internet. An easy way to think about them is like retrievers that pull information from web pages and links. What sets them apart is how well they do that indexing. That’s their main “differentiator.” For example, Perplexity API and Exa organize and access web content in different ways.

Below is a simplified version of what Exa’s underlying indexing and search technology does.

Exa is a search engine that combines neural and keyword based search techniques. Instead of relying solely on keyword matching, it uses embeddings, which are vector representations of text, to perform semantic search over its own indexed web content. This allows it to return results that are relevant in meaning even if they don’t contain the exact query terms.

At the core of this is a method called next link prediction, where the model predicts which documents or links are most relevant based on the semantic content of both the query and the indexed documents. This is particularly useful for ambiguous or exploratory queries where users may not know the exact terms to use.

However, for cases where exact matches matter, like proper nouns, technical jargon, or known keywords, Exa also supports traditional keyword search. Both search modes are integrated into a system called Auto Search, which uses a lightweight classifier to decide whether to route the query to the neural search engine, the keyword search engine, or both.

This hybrid setup helps Exa handle a wide range of query types, balancing semantic relevance with precision.

For a more detailed explanation or information you can visit their docs and blog.

We have served 13+ clients, from startups to big multi-national corporations, need help with AI? Press button below

Reach out for help

How to use Exa’s API

Now for the fun part on how you can use this API to search contextually relevant information from the web. There are three key methods for the search api

Exa finds the exact content you’re looking for on the web, with three core functionalities:

Search: Find webpages using Exa’s embeddings-based or Google-style keyword search.

from exa_py import Exa
exa = Exa('YOUR_EXA_API_KEY')

results = exa.search_and_contents(
    "Find blogs by FireBirdTech",
      type ="auto",# can be neural or keyword, based on your preference
      numResults=30, # How many results to return
)

print(results)

Building a Reliable Text-to-SQL Pipeline: A Step-by-Step Guide pt.1

Arslan Shahid

Jan 24

Building a Reliable Text-to-SQL Pipeline: A Step-by-Step Guide pt.1

Many of our clients are asking for text-to-SQL solutions these days, and it’s become a key part of nearly every project we’ve worked on in the last quarter. While it’s easy to get a language model to generate SQL queries, building a reliable system for enterprise use is a different story. For business-critical applications, we need a high…

Read full story

Content:
Obtain clean, up-to-date, parsed HTML from Exa search results.

from exa_py import Exa
exa = Exa('YOUR_EXA_API_KEY')

#crawls the web for the contents of the webpages specified
results = exa.get_contents(
    urls=["https://firebird-technologies.com"], 
    text=True
)
print(results)

Find similar link
Based on a link, find and return pages that are similar in meaning.

from exa_py import Exa
exa = Exa('YOUR_EXA_API_KEY')

# Find similar links to the link provided and optionally return the contents of # the pages
results = exa.find_similar_and_contents(
    url="www.firebird-technologies.com",
    text=True
)

print(results)

The search feature is simple and easy to use, for a full overview of every feature of the API you can look at Exa’s API reference here

Design of the AI researcher

I designed a three-step system so the researcher can use more than one API at a time. Here's how it works:

1. API Router: This part decides which of the three APIs to use for the query. Most queries will only need the search API, but it's helpful to also have the option to fetch content or find similar links.

2. API Agent: There are three agents, and each one is set up with a special prompt that works best with one of the APIs.

3. Answer Layer: This is where everything comes together. The system summarizes and analyzes the information to give a final answer to the user.

Need someone on your team that can develop AI solutions for you? Press button below

Reach out for help

API Router

Below is the prompt for the API router

class api_router(dspy.Signature):
    """

    You are an intelligent router that decides which Exa API(s) to call to answer a user query. Choose the fewest number of APIs needed to produce a relevant and useful result.

    There are three available APIs:

    search: Use this to discover new documents based on a topic, keyword, or research question.

    contents: Use this when the user provides one or more known URLs and wants full text, summary, or highlights.

    findSimilar: Use this when the user gives a URL and wants to find related links.

    You can also combine APIs in sequence if needed, but prefer the minimal number of calls.

    Return your decision as a list of APIs with a short explanation for each.

    Examples:

    Query: Summarize this paper: https://arxiv.org/abs/2307.06435
    api_selection:contents
    selection_reason: Get the summary and highlights of the provided paper.

    Query: What are the latest trends in scalable LLM training?
    api_selection:search

    Query: Find related blog posts to this article: [URL]
    api_selection:findSimilar
    selection_reason: Find related content based on the URL.

    Query: Give me summaries of new papers on few-shot learning
    api_selection:search,contents
    selection_reason: The user wants summaries of new papers on few-shot learning

    Only use findSimilar if a URL is provided and user wants to find related links.
    """
    query = dspy.InputField(description="The search query the user wants to search for")
    api_selection = dspy.OutputField(description="The API to use for the query - search,content or find_similar_link")
    selection_reason = dspy.OutputField(description="The reason for the API selection")

# using chain of thought for better prompting
api_router_agent = dspy.ChainOfThought(api_router)

response = api_router_agent(query = "Find blog posts by FireBirdTech")

Response from the program

Response by API router, the reasoning & selection reason are used for better answers but the output needed is the api_selection

API agent

All three API agents have a similar structure, they are feed with Exa’s python documentation.

They have one input (query) and output (API settings)

class search_api_agent(dspy.Signature):
    """ 
You are an intelligent API assistant that generates structured settings for calling the Exa `/search` endpoint. 
Your role is to interpret a user's query and intent, and return a complete Python dictionary representing the JSON payload for the request.

The `/search` endpoint parameters are:

**Basic Parameters**
Here’s the information formatted into a **table** for better clarity:

| **Input Parameters**       | **Type**                                                                                     | **Description**                                                                                                                                                   | **Default**     |
|----------------------------|----------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|
| `query`                    | `str`                                                                                         | The input query string.                                                                                                                                           | Required        |
| `text`                     | `Union[TextContentsOptions, Literal[True]]`                                                   | If provided, includes the full text of the content in the results.                                                                                              | None            |
| `highlights`               | `Union[HighlightsContentsOptions, Literal[True]]`                                             | If provided, includes highlights of the content in the results.                                                                                                | None            |
| `num_results`              | `Optional[int]`                                                                                | Number of search results to return.                                                                                                                               | 10              |
| `include_domains`          | `Optional[List[str]]`                                                                          | List of domains to include in the search.                                                                                                                           | None            |
| `exclude_domains`          | `Optional[List[str]]`                                                                          | List of domains to exclude in the search.                                                                                                                           | None            |
| `start_crawl_date`         | `Optional[str]`                                                                               | Results will only include links crawled after this date.                                                                                                         | None            |
| `end_crawl_date`           | `Optional[str]`                                                                               | Results will only include links crawled before this date.                                                                                                        | None            |
| `start_published_date`     | `Optional[str]`                                                                               | Results will only include links with a published date after this date.                                                                                          | None            |
| `end_published_date`       | `Optional[str]`                                                                               | Results will only include links with a published date before this date.                                                                                         | None            |
| `type`                     | `Optional[str]`                                                                               | The type of search, either "keyword" or "neural".                                                                                                                | "auto"          |
| `category`                 | `Optional[str]`                                                                               | A data category to focus on when searching (e.g., company, research paper, news, etc.).                                                                       | None            |


---

**Your Task:**

Given a user query and any known preferences (e.g. prefer research papers, want summaries only, exclude news, etc.), generate a JSON-compatible Python dictionary representing the body for a POST request to `/search`.

💡 Example Input:
> Query: "Latest developments in LLM capabilities"

✅ Example Output:
```python
{
    "query": "Latest developments in LLM capabilities",
    "type": "auto",
    "category": "research paper",
    "numResults": 25,
    "contents": {
        "text": True,
        "summary": True,
        "highlights": False,
        "livecrawl": "fallback",
        "livecrawlTimeout": 10000,
        "subpages": 0,
        "extras": {
            "links": 0,
            "imageLinks": 0
        }
    }
}
Unless specified otherwise, always keep summary True
    """
    query = dspy.InputField(description="The search query the user wants to search for")
    api_settings = dspy.OutputField(description="The settings for the search API in python dictionary format")

# ChainOfThought prompting for better responses
search_api_AI = dspy.ChainOfThought(search_api_agent)

The API agent return the settings for the API, which you can feed into the API as a dict to get responses

The other two are constructed the same way, with the only difference being the API documentation used.

Answer Agent

class answer_summarize_analyze(dspy.Signature):
    """

### 🔧 **Prompt for an AI Research Agent**:

You are a research assistant AI. Your task is to read through a list of website content and generate a helpful summary based on a user query.

### User Query:
{query}

### Source Data:
{web_data}

---

Instructions:

- Analyze all the web_data provided.
- Focus only on content relevant to the query.
- Provide a concise and informative summary of the findings.
- Group or list items clearly if multiple blog posts or sources are referenced.
- Do not include unrelated metadata (like ratings, loading GIFs, etc.).
- Provide links to the original posts when helpful.
- Be accurate, neutral, and professional in tone.

Respond only with the analysis, summarise and answer to the user query!

---

### 👇 Example with your provided input:

**Query:**
> Find blog posts around FireBirdTech and tell me what they contain.

**Resulting AI Response:**

Here are some recent blog posts related to FireBirdTech:

1. **[Jaybird 5.0.5 Released](https://www.firebirdnews.org/jaybird-5-0-5-released/)**  
   A new version of the Jaybird JDBC driver introduces bug fixes and minor feature enhancements:
   - Fixed precision issues for `NUMERIC` and `DECIMAL` columns.
   - Improved support for auto-increment columns and metadata queries.
   - Added support for Firebird 5.0 features and updated Java compatibility.

2. **[Firebird Monitor Tool – Version 2](https://www.firebirdnews.org/real-time-firebird-monitor-for-firebird-server-2-5-3-0-4-0-5-0-2/)**  
   A real-time database monitoring tool with trace and audit capabilities. A 10% discount is offered until June 2024.

3. **[Database Workbench 6.5.0 Released](https://www.firebirdnews.org/database-workbench-6-5-0-released/)**  
   New features include Firebird 5, MySQL 8.3, and PostgreSQL 16 support. A new SQLite module has been introduced for enhanced compatibility.

4. **[DBD::Firebird Perl Extension v1.38](https://www.firebirdnews.org/perl-extension-dbdfirebird-version-1-38-is-released/)**  
   Updates include support for Firebird’s BOOLEAN data type and improvements to compatibility with older Perl and Firebird versions.

5. **[Kotlin Multiplatform Firebird SQL Client Library](https://www.firebirdnews.org/firebird-sql-client-library-for-kotlin-multiplatform/)**  
   A client library for accessing Firebird databases from Kotlin Multiplatform environments, targeting JVM, Android, and native.

6. **[IBProvider v5.37 Released](https://www.firebirdnews.org/release-of-ibprovider-v5-37/)**  
   A Firebird database provider for C++ with updates including improved error message handling, ICU changes, and thread pool improvements.

7. **[Firebird Export Tool](https://www.firebirdnews.org/firebird-export/)**  
   Open-source tool to export Firebird databases to formats like CSV and JSON, supporting selective exports and blob data.

8. **[RedExpert 2024.04 Released](https://www.firebirdnews.org/redexpert-2024-04-has-been-released/)**  
   A new version of the RedExpert tool for managing and developing Firebird databases.

9. **[libpthread Compatibility Warning for Firebird 2.5](https://www.firebirdnews.org/libpthread-compatibility-problem-with-firebird-2-5/)**  
   Compatibility issues have been found between newer libpthread versions and Firebird 2.5 on Linux. Patching or upgrading is recommended.

10. **[Hopper Debugger v2.3 Released](https://www.firebirdnews.org/stored-routine-debugger-hopper-v2-3-released/)**  
    A debugging tool for stored routines with new Firebird 5 support and general bug fixes.
"""
query = dspy.InputField(description="The query that the user wants to get the contents of")
web_data = dspy.InputField(description="The web data that the user wants to get the contents of")
answer = dspy.OutputField(description="The answer to the user query")

answer_summarize_analyze = dspy.ChainOfThought(answer_summarize_analyze)

response = answer_summarize_analyze(query = "find blog posts around FireBirdTech and tell me what they contain", web_data = result.results[1].text)

How to improve AI agent(s) using DSPy

Arslan Shahid

September 15, 2024

Read full story

Combining all components

Now we just to combine everything into one system to answer user queries.

class research_agent(dspy.Module):
    """A research agent that routes queries to appropriate APIs and returns analyzed results"""
    
    def __init__(self):
        # Initialize API components
        self.api_router = api_router_agent        # Routes queries to appropriate APIs
        self.search_api = search_api_AI           # Handles search API calls
        self.contents_api = content_api_AI        # Handles content API calls  
        self.find_similar_api = find_similar_api_AI  # Handles similarity search
        self.answer_summarize_analyze = answer_summarize_analyze  # Analyzes results
        
    def forward(self, query):
        """
        Process a query through appropriate APIs and return analyzed results
        
        Args:
            query (str): The user's research query
            
        Returns:
            str: Analyzed and summarized answer based on API results
        """
        # Route query to determine which APIs to use
        route = self.api_router(query=query)
        api_result = []

        # Process query through each selected API
        for api_selection in route.api_selection.split(','):
            print(api_selection)
            if "search" in api_selection:
                # Handle search API
                api_settings = self.search_api(query=query)
                # Convert API settings string to dict
                api_settings = api_settings.api_settings
                api_settings = json.loads(api_settings.replace("'", '"').replace("True", "true").replace("False", "false"))
                print(api_settings)
                api_result.append(exa.search_and_contents(**api_settings))
                
            elif "contents" in api_selection:
                # Handle contents API
                api_settings = self.contents_api(query=query)
                api_settings = api_settings.api_settings
                api_settings = json.loads(api_settings.replace("'", '"').replace("True", "true").replace("False", "false"))
                print(api_settings)
                api_result.append(exa.get_contents(**api_settings))
                
            elif "findSimilar" in api_selection:
                # Handle similarity search API
                api_settings = self.find_similar_api(query=query)
                api_settings = api_settings.api_settings
                api_settings = json.loads(api_settings.replace("'", '"').replace("True", "true").replace("False", "false"))
                print(api_settings)
                api_result.append(exa.find_similar(**api_settings))
                
        # Collect and analyze results
        answers = []
        for result in api_result:
            answers.append(result)
            
        # Generate final analyzed answer
        answer = self.answer_summarize_analyze(query=query, web_data = str(answers))
        return answer.answer


researcher = research_agent()
result = researcher(query="How is Tesla's stock doing since 2020, get me new articles and press release links")

Response generated by our AI researcher,, with sources!

Here is an example of the system using similar link API

Response for query for finding similar articles to one mentioned.

Thank you for reading! As a follow-on the project, I will be adding a ‘deep research’ feature into this system.
Please follow us on Linkedin, Medium and substack

FireBirdTech

Creating an AI-Powered Researcher: A Step-by-Step Guide

How to build a system that searches the internet and summarizes, researches and explains the information

Tech Stack used

What do search APIs like Exa do?

How to use Exa’s API

Building a Reliable Text-to-SQL Pipeline: A Step-by-Step Guide pt.1

Design of the AI researcher

How to improve AI agent(s) using DSPy

Combining all components

Discussion about this post