Skip to content

Web Scraping

zyx comes with a quick way to scrape the web for information using agentic reasoning through LLMs with the scrape() function.

Node-based scraping coming soon for more complex scraping tasks.


Simple Scraping

from zyx import scrape

result = scrape(
    "The latest & hottest AI hardware",
    model = "openai/gpt-4o"
    workers = 5,
    max_results = 3
)

print(result)
Output
...
'summary': "The AI hardware market has seen rapid advancements and fierce competition, with several key players releasing
innovative products to meet the growing demand for AI capabilities. Here are the most notable companies and their contributions to AI hardware
as of 2024:\n\n1. **Nvidia**: A leader in the AI hardware space, Nvidia's chips like the A100 and H100 are critical for data centers. The
recent introduction of the H200 and B200 chips, along with the Grace Hopper superchip, emphasizes Nvidia's focus on performance and
scalability in AI applications.\n\n2. **AMD**: AMD continues to compete with Nvidia, having launched its MI300 series of AI chips, which rival
Nvidia's offerings in terms of memory capacity and bandwidth. The new Zen 5 CPU microarchitecture enhances AMD's capabilities in AI
workloads.\n\n3. **Intel**: Intel has introduced its Xeon 6 processors and the Gaudi 3 AI accelerator, which aims to improve processing
efficiency. Intel's longstanding presence in the CPU market is now complemented by its focus on AI-specific hardware.\n\n4. **Alphabet
(Google)**: With its Cloud TPU v5p and the recently announced Trillium TPU, Alphabet is committed to developing powerful AI chips tailored for
large-scale machine learning tasks.\n\n5. **Amazon Web Services (AWS)**: AWS has shifted towards chip production with its Trainium and
Inferentia chips, designed for training and deploying machine learning models, respectively. Their latest instance types offer significant
improvements in memory and processing power.\n\n6. **Cerebras Systems**: Known for its wafer-scale engine, the WSE-3, Cerebras has achieved
remarkable performance with its massive core count and memory bandwidth, making it a strong contender in the AI hardware market.\n\n7.
**IBM**: IBM's AI Unit and the upcoming NorthPole chip focus on energy efficiency and performance improvements, aiming to compete with
existing AI processors.\n\n8. **Qualcomm**: Although newer to the AI hardware scene, Qualcomm's Cloud AI 100 chip has shown competitive
performance against Nvidia, particularly in data center applications.\n\n9. **Tenstorrent**: Founded by a former AMD architect, Tenstorrent
focuses on scalable AI hardware solutions, including its Wormhole processors.\n\n10. **Emerging Startups**: Companies like Groq, SambaNova
Systems, and Mythic are also making strides in the AI hardware space, offering specialized solutions for AI workloads.\n\nIn summary, the
competitive landscape for AI hardware is characterized by rapid innovation, with established tech giants and emerging startups alike vying to
create the most powerful and efficient AI chips. This ongoing evolution is driven by the increasing demands of AI applications, particularly
in data centers and for large-scale machine learning models.",
  'evaluation': {
      'is_successful': True,
      'explanation': 'The summary effectively captures the current landscape of AI hardware as of 2024, highlighting key players and
their contributions. It provides relevant details about the advancements made by major companies like Nvidia, AMD, Intel, and others, which
directly relates to the query about the latest and hottest AI hardware. The structure is clear, listing companies and their innovations,
making it easy for readers to understand the competitive dynamics in the AI hardware market. Overall, the summary is comprehensive, relevant,
and well-organized, making it a successful response to the query.',
      'content': None
  }
}
},
messages=[]
)

API Reference

Scrapes the web for topics & content about multiple queries, generates a well-written summary, and returns a Document object.

Parameters:

Name Type Description Default
query str

The initial search query.

required
num_queries int

Number of queries to generate based on the initial query.

5
max_results Optional[int]

Maximum number of search results to process.

5
workers int

Number of worker threads to use.

5
model str

The model to use for completion.

'gpt-4o-mini'
client Literal['openai', 'litellm']

The client to use for completion.

'openai'
api_key Optional[str]

The API key to use for completion.

None
base_url Optional[str]

The base URL to use for completion.

None
mode InstructorMode

The mode to use for completion.

'tool_call'
max_retries int

The maximum number of retries to use for completion.

3
temperature float

The temperature to use for completion.

0.5
run_tools Optional[bool]

Whether to run tools for completion.

False
tools Optional[List[ToolType]]

The tools to use for completion.

None

Returns:

Type Description
Document

A Document object containing the summary and metadata.

Source code in zyx/resources/completions/agents/scrape.py
def scrape(
    query: str,  # Single query input
    num_queries: int = 5,  # Number of queries to generate
    max_results: Optional[int] = 5,
    workers: int = 5,
    model: str = "gpt-4o-mini",
    client: Literal["openai", "litellm"] = "openai",
    api_key: Optional[str] = None,
    base_url: Optional[str] = None,
    mode: InstructorMode = "tool_call",
    response_model: Optional[BaseModel] = None,
    max_retries: int = 3,
    temperature: float = 0.5,
    run_tools: Optional[bool] = False,
    tools: Optional[List[ToolType]] = None,
    parallel_tool_calls: Optional[bool] = False,
    tool_choice: Optional[Literal["none", "auto", "required"]] = "auto",
    verbose: Optional[bool] = False,
    **kwargs,
) -> Document:
    """
    Scrapes the web for topics & content about multiple queries, generates a well-written summary, and returns a Document object.

    Parameters:
        query: The initial search query.
        num_queries: Number of queries to generate based on the initial query.
        max_results: Maximum number of search results to process.
        workers: Number of worker threads to use.
        model: The model to use for completion.
        client: The client to use for completion.
        api_key: The API key to use for completion.
        base_url: The base URL to use for completion.
        mode: The mode to use for completion.
        max_retries: The maximum number of retries to use for completion.
        temperature: The temperature to use for completion.
        run_tools: Whether to run tools for completion.
        tools: The tools to use for completion.

    Returns:
        A Document object containing the summary and metadata.
    """

    warnings.warn(
        "The scrape function will no longer be updated."
        "Go to https://github.com/unclecode/crawl4ai for an incredibly robust & feature-rich web-scraping tool.",
        DeprecationWarning,
    )

    import threading
    from bs4 import BeautifulSoup

    completion_client = Client(
        api_key=api_key, base_url=base_url, provider=client, verbose=verbose
    )

    # Generate multiple queries based on the initial query
    query_list = generate(
        target=QueryList,
        instructions=f"Generate {num_queries} related search queries based on the initial query: '{query}'",
        n=1,
        model=model,
        api_key=api_key,
        base_url=base_url,
        client=client,
        verbose=verbose,
    ).queries

    workflow = ScrapeWorkflow(query=query)  # Use the initial query for workflow

    if verbose:
        print(f"Starting scrape for queries: {query_list}")

    all_search_results = []
    all_urls = []

    for query in query_list:
        # Step 1: Use web_search() to get search results
        workflow.current_step = ScrapingStep.SEARCH
        search_results = web_search(query, max_results=max_results)
        all_search_results.extend(search_results)
        urls = [result["href"] for result in search_results if "href" in result]
        all_urls.extend(urls)

        if verbose:
            print(f"Found {len(urls)} URLs for query: {query}")

    workflow.search_results = all_search_results

    # Step 2: Define a function to fetch and parse content from a URL
    def fetch_content(url: str) -> str:
        try:
            headers = {"User-Agent": "Mozilla/5.0"}
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, "html.parser")
            texts = soup.find_all(text=True)
            visible_texts = filter(tag_visible, texts)
            content = " ".join(t.strip() for t in visible_texts)
            return content
        except Exception as e:
            if verbose:
                print(f"Error fetching {url}: {e}")
            return ""

    # Helper function to filter visible text
    from bs4.element import Comment

    def tag_visible(element):
        if element.parent.name in [
            "style",
            "script",
            "head",
            "title",
            "meta",
            "[document]",
        ]:
            return False
        if isinstance(element, Comment):
            return False
        return True

    # Step 3: Use ThreadPoolExecutor to fetch content in parallel
    workflow.current_step = ScrapingStep.FETCH
    contents = []
    with ThreadPoolExecutor(max_workers=workers) as executor:
        future_to_url = {executor.submit(fetch_content, url): url for url in all_urls}
        for future in future_to_url:
            content = future.result()
            if content:
                contents.append(content)

    workflow.fetched_contents = contents

    if verbose:
        print(f"Collected content from {len(contents)} pages")

    # Step 4: Combine the content
    workflow.current_step = ScrapingStep.SUMMARIZE
    combined_content = "\n\n".join(contents)

    # Step 4.5: If Response Model is provided, return straight away
    if response_model:
        return completion_client.completion(
            messages=[
                {"role": "user", "content": "What is our current scraped content?"},
                {"role": "assistant", "content": combined_content},
                {
                    "role": "user",
                    "content": "Only extract the proper content from the response & append into the response model.",
                },
            ],
            model=model,
            response_model=response_model,
            mode=mode,
            max_retries=max_retries,
            temperature=temperature,
            run_tools=run_tools,
            tools=tools,
            parallel_tool_calls=parallel_tool_calls,
            tool_choice=tool_choice,
            verbose=verbose,
            **kwargs,
        )

    # Step 5: Use the completion function to generate a summary
    # Prepare the prompt
    system_prompt = (
        "You are an AI assistant that summarizes information gathered from multiple web pages. "
        "Ensure that all links parsed are from reputable sources and do not infringe any issues. "
        "Provide a comprehensive, well-written summary of the key points related to the following query."
    )
    user_prompt = f"Query: {query}\n\nContent:\n{combined_content}\n\nSummary:"

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]

    # Call the completion function
    response = completion_client.completion(
        messages=messages,
        model=model,
        mode=mode,
        max_retries=max_retries,
        temperature=temperature,
        run_tools=run_tools,
        tools=tools,
        parallel_tool_calls=parallel_tool_calls,
        tool_choice=tool_choice,
        verbose=verbose,
        **kwargs,
    )

    # Extract the summary
    summary = response.choices[0].message.content
    workflow.summary = summary

    # Step 6: Evaluate
    workflow.current_step = ScrapingStep.EVALUATE
    evaluation_prompt = (
        f"Evaluate the quality and relevance of the following summary for the query: '{query}'\n\n"
        f"Summary:\n{summary}\n\n"
        "Ensure that all links parsed are from reputable sources and do not infringe any issues. "
        "Provide an explanation of your evaluation and determine if the summary is successful or needs refinement."
    )

    evaluation_response = completion_client.completion(
        messages=[
            {"role": "system", "content": "You are an expert evaluator of summaries."},
            {"role": "user", "content": evaluation_prompt},
        ],
        model=model,
        response_model=StepResult,
        mode=mode,
        max_retries=max_retries,
        temperature=temperature,
    )

    workflow.evaluation = evaluation_response

    # Step 7: Refine if necessary
    if not evaluation_response.is_successful:
        workflow.current_step = ScrapingStep.REFINE
        refine_prompt = (
            f"The previous summary for the query '{query}' needs improvement.\n\n"
            f"Original summary:\n{summary}\n\n"
            f"Evaluation feedback:\n{evaluation_response.explanation}\n\n"
            "Ensure that all links parsed are from reputable sources and do not infringe any issues. "
            "Please provide an improved and refined summary addressing the feedback."
        )

        refined_response = completion_client.completion(
            messages=[
                {
                    "role": "system",
                    "content": "You are an expert at refining and improving summaries.",
                },
                {"role": "user", "content": refine_prompt},
            ],
            model=model,
            mode=mode,
            max_retries=max_retries,
            temperature=temperature,
        )

        summary = refined_response.choices[0].message.content

    if verbose:
        print("Generated summary:")
        print(summary)

    # Create a Document object
    document = Document(
        content=summary,
        metadata={
            "query": query,
            "urls": all_urls,
            "model": model,
            "client": client,
            "workflow": workflow.model_dump(),
        },
    )

    return document