Build an AI Agent Workflow to Create Custom Financial Datasets — No Scraping, No APIs

In this tutorial, I’ll show you how to build an AI agent workflow that creates custom, company-level financial datasets — all without using stock data APIs or traditional scraping.

We’ll use agentic AI to gather unique, internet-based data points for public companies, using only a stock ticker as input. These signals can include:

  • Executive departures
  • Board composition
  • Employee sentiment
  • Number of open job postings

And you can define your own variables — whatever you believe moves stock prices.


Why This Matters

Web-search-enabled agents can generate custom, timely data that traditional datasets may not yet include. You can:

  • Track CEO tenure trends across your watchlist
  • Measure employee sentiment using Glassdoor
  • Get real-time job openings as a proxy for growth
  • Compare institutional ownership over time

Customize Based on Your Market Theory

You decide what data matters:

  • Leadership changes?
  • Hiring patterns?
  • Board composition?

Use your hypothesis to define the fields you want the AI to extract from the web.


Tech Stack Used

  • Python 3.11 (via Conda)
  • OpenAI Agents SDK
  • Perplexity Sonar (via OpenAI client wrapper)
  • Pandas for tabular output
  • WebSearchTool (included in openai-agents)

Environment Setup

Environment setup: Conda + pip install instructions for command prompt

# Create Conda environment
conda create -n ai-agent-finance python=3.11 -y

# Activate
conda activate ai-agent-finance

# Install dependencies
pip install openai-agents==0.0.11 pydantic==2.11.3 pandas==2.2.3

Create a Jupyter notebook in VS code

Add your API keys via environment variables:

#for chatgpt
export OPENAI_API_KEY=your_openai_key

#for perplexity
export PPLX_API_KEY=your_perplexity_key

1. OpenAI Models

OpenAI Agent: Create a dataset with GPT-4o-mini + web search

Step 1: Import necessary libraries and set up OpenAI API key

# import os
# os.environ["OPENAI_API_KEY"] = "OPEN AI API KEY GOES HERE"
from agents import Agent, Runner, WebSearchTool
from typing import List
from pydantic import BaseModel
import pandas as pd

Step 2: Define CompanyInfo schema

class CompanyInfo(BaseModel):
    company_name: str
    ticker: str
    sector: str
    founding_year: int
    number_of_employees: int
    ceo_tenure_years: float
    ceo_count_since_2010: int
    average_glassdoor_rating: float
    institutional_ownership_pct: float
    board_member_count: int
    job_positions_open: int

Step 3: Instantiate WebSearchTool

# 1) Instantiate the search tool
web_search = WebSearchTool()

Step 4: Create the OpenAI Agent

# 2) Create the Agent
agent = Agent(
    name="CompanyInfoAgent",
    instructions="""
    For a given U.S.-listed company ticker, use the WebSearchTool to find:
    - Full company name
    - Ticker symbol
    - Sector/industry
    - Year the company was founded
    - Current total number of employees
    - Current CEO’s tenure in years
    - Number of different CEOs the company has had since January 1, 2010
    - Average employee rating on Glassdoor
    - Percentage of shares held by institutional investors
    - Total number of board members
    - Current Number of Job Positions Opened (globally)
    Then return exactly the JSON matching the CompanyInfo schema.
    """,
    tools=[web_search],
    output_type=CompanyInfo,
    model='gpt-4o-mini',
)

Step 5: Loop over a list of company tickers

# 3) Loop over a list of tickers
tickers = [
    "AAPL", 
    "MSFT",
    "GOOGL",
    "AMZN", 
    "TSLA" 
]

all_company_data = []
for ticker in tickers:
    info = await Runner.run(agent, ticker)
    print(info.final_output)
    all_company_data.append(info.final_output.model_dump())

Step 6: Create a Pandas DataFrame from the collected data

# 4) Create a Pandas DataFrame from the collected data
df = pd.DataFrame(all_company_data)
df

2. Perplexity Sonar Models

Perplexity Agent: Repeat the process using Sonar Pro

Step 1: Import necessary libraries and set up the Perplexity Sonar API key

# PPLX_API_KEY = "PERPLEXITY SONAR API KEY GOES HERE"
from agents import Agent, Runner, AsyncOpenAI, OpenAIChatCompletionsModel
from typing import List
from pydantic import BaseModel
import pandas as pd

Step 2: Define CompanyInfo schema

class CompanyInfo(BaseModel):
    company_name: str
    ticker: str
    sector: str
    founding_year: int
    number_of_employees: int
    ceo_tenure_years: float
    ceo_count_since_2010: int
    average_glassdoor_rating: float
    institutional_ownership_pct: float
    board_member_count: int
    job_positions_open: int

Step 3: Set up the Perplexity client

# 1) Setup the perplexity client
perplexity_client = AsyncOpenAI(base_url="https://api.perplexity.ai", api_key=PPLX_API_KEY)

Step 4: Create the Perplexity Sonar Agent

# 2) Create the Agent
perplexity_agent = Agent(
    name="CompanyInfoAgent_pplx",
    instructions="""
    For a given U.S.-listed company ticker, use the WebSearchTool to find:
    - Full company name
    - Ticker symbol
    - Sector/industry
    - Year the company was founded
    - Current total number of employees
    - Current CEO’s tenure in years
    - Number of different CEOs the company has had since January 1, 2010
    - Average employee rating on Glassdoor
    - Percentage of shares held by institutional investors
    - Total number of board members
    - Current Number of Job Positions Opened (globally)
    Then return exactly the JSON matching the CompanyInfo schema.
    """,
    output_type=CompanyInfo,
    model=OpenAIChatCompletionsModel( 
        model="sonar-pro",
        openai_client=perplexity_client) # perplexity client goes here
)

Step 5: Loop over a list of company tickers

# 3) Loop over five major plants
tickers = [
    "AAPL",  
    "MSFT", 
    "GOOGL", 
    "AMZN", 
    "TSLA"
]
all_company_data = []
for ticker in tickers:
    info = await Runner.run(perplexity_agent, ticker)
    print(info.final_output)
    all_company_data.append(info.final_output.model_dump())

Step 6: Create a Pandas DataFrame from the collected data

# 4) Create a Pandas DataFrame from the collected data
df_pplx = pd.DataFrame(all_company_data)

print("Perplexity Sonar Pro")
df_pplx

Comparison: OpenAI vs. Perplexity

Which agent produced more accurate or complete data?

  • OpenAI GPT-4o-mini had better company summaries
  • Perplexity Sonar found fresher job opening stats
  • Glassdoor ratings varied slightly — worth cross-checking

How to Improve These Results

  • Tune your agent instructions — clarity = better parsing
  • Add fallback prompts if data is missing
  • Retry logic for failed or partial results
  • Merge with structured APIs (like Yahoo Finance)

🧠 Final Thoughts

The market hasn’t priced in these kinds of custom AI-built datasets yet.

If you’re early, you can build smarter signals from public data using cheap inference—before everyone else is doing the same.

This workflow is reproducible, extensible, and completely offline-compatible once set up.

Leave a Reply

x
Advertisements