Detect Website Tech Stacks with Python: Automate Sales Prospecting
If you sell developer tools, SaaS integrations, or anything where the buyer's technology matters, you already know the pain: you have a list of 500 company domains and zero idea which ones are actually worth reaching out to. Cold outreach without technographic context is spam. Outreach with technographic context is relevant.
This article shows you how to build a Python pipeline that scans your prospect list, detects each site's technology stack, loads the results into a pandas DataFrame, and filters leads by technology. By the end you will have a script that turns a flat CSV of domains into a qualified, technology-filtered lead list — automatically, for $9/month instead of Wappalyzer's $250/month.
The Problem with Manual Tech Stack Research
The typical sales workflow looks like this: open a prospect's website, right-click → View Source, squint at script tags, maybe install the Wappalyzer browser extension, tab over to your CRM, manually enter "React, Stripe, Cloudflare." Repeat 499 times.
That is not research, that is data entry. It is also error-prone — the browser extension misses server-side signals, HTTP headers, and meta tags that the HTML source doesn't expose directly. And it doesn't scale. A sales engineer spending two hours on technographic research per 50 leads is burning significant time that could go into actual selling.
The fix is to make tech stack detection a programmatic step in your prospecting pipeline, not a manual step in your sales workflow. You already have a Python environment. You already have a prospect list in a spreadsheet. The only missing piece is a reliable detection API.
The StackPeek API in 30 Seconds
The StackPeek API takes a URL and returns a JSON array of technologies. No authentication required for the free tier. Here is the simplest possible Python call:
import requests
resp = requests.get(
"https://us-central1-todd-agent-prod.cloudfunctions.net/stackpeekApi/api/v1/detect",
params={"url": "https://shopify.com"},
timeout=15
)
data = resp.json()
for tech in data["technologies"]:
print(f"{tech['name']} ({tech['category']}) — confidence: {tech['confidence']:.0%}")
Output:
React (framework) — confidence: 97%
Next.js (framework) — confidence: 94%
Cloudflare (cdn) — confidence: 99%
Google Analytics (analytics) — confidence: 88%
Stripe (payments) — confidence: 91%
Ruby on Rails (framework) — confidence: 76%
The response includes the technology name, category (framework, analytics, cms, payments, cdn, hosting), and a confidence score between 0 and 1. Detection runs against HTTP headers, HTML source, script URLs, and meta tags — not a browser render, so it is fast and headless-friendly.
Free tier: 100 scans per day, no API key, no account. For a prospect list of 100 companies, you can scan the entire list every day at zero cost. For larger lists, the Starter plan at $9/month gives you 5,000 scans.
Step 1: Scan a Prospect List with requests
Assume your prospect list is a CSV with at minimum a domain column. Here is a script that reads it, scans each domain, and writes a new CSV with technology data appended:
# scan_prospects.py
import csv
import json
import time
import requests
from pathlib import Path
API_URL = "https://us-central1-todd-agent-prod.cloudfunctions.net/stackpeekApi/api/v1/detect"
INPUT_CSV = "prospects.csv"
OUTPUT_CSV = "prospects_with_stacks.csv"
def detect_stack(domain: str) -> list:
"""Return list of technology dicts for a domain. Returns [] on any error."""
url = domain if domain.startswith("http") else f"https://{domain}"
try:
resp = requests.get(API_URL, params={"url": url}, timeout=15)
resp.raise_for_status()
return resp.json().get("technologies", [])
except Exception as e:
print(f" [warn] {domain}: {e}")
return []
rows = list(csv.DictReader(open(INPUT_CSV)))
print(f"Scanning {len(rows)} prospects...")
with open(OUTPUT_CSV, "w", newline="") as f:
fieldnames = rows[0].keys() | ["technologies_json", "tech_names"]
writer = csv.DictWriter(f, fieldnames=list(fieldnames))
writer.writeheader()
for i, row in enumerate(rows, 1):
print(f" [{i}/{len(rows)}] {row['domain']}", end=" ... ", flush=True)
techs = detect_stack(row["domain"])
names = [t["name"] for t in techs]
print(", ".join(names) if names else "(none detected)")
row["technologies_json"] = json.dumps(techs)
row["tech_names"] = "|".join(names)
writer.writerow(row)
time.sleep(0.3) # be polite
print(f"\nDone. Results saved to {OUTPUT_CSV}")
The output CSV has every original column plus two new ones: technologies_json (the full API response, serialized) and tech_names (a pipe-delimited list of technology names for easy filtering in Excel or Google Sheets). Run it on your prospects list the night before a sales push and it is ready by morning.
Step 2: Analyze Results with pandas
Once you have the enriched CSV, pandas makes the filtering effortless. Load it back and start slicing:
# analyze_stacks.py
import json
import pandas as pd
df = pd.read_csv("prospects_with_stacks.csv")
# Deserialize the JSON column back to Python objects
df["technologies"] = df["technologies_json"].apply(
lambda x: json.loads(x) if pd.notna(x) else []
)
# Helper: does this prospect use a specific technology?
def uses(df: pd.DataFrame, tech_name: str) -> pd.Series:
return df["technologies"].apply(
lambda techs: any(t["name"].lower() == tech_name.lower() for t in techs)
)
# --- Use case 1: Find all Shopify stores (sell them your Shopify app)
shopify_leads = df[uses(df, "Shopify")]
print(f"Shopify sites: {len(shopify_leads)}")
# --- Use case 2: React sites without Sentry (sell them error monitoring)
needs_monitoring = df[uses(df, "React") & ~uses(df, "Sentry")]
print(f"React sites missing error monitoring: {len(needs_monitoring)}")
# --- Use case 3: WordPress sites (sell them a WP plugin or migration service)
wp_sites = df[uses(df, "WordPress")]
print(f"WordPress sites: {len(wp_sites)}")
# --- Use case 4: Sites with no CDN (sell them performance services)
cdn_techs = {"Cloudflare", "Fastly", "Akamai", "Amazon CloudFront"}
def has_cdn(techs):
return any(t["name"] in cdn_techs for t in techs)
no_cdn = df[~df["technologies"].apply(has_cdn)]
print(f"Sites with no CDN: {len(no_cdn)}")
# --- Technology frequency across entire prospect list
all_techs = [t["name"] for techs in df["technologies"] for t in techs]
freq = pd.Series(all_techs).value_counts()
print("\nTop 10 technologies in your prospect list:")
print(freq.head(10).to_string())
The frequency table at the bottom is particularly useful before you start building integrations or writing outreach copy. If 70% of your prospects use Shopify, lead with Shopify. If only 8% use Magento, don't waste a sequence on it.
Exporting Filtered Leads to CSV
Once you have your filtered DataFrame, export it for your CRM or outreach tool:
# Export Shopify leads with clean columns for outreach
export_cols = ["company", "domain", "contact_email", "tech_names"]
shopify_leads[export_cols].to_csv("shopify_leads.csv", index=False)
# Or push straight to a dict list for your CRM API
leads_for_crm = shopify_leads[export_cols].to_dict(orient="records")
print(f"Ready to import {len(leads_for_crm)} leads")
Step 3: Batch Processing at Scale with asyncio and aiohttp
The sequential requests approach works fine for lists up to a few hundred domains. Beyond that, the wall-clock time gets painful. Each API call takes roughly 1.5–3 seconds; at 1,000 prospects, sequential scanning takes 25–50 minutes. With async concurrency set to 10, the same list finishes in 3–5 minutes.
# async_scan.py
import asyncio
import json
import csv
import aiohttp
API_URL = "https://us-central1-todd-agent-prod.cloudfunctions.net/stackpeekApi/api/v1/detect"
CONCURRENCY = 10 # stay well under rate limits
async def detect_one(session: aiohttp.ClientSession, domain: str) -> dict:
url = domain if domain.startswith("http") else f"https://{domain}"
try:
async with session.get(API_URL, params={"url": url}, timeout=aiohttp.ClientTimeout(total=20)) as resp:
data = await resp.json()
return {"domain": domain, "technologies": data.get("technologies", []), "error": None}
except Exception as e:
return {"domain": domain, "technologies": [], "error": str(e)}
async def scan_all(domains: list[str]) -> list[dict]:
sem = asyncio.Semaphore(CONCURRENCY)
async def bounded(session, domain):
async with sem:
result = await detect_one(session, domain)
print(f" {domain}: {len(result['technologies'])} technologies")
return result
connector = aiohttp.TCPConnector(limit=CONCURRENCY)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [bounded(session, d) for d in domains]
return await asyncio.gather(*tasks)
# Load domains from CSV
domains = [row["domain"] for row in csv.DictReader(open("prospects.csv"))]
print(f"Scanning {len(domains)} domains with concurrency={CONCURRENCY}...")
results = asyncio.run(scan_all(domains))
# Write output
with open("prospects_async.json", "w") as f:
json.dump(results, f, indent=2)
print(f"Done. {len(results)} results written to prospects_async.json")
Install the dependency with pip install aiohttp. The semaphore prevents you from overwhelming the API with simultaneous requests. At CONCURRENCY=10, you are making at most 10 requests at once, which is well within the Starter plan's rate limits.
Save money on re-runs: Cache results to disk by domain. Before scanning, check if a cache/{domain}.json file exists. If it does and it is less than 7 days old, skip the API call and load from disk. This is especially useful during development when you are iterating on your filtering logic.
Practical Filtering Recipes
Here are filters for common sales scenarios. All assume you have loaded the results into a pandas DataFrame as shown above.
Find prospects using a competitor's product
# You sell a Segment alternative — find sites using Segment
segment_users = df[uses(df, "Segment")]
# You sell a Mixpanel alternative — find Mixpanel + no Amplitude
mixpanel_only = df[uses(df, "Mixpanel") & ~uses(df, "Amplitude")]
Find high-intent leads by stack combination
# Shopify stores with no email marketing tool
# (warm leads if you sell email or SMS marketing)
email_tools = {"Klaviyo", "Mailchimp", "Drip", "ConvertKit"}
def has_email_tool(techs):
return any(t["name"] in email_tools for t in techs)
shopify_no_email = df[uses(df, "Shopify") & ~df["technologies"].apply(has_email_tool)]
print(f"High-intent leads (Shopify, no email tool): {len(shopify_no_email)}")
Score prospects by stack sophistication
# Assign a "sophistication score" — more modern tools = better fit
modern_indicators = {"React", "Next.js", "Vue", "Svelte", "Stripe",
"Segment", "Cloudflare", "Vercel", "Netlify"}
def sophistication_score(techs):
names = {t["name"] for t in techs}
return len(names & modern_indicators)
df["score"] = df["technologies"].apply(sophistication_score)
top_prospects = df.nlargest(20, "score")
print("Top 20 prospects by stack sophistication:")
print(top_prospects[["company", "domain", "score", "tech_names"]].to_string(index=False))
The score-based approach is useful when your ICP (ideal customer profile) correlates with technical sophistication. If you sell a developer tool, a company already using React + Stripe + Segment is a better fit than a company still on jQuery + PayPal.
Wappalyzer vs. StackPeek for Python Workflows
If you have searched for "python website technology detection api" before, you have probably landed on Wappalyzer. Let's compare them honestly:
| Feature | Wappalyzer API | StackPeek API |
|---|---|---|
| Monthly price | $250/mo (Business) | $9/mo (Starter) |
| Free tier | 50 lookups/mo | 100 scans/day |
| Monthly lookups | 5,000 (Business) | 5,000 (Starter) / 25,000 (Pro) |
| Technologies detected | 1,200+ | 120+ (core categories) |
| API key required | Yes (all tiers) | No (free tier) |
| JSON response format | Yes | Yes |
| Python-friendly REST API | Yes | Yes |
| Annual cost | $3,000/yr | $108/yr |
Wappalyzer detects more technologies — if you need coverage of obscure or niche tools, that breadth matters. But for the 80% use case in sales prospecting — frameworks, CMS platforms, analytics, payments, CDNs, hosting — StackPeek's 120+ technology coverage is sufficient, and the $2,892/year difference is hard to argue with for a scrappy sales team or a solo founder.
There is also a defunct open-source Python package called python-wappalyzer that wraps a local copy of Wappalyzer's fingerprint database. It still works for basic cases, but the fingerprint data goes stale fast, it requires Playwright or Puppeteer for JavaScript-rendered sites, and it is not maintained. For production use, a maintained API is more reliable.
Putting It Into Your Sales Workflow
The full pipeline looks like this:
- Export domains from your CRM as a CSV. Most CRMs (HubSpot, Salesforce, Pipedrive) support this in two clicks.
- Run the async scanner overnight or as a weekly cron job. For 1,000 domains at concurrency 10, it takes under 5 minutes.
- Load into pandas, apply filters based on your ICP criteria. Export filtered leads back to CSV.
- Import back into your CRM or outreach tool (Apollo, Outreach, Lemlist) with technographic tags attached.
- Personalize your sequences based on the detected stack. "I noticed you're on Shopify but not using Klaviyo yet..." converts significantly better than generic copy.
If you want to automate the scan step on a schedule, see our article on building a competitive intelligence dashboard — the same scheduling patterns apply to a prospecting pipeline. A weekly cron job that re-scans your prospect list catches companies that migrate platforms, add new tools, or launch payment infrastructure, all of which are strong buying signals.
Start Scanning Your Prospect List Today
100 free scans per day. No API key. No signup. Copy the Python code above and run it against your prospect list right now.
Get Started →Frequently Asked Questions
Do I need an API key?
No. The free tier gives you 100 scans per day with zero authentication. Just make a GET request to the endpoint with a url query parameter. For higher volumes, the Starter plan at $9/month adds an API key and raises the limit to 5,000 scans per month.
What if a site blocks scraping?
The StackPeek API does its own fetching server-side; it is not running in a browser on your machine. Sites that block scrapers based on user-agent or IP rate-limiting may still have reduced detection accuracy, but this affects all tech detection services equally, including Wappalyzer. For protected sites, confidence scores will be lower and some technologies may not be detected.
Can I detect technologies on sites that require JavaScript rendering?
StackPeek analyzes HTTP headers, static HTML, and script tag sources — it does not execute JavaScript. Technologies that only appear after JS execution (some single-page apps, lazy-loaded analytics) may not be detected. However, most technologies leave fingerprints in static signals: script source URLs, HTTP headers (X-Powered-By, Server), and meta tags. In practice, this covers the majority of commercially relevant tools.