
Why object count isn’t scale—and how we (+AI) built data generators based on tens of thousands of real-world backups
If you build anything that touches Jira or Confluence at scale, such as backups, migrations, analytics, reporting, or marketplace apps, you’ve probably done this:
You spin up a test instance.
You generate 1,000 issues.
Everything works.
And then production happens.
The problem isn’t the number of issues. It’s everything that comes with them. Real Jira and Confluence instances aren’t just big—they’re dense, interconnected, and shaped by years of human behavior.
So I built and open sourced two tools to make realistic large-scale testing easier:
They generate large, structurally realistic datasets that actually help you test how your system behaves when the data has history, relationships, and uneven growth.
An issue is not a unit of scale.
In real environments, an issue implies:
So when someone says “10,000 issues,” what they usually mean is:
10,000 issues plus years of accumulated complexity attached to them.
If your test data doesn’t model that, you’re not testing reality—you’re testing a skeleton.
The same story applies to Confluence. A few thousand pages sounds simple until you remember what production looks like:
Scale isn’t just object count.
Scale is structure + history + density.
Rather than make up ratios, I analyzed structural patterns across tens of thousands of Jira and Confluence backups over an extended period to understand how real environments grow.
To be clear: this wasn’t about inspecting customer content. It was an analysis of aggregated, anonymized operational metadata: object counts, attachment distributions, relationship density, growth patterns. The shape of the data, not the data itself.
From that analysis, I built multiplier tables that model how related data types scale relative to a core unit like:
That’s the missing piece in most data generators. They can create objects, but they usually don’t create the implied objects in realistic ratios that make the dataset behave like production.
If you ask the Jira generator for 10,000 issues, you don’t just get 10,000 empty shells. You get issues, plus a realistically scaled mix of:
The goal isn’t to clone any specific organization.
The goal is to recreate the shape and density of real-world Jira and Confluence data, so systems experience realistic pressure:
That’s where integrations break. That’s where performance cliffs show up. That’s where you find out whether your tooling is actually ready.
Large datasets take time to generate. And long-running jobs fail.
These tools were built for jobs that run for hours (or longer), not just quick demo scripts:
If you’re generating a genuinely large dataset, you shouldn’t have to restart from zero just because a network hiccup happened at hour four.
This isn’t a “make my demo prettier” tool.
It’s for engineers who need to answer questions like:
Small synthetic datasets are comforting because they tend to be clean.
Production isn’t clean.
If you want confidence, you need test data that reflects reality.
Both tools are available on GitHub:
Each repository includes configuration options so you can:
My advice: start with a dataset slightly larger than you think you need. You’ll learn faster.
Because generating realistic large-scale test environments is harder than it looks, and most teams eventually run into it.
If you build on Jira or Confluence long enough, you will hit scale. And when you do, you basically have two options:
I prefer option one.
If these tools help you surface edge cases earlier, pressure-test your systems more effectively, or avoid a few unpleasant surprises, then they’ve done their job.
If you want to extend the modeling, improve the scaling logic, or support additional data shapes, open an issue or send a PR.
Small datasets hide problems.
Real Jira and Confluence instances are shaped by time: collaboration patterns, attachments, relationships, versions, and growth that’s anything but uniform.
If your system needs to work under real conditions, your test data should reflect that reality.
That’s what these generators are for.