Home / Cases / Crafting high quality synthetic datasets from scratch

Crafting high quality synthetic datasets from scratch

April 28, 2025

#promptengineering #sql #syntheticdata

One of the most common mistakes in GenAI application development is the lack of a solid evaluation dataset—without one, you’re left guessing and relying on intuition. Yet in many cases, we have no access to any data, let alone real user feedback. So, how do we move forward? Below, you will learn how to quickly generate high-quality data that facilitates application prototyping.

Recently, I worked on a tool allowing non-technical users to query an SQL database containing survey results using natural language. The challenge was the scale of the survey—nearly 400 variables from dozens of questions, many with multiple-choice options.

Question	SQL Query
Do people aged 50-59 in GZM listen to podcasts more often than watch cultural programs on TV?	`SELECT 'Podcasts (yearly)' AS activity, ROUND(SUM(CASE WHEN culture_home_podcasts = TRUE THEN sample_weight ELSE 0 END) * 100.0 / SUM(sample_weight), 2) AS percentage FROM participation_survey WHERE age_group = '50-59' UNION ALL SELECT 'Cultural TV programs (yearly)' AS activity, ROUND(SUM(CASE WHEN frequency_cultural_programs = 'At least once in the last 12 months' THEN sample_weight ELSE 0 END) * 100.0 / SUM(sample_weight), 2) AS percentage FROM participation_survey WHERE age_group = '50-59';`
What is the average monthly spending on culture among people with higher education?	`SELECT ROUND(SUM(culture_spending_amount * sample_weight) / SUM(sample_weight), 2) AS average_culture_spending FROM participation_survey WHERE education = 'higher';`
What percentage of people using public transport, but NOT regional transport (rail), visit cultural places in other GZM cities?	`SELECT ROUND(SUM(CASE WHEN participation_other_place_gzm = TRUE THEN sample_weight ELSE 0 END) * 100.0 / SUM(sample_weight), 2) AS percentage_culture_other_cities FROM participation_survey WHERE transport_public_transport = TRUE AND transport_regional_transport = FALSE;`

I faced a dilemma: normalize the data by breaking it into many related tables, or keep everything in its original form, considering this was an analytical database whose records wouldn’t be modified. Either solution required generating fairly complex queries, so I opted for a single wide table.

While LLMs handle translating verbal queries to SQL reasonably well, to do it correctly, they must first understand your data structure—the database schema, data types, and in our case, lists of possible answers. Boolean values are self-explanatory (we can only have TRUE and FALSE), but the model needs to know all possible answers to questions with scales, for example:

"1 very rarely or never",
"2",
"3",
"4",
"5",
"6",
"7 very frequently"

Below is an example of metadata I prepared for each variable—partly based on survey metadata and partly through reverse-engineering by analyzing the existing survey data. By examining all table values for a given variable, I could infer the data type and all possible values.

{
    "semantic_name": "waga_proby",
    "data_type": "numeric",
    "description": "Statistical weight assigned to a participant for representativeness purposes.",
    "possible_values": null,
    "hints": [
        "Always use this unless the user specifically asks for record counts"
    ]
},
{
    "semantic_name": "subregion_metropolii_gzm",
    "data_type": "text",
    "description": "GZM Subregion where the survey was conducted.",
    "possible_values": [
        "KATOWICE",
        "BYTOMSKI",
        "GLIWICKI",
        "KATOWICKI",
        "SOSNOWIECKI",
        "TYSKI",
        "Inny"
    ],
    "hints": [
        "The GZM Metropolis is divided into subregions: use this value for location only when specifically requested regarding GZM/Metropolis subregions", "Answers for Katowice stored separately, to be merged with 'KATOWICKI' if necessary
    ]
},

Context #

I already had the original data and database schema, but I still needed context: Why was the survey conducted? Which group did it target, and what research problems did its authors aim to address? Finally: Who would be my application users, and what were their needs? Equally important was detailed information about the survey data.

For example, the survey covered the GZM Metropolis in southern Poland, home to over 2 million people. The area is divided into subregions, counties/cities, municipalities (urban/rural), and for the largest city, Katowice, residents were surveyed at the district level.

Even if top-tier foundational LLM models have knowledge about Polish administrative divisions, our agent needs to know exactly which column to retrieve data from when a user asks about cities (city/county) or towns (municipality). Similarly, which districts to select when a user requests a comparison between Katowice’s downtown and southern districts.

Small data with superior quality #

I fed all this information into the language model’s context:

A small sample of data (several records)
DB schema with data types
Possible answers to survey questions
The survey itself
Additional information about the study and the specifics of the research problem

Then I asked it to generate 50 example queries with SQL code of varying complexity, from simplest to more sophisticated.

To ensure the baseline data would be of the highest quality, I ran all queries, recorded the results, and manually reviewed them step by step to verify they were doing what I expected. Additionally, I asked another model, Claude 3.7 Sonnet Thinking, to critically analyze everything to catch potential shortcomings or systemic errors I might have missed—current models have large enough context windows (~1M for Gemini models) to comfortably work with even hundreds of records simultaneously.

With 50 high-quality diverse examples, I could use them as a pattern for subsequent iterations, again employing the most robust models by:

Generating 50 more similar but slightly modified examples (low risk of error)
Then requesting another 50, this time covering different cases that would explore previously untouched survey questions

The key is to systematically avoid previously identified errors in each iteration, adding new correct examples and modifying the prompt. For instance, in the survey from which my data came, response weights were used. When the first automatically generated examples failed to account for them, I added the following rule to my prompt:

Only generate unweighted results (using simple COUNT(_) or AVG([column]) without waga_proby) if the user explicitly asks for "raw counts", "unweighted results", or similar explicit terms indicating a desire to ignore representativeness. If unsure, default to weighted.

With each subsequent step, I received increasingly better quality queries requiring less work on my part—the language model in LLM-as-a-judge mode was able to catch potential errors.

System prompt #

The data and instructions collected this way also served to create a system prompt for the final solution. It should include:

General instructions for translating queries to SQL
Database schema
Important contextual information (e.g., the spatial relationships described above)
Example SQL queries

The approach that worked well for me is that the first, unoptimized prompt version is comprehensive, containing the full database schema—but few specific examples. With a baseline prompt and verified evaluation dataset, we can move to the iterative prompt engineering phase: optimizing by trimming or modifying the prompt to occupy less tokens, ensuring lower usage costs and reduced latency—without sacrificing output quality.

Dynamic example generation #

One trick that yielded the best results was implementing dynamic example generation. For this, I used semantic similarity and a reranker. In short, the method involved searching for semantically similar examples for a specific user query using a separate model, and adding just those to the context (similar approach can be used to fetch relevant column names). While this approach requires additional queries, it’s less resource-intensive than the main LLM call generating SQL syntax. The investment clearly paid off, improving results by a remarkable 20-30%.

As you can see, data creation is an integral part of AI application development, especially when we don’t have real-life cases from the get-go. The process of generating and testing data is closely linked to the art of prompt engineering, where it is worth combining our AI engineering knowledge with domain expertise and support from various models in LLM-as-a-judge mode. Dedicating a few hours to creating a baseline dataset enables a good start and the ability to build a prototype that we can then pass on to more realistic users for testing.

This article provides a high-level overview of how to create a dataset from scratch—I’m not sharing code snippets here, but you can see the complete repository for this project on GitHub.

←

Bringing clarity to >1M documents with LLM structured output