Back to Blog
Performance
July 2, 2024
12 min read

Techniques for Generating Large Volumes of Test Data

Master strategies for efficiently generating millions of records while maintaining performance, memory usage, and data quality at scale.

large scale data
performance
optimization
bulk generation
memory management

Techniques for Generating Large Volumes of Test Data

Generating large volumes of test data efficiently is a complex challenge that requires careful consideration of performance, memory usage, and data quality. Whether you need millions of user records for load testing or massive datasets for data processing validation, the right techniques can make the difference between success and failure.

Understanding Scale Requirements

Before diving into implementation, it's crucial to understand your scale requirements and constraints:

Scale Categories

Small Scale (1K - 10K records)

  • • Memory: Load entire dataset into memory
  • • Performance: Single-threaded generation acceptable
  • • Tools: Standard faker libraries sufficient
  • • Time: Seconds to minutes
  • Medium Scale (10K - 1M records)

  • • Memory: Batch processing recommended
  • • Performance: Multi-threading beneficial
  • • Tools: Optimized generation algorithms
  • • Time: Minutes to hours
  • Large Scale (1M - 100M records)

  • • Memory: Streaming and chunking required
  • • Performance: Parallel processing essential
  • • Tools: Specialized high-performance libraries
  • • Time: Hours to days
  • Massive Scale (100M+ records)

  • • Memory: Distributed processing needed
  • • Performance: Cluster computing required
  • • Tools: Big data frameworks
  • • Time: Days to weeks
  • Memory-Efficient Generation Strategies

    1. Streaming Data Generation

    Generate data on-demand without storing everything in memory:

    const { Readable } = require('stream');

    class LargeDatasetGenerator extends Readable { constructor(options) { super({ objectMode: true }); this.recordCount = 0; this.maxRecords = options.maxRecords; this.batchSize = options.batchSize || 1000; this.currentBatch = []; } _read() { if (this.recordCount >= this.maxRecords) { this.push(null); // End of stream return; } // Generate batch of records const remainingRecords = this.maxRecords - this.recordCount; const batchSize = Math.min(this.batchSize, remainingRecords); for (let i = 0; i < batchSize; i++) { const record = this.generateSingleRecord(); this.push(record); this.recordCount++; } } generateSingleRecord() { return { id: this.recordCount + 1, name: faker.person.fullName(), email: faker.internet.email(), createdAt: faker.date.recent({ days: 365 }), // Minimal memory footprint per record }; } }

    // Usage: Generate 10 million records with streaming const generator = new LargeDatasetGenerator({ maxRecords: 10_000_000, batchSize: 5000 });

    generator.on('data', (record) => { // Process individual record (write to DB, file, etc.) processRecord(record); });

    generator.on('end', () => { console.log('Generation complete'); });

    2. Batch Processing with Memory Management

    Process data in manageable chunks:

    async function generateLargeDatasetInBatches(totalRecords, batchSize = 10000) {
      const startTime = Date.now();
      let processedRecords = 0;
      
      while (processedRecords < totalRecords) {
        const currentBatchSize = Math.min(batchSize, totalRecords - processedRecords);
        
        // Generate batch
        const batch = Array.from({ length: currentBatchSize }, (_, index) => ({
          id: processedRecords + index + 1,
          userId: faker.string.uuid(),
          firstName: faker.person.firstName(),
          lastName: faker.person.lastName(),
          email: faker.internet.email(),
          phone: faker.phone.number(),
          address: {
            street: faker.location.streetAddress(),
            city: faker.location.city(),
            zipCode: faker.location.zipCode(),
            country: faker.location.countryCode()
          },
          createdAt: faker.date.past({ years: 2 }),
          metadata: {
            source: 'bulk_generation',
            batchId: Math.floor(processedRecords / batchSize) + 1
          }
        }));
        
        // Process batch (write to database, file, etc.)
        await processBatch(batch);
        
        processedRecords += currentBatchSize;
        
        // Progress reporting
        const progress = (processedRecords / totalRecords * 100).toFixed(2);
        const elapsed = Date.now() - startTime;
        const rate = Math.round(processedRecords / elapsed * 1000);
        
        console.log(Progress: ${progress}% (${processedRecords}/${totalRecords}) - ${rate} records/sec);
        
        // Memory cleanup and throttling
        if (processedRecords % (batchSize * 10) === 0) {
          global.gc && global.gc(); // Force garbage collection if available
          await new Promise(resolve => setTimeout(resolve, 100)); // Brief pause
        }
      }
      
      return processedRecords;
    }

    async function processBatch(batch) { // Example: Write to database await database.bulkInsert('users', batch); // Example: Write to file // await fs.appendFile('large_dataset.jsonl', // batch.map(record => JSON.stringify(record)).join('\n') + '\n' // ); }

    Generate large-scale datasets efficiently with our bulk data generator.

    Performance Optimization Techniques

    1. Multi-Threading with Worker Threads

    Utilize multiple CPU cores for parallel generation:

    const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
    const os = require('os');

    if (isMainThread) { // Main thread - coordinate workers async function generateWithWorkers(totalRecords, numWorkers = os.cpus().length) { const recordsPerWorker = Math.ceil(totalRecords / numWorkers); const workers = []; const results = []; console.log(Starting ${numWorkers} workers to generate ${totalRecords} records); for (let i = 0; i < numWorkers; i++) { const startId = i * recordsPerWorker + 1; const endId = Math.min((i + 1) * recordsPerWorker, totalRecords); if (startId <= totalRecords) { const worker = new Worker(__filename, { workerData: { startId, endId, workerId: i } }); workers.push(new Promise((resolve, reject) => { worker.on('message', (data) => { results.push(data); resolve(); }); worker.on('error', reject); })); } } await Promise.all(workers); // Combine results from all workers const allRecords = results.flatMap(result => result.records); console.log(Generated ${allRecords.length} records using ${numWorkers} workers); return allRecords; } // Export for use module.exports = { generateWithWorkers }; } else { // Worker thread - generate data const { startId, endId, workerId } = workerData; const { faker } = require('@faker-js/faker'); const records = []; for (let id = startId; id <= endId; id++) { records.push({ id: id, workerId: workerId, name: faker.person.fullName(), email: faker.internet.email(), phone: faker.phone.number(), address: faker.location.streetAddress(), createdAt: faker.date.recent({ days: 365 }).toISOString() }); // Progress reporting from worker if (id % 10000 === 0) { console.log(Worker ${workerId}: Generated ${id - startId + 1}/${endId - startId + 1} records); } } parentPort.postMessage({ workerId, records }); }

    2. Optimized Data Structure Generation

    Use efficient algorithms for complex data structures:

    class OptimizedDataGenerator {
      constructor() {
        // Pre-compute common values to avoid repeated generation
        this.commonFirstNames = Array.from({ length: 1000 }, () => faker.person.firstName());
        this.commonLastNames = Array.from({ length: 1000 }, () => faker.person.lastName());
        this.commonDomains = ['gmail.com', 'yahoo.com', 'hotmail.com', 'example.com'];
        this.commonCities = Array.from({ length: 500 }, () => faker.location.city());
        this.commonCountries = Array.from({ length: 50 }, () => faker.location.country());
        
        // Pre-generate UUID pool for better performance
        this.uuidPool = Array.from({ length: 10000 }, () => faker.string.uuid());
        this.uuidIndex = 0;
      }
      
      getUUID() {
        // Cycle through pre-generated UUIDs for better performance
        const uuid = this.uuidPool[this.uuidIndex];
        this.uuidIndex = (this.uuidIndex + 1) % this.uuidPool.length;
        return uuid;
      }
      
      generateOptimizedUser() {
        const firstName = faker.helpers.arrayElement(this.commonFirstNames);
        const lastName = faker.helpers.arrayElement(this.commonLastNames);
        const domain = faker.helpers.arrayElement(this.commonDomains);
        
        return {
          id: this.getUUID(),
          firstName: firstName,
          lastName: lastName,
          email: ${firstName.toLowerCase()}.${lastName.toLowerCase()}@${domain},
          city: faker.helpers.arrayElement(this.commonCities),
          country: faker.helpers.arrayElement(this.commonCountries),
          createdAt: faker.date.recent({ days: 365 }),
          
          // Generate complex nested data efficiently
          preferences: this.generatePreferences(),
          activity: this.generateActivityData()
        };
      }
      
      generatePreferences() {
        // Use bitwise operations for efficient boolean generation
        const flags = faker.number.int({ min: 0, max: 255 });
        
        return {
          emailNotifications: Boolean(flags & 1),
          smsNotifications: Boolean(flags & 2),
          pushNotifications: Boolean(flags & 4),
          marketingEmails: Boolean(flags & 8),
          newsletter: Boolean(flags & 16),
          productUpdates: Boolean(flags & 32),
          darkMode: Boolean(flags & 64),
          twoFactorAuth: Boolean(flags & 128)
        };
      }
      
      generateActivityData() {
        const loginCount = faker.number.int({ min: 1, max: 1000 });
        
        return {
          loginCount: loginCount,
          lastLoginDate: faker.date.recent({ days: 30 }),
          averageSessionDuration: faker.number.int({ min: 300, max: 7200 }),
          pageViews: loginCount * faker.number.int({ min: 5, max: 50 }),
          
          // Generate activity timestamps efficiently
          recentActivity: this.generateRecentActivity(Math.min(loginCount, 10))
        };
      }
      
      generateRecentActivity(count) {
        const baseDate = new Date();
        const activities = [];
        
        for (let i = 0; i < count; i++) {
          activities.push({
            timestamp: new Date(baseDate.getTime() - i  24  60  60  1000),
            action: faker.helpers.arrayElement(['login', 'view_page', 'update_profile', 'logout'])
          });
        }
        
        return activities;
      }
    }

    // Usage async function generateLargeOptimizedDataset(count) { const generator = new OptimizedDataGenerator(); const batchSize = 10000; const totalBatches = Math.ceil(count / batchSize); console.log(Generating ${count} optimized records in ${totalBatches} batches); for (let batch = 0; batch < totalBatches; batch++) { const batchStart = batch * batchSize; const batchEnd = Math.min((batch + 1) * batchSize, count); const batchCount = batchEnd - batchStart; const records = Array.from({ length: batchCount }, () => generator.generateOptimizedUser() ); // Process batch await processBatch(records); console.log(Completed batch ${batch + 1}/${totalBatches}); } }

    Database Integration Strategies

    1. Bulk Insert Optimization

    Optimize database writes for large datasets:

    class DatabaseBulkInserter {
      constructor(connection, tableName) {
        this.connection = connection;
        this.tableName = tableName;
        this.batchSize = 5000; // Optimal batch size for most databases
        this.pendingRecords = [];
      }
      
      async addRecord(record) {
        this.pendingRecords.push(record);
        
        if (this.pendingRecords.length >= this.batchSize) {
          await this.flush();
        }
      }
      
      async flush() {
        if (this.pendingRecords.length === 0) return;
        
        const startTime = Date.now();
        
        try {
          // Use database-specific bulk insert
          await this.performBulkInsert(this.pendingRecords);
          
          const duration = Date.now() - startTime;
          const rate = Math.round(this.pendingRecords.length / duration * 1000);
          
          console.log(Inserted ${this.pendingRecords.length} records in ${duration}ms (${rate} records/sec));
          
          this.pendingRecords = [];
        } catch (error) {
          console.error('Bulk insert failed:', error);
          throw error;
        }
      }
      
      async performBulkInsert(records) {
        // PostgreSQL example
        const values = records.map(record => `(
          '${record.id}',
          '${record.firstName.replace(/'/g, "''")}',
          '${record.lastName.replace(/'/g, "''")}',
          '${record.email}',
          '${record.createdAt.toISOString()}'
        )`).join(',');
        
        const query = `
          INSERT INTO ${this.tableName} (id, first_name, last_name, email, created_at)
          VALUES ${values}
        `;
        
        await this.connection.query(query);
      }
      
      async close() {
        await this.flush(); // Insert any remaining records
      }
    }

    // Usage with data generation async function generateAndInsertLargeDataset(totalRecords) { const inserter = new DatabaseBulkInserter(dbConnection, 'users'); const generator = new OptimizedDataGenerator(); try { for (let i = 0; i < totalRecords; i++) { const record = generator.generateOptimizedUser(); await inserter.addRecord(record); if (i % 100000 === 0) { console.log(Generated ${i}/${totalRecords} records); } } } finally { await inserter.close(); } }

    2. Connection Pooling and Transaction Management

    Manage database connections efficiently:

    const { Pool } = require('pg'); // PostgreSQL example

    class HighPerformanceInserter { constructor(poolConfig) { this.pool = new Pool({ ...poolConfig, max: 20, // Maximum number of connections idleTimeoutMillis: 30000, connectionTimeoutMillis: 2000, }); } async generateAndInsertWithTransactions(totalRecords, batchSize = 10000) { const client = await this.pool.connect(); try { let processedRecords = 0; while (processedRecords < totalRecords) { const remainingRecords = totalRecords - processedRecords; const currentBatchSize = Math.min(batchSize, remainingRecords); await client.query('BEGIN'); try { // Generate and insert batch within transaction const batch = this.generateBatch(currentBatchSize, processedRecords); await this.insertBatch(client, batch); await client.query('COMMIT'); processedRecords += currentBatchSize; console.log(Committed batch: ${processedRecords}/${totalRecords} records); } catch (error) { await client.query('ROLLBACK'); console.error('Batch failed, rolled back:', error); throw error; } } } finally { client.release(); } } generateBatch(size, startId) { return Array.from({ length: size }, (_, index) => ({ id: startId + index + 1, name: faker.person.fullName(), email: faker.internet.email(), createdAt: faker.date.recent({ days: 365 }) })); } async insertBatch(client, batch) { const values = batch.map((record, index) => ($${index 4 + 1}, $${index 4 + 2}, $${index 4 + 3}, $${index 4 + 4}) ).join(','); const query = ` INSERT INTO users (id, name, email, created_at) VALUES ${values} `; const params = batch.flatMap(record => [ record.id, record.name, record.email, record.createdAt ]); await client.query(query, params); } }

    Optimize your database insertions with our database seeding tools.

    File Output Optimization

    1. Streaming File Writers

    Write large datasets to files efficiently:

    const fs = require('fs');
    const { Transform } = require('stream');

    class HighPerformanceFileWriter { constructor(filename, format = 'json') { this.filename = filename; this.format = format; this.writeStream = fs.createWriteStream(filename); this.recordCount = 0; // Initialize file format if (format === 'json') { this.writeStream.write('['); } } async writeRecord(record) { let output; switch (this.format) { case 'json': output = (this.recordCount > 0 ? ',' : '') + JSON.stringify(record); break; case 'jsonl': output = JSON.stringify(record) + '\n'; break; case 'csv': output = this.recordCount === 0 ? this.generateCSVHeader(record) + '\n' + this.recordToCSV(record) + '\n' : this.recordToCSV(record) + '\n'; break; default: throw new Error(Unsupported format: ${this.format}); } return new Promise((resolve, reject) => { this.writeStream.write(output, (error) => { if (error) reject(error); else { this.recordCount++; resolve(); } }); }); } recordToCSV(record) { const values = Object.values(record).map(value => { if (typeof value === 'string') { return "${value.replace(/"/g, '""')}"; // Escape quotes } return value; }); return values.join(','); } generateCSVHeader(record) { return Object.keys(record).join(','); } async close() { return new Promise((resolve) => { if (this.format === 'json') { this.writeStream.write(']'); } this.writeStream.end(() => { console.log(File ${this.filename} written with ${this.recordCount} records); resolve(); }); }); } }

    // Usage for large file generation async function generateLargeFile(filename, recordCount, format = 'jsonl') { const writer = new HighPerformanceFileWriter(filename, format); const generator = new OptimizedDataGenerator(); try { console.log(Generating ${recordCount} records to ${filename}); for (let i = 0; i < recordCount; i++) { const record = generator.generateOptimizedUser(); await writer.writeRecord(record); if (i % 50000 === 0) { console.log(Written ${i}/${recordCount} records); } } } finally { await writer.close(); } }

    2. Compressed Output

    Reduce file size with compression:

    const zlib = require('zlib');

    class CompressedFileWriter { constructor(filename, compressionLevel = 6) { this.filename = filename; this.writeStream = fs.createWriteStream(filename); this.gzipStream = zlib.createGzip({ level: compressionLevel }); // Pipe gzip stream to file this.gzipStream.pipe(this.writeStream); this.recordCount = 0; } async writeRecord(record) { const jsonLine = JSON.stringify(record) + '\n'; return new Promise((resolve, reject) => { this.gzipStream.write(jsonLine, (error) => { if (error) reject(error); else { this.recordCount++; resolve(); } }); }); } async close() { return new Promise((resolve) => { this.gzipStream.end(() => { this.writeStream.end(() => { console.log(Compressed file ${this.filename} written with ${this.recordCount} records); resolve(); }); }); }); } }

    Distributed Generation

    1. Cluster-Based Generation

    Scale across multiple machines:

    const cluster = require('cluster');
    const numCPUs = require('os').cpus().length;

    if (cluster.isMaster) { // Master process - coordinate workers async function distributeGeneration(totalRecords, numWorkers = numCPUs) { const recordsPerWorker = Math.ceil(totalRecords / numWorkers); const workers = []; console.log(Master: Starting ${numWorkers} workers for ${totalRecords} records); for (let i = 0; i < numWorkers; i++) { const worker = cluster.fork({ WORKER_ID: i, START_RECORD: i * recordsPerWorker + 1, END_RECORD: Math.min((i + 1) * recordsPerWorker, totalRecords) }); workers.push(new Promise((resolve) => { worker.on('message', (message) => { if (message.type === 'complete') { console.log(Worker ${i} completed: ${message.recordsGenerated} records); resolve(); } }); })); } await Promise.all(workers); console.log('All workers completed'); // Cleanup for (const id in cluster.workers) { cluster.workers[id].kill(); } } // Start distributed generation distributeGeneration(10_000_000, numCPUs); } else { // Worker process - generate assigned records const workerId = process.env.WORKER_ID; const startRecord = parseInt(process.env.START_RECORD); const endRecord = parseInt(process.env.END_RECORD); async function workerGeneration() { const generator = new OptimizedDataGenerator(); const writer = new HighPerformanceFileWriter( output_worker_${workerId}.jsonl, 'jsonl' ); let recordsGenerated = 0; try { for (let i = startRecord; i <= endRecord; i++) { const record = generator.generateOptimizedUser(); record.id = i; // Ensure unique sequential IDs await writer.writeRecord(record); recordsGenerated++; if (recordsGenerated % 10000 === 0) { console.log(Worker ${workerId}: ${recordsGenerated}/${endRecord - startRecord + 1} records); } } } finally { await writer.close(); } // Notify master process.send({ type: 'complete', workerId: workerId, recordsGenerated: recordsGenerated }); } workerGeneration(); }

    Monitoring and Quality Assurance

    1. Performance Monitoring

    Track generation performance in real-time:

    class PerformanceMonitor {
      constructor() {
        this.startTime = Date.now();
        this.recordsGenerated = 0;
        this.memoryPeaks = [];
        this.throughputHistory = [];
      }
      
      recordGenerated() {
        this.recordsGenerated++;
        
        if (this.recordsGenerated % 10000 === 0) {
          this.logPerformanceMetrics();
        }
      }
      
      logPerformanceMetrics() {
        const now = Date.now();
        const elapsed = now - this.startTime;
        const currentThroughput = Math.round(this.recordsGenerated / elapsed * 1000);
        
        // Memory usage
        const memUsage = process.memoryUsage();
        const memoryMB = Math.round(memUsage.heapUsed / 1024 / 1024);
        
        this.memoryPeaks.push(memoryMB);
        this.throughputHistory.push(currentThroughput);
        
        console.log(Performance: ${this.recordsGenerated} records, ${currentThroughput} rec/sec, ${memoryMB}MB memory);
        
        // Alert on memory spikes
        if (memoryMB > 1000) {
          console.warn(High memory usage: ${memoryMB}MB);
        }
      }
      
      getFinalReport() {
        const totalTime = Date.now() - this.startTime;
        const avgThroughput = Math.round(this.recordsGenerated / totalTime * 1000);
        const peakMemory = Math.max(...this.memoryPeaks);
        
        return {
          recordsGenerated: this.recordsGenerated,
          totalTimeMs: totalTime,
          averageThroughput: avgThroughput,
          peakMemoryMB: peakMemory,
          throughputHistory: this.throughputHistory
        };
      }
    }

    2. Data Quality Validation

    Ensure data quality at scale:

    class ScaleDataValidator {
      constructor() {
        this.validationErrors = [];
        this.sampledRecords = [];
        this.sampleRate = 0.001; // Validate 0.1% of records
      }
      
      validateRecord(record, recordIndex) {
        // Always validate critical fields
        if (!record.id || !record.email) {
          this.validationErrors.push({
            recordIndex,
            error: 'Missing required fields'
          });
        }
        
        // Sample records for detailed validation
        if (Math.random() < this.sampleRate) {
          this.sampledRecords.push(record);
          this.performDetailedValidation(record, recordIndex);
        }
      }
      
      performDetailedValidation(record, recordIndex) {
        // Email validation
        if (!/^[^s@]+@[^s@]+.[^s@]+$/.test(record.email)) {
          this.validationErrors.push({
            recordIndex,
            error: Invalid email format: ${record.email}
          });
        }
        
        // Date validation
        if (record.createdAt && isNaN(new Date(record.createdAt))) {
          this.validationErrors.push({
            recordIndex,
            error: 'Invalid date format'
          });
        }
      }
      
      getValidationReport() {
        const uniqueEmails = new Set(this.sampledRecords.map(r => r.email));
        const uniqueNames = new Set(this.sampledRecords.map(r => r.name));
        
        return {
          totalErrors: this.validationErrors.length,
          sampleSize: this.sampledRecords.length,
          uniqueEmailsInSample: uniqueEmails.size,
          uniqueNamesInSample: uniqueNames.size,
          diversityScore: uniqueNames.size / this.sampledRecords.length,
          errors: this.validationErrors
        };
      }
    }

    Conclusion

    Generating large volumes of test data efficiently requires careful planning and optimization across multiple dimensions:

    Key Strategies:

  • • Use streaming and batch processing for memory efficiency
  • • Implement parallel processing for better performance
  • • Optimize database operations with bulk inserts and transactions
  • • Monitor performance and quality throughout generation
  • • Plan for distributed generation when scaling beyond single machines
  • Performance Considerations:

  • • Memory usage grows with data complexity, not just volume
  • • Database write performance often becomes the bottleneck
  • • Network I/O can limit distributed generation approaches
  • • Quality validation overhead should be balanced with thoroughness
  • Best Practices:

  • • Start with performance profiling to identify bottlenecks
  • • Use appropriate data structures for your use case
  • • Implement comprehensive monitoring and alerting
  • • Plan for graceful error handling and recovery
  • • Document your generation process for reproducibility
  • Ready to generate large-scale test datasets? Start with our high-performance bulk generator designed for enterprise-scale data generation.

    Related Articles:

  • The Ultimate Guide to Test Data Generation
  • Generating Realistic User Data for Web Applications
  • Customizing Fake Data with Regular Expressions
  • Need help optimizing large-scale data generation for your specific use case? Contact our performance experts for specialized guidance.

    Ready to Generate Test Data?

    Put these best practices into action with our comprehensive data generation tools.