Effective Identifier Generation in Distributed Systems

Generating unique identifiers in distributed systems presents significant challenges. While single-server environments can easily rely on auto-increment values or sequences, distributed environments where multiple servers must simultaneously generate IDs require careful consideration to ensure efficiency and avoid duplication.

In this article, we'll explore an effective implementation of a unique identifier generator designed for distributed environments: the DistributedIdGenerator.

Requirements for Distributed Identifiers

Uniqueness: Each identifier must be unique across the entire system.
Scalability: The system should be capable of generating thousands or millions of identifiers per second.
Time-sortability: Being able to sort identifiers chronologically is often valuable.
Compactness: Identifiers should be appropriately sized, considering storage space and transmission bandwidth.
Unpredictability: In security-sensitive contexts, identifiers should not be easily predictable.

Implementation Example: DistributedIdGenerator

Here's an example implementation of an identifier generator for distributed environments:

          
import kr.co.example.global.extension.now
import java.time.ZoneOffset
import java.util.concurrent.atomic.AtomicLong

class DistributedIdGenerator(
    private val serverId: Int
) {
    companion object {
        const val MAX_SERVER_ID = 1000
    }

    private val maxSequence = 999999
    private val counter = AtomicLong()

    init {
        require(serverId < MAX_SERVER_ID) {
            throw IllegalStateException("exceed max server id $serverId")
        }
    }

    fun generate(): String {
        val timestamp = now().toEpochSecond(ZoneOffset.UTC)
        val uniqueId = counter.updateAndGet { current ->
            if (current >= maxSequence) 0 else current + 1
        }
        return String.format("%s%03d%06d", timestamp, serverId, uniqueId)
    }
}

Implementation Analysis

This implementation works as follows:

Timestamp Utilization: Uses the current time in seconds (UTC) as a timestamp.
Server ID: Each server has a unique ID (0-999), which is included in the identifier.
Sequence Number: Uses an AtomicLong-managed sequence number (0-999999) to distinguish identifiers generated within the same second.
Reset Logic: When the sequence number reaches its maximum value, it resets to 0.

The final identifier format is: "Timestamp (seconds) + Server ID (3 digits) + Sequence Number (6 digits)".

The diagram above illustrates how multiple servers can generate unique IDs simultaneously without coordination. Each server:

Uses the same timestamp component (when generating IDs in the same second)
Incorporates its unique server ID in the middle section
Maintains its own sequence counter that increments independently

This ensures that even if multiple servers generate IDs at the exact same millisecond, the resulting IDs will still be globally unique.

Key Benefits

Distributed by Design: By incorporating a server ID, the generator ensures uniqueness across multiple servers.
High Throughput: Can generate up to 1 million unique IDs per second per server.
Time-ordered: IDs are naturally sorted by time due to the timestamp prefix.
Bounded Length: Maximum length is predictable (usually around 19-20 characters for current timestamps).
No Coordination Needed: Servers don't need to communicate to generate unique IDs.

Verification with Tests

The following test code verifies that the identifier generator meets the requirements:

import org.assertj.core.api.Assertions.assertThat
import org.junit.jupiter.api.Assertions.assertEquals
import org.junit.jupiter.api.Assertions.assertTrue
import org.junit.jupiter.api.DisplayName
import org.junit.jupiter.api.RepeatedTest
import org.junit.jupiter.api.Test
import java.util.concurrent.ConcurrentHashMap
import java.util.concurrent.Executors
import java.util.concurrent.TimeUnit

class DistributedIdGeneratorTest {
    private val DistributedIdGenerator = DistributedIdGenerator(1)

    @Test
    @DisplayName("serverId_validation")
    fun test_constructor() {
        // given
        val invalidServerId = 1000
        val maxServerId = 999
        // when
        val failedResult = runCatching { DistributedIdGenerator(invalidServerId) }
            .onFailure { it.printStackTrace() }
        val successResult = runCatching { DistributedIdGenerator(maxServerId) }
            .onFailure { it.printStackTrace() }
        // then
        assertThat(failedResult.isFailure).isTrue
        assertThat(failedResult.exceptionOrNull()).isInstanceOf(IllegalStateException::class.java)
        assertThat(successResult.isSuccess).isTrue
    }

    @RepeatedTest(10)
    @DisplayName("should_generate_1M_ids_per_second_with_max_length_20")
    fun test_generate_should_not_over_20_length_should_not_duplicated_value() {
        val totalKeys = 1_000_000 // Total number of keys to generate
        val threadCount = 10 // Number of threads to use
        val keysPerThread = totalKeys / threadCount // Keys to generate per thread
        val generatedKeys = ConcurrentHashMap.newKeySet<String>() // ConcurrentHashSet to check for duplicates
        val executor = Executors.newFixedThreadPool(threadCount) // Thread pool creation

        (1..threadCount).forEach { i ->
            executor.submit {
                (1..keysPerThread).forEach { j ->
                    val key = DistributedIdGenerator.generate()
                    assertTrue(key.length <= 20, "Key length exceeded 20 characters: $key")
                    generatedKeys.add(key) // Add the generated key to the Set
                }
            }
        }

        // Wait for all threads to complete their work
        executor.shutdown()
        executor.awaitTermination(1, TimeUnit.HOURS)

        // Verify that the total number of keys generated matches the expected count
        assertEquals(totalKeys, generatedKeys.size, "Duplicate keys detected!")
    }
}

Potential Improvements

Millisecond Precision: Using milliseconds instead of seconds would allow for more identifiers per time unit.
Snowflake Variation: Twitter's Snowflake ID format could be adopted for improved bit utilization.
UUID Alternative: For cases where guaranteed uniqueness is more important than sequentiality.
Persistence: For recovering sequence numbers after restarts to prevent possible collisions.

Spring Boot Integration Example

Here's an example of how to integrate the DistributedIdGenerator into a Spring Boot application:

import kr.co.example.global.ActiveProfileManager
import org.slf4j.LoggerFactory
import org.springframework.boot.SpringApplication
import org.springframework.context.ApplicationContext
import org.springframework.context.annotation.Bean
import org.springframework.context.annotation.ComponentScan
import org.springframework.context.annotation.Configuration
import kotlin.math.absoluteValue

@Configuration
@ComponentScan(basePackages = ["kr.co.example.notification"])
class BeanConfig(
    private val detector: ActiveProfileManager,
    private val applicationContext: ApplicationContext
) {
    private val logger = LoggerFactory.getLogger(javaClass)

    @Bean
    fun distributedIdGenerator(): DistributedIdGenerator {
        val serverId: Int = if (detector.isLocal()) {
            1
        } else {
            val podName = System.getenv("HOSTNAME")
            if (podName.isNullOrEmpty()) {
                logger.info("HOSTNAME is not set. Shutting down the application.")
                SpringApplication.exit(applicationContext) // Gracefully shutdown Spring application
            }
            val uniquePart = podName.substringAfterLast("-")
            uniquePart.hashCode().absoluteValue.rem(DistributedIdGenerator.MAX_SERVER_ID)
        }
        logger.info("server id is : {}", serverId)
        return DistributedIdGenerator(serverId)
    }
  }

This configuration class demonstrates intelligent server ID assignment:

In local development environments, it uses a hardcoded server ID (1).
In Kubernetes environments, it extracts the pod name from the HOSTNAME environment variable.
It generates a consistent server ID by hashing the unique part of the pod name and taking the modulo of MAX_SERVER_ID.
This ensures each pod gets a unique ID within the valid range, maintaining distributed ID generation across deployments.

Conclusion

Generating unique identifiers in distributed environments requires careful consideration of uniqueness, performance, and scalability. The approach presented here offers a simple yet effective solution that balances these requirements without needing external coordination between servers.

By combining a timestamp, server ID, and sequence number, this implementation can generate millions of unique identifiers per second across a distributed system, with each identifier being no longer than 20 characters.

When implementing your own distributed ID generation system, consider your specific requirements around sortability, predictability, and collision resistance to choose the most appropriate approach.