Unique
I’ve already looked at the unique
package in my Go 1.23 post. I’m revisiting it, as it seems to be misappreciated, even overlooked.
Some gophers have dismissed it as pointless, thinking that maps or pointers can do the same thing. Some are not really worried about saving a bit of memory. But there’s a bit more to it.
It’s true that in many ways a unique.Handle
is just a pointer, but it has other features such as avoiding memory leaks as I already discussed in unique package and further elaborate below.
But primarily unique
is about making your code faster by speeding up comparisons. Memory savings are a secondary benefit.
I will give a realistic example of how it might be used. Then I’ll demonstrate the performance and other advantages. Then finally explain a few gotchas.
Background
General
Handles
To understand the point of interning let’s first consider handles. Handles are designed to provide a convenient or efficient way to deal with objects. Efficiency is often a result of the handle being much smaller than the object to which it refers.
Here are some examples of what I consider handles in code.
- array index - an integer allowing access to an object
- pointer - a memory address that points to the object
- key into map - the key provides access to the object
- file name (path) - a string giving access to a file
- checksum/CRC/hash - a value derived from the object
- cryptographic hash - a hash where it’s “impossible” to derive the object from the value
There are several properties of handles that may be useful:
- comparable - comparing handles == comparing values
- retrievable - can you get back the original value?
- uniqueness - values always have different handles
- derivable - you can derive the handle from the object (as opposed to searching for it)
Generally handles are comparable (support ==, !=) but not all types of handles have all the above properties. For example:
- a cryptographic hash is not retrievable (which is what makes it useful)
- an array index is not derivable - to get the index given the value you must search the array
- a hash value is not unique - e.g. collisions in hash table lookups have to be catered for
- but a hash is derivable - you can obtain the handle just from the object value
Go
History of Go Interning
The Go standard library has used “interning” for various specific data types from its inception.
However, generics (go 1.18) made it possible to create an interning package that can be used with any comparable type.
The Go standard library has had a such a package (internal/intern
), for internal use, since Go 1.20. This was in turn based on https://github.com/go4org/intern.
This internal package was enhanced and made public as the unique
package in Go 1.23.
Constant String “Interning”
Since strings in Go are immutable the compiler can “intern” literal strings. (This can be important to understand when testing unique
with strings.)
If you have multiple literals of the same string, they will all use the same memory for the characters. The following calls to unsafe.StringData()
will all return the same address.
const sConst = "abc"
println(unsafe.StringData(sConst))
println(unsafe.StringData(sConst[:1]))
println(unsafe.StringData("abc"))
println(unsafe.StringData("ab" + "c"))
unsafe.StringData()
returns the address of the string’s underlying byte
array.
Only string literals are interned. Strings generated at run-time are not interned even if they are the same as an existing string. (Note that above "ab" + "c"
is a string constant evaluated at compile-time.)
Why do we need it?
If we have a lot of identical objects (same type and value) we can use them in two ways:
- by value - use a copy wherever necessary
- by reference - pass around a pointer to each object
Using objects by value is great most of the time but can be inefficient when you have multiple instances of large objects.
Then, why not just use simple pointers? Ie, each identical instance is represented by a placeholder - a pointer to the archetypal instance.
This is just what the unique
package does, but it also has the advantage of relieving you of the burden of tracking who owns the unique copy of the object.
In other words, it’s useful when there is no clear “owner” of (large) identical objects, or you don’t want to attach ownership to anything in particular.
❝ ... useful when there is no clear owner ❞The unique
package provides unique.Make[T]()
to identify duplicates and retain each unique value exactly once. It returns a unique.Handle[T]
.
The handle allows you to later retrieve the value, using the unique.Handle[T].Value()
method. More importantly, for large objects, you can compare handles more efficiently than comparing the objects themselves.
Example
Maybe an example will clarify this. Consider a site that sells products where customers are given discount vouchers which can be applied to certain products.
Using strings
First, let’s look at storing the vouchers as strings. A customer
has a list of vouchers ([]string
) that they own. A product
has a “set” of available vouchers (map[string]struct{}
), which can be looked up quickly.
type (
product struct {
ID productID
Price Currency
Available map[string]struct{}
}
customer struct {
Name string
Vouchers []string
}
)
// hasDiscount - do you get a discount on a product?
func hasDiscount(c customer, p product) bool {
for _, v := range c.Vouchers {
if _, found := p.Available[v]; found {
return true
}
}
return false
}
This code works fine, but consider if…
For security, or other reasons, voucher codes are long - maybe all the same length, say 26 characters.
There are thousands of different products many of which have the same available discounts.
Furthermore, the sales department have been profligate in their distribution of discounts. Customers have lots of vouchers.
The end result is that at any time there are many duplicate vouchers of 34 bytes. But, when stored on the heap, they consume 48 bytes - the next largest memory class as I explained in Go 1.23 HostLayout
We make the reasonable assumption here that voucher strings come from “external” sources such as a database, user entry, etc, and are not subject to string interning as explained in the Background->Go section above.
Using unique
Now consider storing vouchers as unique
handles instead of string
s. (The code for hasDiscount()
is the same as above.)
type (
product struct {
ID productID
Price Currency
Available map[unique.Handle[string]]struct{}
}
customer struct {
Name string
Vouchers []unique.Handle[string]
}
)
Now, every duplicate copy of a voucher only requires 8 bytes (on 64-bit systems where a pointer is 8 bytes) instead of 48. Furthermore, it also reduces GC load.
Better yet, hasDiscount()
runs at least 3 times faster, due to faster map lookups.
Note: Just doing simple comparisons (rather than map lookups) is more than 10 times faster - see the benchmark code below.
Advantages
Saves Memory
The obvious advantage of interning is that it saves memory. Personally, I don’t think this is particularly important, since memory is cheap nowadays.
If you are coding in a memory-constrained environment and/or you have a lot of large duplicated objects it may be useful.
Saves Time
The main advantage is the speed of comparing handles. Internally a unique.Handle
is just a pointer (ie, 8 bytes) - it takes a single CPU instruction to compare them, whereas comparing something like 2 long strings (of the same length) takes longer. Viktor Pakhuchyi’s benchmarks show that it is many times faster even for simple strings.
Of course, this is only useful if you do lots of comparisons. If you need to use the values (using unique.Handle[T].Value()
) more than you compare them then it may be detrimental to performance.
Here’s my quick benchmark which shows that comparing handles is more than 10 times faster than comparing 26 character strings.
import (
"crypto/rand"
"testing"
"unique"
)
func BenchmarkCompareUniqueHandles(b *testing.B) {
h1 := unique.Make(rand.Text())
h2 := unique.Make(rand.Text())
for b.Loop() {
_ = h1 == h2
}
}
On my machine this shows that comparing handles takes less than 0.3 ns/op, whereas comparing the strings takes about 3 ns/op.
Avoiding Memory Leaks
One thing that failed to occur to me when I first read about unique
is that there is no way to delete objects once they are interned.
This brings me to one of the coolest things about the package. When the last handle to an interned object is garbage-collected, the interned object is automatically deleted.
Why is this useful? I’ll give a real-life example…
Some server software I created had a large map of objects. Theses were referenced in many places in the software using the map key (string) rather than the objects themselves (struct). The problem was that these objects would over time not be required anymore, ie, after all instances of an object’s key had been deleted.
There was no simple way to determine when nothing was using them. Moreover, the code was evolving – the keys were being retained in more and more places.
This is a classic memory leak in Go. It meant that the map would get bigger and bigger and after days or weeks the system memory would be exhausted.
We tried various things like wrapping access to the map and using wait-groups and atomic counters. Eventually we settled on running a regular cleanup is a separate go-routine to scan all the data structures to identify those no longer being used and delete them from the map(s).
Using unique
to intern the objects (instead of storing them in a map), then comparing and accessing their handles (instead of the map key strings) means that the memory leak problem just goes away. As soon as the last handle to an object is no longer referenced and garbage collected the object is automatically deleted.
Generic (Comparable Types)
Because unique.Make()
is a generic function you can use it with any comparable type.
Remember, it only makes sense to use it with large types - ie, where comparing objects of the type is slower than comparing the handles. It makes no sense to intern int
s – but struct
s or long string
s could benefit.
In the future, it might be useful if unique
is enhanced to work with any type not just comparable types. This would require a new function such as unique.MakeFunc()
, that takes a comparison function.
func MakeFunc[T any](T, func(T, T) bool) Handle[T]
Avoids Data Races
A useful features is that unique.Make()
and unique.Handle[T].Value()
are safe for concurrent use.
If you used another solution, such as pointers, to deduplicate values then you would need to protect their concurrent use with something like sync.Mutex
or sync.RWMutex
.
String Cloning
When writing Go, a lot of people have discovered that using a substring of a larger string prevents the larger string from being garbage-collected until the substring is no longer in use.
This is a known gotcha of Go that can result in a lot of wasted memory.
The unique
package avoids this problem for strings. unique.Make[string]()
will “clone” strings so that when a sub-string of a larger string is interned this does not prevent the larger string from being garbage collected.
Warning: To guarantee the correct behaviour please use Go 1.23.2+ or Go 1.24 - see Issue 69370
The problem is avoided for strings but still exists for other types - see below.
Gotchas
GC Issues
Although the problem of substrings was addressed (unique.Make()
will “clone” strings as mentioned above), there is still a problem when an interned object refers to part of a larger object on the heap. This is demonstrated by the following:
a := new([8192]int) // create large array on the heap
h := unique.Make(&a[42])
runtime.AddCleanup(a, func(int) { println("GC of a") }, 0)
a = nil // not nec. but indicates that the array isn't referenced
runtime.GC()
time.Sleep(time.Second) // wait for GC to finish
println("after GC")
runtime.KeepAlive(h)
The above code only prints “after GC”. However, if you comment out the call to unique.Make()
, it first prints “GC of a”. Try this on the Go Playground.
This problem applies to pointers (and interfaces) that point to something on the heap, as well as anything that contains them like a struct or array.
Is it important. Interning pointers is pointless in itself. I can imagine a scenario where you want to intern structs which have a pointer field. TBD: add a realistic example.
Serialization
When I first read about interning, I considered using it on a large server application. This stores millions of objects (structs) in memory and interning would greatly speed up comparisons of these objects.
Moreover, the code changes were relatively simple, involving simply converting something like this:
type customer struct {
Name string
Vouchers []string
}
to something like this:
type customer struct {
Name string
Vouchers []unique.Handle[string]
}
But I soon discovered a problem. These structs on occasion need to be serialised to disk, requiring them to be GOB-encoded.
Unfortunately, unique
does not support any sort of encoding (GOB, JSON, etc.). For example, in JSON handles are encoded as {}
. So the encoding/decoding round trip loses the value.
It would be great if the standard Go library encoding packages were enhanced to support unique.Handle
. This would be very simply achieved by adding custom marshaling/unmarshaling like this (for JSON) to the unique
package.
func (h Handle[T]) MarshalJSON() (r []byte, err error) {
return json.Marshal(h.Value())
}
func (h *Handle[T]) UnmarshalJSON(data []byte) error {
var v T
if err := json.Unmarshal(data, &v); err != nil {
return err
}
*h = Make[T](v)
return nil
}
Deletion Overload
I also discovered that if you intern objects too quickly deletions can’t keep up. Remember, an interned object is deleted once all it’s handles are garbage-collected. This is effectively a memory leak and can cause memory exhaustion if left unchecked.
Note that this was just a test and not a good use for interning. Generally you would keep handles around and compare them, not continuously create and discard them.
I believe this problem is being addressed by having cleanup handled by more than one goroutine. See issue 71772.
Shared “Sets” of Values
On first use, I was surprised that the unique
package uses “global” data structures.
Globals smell of bad design. I expected that each user of unique
would create a container for their own use.
As it is, if different packages in the same program intern values of the same type then they will be sharing the same “set” of values. So if you intern strings you will be sharing strings with any packages of your import tree that also intern strings.
On further consideration, I concluded this is not really a problem. Though, I guess it could give misleading results if you’re testing the behaviour of the unique
package or analysing its performance.
If this bothers you, you can obtain your own private “set” of values by creating a new type. For example, use type myString string
, in order to get a different “container” whenever you use is like unique.Make(myString("abc"))
.
Conclusion
The unique
package has several cool features. It can be used to save memory and even avoid memory leaks, and is especially useful for concurrent code being thread-safe.
However, the main advantage is that you can get large increases in the speed of comparisons, when the objects to be compared are large.
As mentioned above you should be aware of limitations such interning parts of large objects preventing them from being garbage collected.
Hopefully, these limitations are addressed and marshaling/unmarshaling (JSON, GOB, etc.) added in a future Go release.
Comments