Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Quickstart

Installation

$ nimble install -y intops

Add intops to your .nimble file:

requires "intops"

Usage

The most straightforward way to use intops is by importing intops and calling the operations from it:

import intops

echo carryingAdd(12'u64, 34'u64, false)

Output:

(res: 46, carryOut: false)

intops.carryingAdd is a dispatcher. When invoked, it calls the best implementation of this operation out of the available ones for the given environment.

Notice that we call carryingAdd with uint64 type set explicitly. This is important because int can mean different things under different circumstances and so intops doesn’t allow this kind of ambiguity.

If you try to call carryingAdd with an int, your code simply won’t compile:

echo not compiles carryingAdd(12, 34, false)

Output:

true

All available dispatchers are listed in the Imports section of the API docs for intops module.

The operations are grouped into families, each one living in a separate submodule. For example, carryingAdd operation mentioned above is imported from intops/ops/add submodule, so this is where you find its documentation: /apidocs/intops/ops/add.html.

Calling Specific Implementations

You may want to override the dispatcher’s choice. To do that, import a particular implementation from intops/impl directly and call the function from it:

import intops/impl/intrinsics

echo intrinsics.gcc.carryingAdd(12'u64, 34'u64, false)

Output:

(46, false)

Implementations are also grouped into families: pure Nim, C intrinsics, inline C, and inline Assembly. Each family can be further split into subgroups.

For example, the carryingAdd implementation above is based on C intrinsics that are specific to GCC/Clang. All such implementations live in intops/impl/intrinsics/gcc. There’s also a subgroup that provides an implementation based on Intel/AMD specific C intrinsics—intops/impl/intrinsics/x86.

To see all available implementations for a particular operation, find it in the API index.

Caveat

When you use a dispatcher, you can be sure that your code will compile in any environment. It is the dispatcher’s job to provide a code path for any case, so you don’t have to worry about it.

But if you choose to call an implementation manually, it is your job to validate that this implementation is available in the environment you’re using it in. If you attempt to use an implementation that is unavailable, you’ll get a compile-time error.

For example, trying to use an Inline Assembly implementation with intopsNoInlineAsm flag (more on those in the section below) will cause a compilation error:

import intops/impl/inlineasm

echo not compiles inlineasm.x86.carryingAdd(12'u64, 34'u64, false)

If compiled with -d:intopsNoInlineAsm, this outputs:

true

Compilation Flags

You can control which implementations are forbidden to be picked by dispatchers using the compilation flags:

  • intopsNoIntrinsics
  • intopsNoInlineAsm
  • intopsNoInlineC

For example, to avoid using Inline Assembly implementations, compile your code with -d:intopsNoInlineAsm:

$ nim c -d:intopsNoInlineAsm mycode.nim

Of course, you can combine those flags. For example, if you want to use only pure Nim implementations, pass all three forbidding flags:

$ nim c -d:intopsNoIntrinsics -d:intopsNoInlineAsm -d:intopsNoInlineC mycode.nim

Contributor’s Guide

Optimizing arithmetic operations for a variety of enviroments and consumers is a never-ending chase. There hardly will ever be a moment when intops will be 100% ready: there’s always an older Nim version, and newer GCC version, or an obscure CPU vendor to target.

Because of that, contributors’ help is essential for intops’ development. You can help improve intops by improving the code, improving the docs, and reporting issues.

Library Structure

src
│   intops.nim              <- entrypoint for the public API
│
└───intops
    │   consts.nim          <- global constants for environment detection
    │
    ├───impl                <- implementations of primitives
    │   │   inlineasm.nim   <- entrypoint for the Inline ASM family of implementations
    │   │   inlinec.nim
    │   │   intrinsics.nim
    │   │   pure.nim
    │   │   ...
    │   │
    │   ├───inlineasm
    │   │       arm64.nim   <- implementation in ARM64 Assembly 
    │   │       x86.nim
    │   │       ...
    │   │
    │   └───...
    │
    └───ops                 <- operation families
            add.nim         <- addition flavors: carrying, saturating, etc.
            mul.nim
            sub.nim
            ...

The entrypoint is the root intops module. It’s the library’s public API and exposes all the available primitives.

Each operation family has its own submodule in intops/ops. E.g. intops/ops/add is the submodule that contains various addition flavors.

These submodules contain the dispatchers that pick the best implementation of the given operation and for the given CPU, OS, and C compiler. I.e., each operation “knows” its best implementation and “decides” which one to run.

The actual implementations are stored in submodules in intops/impl. For example, intops/impl/intrinsics contains all primitives implemented with C intrinsics.

API Conventions

  1. intops follows the common Nim convention of calling things in camelCase.
  2. intops prefers pure functions that return values to the ones that modify mutable arguments.
  3. The docstrings are mandatory for the dispatchers in intops/ops modules because this is the library public API.
  4. Operations the return a wider type that the input type are called widening. For example wideningMul is multiplication that takes two 64-bit integers and return a single 128-bit integer (althouth represented as a pair of 64-bit integers).
  5. Operations that return a carry or borrow flag are called carrying and borrowing, e.g. carryingAdd.
  6. Operations that return an overflow flag are called overflowing, e.g. overflowingSub.
  7. Operations that return maximal or mininal type value when a type border is hit are called saturating, e.g. saturatingAdd.
  8. Carry, borrow, and overflow flags are booleans.

Tests

The tests for intops are located in a single file tests/tintops.nim.

These are integration tests that emulate real-lfe usage of the library and check two things:

  1. every dispatcher picks the proper implementation on any environment
  2. the results are correct no matter the implementation

To run the tests locally, use nimble test command.

With this command, the tests are run:

  • without compilation flags in runtime mode
  • without compilation flags in compile-time mode
  • with each compilation flag separately
  • with all compilation flags

When executed on the CI, the tests are run against multiple OS, C compilers, and architectures:

  • amd64 + Linux + gcc 13
  • amd64 + Linux + gcc 14
  • amd64 + Windows + clang 19
  • i386 + Linux + gcc 13
  • amd64 + macOS + clang 17
  • arm46 + macOS + clang 17

This is hardly reproduceable locally, but you can cover at least some of the cases by passing flags to nimble test. For example, to emulate running the tests in a 32-bit CPU, run nimble --cpu:i386 test. On Windows, you also must pass the path to your GCC installation, e.g. nimble --gcc.path:D:\mingw32\bin\ --cpu:i386 test. To run your tests contunously during development, use monit:

  1. Install monit with nimble install monit.
  2. Create a file called .monit.yml in the working directory:
%YAML 1.2
%TAG !n! tag:nimyaml.org,2016:
---
!n!custom:MonitorConfig
sleep: 1
targets:
  - name: Run tests
    paths: [src, tests]
    commands:
      - nimble test
      - nimble --gcc.path:D:\mingw32\bin\ --cpu:i386 test
    extensions: [.nim]
    files: []
    exclude_extensions: []
    exclude_files: []
    once: true
  1. Run monit run

Benchmarks

Benchmarking is crucial for a library like intops: you can’t really do any reasonable dispatching improvement if you can’t argue about the changes with numbers.

There are two kinds of benchmarks: latency and throughput.

Latency benchmarks measure how long a particular operation takes to complete. For a latency benchmark, we run the same operation against random input many times making sure the next iteration doesn’t start before the previous one completes. The result is measured in nanoseconds per operation. Throughput benchmarks measure how many operations of a particluar kind can be executed per unit of time. For a throughput benchmark, we spawn the same operation against random input many times back to back so that multiple instances of the same operation are executed in parallel. The results are measured in millions of operations per second.

Benchmarks are grouped by kind and operation family, e.g. benchmarks/latency/add.nim contains latency benchmarks for the add operations (overflowingAdd, saturatingAdd, etc.).

To run the benchmarks locally, use nimble bench command:

  • run latency and throughput benchmarks for all operations:
$ nimble bench
  • run latency and throughput benchmarks for particular operations:
$ nimble bench add
$ nimble bench sub mul
  • run latency or throughput benchmarks for all operations:
$ nimble bench --kind:latency
$ nimble bench --kind:throughput
  • run particular kind of benchmarks on a particular kind of operations:
$ nimble bench --kind:latency add
$ nimble bench --kind:throughput sub mul

Docs

The docs consist of two parts:

  • the book (this is what you’re reading right now)
  • the API docs

The book is created using mdBook.

The API docs are generated from the source code docstrings.

To build the docs locally, run:

  • nimble book to build the book
  • nimble apidocs to build the API docs
  • nimble docs to build both

Operations

Improving Existing Operations

intops’ public API exposes dispatchers for each available operation.

Dispatcher is a Nim template that contains the logic used to select the best available implementation of the given operation for the given environment.

For example, carryingAdd mentioned in Quickstart is a dispatcher.

Let’s examine the code of this dispatcher:

template carryingAdd*(a, b: uint64, carryIn: bool): tuple[res: uint64, carryOut: bool] =
  when nimvm:
    pure.carryingAdd(a, b, carryIn)
  else:
    when cpuX86 and compilerMsvc and canUseIntrinsics:
      intrinsics.x86.carryingAdd(a, b, carryIn)
    elif compilerGccCompatible and canUseIntrinsics:
      intrinsics.gcc.carryingAdd(a, b, carryIn)
    elif cpu64Bit and compilerGccCompatible and canUseInlineC:
      inlinec.carryingAdd(a, b, carryIn)
    else:
      pure.carryingAdd(a, b, carryIn)

As you can see, a dispatcher is just a nested when-condition that checks if:

  1. the operation called during compilation (when nimvm)
  2. the code is run on a particular CPU (when cpuX86) and with a particular C compiler (and compilerMsvc)
  3. particular compilation flags were passed (and canUseIntrinsics)

Depending on these conditions, a particular implementation is called.

If you want to improve how intops chooses an implementation, find the corresponding dispatcher and modify the branching in it. All dispatchers are defined in intops/ops modules and are grouped by operation. For example, the dispatchers for addition flavors are defined in intops/ops/add.nim.

In the dispatchers, you can use the global constants defined in intops/consts.nim to check for the CPU architecture, C compiler, etc. If necessary, feel free to define new constants.

For example, if you want to prioritize inline C implementation over intrinsics, you could modify the dispatcher like so:

template carryingAdd*(a, b: uint64, carryIn: bool): tuple[res: uint64, carryOut: bool] =
  when nimvm:
    pure.carryingAdd(a, b, carryIn)
  else:
-   when cpuX86 and compilerMsvc and canUseIntrinsics:
+   when cpu64Bit and compilerGccCompatible and canUseInlineC:
+     inlinec.carryingAdd(a, b, carryIn)
+   elif cpuX86 and compilerMsvc and canUseIntrinsics:
      intrinsics.x86.carryingAdd(a, b, carryIn)
    elif compilerGccCompatible and canUseIntrinsics:
      intrinsics.gcc.carryingAdd(a, b, carryIn)
-   elif cpu64Bit and compilerGccCompatible and canUseInlineC:
-     inlinec.carryingAdd(a, b, carryIn)
    else:
      pure.carryingAdd(a, b, carryIn)

Adding New Operations

Adding an operation means doing two things:

  1. Adding a pure Nim implementation for the new operation. Pure Nim implementations are universal fallbacks for all operations because they are guaranteed to compile everwhere Nim code can compile regardless of the environment. Pure Nim implementations are defined in intops/impl/pure.nim.
  2. Adding a dispatcher that exposes this implementation. Find the corresponding module in intops/ops (or create a new one) and add the dispatcher there.

For example, let’s define a new addition flavor called magic addition which adds two uint64 integers and adds the number 42 to the sum (this is our magic component).

  1. In intops/impl/pure.nim:
func magicAdd*(a, b: uint64): uint64 =
  a + b + 42
  1. In intops/ops/add.nim:
template magicAdd*(a, b: uint64): uint64 =
  ## Docstring is mandatory for dispatchers.

  pure.magicAdd(a, b)

Adding New Operation Families

If you’re not just adding a new operation to an existing module but adding a new module to intops/ops, you must also expose it in intops.nim so that it becomes part of the public API.

For example, you’ve added a new module intops/ops/magicadd.nim, do this in intops.nim:

- import intops/ops/[add, sub, mul, muladd, division, composite]
+ import intops/ops/[add, sub, mul, muladd, division, composite, magicadd]

- export add, sub, mul, muladd, division, composite
+ export add, sub, mul, muladd, division, composite, magicadd

Implementations

Improving Existing Implementations

In a perfect world, a pure Nim implementation would be enough to cover every operation: the Nim compiler would generate the optimal C code and the C compiler would generate the optimal Assembly code for every environment.

In reality, this isn’t always so. Since there are so many combinations of a Nim version, OS, CPU, and C compiler, there are inevitable performance gaps that need to be filled manually.

This is why most operations in intops have multiple implementations and dispatchers exist.

For improve an existing implementation, find its module in intops/impl and modify the code there. Some implementation families are represented as a single module (e.g. intops/impl/inclinec), some are split into submodules (e.g. intops/intrinsics/x86.nim and intops/intrinsics/gcc.nim).

Adding New Implementations

If you want to provide a new implemtation for an existing operation:

  1. Add a new function to the corresponsing intops/impl submodule (or create a new one).
  2. Update the corresponding dispatcher →

For example, let’s implement magic addition from the previous chapter in C.

  1. In intops/impl/inlinec.nim:
# This is a guard that prevents the compilation in unsupported environments.
# In this example, we explicitly say that this implementation works only with 64-bit CPUs.
# Guards are necessary for the case where a user calls this function directly circumventing
# the logic in `ops/add.nim`.
when cpu64Bit:
  func magicAdd*(a, b: uint64): uint64 {.inline.} =
    var res: uint64

    {. emit: "`res` = `a` + `b` + ((unsigned __int64)42);" .}

    res
  1. In intops/ops/add.nim:
template magicAdd*(a, b: uint64): uint64 =
  ## Docstring is mandatory for dispatchers.

- pure.magicAdd(a, b)

+ when nimvm:
+   pure.magicAdd(a, b)
+ else:
+   when cpu64Bit and canUseInlineC:
+     inlinec.magicAdd(a, b)
+   else:
+     pure.magicAdd(a, b)