Auto-indexing inference configuration reference
This document details how a site administrator can supply a Lua script to customize the way Sourcegraph detects precise code intelligence indexing jobs from repository contents.
By default, Sourcegraph will attempt to infer index jobs for the following languages:
GoJava/Scala/KotlinPythonRubyRustTypeScript/JavaScript
Inference logic can be disabled or altered in the case when the target repositories do not conform to a pattern that the Sourcegraph default inference logic recognizes. Inference logic is controlled by a Lua override script that can be supplied in the UI under Admin > Code graph > Inference.
Example
The Lua override script ultimately must return an auto-indexing config object. A configuration that neither disables or adds new recognizers does not change the default inference behavior.
LUAreturn require("sg.autoindex.config").new({ -- Empty configuration (see below for usage) })
To disable default behaviors, you can re-assign a recognizer value to false. Each of the built-in recognizers are prefixed with sg. (and are the only ones allowed to be).
LUAreturn require("sg.autoindex.config").new({ -- Disable default Python inference ["sg.python"] = false })
To add additional behaviors, you can create and register a new recognizer. A recognizer is an interface that requests some set of files from a repository, and returns a set of auto-indexing job configurations that could produce a precise code intelligence index.
A path recognizer is a concrete recognizer that advertises a set of path globs it is interested in, then invokes its generate function with matching paths from a repository. In the following, all files matching Snek.module (Snek.module, proj/Snek.module, proj/sub/Snek.module, etc) are passed to a call to generate (if non-empty). The generate function will then return a list of indexing job descriptions. The guide for auto-indexing jobs configuration gives detailed descriptions on the fields of this object.
The ordering of paths and limits are defined in the Ordering guarantees and limits section.
LUAlocal path = require("path") local pattern = require("sg.autoindex.patterns") local recognizer = require("sg.autoindex.recognizer") local snek_recognizer = recognizer.new_path_recognizer { patterns = { -- Look for Snek.module files -- (would match Snek.module; proj/Snek.module, proj/sub/Snek.module, etc) pattern.new_path_basename("Snek.module"), -- Ignore any files in test or vendor directories pattern.new_path_exclude( pattern.new_path_segment("test"), pattern.new_path_segment("vendor") ), }, -- Called with list of matching Snek.module files generate = function(_, paths) local jobs = {} for i = 1, #paths do -- Create indexing job description for each matching file table.insert(jobs, { indexer = "acme/snek:latest", -- Run this indexer... root = path.dirname(paths[i]), -- ...in this directory local_steps = {"snekpm install"}, -- Install dependencies indexer_args = {"snek", "index", ".", "--output", "index.scip"}, outfile = "index.scip", }) end return jobs end } return require("sg.autoindex.config").new({ -- Register new recognizer ["acme.snek"] = snek_recognizer, })
Available libraries
There are a number of specific and general-purpose Lua libraries made accessible via the built-in require.
The type signatures for the functions below use the following syntax:
(A1, ..., An) -> R: Function type with arguments of typeA1, ..., Anand return typeR.array[A]: Table with indexes 1 to N of elements of typeA.table[K, V]: Table with keys of typeKand values of typeV.A | B: Union type (includes values of typeAand typeB).A...: Variadic (0 or more values of A, without being wrapped in a table)."mystring": Literal string type with only"mystring"as the allowed value.{K1: V1, K2: V2, ...}: Heterogenous table (object) with a key of typeK1mapping to a value of typeV1etc.void: no value returned from function
sg.autoindex.recognizer
This auto-indexing-specific library defines the following two functions.
-
new_path_recognizercreates aRecognizerfrom a config object containingpatternsandgeneratefields. See the example above for basic usage.- Type:
whereSHELL
({ -- List of patterns to match against paths in the repository "patterns": array[pattern], -- List of patterns to match against paths in the repository -- for getting contents (see contents_by_path below) "patterns_for_content": array[pattern], -- Callback function invoked with paths requested by patterns above -- for creating index jobs "generate": ( registration_api, -- List of paths obtained from 'patterns' and -- 'patterns_for_content' combined. paths: array[string], -- Table mapping paths to contents for paths matched by -- 'patterns_for_content' contents_by_path: table[string, string] ) -> array[index_job], }) -> recognizerindex_jobis an object with the following shape:For installing dependencies, if the indexer image contains the relevant package manager(s), then it is simpler to install dependencies usingSHELLindex_job = { -- Docker image for the indexer "indexer": string, -- Working directory for invoking the indexer "root": string, -- Preparatory steps to run before invoking the indexer -- such as installing dependencies "steps": array[{ -- Working directory for this step "root": string, -- Docker image to use for this step "image": string, -- List of commands to run inside the Docker image "commands": array[string] }], -- List of commands to run inside the indexer image at "root" -- before invoking the indexer, such as installing dependencies. "local_steps": array[string], -- Command-line invocation for the indexer "indexer_args": array[string], -- Path to the index generated by the indexer "outfile": string, -- Names of necessary environment variables. These are -- made accessible to steps, local_steps, and the -- indexer_args command. -- -- These are generally used for passing secrets. "requested_envvars": array[string], }local_steps. Otherwise, thestepsfield allows more customizability.
- Type:
-
new_fallback_recognizercreates arecognizerfrom an ordered list ofrecognizers. Eachrecognizeris called sequentially, until one of them emits non-empty results.- Type:
(array[recognizer]) -> recognizer
- Type:
The registration_api object has the following API:
registerwhich queues arecognizerto be run at a later stage. This makes it possible to add more recognizers dynamically, such as based on whether specific configuration files were found or not.- Type:
(recognizer) -> void
- Type:
sg.autoindex.patterns
This auto-indexing-specific library defines the following four path pattern constructors.
new_path_literal(fullpath)creates apatternthat matches an exact filepath.- Type:
(string) -> pattern
- Type:
new_path_segment(segment)creates apatternthat matches a directory name.- Type:
(string) -> pattern
- Type:
new_path_basename(basename)creates apatternthat matches a basename exactly.- Type:
(string) -> pattern
- Type:
new_path_extension(ext_no_leading_dot)creates apatternthat matches files with a given extension.- Type:
(string) -> pattern
- Type:
This library also defines the following two pattern collection constructors.
new_path_combine(patterns)creates a pattern collection object (to be used with recognizers) from the given set of pathpatterns.- Type:
((pattern | array[pattern])...) -> pattern
- Type:
new_path_exclude(patterns)creates a new inverted pattern collection object. Paths matching thesepatterns are filtered out from the set of matching filepaths given to a recognizer'sgeneratefunction.- Type:
((pattern | array[pattern])...) -> pattern
- Type:
path
This library defines the following utility functions:
ancestors(path)returns a list{dirname(path), dirname(dirname(path)), ...}. The last element in the list will be an empty string.- Type:
(string) -> array[string]
- Type:
basename(path)returns the basename of the given path as defined by Go's filepath.Base.- Type:
(string) -> string
- Type:
dirname(path)returns the dirname of the given path as defined by Go's filepath.Dir, except that it (1) returns an empty path instead of"."if the path is empty and (2) removes a leading/if present.- Type:
string -> string
- Type:
join(path1, path2)returns a filepath created by joining the given path segments via filepath separator.- Type:
(string, string) -> string
- Type:
split(path)is a convenience function that returnsdirname(path), basename(path).- Type:
(string) -> string, string
- Type:
json
This library defines the following two JSON utility functions:
encode(val)returns a JSON-ified version of the given Lua object.decode(json)returns a Lua table representation of the given JSON text.
fun
Lua Functional is a high-performance functional programming library accessible via local fun = require("fun"). This library has a number of functional utilities to help make recognizer code a bit more expressive.
Ordering guarantees and limits
Sourcegraph enforces several limits to avoid inference timeouts and ever-growing auto-indexing queues. These limits apply for a single round of inference for a single repository, combined across all recognizers, including any implicitly included Sourcegraph recognizers.
| Limit | Default value |
|---|---|
| The number of auto-indexing jobs inferred | 100 |
The number of total paths passed to the inference script's generate functions as the second argument paths | 500 |
The number of total paths with contents passed to the inference script's generate functions as the third argument contents_by_paths | 100 |
| Maximum size limit for file contents, in bytes | 1 MiB |
Auto-indexing jobs and paths are first ranked based on the criteria described below. If the number of jobs and/or paths exceeds the limits above, lower ranked items are discarded.
-
For auto-indexing jobs, ranking is done based on the following:
- Descending order of indexer frequency (total number of inferred jobs with the same
indexerfield). - Ascending lexicographic ordering of
indexer. - Descending order of number of path components for
root. Shallower roots are preferrred over deeper ones as they are more likely to cover more code. - Ascending lexicographic ordering of
rootpaths.
- Descending order of indexer frequency (total number of inferred jobs with the same
-
For paths, ranking happens in the following order:
- Paths for which the contents are requested are ranked higher.
- Paths with fewer components are ranked higher.
- Otherwise, lexicographic ordering of paths is used.