🍀 Recommended Best Practices
Canonical Log Lines
Users should strive to implement canonical log lines (when it makes sense).
Note: the below is a quick primer on the topic; see the original Stripe blog post for more details.
canonical log lines: authoritative line for a particular request, in the same vein that the IETF’s canonical link relation specifies an authoritative URL.
In addition to their normal log traces, requests (or some other unit of work that’s executing) also emit ONE LONG LOG LINE at the end that pulls all its key telemetry into one place. i.e.,
[2019-03-18 22:48:32.999] "canonical-log-line" alloc_count=9123 auth_type=api_key database_queries=34 duration=0.009 http_method=POST http_path=/v1/charges http_status=200 key_id=mk_123 permissions_used=account_write rate_allowed=true rate_quota=100 rate_remaining=99 request_id=req_123 team=acquiring user_id=usr_123†
- † uuid (aka guid) may be useful
i.e.,
user_id = uuid.uuid4()
See also: When are you truly forced to use UUID as part of the design?
The above sample shows the kind of information that a canonical line might contain:
- HTTP request:
verb
path
response status
- Authenticated user and related information, e.g.,
authentication method (API key, password)
ID of the API key they used
- Rate limiters which allowed the request and related statistics, e.g.,
allotted quota
what portion remains
- Timing information, e.g.,
total request duration
time spent in database queries
- Number of:
database queries issued
objects allocated by the VM
etc.
Canonical lines are an ergonomic feature.
By colocating everything that’s important to us, we make it accessible through queries that are easy for people to write, even under the duress of a production incident.
Because the underlying logging system doesn’t need to piece together multiple log lines at query time they’re also cheap for computers to retrieve and aggregate, which makes them fast to use.
The wide variety of information being logged provides almost limitless flexibility in what can be queried. This is especially valuable during the discovery phase of an incident where it’s understood that something’s wrong, but it’s still a mystery as to what.
Getting insight into e.g., a rate limiting problem, becomes as simple as (e.g. using Splunk and its built-in query language):
canonical-log-line rate_allowed=false | stats count by user_id
A Quick Primer on Log Levels
Excerpt from 1:
You must use the appropriate criticality for each log message for log levels to be useful. While log levels are not completely standard, the following levels are common:
TRACE This is an extremely fine level of detail that only gets turned on for specific packages or classes. This is rarely used outside of development. If you need line-by-line logs or data structure dumps, this level is for you. If you find yourself using TRACE frequently, you should consider using a debugger to step through code instead.
DEBUG This is used when the message will be useful during a production issue but not during normal operations. Don’t use debug-level logging so much that the output is unusable when debugging; save that for TRACE.
INFO This is nice-to-have information about the state of the application but not indicative of any problems. Application state messages like “Service started” and “Listening on port 5050” go here. INFO is the default log level. Don’t emit frivolous logs with INFO—“just in case” logging goes into TRACE or DEBUG. INFO logging should tell us something useful during normal operations.
WARN These are messages about potentially problematic situations. A resource nearing its capacity merits a warning. Whenever you log a warning, there should be a concrete action you want the person seeing the message to take. If the warning is not actionable, log it to INFO.
ERROR These messages indicate that an error that needs attention is occurring. An unwritable database usually merits an ERROR log. ERROR logs should be detailed enough to diagnose problems. Log explicit details, including relevant stack traces and the resulting actions the software is performing.
FATAL These are the “last gasp” log messages. If the program encounters a condition so severe that it must exit immediately, a message about the cause of the problem can be logged at the FATAL level. Include relevant context about the program’s state; locations of recovery or diagnostic-related data should be logged.
Here’s an INFO-level log emitted in Rust *:
info!("Failed request: {}, retrying", e);The log line includes the error that causes the request to fail. The INFO level is used because the application is automatically retrying; no operator action is needed.
- 1
- *
The same log in Python:
import structlog_sentry_logger logger = structlog_sentry_logger.getLogger() ... logger.info(f"Failed request: {e}, retrying")
With structured logging:
logger.info("Failed request", error=e, retrying=True)
And for even more granularity (e.g., if using the HTTP client library httpx):
import httpx transport = httpx.HTTPTransport(retries=2) client = httpx.Client(transport=transport) # Get response try: response = client.get(url) except httpx.RequestError as exc: logger.fatal( "Failed request", error=exc, request=vars(exc.request), exc_info=True, ) raise # Log response details try: response.raise_for_status() # raise HTTPStatusError on 4xx and 5xx responses except httpx.HTTPStatusError as exc: # Log more details for responses with error status codes logger.error( "Error response", error=exc, response=vars(exc.response), request=vars(exc.request), exc_info=True, ) else: logger.debug("Success", response=vars(response))