Network Observability with New Relic

Application teams have had rich observability for years — traces, metrics, logs, dashboards. Network teams, meanwhile, are often still staring at per-device web UIs and CLI output. New Relic Network Performance Monitoring (NPM) closes that gap: it pulls SNMP metrics, flow records, and syslog from your network gear into the same platform where your application telemetry already lives, so you can answer "is it the network or the app?" with data instead of finger-pointing.

This guide covers how NPM works, how to deploy the ktranslate collector, and the queries and alerts I actually use day to day monitoring campus switches and SD-WAN appliances.

Architecture: How ktranslate Works

New Relic's network monitoring is built on ktranslate, an open-source collector originally developed by Kentik. It runs as a Docker container (or Podman, or a Linux service) inside your network and does three jobs:

SNMP polling — walks your devices on an interval (default 60s for metrics, 30m for metadata like interface names) and ships the results to New Relic as dimensional metrics.
Flow collection — listens for NetFlow v5/v9, IPFIX, sFlow, or jFlow exported by your routers, switches, and firewalls.
Syslog forwarding — receives syslog from devices and forwards it into New Relic Logs.

One container can do all three, but in production you should run separate containers per job — a busy flow exporter can starve the SNMP poller if they share a process.

[ Switches / Routers / Firewalls ]
        |  SNMP (161/udp)      -> ktranslate (snmp)   \
        |  NetFlow (2055/udp)  -> ktranslate (flow)    ->  New Relic (OTLP/HTTPS out)
        |  Syslog (514/udp)    -> ktranslate (syslog) /

The container only needs outbound HTTPS to New Relic — nothing inbound from the internet — which keeps security review simple.

Deploying the SNMP Collector

Generate a config by running discovery against a subnet:

docker run -ti --name ktranslate-discovery --rm \
  -v "$(pwd)/snmp-base.yaml:/snmp-base.yaml" \
  kentik/ktranslate:v2 \
  -snmp /snmp-base.yaml \
  -log_level info \
  -snmp_discovery=true

A minimal snmp-base.yaml for discovery:

devices: {}
trap:
  listen: 0.0.0.0:1620
discovery:
  cidrs:
    - 10.10.0.0/24
  default_communities:
    - "your-community"
  default_v3: null
  add_devices: true
  add_mibs: true
  threads: 4
  check_all_ips: true
global:
  poll_time_sec: 60
  mib_profile_dir: /etc/ktranslate/profiles

Discovery writes every device it can reach into the devices: section, matching each one against New Relic's public snmp-profiles repository (Cisco, Juniper, Arista, Fortinet, Palo Alto, and hundreds more). If a device matches a profile, you get vendor-specific metrics — PoE draw, CPU per routing engine, VPN tunnel state — not just generic ifTable counters.

Then run the poller for real:

docker run -d --name ktranslate-snmp --restart unless-stopped \
  -v "$(pwd)/snmp-base.yaml:/snmp-base.yaml" \
  -e NEW_RELIC_API_KEY=$NR_LICENSE_KEY \
  kentik/ktranslate:v2 \
  -snmp /snmp-base.yaml \
  -nr_account_id=$NR_ACCOUNT_ID \
  -metrics=jchf \
  -tee_logs=true \
  nr1.snmp

Prefer SNMPv3 over v2c anywhere the hardware supports it — v2c community strings cross the wire in cleartext. In the device config:

    snmp_v3:
      user_name: nrmonitor
      authentication_protocol: SHA
      authentication_passphrase: "..."
      privacy_protocol: AES256
      privacy_passphrase: "..."

Flow Data

SNMP tells you an interface is at 90% utilization. Flow tells you why — which hosts, which applications, which conversations. Enable flow export on the device (NetFlow on Cisco/Meraki MX, sFlow or IPFIX on Juniper EX, jFlow on SRX), point it at the ktranslate container, and run the container with -nf.source=netflow5|netflow9|ipfix|sflow.

Flow records land in New Relic as KFlow events, queryable with NRQL:

-- Top talkers through the WAN edge in the last hour
FROM KFlow SELECT sum(in_bytes)/1e9 AS 'GB'
FACET src_addr, dst_addr
WHERE device_name = 'branch-mx85'
SINCE 1 hour ago LIMIT 20

-- What application classes are eating the uplink?
FROM KFlow SELECT rate(sum(in_bytes)*8, 1 second) AS 'bps'
FACET application
WHERE device_name = 'core-ex4000' TIMESERIES SINCE 3 hours ago

The NRQL You Actually Need

SNMP metrics arrive as dimensional metrics with the kentik.snmp prefix. The workhorses:

-- Interface utilization, in and out, per interface
FROM Metric SELECT
  average(kentik.snmp.ifHCInOctets) * 8 / 1e6 AS 'In Mbps',
  average(kentik.snmp.ifHCOutOctets) * 8 / 1e6 AS 'Out Mbps'
FACET device_name, if_interface_name
TIMESERIES SINCE 6 hours ago

-- Interface errors and discards — the early-warning signal
FROM Metric SELECT
  sum(kentik.snmp.ifInErrors), sum(kentik.snmp.ifOutErrors),
  sum(kentik.snmp.ifInDiscards), sum(kentik.snmp.ifOutDiscards)
FACET device_name, if_interface_name
WHERE if_OperStatus = 'up' SINCE 1 day ago

-- Device health rollup
FROM Metric SELECT latest(kentik.snmp.CPU), latest(kentik.snmp.MemoryUtilization)
FACET device_name SINCE 30 minutes ago

Alerts Worth Having

Skip the "alert on everything" phase and start with these five:

Alert	Signal	Why
Device unreachable	SNMP polling stops reporting	Your basic up/down
Interface down	`if_OperStatus` changes on uplinks/trunks only	Alerting on every access port is noise
Utilization > 80% for 15m	`ifHCInOctets`/`ifHCOutOctets` vs `if_Speed`	Capacity headroom warning
Error rate > 0.1% of packets	`ifInErrors` / `ifHCInUcastPkts`	Catches bad optics and duplex issues before users notice
CPU > 85% for 10m	`kentik.snmp.CPU`	Control-plane stress (BGP churn, broadcast storm)

Scope interface alerts with tags — ktranslate lets you attach user_tags per device (site, role, environment), and the SNMP config supports interface match_attributes so you can alert only on ports whose description matches UPLINK|TRUNK|WAN.

Tying It Back to Applications

The payoff of doing this in New Relic instead of a standalone NMS: correlation. When checkout latency spikes, one workspace shows you the APM trace, the host metrics, and the uplink on the branch firewall sitting at 98% with a backup job as the top flow talker. Entity relationships link network devices to the hosts and apps that traverse them, and a single alert policy can group the application symptom with the network cause instead of paging two teams separately.

Operational Notes

Container health: monitor ktranslate itself — it ships its own health metrics, and a dead collector looks identical to a dead network if you don't.
Version pinning: run the kentik/ktranslate:v2 tag and upgrade deliberately; profiles update frequently.
Scale: one SNMP container comfortably handles a few hundred devices; shard by site or role beyond that.
Meraki and Mist: cloud-managed gear that hides SNMP behind the cloud can still be ingested — ktranslate has a Meraki Dashboard API integration, and Mist exposes a rich REST/webhook API you can wire into New Relic via Flex or the Events API. See my Juniper Mist and Meraki MX85 articles for the vendor-side setup.

Network data next to application data isn't a luxury — it's the difference between "the network is fine, closing ticket" and actually seeing the retransmits.

Network Observability with New Relic: SNMP, Flow Data, and ktranslate

Network Observability with New Relic

Architecture: How ktranslate Works

Deploying the SNMP Collector

Flow Data

The NRQL You Actually Need

Alerts Worth Having

Tying It Back to Applications

Operational Notes

Related Reading

Cisco Meraki MX85: SD-WAN, AutoVPN, and Branch Observability

AWS Hybrid Connectivity: Direct Connect, Site-to-Site VPN, and TGW Integration

AWS Transit Gateway: Multi-VPC and Multi-Account Networking