Network Observability with New Relic
Application teams have had rich observability for years — traces, metrics, logs, dashboards. Network teams, meanwhile, are often still staring at per-device web UIs and CLI output. New Relic Network Performance Monitoring (NPM) closes that gap: it pulls SNMP metrics, flow records, and syslog from your network gear into the same platform where your application telemetry already lives, so you can answer "is it the network or the app?" with data instead of finger-pointing.
This guide covers how NPM works, how to deploy the ktranslate collector, and the queries and alerts I actually use day to day monitoring campus switches and SD-WAN appliances.
Architecture: How ktranslate Works
New Relic's network monitoring is built on ktranslate, an open-source collector originally developed by Kentik. It runs as a Docker container (or Podman, or a Linux service) inside your network and does three jobs:
- SNMP polling — walks your devices on an interval (default 60s for metrics, 30m for metadata like interface names) and ships the results to New Relic as dimensional metrics.
- Flow collection — listens for NetFlow v5/v9, IPFIX, sFlow, or jFlow exported by your routers, switches, and firewalls.
- Syslog forwarding — receives syslog from devices and forwards it into New Relic Logs.
One container can do all three, but in production you should run separate containers per job — a busy flow exporter can starve the SNMP poller if they share a process.
[ Switches / Routers / Firewalls ]
| SNMP (161/udp) -> ktranslate (snmp) \
| NetFlow (2055/udp) -> ktranslate (flow) -> New Relic (OTLP/HTTPS out)
| Syslog (514/udp) -> ktranslate (syslog) /
The container only needs outbound HTTPS to New Relic — nothing inbound from the internet — which keeps security review simple.
Deploying the SNMP Collector
Generate a config by running discovery against a subnet:
docker run -ti --name ktranslate-discovery --rm \
-v "$(pwd)/snmp-base.yaml:/snmp-base.yaml" \
kentik/ktranslate:v2 \
-snmp /snmp-base.yaml \
-log_level info \
-snmp_discovery=true
A minimal snmp-base.yaml for discovery:
devices: {}
trap:
listen: 0.0.0.0:1620
discovery:
cidrs:
- 10.10.0.0/24
default_communities:
- "your-community"
default_v3: null
add_devices: true
add_mibs: true
threads: 4
check_all_ips: true
global:
poll_time_sec: 60
mib_profile_dir: /etc/ktranslate/profiles
Discovery writes every device it can reach into the devices: section, matching each one against New Relic's public snmp-profiles repository (Cisco, Juniper, Arista, Fortinet, Palo Alto, and hundreds more). If a device matches a profile, you get vendor-specific metrics — PoE draw, CPU per routing engine, VPN tunnel state — not just generic ifTable counters.
Then run the poller for real:
docker run -d --name ktranslate-snmp --restart unless-stopped \
-v "$(pwd)/snmp-base.yaml:/snmp-base.yaml" \
-e NEW_RELIC_API_KEY=$NR_LICENSE_KEY \
kentik/ktranslate:v2 \
-snmp /snmp-base.yaml \
-nr_account_id=$NR_ACCOUNT_ID \
-metrics=jchf \
-tee_logs=true \
nr1.snmp
Prefer SNMPv3 over v2c anywhere the hardware supports it — v2c community strings cross the wire in cleartext. In the device config:
snmp_v3:
user_name: nrmonitor
authentication_protocol: SHA
authentication_passphrase: "..."
privacy_protocol: AES256
privacy_passphrase: "..."
Flow Data
SNMP tells you an interface is at 90% utilization. Flow tells you why — which hosts, which applications, which conversations. Enable flow export on the device (NetFlow on Cisco/Meraki MX, sFlow or IPFIX on Juniper EX, jFlow on SRX), point it at the ktranslate container, and run the container with -nf.source=netflow5|netflow9|ipfix|sflow.
Flow records land in New Relic as KFlow events, queryable with NRQL:
-- Top talkers through the WAN edge in the last hour
FROM KFlow SELECT sum(in_bytes)/1e9 AS 'GB'
FACET src_addr, dst_addr
WHERE device_name = 'branch-mx85'
SINCE 1 hour ago LIMIT 20
-- What application classes are eating the uplink?
FROM KFlow SELECT rate(sum(in_bytes)*8, 1 second) AS 'bps'
FACET application
WHERE device_name = 'core-ex4000' TIMESERIES SINCE 3 hours ago
The NRQL You Actually Need
SNMP metrics arrive as dimensional metrics with the kentik.snmp prefix. The workhorses:
-- Interface utilization, in and out, per interface
FROM Metric SELECT
average(kentik.snmp.ifHCInOctets) * 8 / 1e6 AS 'In Mbps',
average(kentik.snmp.ifHCOutOctets) * 8 / 1e6 AS 'Out Mbps'
FACET device_name, if_interface_name
TIMESERIES SINCE 6 hours ago
-- Interface errors and discards — the early-warning signal
FROM Metric SELECT
sum(kentik.snmp.ifInErrors), sum(kentik.snmp.ifOutErrors),
sum(kentik.snmp.ifInDiscards), sum(kentik.snmp.ifOutDiscards)
FACET device_name, if_interface_name
WHERE if_OperStatus = 'up' SINCE 1 day ago
-- Device health rollup
FROM Metric SELECT latest(kentik.snmp.CPU), latest(kentik.snmp.MemoryUtilization)
FACET device_name SINCE 30 minutes ago
Alerts Worth Having
Skip the "alert on everything" phase and start with these five:
| Alert | Signal | Why |
|---|---|---|
| Device unreachable | SNMP polling stops reporting | Your basic up/down |
| Interface down | if_OperStatus changes on uplinks/trunks only | Alerting on every access port is noise |
| Utilization > 80% for 15m | ifHCInOctets/ifHCOutOctets vs if_Speed | Capacity headroom warning |
| Error rate > 0.1% of packets | ifInErrors / ifHCInUcastPkts | Catches bad optics and duplex issues before users notice |
| CPU > 85% for 10m | kentik.snmp.CPU | Control-plane stress (BGP churn, broadcast storm) |
Scope interface alerts with tags — ktranslate lets you attach user_tags per device (site, role, environment), and the SNMP config supports interface match_attributes so you can alert only on ports whose description matches UPLINK|TRUNK|WAN.
Tying It Back to Applications
The payoff of doing this in New Relic instead of a standalone NMS: correlation. When checkout latency spikes, one workspace shows you the APM trace, the host metrics, and the uplink on the branch firewall sitting at 98% with a backup job as the top flow talker. Entity relationships link network devices to the hosts and apps that traverse them, and a single alert policy can group the application symptom with the network cause instead of paging two teams separately.
Operational Notes
- Container health: monitor ktranslate itself — it ships its own health metrics, and a dead collector looks identical to a dead network if you don't.
- Version pinning: run the
kentik/ktranslate:v2tag and upgrade deliberately; profiles update frequently. - Scale: one SNMP container comfortably handles a few hundred devices; shard by site or role beyond that.
- Meraki and Mist: cloud-managed gear that hides SNMP behind the cloud can still be ingested — ktranslate has a Meraki Dashboard API integration, and Mist exposes a rich REST/webhook API you can wire into New Relic via Flex or the Events API. See my Juniper Mist and Meraki MX85 articles for the vendor-side setup.
Network data next to application data isn't a luxury — it's the difference between "the network is fine, closing ticket" and actually seeing the retransmits.