3. • Systematic way to bring new data sources into Splunk
• Make sure that new data is instantly usable
& has maximum value for users
• Goes hand-in-hand with the User Onboarding process
(sold separately)
What is the Data Onboarding Process?
8. • Input Processors: Monitor, FIFO, UDP, TCP, Scripted
• No events yet-- just a stream of bytes
• Break data stream into 64KB blocks
• Annotate stream with metadata keys (host, source,
sourcetype, index, etc.)
• Can happen on UF, HF or indexer
Inputs– Where it all starts
9. • Check character set
• Break lines
• Process headers
• Can happen on HF or indexer
Parsing Queue
10. • Merge lines for multi-line events
• Identify events (finally!)
• Extract timestamps
• Exclude events based on timestamp (MAX_DAYS_AGO, ..)
• Can happen on HF or indexer
Aggregation/Merging Queue
11. • Do regex replacement (field extraction, punctuation
extraction, event routing, host/source/sourcetype
overrides)
• Annotate events with metadata keys
(host, source, sourcetype, ..)
• Can happen on HF or indexer
Typing Queue
12. • Output processors: TCP, syslog, HTTP
• Indexandforward
• Sign blocks
• Calculate license volume and throughput metrics
• Index
• Write to disk
• Can happen on HF or indexer
Indexing Queue
18. • Pre-board
• Build the index-time configs
• Build the search-time configs
• Create data models
• Document
• Test
• Get ready to deploy
• Bring it!
• Test & Validate
Process Overview
19. • Identify the specific sourcetype(s) - onboard each separately
• Check for pre-existing app/TA on splunk.com-- don't reinvent the wheel!
• Gather info
• Where does this data originate/reside? How will Splunk collect it?
• Which users/groups will need access to this data? Access controls?
• Determine the indexing volume and data retention requirements
• Will this data need to drive existing dashboards (ES, PCI, etc.)?
• Who is the SME for this data?
• Map it out
• Get a "big enough" sample of the event data
• Identify and map out fields
• Assign sourcetype and TA names according to CIM conventions
Pre-Board
20. • The Common Information Model (CIM) defines
relationships in the underlying data, while leaving the raw
machine data intact
• A naming convention for fields, eventtypes & tags
• More advanced reporting and correlation requires that the
data be normalized, categorized, and parsed
• CIM-compliant data sources can drive CIM-based
dashboards (ES, PCI, others)
Tangent: What is the CIM and why should I care?
21. • Identify necessary configs (inputs, props and transforms)
to properly handle:
• timestamp extraction, timezone, event breaking,
sourcetype/host/source assignments
• Do events contain sensitive data (i.e., PII, PAN, etc.)?
Create masking transforms if necessary
• Package all index-time configs into the TA
Build the Index-time configs
22. • Assign sourcetype according to event format; events with
similar format should have the same sourcetype
• When do I need a separate index?
• When the data volume will be very large, or when it will
be searched exclusively a lot
• When access to the data needs to be controlled
• When the data requires a specific data retention policy
• Resist the temptation to create lots of indexes
Tangent: Best & Worst Practices
23. • Always specify a sourcetype and index
• Be as specific as possible: use /var/log/fubar.log,
not /var/log/
• Arrange your monitored filesystems to minimize
unnecessary monitored logfiles
• Use a scratch index while testing new inputs
Best & Worst Practices – [monitor]
24. • Lookout for inadvertent, runaway monitor clauses
• Don’t monitor thousands of files unnecessarily–
that’s the NSA’s job
• From the CLI: splunk show monitor
• From your browser:
https://your_splunkd:8089/services/admin/inputstatus/
TailingProcessor:FileStatus
Best & Worst Practices – [monitor]
25. • Find & fix index-time problems BEFORE polluting your index
• A try-it-before-you-fry-it interface for figuring out
• Event breaking
• Timestamp recognition
• Timezone assignment
• Provides the necessary props.conf parameter settings
Your friend, the Data PreviewerAnother
Tangent!
27. • Identify "interesting" events which should be tagged with an existing CIM tag
(http://docs.splunk.com/Documentation/CIM/latest/User/Alerts)
• Get a list of all current tags: | rest splunk_server=local /services/admin/tags | rename tag_name as tag,
field_name_value AS definition, eai:acl.app AS app | eval definition_and_app=definition . " (" . app . ")" |
stats values(definition_and_app) as "definitions (app)" by tag | sort +tag
• Get a list of all eventtypes (with associated tags): | rest splunk_server=local /services/admin/eventtypes
| rename title as eventtype, search AS definition, eai:acl.app AS app | table eventtype definition app tags
| sort +eventtype
• Examine the current list of CIM tags: for each "interesting" event, identify which tags should be applied to
each. A particular event may have multiple tags
• Are there new tags which should be created, beyond those in the current CIM tag library? If so, add them
to the CIM library
Build the Search-time Configs:
eventtypes & tags
28. • Extract "interesting" fields
• If already in your CIM library, name or alias appropriately
• If not already in your CIM library, name according to CIM conventions
• Add lookups for missing/desirable fields
• Lookups may be required to supply CIM-compliant fields/field values (for example,
to convert 'sev=42' to 'severity=medium'
• Make the values more readable for humans
• Put everything into the TA package
Build the Search-time Configs:
extractions & lookups
29. • Create data models. What will be interesting for end users?
• Document! (Especially the fields, eventtypes & tags)
• Test
• Does this data drive relevant existing dashboards correctly?
• Do the data models work properly / produce correct results?
• Is the TA packaged properly?
• Check with originating user/group; is it OK?
Keep Going
30. • Determine additional Splunk infrastructure required; can
existing infrastructure & license support this?
• Will new forwarders be required? If so, initiate CR process(es)
• Will firewall changes be required? If so, initiate CR process(es)
• Will new Splunk roles be required? Create & map to AD roles
• Will new app contexts be required? Create app(s) as necessary
• Will new users be added? Create the accounts
Get Ready to Deploy
31. • Deploy new search heads & indexers as needed
• Install new forwarders as needed
• Deploy new app & TA to search heads & indexers
• Deploy new TA to relevant forwarders
Bring it!
32. • All sources reporting?
• Event breaking, timestamp, timezone, host, source,
sourcetype?
• Field extractions, aliases, lookups?
• Eventtypes, tags?
• Data model(s)?
• User access?
• Confirm with original requesting user/group: looks OK?
Test & Validate
34. • Bring new data sources in correctly the first time
• Reduce the amount of “bad” data in your indexes– and the
time spent dealing with it
• Make the new data immediately useful to ALL users– not
just the ones who originally requested it
• Allow the data to drive all sorts of dashboards without
extra modifications
Gee, This Seems Like a Lot of Work…