Ask HN: How do you document and keep tabs on your infrastructure as a sysadmin?

When I was working as a sysadmin, I kept a spreadsheet. I was told later of a repository of information that supposedly was what my spreadsheet did, but it didn't add anything new and was much harder to keep up.

I built it up using nmap and then shelling into each individual machine and poking around to see what it did. This was back in the days before everything became virtualized, so each machine on the network was likely physical.

I added information by walking the aisles and copying down the rack location of every machine into another page on the spreadsheet. I eventually hooked up a terminal to them all and matched network addresses to physical machines.

Only took a few weeks and when I was done, I knew things about the network that guys who worked at the business for years didn't know.

There's no substitute for the good old-fashioned way.

I liked that job, it was fun.

majewsky · 8 years ago

If you're by yourself, using spreadsheets and nmap is usually fine. If you're working in a team of 5 or 10 or 50 sysadmins, spreadsheets turn into a huge mess. You either have to distribute them via mail etc. after every change, but then you will have concurrent edits that need to be merged manually. Or you put the spreadsheets on a network share with file locking, but then it will always be locked when you want to edit it because someone is working on an entirely unrelated part of the infrastructure.

So you have exactly those sorts of problems that RDBMS are designed to solve. Therefore it makes sense to move to a DCIM system using an RDBMS under the hood, that allows for concurrent edits, and also can be accessed by automation (cronjobs, CI, etc.) via some sort of API (or direct DB read access).

sedachv · 8 years ago

There is an even better alternative. You can put infrastructure information into the same version control repository where your infrastructure code lives, and you can even keep all the benefits of spreadsheets by using plain text format spreadsheets like Org-mode tables.

This means you do not have two sources of truth to maintain (what is in the RDBMS, and how that relates to what is in the infrastructure code repository), the RDBMS system does not have to reinvent versioning, you can see exactly how your infrastructure evolves, you can do atomic changes to both the infrastructure code and the infrastructure information that the code relies on (obviously you need a modern version control system for this), and the infrastructure code can access the infrastructure information in a much more straightforward (and much easier to test) way.

vinceguidry · 8 years ago

Just use Google Docs.

I'm going to mostly disagree with everyone here, much to my karma's detriment ;P

I agree the end-goal should be infrastructure as code, and everyone here has covered those tools well. You also want monitoring across your infrastructure. Prometheus is the new poster-boy here, but the Nagios family, and many other decent OSS solutions exist as well.

But you still need documentation. Your documentation should exist wherever you spend most of your time. Some examples:

* If you spend most of your time on a Windows Desktop, doing windows admin type things, then OneNote or some other GUI note-taking/document program makes sense.

* If you spend most of your time in Unix land(linux, BSD, etc) then plain text files on some shared disk somewhere for everyone to get to, makes WAY more sense. Bonus if you put these files in a VCS, and treat it like code, and super bonus if your documentation is just a part of your Infra as code repositories.

* If you spend your time in a web browser, then use a Wiki, like MediaWiki, wikiwiki, etc.

In other words, put your documentation tools right alongside your normal workflow, so you have a decent chance of actually using it, keeping it up to date, and having others on your team(s) also use it.

We put our docs in the repo's right alongside the code that manages the infrastructure.. in plain text. It's versioned. We don't publish it anywhere, it's just in the repo, but then we spend most of our time in editors messing in that repo.

antoncohen · 8 years ago

I totally agree, but having "infrastructure as code" means less documentation.

Instead of documenting all the commands involved in configuring a machine as service X (ssh, run apt-get, paste this, etc.), I have documentation on how work with the configuration management system (roles in the roles/ directory, each node gets one role, commit to git, open PR, etc.). That documentation is in .md files in the config management source repo.

Instead of documenting how to rack a server (print and attach label to front and back, plug power into separate PDUs, enter PDU ports into management database, etc.), I document Terraform conventions (use module foo, name it xxx-yyy, tag with zzz, etc.).

It ends up being less documentation, as the "code" serves to document the steps taken, so the documentation can be higher level. Or if it isn't less documentation, it is documentation that needs to be updated less often, so hopefully there will be less drift between docs and what actually exists.

zie · 8 years ago

Yes, I didn't cover what goes into the documentation, as that is mostly site-specific, but I mostly agree with you... mostly. Instead of documenting run apt-get, ssh, etc to start up service X, now you have to document how your tools are setup, Ansible, Terraform, etc. Plus your code needs documentation about why it's setup the way it is.

You still need high-level stuff, policies, etc. Security guides, none of this has changed.

You also have to document your snowflakes, how you handle the wacky snowflakes, why they exist, etc.

Ideally your documentation should be such that it would pass the hit-by-a-bus test. I.e. if you or your entire team got hit by a bus, someone with a clue could come in, read your documentation and continue.

My docs are not at that stage, but every time I mess about with something I try to read through the docs attached, and verify and add to them, so that hopefully someday we will get there.

toomuchtodo · 8 years ago

Sit down with another sysadmin and have them go through your Terraform repo; if they have to ask more than 3 times why something is done a certain way, your "infrastructure as code" as documentation is insufficient.

Source: 16 years in various ops roles

hobofan · 8 years ago

Ah the good old "self-explanatory code that needs no documentation".

jftuga · 8 years ago

Windows sys admin here. OneNote is fantastic for IT documentation. I like that you can drag and drop a screenshot (no uploading to a wiki), store spreadsheets, word docs, PDFs, etc. and easily search for information via the built-in functionality.

We have used it for years and it has worked great for us.

cik · 8 years ago

We use Collins (https://tumblr.github.io/collins/) as a Configuration Management Database, Ansible (https://www.ansible.com/) for automation, Terraform (https://www.terraform.io/) + a bunch of homebrew for orchestration, Packet (https://www.packer.io/) for multi-cloud (and hypervisor) image creation and maintenance, powered by Ansible. Every since thing is committed to a series of bitbucket (https://www.bitbucket.org) repositories.

We connect Ansible and Collins through ansible-cmdb (https://github.com/fboender/ansible-cmdb), then tie the entire thing to our ticketing systems ServiceNOW (https://www.servicenow.com/) and Jira Service Desk (https://www.atlassian.com/software/jira/service-desk), and finally, ensure we have history tracking with Slack (https://www.slack.com).

As a given, we yank test the entire world. If it doesn't pass a yank, it straight up doesn't exist.

Whether it's bare-metal, virtualized, para-virtualized, dockerized, mixed-mode, or cloud - we 100% do this all the time. There is not a single change across any environment, that isn't fully tracked, fully reproducible, fully auditable, and fully automated.

woodrowbarlow · 8 years ago

what do you mean by "passing a yank test"? i assume "yank test" refers to unplugging the network cable abruptly from the server under test, but what exactly are you looking for when you do that?

A yank test on process and infrastructure is more than a 'did it come up'. It's a "if we totally nuke the thing" - say, were we to rip the hard drives out of a server, fry it, and recreate it - does it come up identicall(is).

That way we know our CMDB is accurate, our workflows are accurate, credentials, ansible, terraform, images, etc. Right down to tickets.

It's how we manage all of our cloud customers.

- keep inventory in a DCIM (we use Netbox)

- configure everything as code (we use Ansible for the infrastructure up to OS level, Kubernetes w/ Helm for applications), have it read the values from the DCIM so that the DCIM remains the single source of truth (we need to still get better on this part....)

Links: https://github.com/digitalocean/netbox https://www.ansible.com https://www.kubernetes.io

That's at work. At home, I do much of the same, except that maintaining a DCIM is excessive for 2 VPS and a home network of 3 boxes.

rubenbe · 8 years ago

I cannot comment on the DCIM side, but I agree on the "everything as code" mantra.

For a relatively small setup I chose a combination of Ansible, Kubernetes and Dockerfiles, but probably any combination will do. All these files are stored in a git repo.

Even after months (or years) neglect, I can easily know what I configured (and why!) and update where needed with a minor effort.

It might be helpful if described your infrastructure. There is a pretty big difference between managing physical Windows servers in a data center and managing Linux servers all in AWS.

If you are all or mostly cloud, Terraform + config management with a CI pipeline takes care of a lot. Then a wiki that covers "Getting Started" and a few how-to articles.

For physical infra you need the setup for DHCP, updating DNS based on DHCP, PXE boot imaging, IPMI access and configuration, switch and router configuration, what servers are connected to which switch ports, PDU management and monitoring, and on and on and on.

You end up with something like NetBox (https://github.com/digitalocean/netbox) or Collins (https://tumblr.github.io/collins/), plus a bunch of other stuff gluing things together.

evangineer · 8 years ago

For future work, I would definitely consider NetBox and Collins as alternative options to GLPI.

beh9540 · 8 years ago

I think it depends a lot on the size of your infrastructure. I've used excel docs on a shared drive pretty successfully where there's not much to keep up on and changes are few.

In larger infrastructure setups (small service provider) we used a combination of netboot, SNMP for monitoring with Observium and Nagios for alerting. We were also a big VMware environment, so naturally we had a lot of inventory tracking available through vCenter as well. I found a lot of opposition to Configuration Management, given the lack of comfort with programming of some sysadmins (Windows admins), so that's something to keep in mind as well. I think mixed environments also can be challenging w/infrastructure as code, but I'd be interested to see how others get through that.

seorphates · 8 years ago

The past decade has been interesting and I'm still processing it.

My current thoughts are that an appropriate approach is for your systems to document themselves via the applications that they run - inside out.

Though I must abide I cannot fully subscribe to "infrastructure as code" anymore. It has proven just another shift, primarily in toolsets and who (or what) gets say and sway over the capacity, capabilities and efficiencies of the thing you actually care about - the app stack and all of its assembled functionality.

In other words most approaches are still "outside in" - one defines 'x' for deploy fitments and that typically over and over and over again and, typically, with a rigidity that can too easily override and overrule effectively caging your application in scale and scope. With my current tact I am trying to provide for 'y' to "self identify" (via some/any form of config mgmt) where from here you can begin to effectively "deploy to any" by hooking the "application config as code" that, in turn, defines its infrastructure and deploys "outward". The "infrastructure as code" then becomes the servant with its objects and platform definitions etc. and the "appconfig as code" becomes the master where the latter defines its own scope and scale.

Infrastructures have a funny way of mutating into inefficient "definitions" of something that once made sense, on the first day, and forevermore complicating progress with capacity, rules and opinions.

But, generically, snmp is still pretty cool for telling me what I need to know. Strapped that into any end engine and, boom, ask any question, request any inventory.

So.. I track apps, not systems. Systems are expendable, applications are not.

brudgers · 8 years ago

I don't do devOps but if I did...

http://howardism.org/Technical/Emacs/literate-devops.html

https://www.youtube.com/watch?v=dljNabciEGg