Overview

Modern information systems are composed of many applications, services and processes, all running on heterogeneous infrastructure, with different parts of the system under control of different groups. Managing change in such systems is notoriously tricky and error-prone. The advent of dynamic, software-defined infrastructure puts even more demands on already stretched-out inventory and configuration management processes. At the same time, desire to release software as fast as possible is at its all-time-high. Satisfying both agility and reliability goals at a scale requires ground-up rethinking of infrastructure, application architectures and ITSM processes.

Tonomi Platform takes a novel approach to this problem. By acknowledging the distributed nature of modern applications as something that should be embraced rather than abstracted away, we are laying out a new framework of distributed configuration and change management.

Tonomi Platform has proven itself well suited for deploying and managing multi-server, multi-cloud configurations of distributed systems such as Apache Hadoop, Oracle ATG, as well as popular web stacks.

At its core, Tonomi Platform is built on three pillars:

These principles, working together, create the foundation for modern automation platforms that enable DevOps in the enterprise.

Note

Throughout the guide, we will use inlays marked with “Dev” and “Ops” to provide specific examples from the respective domain.

Core tenets

At its heart, Tonomi Platform is a configuration management platform. As such, it is deployed side-by-side with the applications and services under management and not in the execution path. This is a crucial difference between Tonomi Platform and platform-as-a-service offerings, which actually host the applications and mediate network flows.

Tonomi Platform does:

  • Store the application and service configuration, including IP addresses, DNS names, access credentials and any other information that is required for a given application or service to function;
  • Store shell scripts, directory structures, configuration file templates, API endpoints;
  • Allow the customer to define which information above is and is not available to Tonomi Platform;
  • Execute calls to service APIs, change configuration files, perform application deployments, run scripts, spawn and kill processes, modify OS data on managed servers and other infrastructure resources;
  • Integrate configuration of different services and resources through orchestration workflows;
  • Execute changes as a result of user request or dynamically as a result of a policy defined by the user.

Tonomi Platform does NOT:

  • Host application binaries;
  • Host application sources, test data, or other build artifacts;
  • Interfere with the network setup on the hosts;
  • Dictate a specific network topology, OS flavors or virtualization technology;
  • Provide application-level services such as identity, log aggregation or monitoring.

Architecture

Tonomi Platform is delivered as Software-as-a-Service, available through a subscription. The software itself consists of two distinct parts:

  • Portal, which is a globally available, web-based application, hosted and managed by Tonomi, Inc.;
  • Control Fabric, which is a hybrid network of software appliances (Fabric Controllers), hosted and managed in part by Tonomi, Inc., in part by the customer.

Portal and Fabric Controllers are designed to work together and not licensed to be used individually. One can, however, add more controllers to an established Control Fabric.

Physical topology

Portal is hosted by Tonomi, Inc. as a highly available application in multiple datacenters and accessed over web.

See Information processing for details on availability and security of the portal.

Fabric Controllers can be deployed behind the firewall, either on-premise or in the cloud. Multiple controllers may be deployed because of security, latency or scalability considerations. Each controller is a self-contained software appliance, which includes a database and a message bus. Each controller requires a outbound SSL connection to the endpoints provided by the Portal or to an intermediary controller. Controllers can be deployed in minimal or highly available mode.

See Control Fabric operations guide for details on setting up and operating controllers.

Logical architecture and information flow

End users of Tonomi Platform access the features of the platform through the Portal. The portal provides four core functions to the service users:

  • Service inventory and dashboard: see what applications and services are managed by Tonomi Platform, including their versions, state, resources and monitoring information, if available;
  • Service catalog: publish and discover applications and services throughout the enterprise, define composite applications based on a combination of existing applications and services; launch a new instance of an application or a service;
  • Command and control: change configuration of an application or a service, roll out updates, schedule retirement of an application or a service;
  • Policy definition and environment management: configure environments and define policies that will affect the applications and services running in this environments.

In addition to that, portal provides administrative functions such as user management and role assignment.

Every time there is a configuration change, the portal uses the environment policies to compile a new control plan, which is then sent to Control Fabric. This plan may require the fabric to provision or deprovision resources, generate new configuration files, redeploy applications etc. It may also define new reactions to the events that are happening with the resources and services that are under control of the fabric.

To support the core functions of the portal, Control Fabric provides the following functionality:

  • Configuration registry: each controller maintains a representation of the configuration and state of managed resources, as well as the target configuration as defined in the control plan by the portal;
  • Reactive control: each controller polls the infrastructure services such as virtual machines, load balancers, cloud APIs etc and reacts to these events in accordance with the control plan;
  • Visibility: each controller sends a subset of events and state known to it to the portal to provide the operators with a “single pane of glass” view on system operation.

Let’s use an example to understand the information flow between the Portal, the Control Fabric and underlying resources.

We will assume single controller connected to AWS EC2 API.

  1. The user uses the portal to request a web farm which consists of five virtual machines.
  2. The portal modifies user’s request in accordance with the environment policies and creates the control plan.
  3. The controller receives the plan: “New instance: small virtual machines on AWS in us-east, image id <web>, keep at five”.
  4. The controller takes the plan and compares it to configuration registry.
  5. The controller requests AWS API to provision five more virtual machines according to spec.
  6. The controller connects to each virtual machine and ensures it works as advertised.
  7. The controller modifies the configuration on each virtual machine to assemble them in a five-node farm.
  8. The controller changes the status of the new instance to “Active”.
  9. The controller continues polling virtual machines to ensure they are working as advertised.

At this point in time, if a virtual machine goes down, a controller will notice this and after a short timeout, provision a new virtual machine and join it to the farm.

  1. The user uses the portal to scale the number of nodes from five to three.
  2. The portal modifies the user’s request and generates a new plan as above.
  3. The controller gets the plan: “Update instance: keep the farm at three”.
  4. The controller takes the plan and compares it to configuration registry.
  5. The controller modifies the configuration of the load balancer so it only sends traffic to three machines.
  6. The controller deprovisions two extra machines that are not required by the configuration.
  7. The controller changes the status of the new instance to “Active”.
  8. The controller continues polling virtual machines to ensure they are working as advertised.

While this example uses the number of nodes as the key parameter that changes between reconfigurations, any other configuration parameter can be used.

Roles

Throughout the documentation, we are going to describe the system from the standpoint of one or more following roles.

User
Launches, destroys and reconfigures instances, either directly or though a third party.
Service developer
Designs and developes the component manifests, automation scripts, widgets and other custom logic.
Support technician
Answers questions and resolves issues with the instances and resources.
Policy administrator
Sets up environments and services, defines valid configurations and compliance rules.
Fabric operator
Monitors and maintains the well-being of the Control Fabric.