Keystone Queens PTG Summary
Policy in Code
The keystone team dedicated two sessions to helping projects achieve the policy-in-code goal for Queens. The crash-course included justification for the goal and code examples, which seemed useful for projects just digging into the work. We also parsed repositories and projects that aren't affected by the goal and removed them from the burndown chart. We now have a very accurate representation of what projects are on track to complete the goal and what ones need some help. A couple developers from the keystone community are going to be available to help other projects achieve the goal for Queens. If your project needs help getting bootstrapped, please swing by #openstack-keystone on Freenode.
As one of the drivers of the recent RBAC work, I felt a little guilty for how much time we spent on this topic throughout the week. We started with a Baremetal/VM session dedicated to improving admin-ness. The discussion focused on the two approaches we have that help address the admin problem we have in OpenStack today. The context and required reading can be found in the previously linked session etherpad. The first main take-away from the cross-project session was that we should really document the impact both approaches have on operators and users. Both are summarized below:
Admin Project Process Log:
- Move all default policies into code
- Identify which users need global access
- Give users who need global access a role assignment on the `admin` project
- Set `admin` project configuration options and redeploy
Global Roles Process Log:
- Move all default policies into code
- Identify which users need global access
- Give users who need global access a role assignment globally
Both processes require end users to learn how to consume elevated scope. The admin project approach requires them to understand that elevated scope comes from some project configured by the operator. The global roles approach replaces that special project with a new scope type of "global".
The keystone team essentially spent the rest of the week working thought the side-effects and issues of both approaches. In the end, we found that both require more cross-project plumbing to be done before any sort of elevated scoping concept can be of any use to OpenStack as a platform. The following summarizes this and I'll elaborate each point in more detail:
- Global is a sub-optimal name for what we really need
- The admin project approach is already in the wild, so we have to provide some sort of migration from it if we come up with something else
- We need at least two more specifications to oslo libraries in order for any sort of elevated context/policy enforcement to make any sense across OpenStack
- We need at least one more community goal to get project policies in a more consumable place
Now for the explanations. I defaulted to using the term "global roles" in the PoC implementation prior to the PTG. My reasoning was that OpenStack has some operations that make sense within a project scope and some that don't. I dubbed the ones that don't in the "global" category. For example, creating an instance in nova or carving out a volumn in cinder require some sort of project ownership. You can operate on those resources so long as you provide a project for them to be contained in. This isn't true for other things you can do with OpenStack APIs like creating endpoints, domains, or live migrating instances. In the ideal world, what project should you scope to if you want to migrate an instance from one host to another for maintenance (in a way that doesn't confuse users or violate multi-tenancy)?
As we discussed more throughout the week, we seemed to gravitate towards the term "system" instead of "globlal". This too is highlighted best in an example. If your entire deployment bubbles up to a single domain in a hierarchy of projects, and you have a role on that domain, shouldn't you be able to do things to all instances in the deployment? Is that not in some way also "global" in nature? After thinking about that case, it seemed awkward to reserve "global" scope for system-like operations.
The `is_admin_project` approach is already exposed in keystone APIs and being consumed by certain projects. Some were opposed to the idea because it can be confusing to explain this to users and it continues to inflate the already overloaded term "project". Fortunately, if we implement a system-scope mechanism in keystone, we should be able to provide a migration path from admin project that results in it being just another project (e.g. consumers should have to defensively protect an admin project in scripts, etc). While it is more work and it will be a long migration, it does seem like the best route for operators and end users.
In order to get other services in a better place for solving elevated context problems, we agreed that we need a couple more tools from oslo. The first is the ability to mark a specific policy for deprecation. The whole idea of moving policy into code is to treat it more like configuration and less like a file operators have to manually maintain. Providing deprecation is the next logical step in that progression. It will give developers a way to rename and fix existing policies in addition to changing their default rules. Operators can consume these changes just like they consume deprecated configuration operations. If they want the old behavior, they can copy/paste the old value into configuration and override the new default. The second really powerful tool would be an immutable `scope` attribute in the existing oslo.policy objects (see an in-code example here). This part is important because it would allow developers to set the scope of an operation when it is defined, which would get rendered in documentation for operators and eventually be available in policy evaluation. For example, scope for endpoint operations would be set to "system" and scope for instance operations would be set to "project". We could only think of two scopes, "system" and "project", since domains are essentially projects if the project hierarchy is consumed/implemented properly.
Helping projects consume these new additions would make for another community goal, which I plan to propose for Rocky. Project developers would have the tools they need to signal which operations function on a project-basis and which function on a system-basis, this is something we don't have today. They can also gradually move towards role definitions that are more useful and granular than "admin" and "everyone else". A big thanks to the folks who took notes for this and published them in etherpad, which denotes our new roadmap for policy.
Without getting too ahead of myself, we wrapped up Friday with a super interesting discussion about future usage of "system" scope. What if you could break "system" into multiple things. For example, taking regions or services into account with a system hierarchy? This might be a good way to distribute delegation across operators of a deployment while maintaining separation of concerns.
Being able to grant administrator or audit privileges on a per service, or per region basis might be useful. We're not entirely sure yet, and there are probably more holes to poke here, but this proves that we should at least build the initial "system" mechanism to handle a hierarchy if we want to move that direction in the future.
This topic was one of the first things we discussed during the Baremetal/VM sessions. We'd made progress on defining the problem during the Pike PTG, but uncovered some issues after merging the specification.
To summarize, OpenStack currently doesn't have a good authentication/authorization solution for users who want to delegate the consumption of OpenStack services to an application. Today, this typically requires an administrator to create a new service user, or have the user put their OpenStack credentials in a configuration file, which might be accessible to other members of the team. This case gets even more complicated when considering OpenStack deployments that are integrated with LDAP and corporate policies where LDAP credentials must never be written to disk.
Application credentials sought to help on this front by allowing users to create application specific passwords to access OpenStack services. One of the original requirements was that an application credential should continue to function normally even after the user who created it leaves the organization or is disabled. This essentially tied the life-cycle of the application credential to the project. This introduces a bunch of interesting problems, especially since application credentials would need to be rotated regardless of any team member changes. For example, Alice sets up an application credential for some infrastructure that manages instances. Bob is on the same project as Alice, and has access to the configuration file for their instance management service they work on, so he can see the application credential. If Alice leaves, she could still access OpenStack APIs with the application credential even though her user might be deleted. Bob could do the same thing if he leaves the team, even though he had nothing to do with the creation of the application credential. This argument shows the importance of responsible credential rotation and highlights the fact that it should be done anytime there are changes to the team. This effectively ties application credentials to user life-cycles.
While we agreed that project-lived application credentials pose security concerns, application credentials tied to user accounts are still a massive improvement over what we have today. Monty and Colleen are going to be updating the specification with the outcomes from the PTG. Colleen also posted a great summary to the OpenStack-Dev mailing list right after the session.
As a team, we've discussed microversions multiple times. In the past we've always filed it as something that doesn't really affect keystone and we should wait until the community comes calling for it. Well, most projects in OpenStack now use microversions and we're one of the few projects who don't.
Where this conversation differed from previous discussions is that we had a couple folks available in the room for questions. Sean Dague was able to describe how nova uses microversions, which at the very least, helped me understand what they are designed for and how they're used. I think one of the big misconceptions we had as a group is that microversions don't necessarily let us remove functionality or attributes exposed from the API from N.X to N.X+1. We still have to keep it, since a client can now request that specific version from the API. We can only remove those specific attributes once the minimum required version is bumped. From a process perspective, this is similar to how keystone removed the v2.0 API. Minimum version bumps takes a long time and requires tons of external communication to users (more details on microversions can be found in review).
The conversation ended with most of the team accepting the approach, and a few folks still in the "soft no" category (thought process and votes are documented in the etherpad). All-in-all, microversions won't solve all the problems we have in our API and it still requires us to maintain translation code between microversions. But, the main advantage of implementing it is that we'd be consistent with the growing number of OpenStack projects that are adopting it. The next step for the team is to start building a list of things we can do with microversions, outside of the obvious. This will help us justify the work to implement it.
Based on the current workload, it's unlikely keystone will be able to commit to microversions this release. Once we've burned some of our other items down or if someone decides to own it, we can commit it to a release. Overall, the session seemed educational and informative and we know more about where we stand and why.
Deprecations & Removals
The list of things for keystone to deprecate and remove is short, but the contents are significant.
The v2.0 API
We finally did it. We removed 90% of the v2.0 API in the Queens release. We've been working towards this as a team for a long time. It was exciting to review and approve the patches as a group (we also impatiently watched them trickle through the gate during happy hour and at supper). This is a big milestone for the team and the rest of the week was filled with discussions of things we can improve and clean up as a result:
- The policy/RBAC enforcement framework becomes a lot easier to refactor
- Policy/RBAC business logic will become easier to understand
- Testing between v2.0 and v3 API versions become much simpler, this gives us opportunities to improve many parts of our existing framework and functional tests
All of these things will make it easier for keystone to do things that have been on our cleanup list for a while (e.g. Flask support).
We've been running with Fernet as the default token provider for some time now. The UUID provider only has two advantages over the Fernet provider. First, is it very easy to deploy and doesn't require any setup (Fernet requires some encryption key setup and distribution before proper authentication can happen). Second, it provides keystone with a fall-back token provider in the unlikely event there is a security issue with the Fernet specification or implementation keystone uses. Providing a backup token format is important, and luckily there is a bunch of work recently around the JSON Web Token (JWT) standard (RFC 7519). Unfortunately, Fernet was developed just months before RFC 7519. Fortunately, implementing a JWT token provider would be relatively easy and it would reuse much of the Fernet plumbing.
The consensus here was that we are going to remove the UUID token format in Rocky and propose a specification for JWT. There might be a brief period where keystone only supports a single token provider, but the trade off is that a lot of the token provider API can be simplified as a result of removing UUID token provider cruft. Regardless of keystone's token provider support, it's imperative for it to remain pluggable for out-of-tree implementations.
SQL Token Driver
This goes hand-in-hand with the deprecation of the UUID token provider. Both Fernet and JWT (not implemented yet) are non-persistent, making the SQL driver unused by anything in keystone if UUID is removed in Rocky. The important part is that if out-of-tree implementations are relying on the in-tree SQL token driver implementation, then they need to roll the persistence logic into their token provider implementation. This should be clearly documented and advertised for deployments using a custom token provider that requires persistence.
SQL Catalog Driver
With the removal of v2.0, a lot of business logic in the catalog API gets simpler. There is also a proposal to add a new catalog backend that is file-based and YAML. This fits the usecases of nearly all deployments and a file-based solution makes sense since catalog information is unlikely to change often. This lead us to discussing the deprecation of the SQL catalog driver. The templated catalog driver would also be removed in favor of the YAML based approach. We're unsure of a removal date, but certainly looking to discuss this with operators that weren't able to attend the PTG.
Mission Statement & Long-Term Vision
A couple contributors were curious about keystone's long-term vision and given recent contributor turn-over, it made sense to have a session on this topic. Both seasoned and new contributors were able to share their vision for the project. The majority of it boiled down to the following:
- We should strive to follow industry standards instead of rolling our own purpose built things
- Being consumed by most OpenStack projects put us in a position to be advocates for helping other projects use keystone in efficient ways
- Be less of an authentication service and more of an authorization service (pushing the federation model)
- A good identity/authorization system should be completely transparent but still secure
Harry Rybacki took an action item to consolidate the discussion and propose a formal mission statement that we can iterate on. The result should embody what we've worked towards in keystone and the direction we want to take.
This time we decided to change up the format of our retrospective. I won't get into the direct results/outcomes of our retrospective, you can find that here. I want to describe what we did, how it differs from what we did in the past, and how we liked it.
Instead of having it as a session during the day, we held our Pike retrospective at a good local brewery (which are plentiful in Denver). We also used a different tool (Postfacto) instead of etherpad. We made our retrospective open for others to view. It took us at least a couple hours to get through it, but it was two hours of very productive discussion. First, everyone got a beverage. Then we spent 20 minutes filling out each of the columns and voting on cards we thought were important (votes were unlimited, which is why "v2.0 removal" had so many upvotes - there weren't 200+ people in our retrospective). We addressed the most popular topics first and limited each topic to 5 minute increments. Every topic wrapped up with possible action items, which are listed at the bottom.
I found the retrospective extremely useful, and I wasn't alone. One of my action items for this release is to schedule retrospectives at the end of every milestone. The tooling we used supports video conferencing, so it will be interesting to see how productive it is compared to an in-person retrospective.
If you have comments or question about our retrospective, feel free to ping me in #openstack-keystone. I plan to do a separate post on the per-milestone retrospectives this release.
Photo credit via pixabay