OpenStack Stein PTG - Keystone Summary
One year after meeting in Denver for the Queens PTG, we returned to Stapleton, Colorado to plan the Stein release. Although the trains sound the same, much has changed in keystone since then.
A year ago, we focused on rebuilding a roadmap to deliver more APIs to end-users and make deployments more secure for operators with granular role-based access control (RBAC). We had a plan to better support application developers consuming OpenStack. We also worked on the foundation of a consistent hierarchical quota experience.
Today, we have tools that help services protect APIs with more granularity and provide default roles out-of-the-box, allowing OpenStack developers to expose more functionality to end-users. We have application credentials that developers can use to give authorization to software consuming OpenStack in a more user-friendly and secure way. We also have a unified limits API that we've incrementally improved over the year while we prepare for services to start using it.
It was exciting to watch all those initiatives take shape over the last 12 months. They still popped up throughout the week but from the perspective of services looking to consume them, as opposed to the design discussions we were having precisely one year ago.
In the weeks leading up to the Stein PTG, contributors took time to think about the next set of challenges that face keystone, and how to address them. Their thoughts helped generate some new, refreshing discussions about what we expect to do over the next year or two.
The following report is dense, but I've structured it into sections so that it's easier to pick out the parts you care about most.
The most recent results from the OpenStack user survey lists federated identity as being the top contender for keystone-specific improvements. This feedback isn't news, but it's taken a back seat to other initiatives. With a few of our other items well underway, like granular policy and unified limits, this is an excellent time to revisit that work.
We want federation to be a first-class citizen. To make that a reality, we were able to identify a set of bugs that improve user and operator experience.
We found a way to relay group information in SAML assertions
We're going to improve validation of groups during WebSSO flows
We're going to make the remote ID attribute configurable via the API, as opposed to using configuration files
We're going to improve delegation usability for federated users with refreshable application credentials
A couple of people who recently parsed the federated identity documentation gave us some pointers on how we can make using federation easier to understand. Everything from explaining why someone might want to use federated identity to debugging issues in service providers.
Discussions for future enhancements revolved around the concept of keystone as an identity provider proxy. Ultimately, this means keystone would talk to multiple identity providers that might be using different protocols, like OpenID Connect or SAML, and requires implementing better native support for pluggable protocols. Benefits include the possibility of simplifying the attribute mappings used today to map from SAML attributes directly to OpenStack entities, as opposed to mapping from SAML to environment variables and finally to OpenStack entities. The mapping should be generalized for other protocols, too. With keystone's adoption of Flask in Rocky, it'll be easier to improve experience using single sign-on by implementing additional content-type headers. In the case of OIDC, we can use the OAuth 2.0 protocol to experiment with scopes to create tokens for specific operations.
If we make federation a first-class citizen, we also need to think about how to handle multiple user accounts. For example, a user authenticating via LDAP and using SAML assertions from ADFS results in two separate accounts. Both users may have the same role assignments on various projects, but support for linking those accounts doesn't exist. The concept of shadow users started decoupling the way in which a user authenticates from the user reference itself. The next step in that work is to allow users to link accounts or associate multiple ways of authentication to a single user.
We started the work for shadow users in Mitaka and continued to work on it through Newton and Ocata. There hasn't been any significant movement since then. We are going to refamiliarize ourselves with the current implementation and see if we can pick up the pieces for account linking.
We made a note to create four specifications detailing the work summarized here.
We planned an impromptu feedback session with operators on Wednesday since the Foundation collocated the Ops Midcycle with the PTG. We took a pulse on system-scope and unified limits. I was reassured to hear responses along the line of "are they ready, yet?" concerning those initiatives. It means we're still on the right path and we were able to give them a better idea of the expected timeline for consumption. We also shared the two different enforcement models support by unified limits. Operators in the room weren't opposed to the strict two-level project hierarchy, primarily because most of their deployments still rely on flat project structure. Once services start adopting unified limits, operators should be able to utilize it without having to worry about making disruptive changes to project structure, which is encouraging.
We spent Tuesday with the Edge working group to discuss architectural issues across tens or hundreds of regions. There were two solutions up for debate.
The first was to write a layer in between the application and the database that attempts to be smarter about data replication in edge-specific deployments. Several people in the room were hesitant to pursue this approach, especially since it requires domain-specific knowledge about SQL and low-latency replication in general.
The second approach was to use federation as a way to provide the appearance each independent region was part of the same deployment without replicating data. James Penick drew out the specific approach they use at Oath. Using federation also fits nicely with the work we'll be doing in that area moving forward. Federated improvements for edge deployments are going to be reusable for public, private, and hybrid deployments, allowing more bang for our buck across use-cases.
JSON Web Tokens
We revisited a specification detailing this work, which we've carried forward the last couple of cycles. In Dublin, we needed more clarity on why exactly we want to use JWTs, especially since we already support a non-persistent token format. There are two specific use-cases for implementing JWT support. First, it offers a backup solution to Fernet. Second, it allows for better support for validation-only keystone servers due to asymmetric encryption or signing.
Since public key encryption and signing keeps the private key on the initial host, operators can sync public keys to keystone servers used only to validate tokens. Compromised read-only keystone servers wouldn't result in bad actors crafting tokens to use in other regions. We walked through the key setup and rotation strategy, mainly noting the differences based on asymmetric versus symmetric. The other significant detail from the discussion is that the libraries that support JWE (the JWT equivalent to Fernet) tokens are not compatible with OpenStack licensing. Libraries compatible with OpenStack licensing only support JWS tokens, which didn't seem to be a concern since keystone supports Fernet. Deployments concerned about sensitive data in token payloads should continue to use Fernet tokens. We also plan on using big red stickers to warn users that payloads are an internal implementation detail. Anyone relying on that information accepts the risk of being broken if payloads change for any reason.
I've updated the specification with details from the session.
I noted earlier that nearly all the discussion here was specific to services adopting the new model. We went through things that changed since the Vancouver summit and walked through usage examples. I took five main things away from these sessions throughout the week.
First, we found a way to make it easier for services to consume oslo.limit by reducing the number of callbacks required by the library. Initially, we were expecting services to provide a single callback responsible for returning the usage of a particular resource. The problem is in the event of a race condition where two clients claim a resource at the same time, exceeding the limit of the project. You would expect the service to clean up the resources created last, which would require an additional callback for oslo.limit to reap the exact resources specific to the failed request. This approach starts to muddy the water and requires specific hand-offs between the service and oslo.limit. Instead, Jay Pipes proposed doing all of this in a single callback and said he was going to work with Melanie on an example using nova. We'll plan on addressing any gaps flushed out by the example before releasing a version nova can use.
Second, we discussed support for domain-level limits. Today, unified limit support is specific to projects, but it's not difficult to come up with cases where you'd like to have limits for a domain. As a group, we didn't have strong opposition to the idea, and there is a specification up for review.
Third, we worked with the nova team to discuss the existing limits API in their service and what to do with it. The API in nova returns limits and usage in the same response, so how do we aggregate that information together now that it's coming from two different services? We iterated over three proposals:
Proxy usage from services into keystone
Aggregate usage and limits in clients
Query keystone directly from the service
The issue with the first proposal is that it requires another API in keystone and requires keystone to iterate services. The second proposal relies heavily on clients to aggregate things correctly, and multiple implementations across sdks. We can use oslo.limit to do the third and is technically supported by keystone today. John Garbutt is working on a specification for integrating unified limits into nova. He plans on including this as a detail of that specification.
Fourth, we realize that user limits aren't a valid use-case for unified limits, which are specific to projects. Most user-specific limits are not limiting resource consumption. They limit database bloat, but not resources. Everyone in the room agreed that limiting these types of things might be important, but falls outside the scope of using unified limits to rate-limit users.
Finally, we walked through a migration path for both developers and operators. Developers must support pre-existing quota systems in their services while providing a way to consume unified limits. Despite reluctance to yet another configuration option, it might make sense to add one for this migration. Deployers can opt into using unified limits with the configuration option for a cycle or two. We plan to deprecate the toggle in favor of defaulting to limits defined in keystone.
System Scope & Granular RBAC
Similar to the previous section, much of the discussion here was specific to service adoption. We took the time to meet with teams to answer questions about system scope and how to use it. Keystone itself is going to be making changes to consume some of this work, too. A couple of people asked if this was a sign of an impending community goal to use system scope. That's the intent, but before we do, we want to make sure we've built out examples for other projects to use, and we can do that while making keystone's API more self-serviceable. If everything goes as scheduled, we should be able to consider this for a community goal proposal in the Train* cycle.
I dropped into a cinder session where they were investigating ways to prevent regression when changing policy values, which happened late last release. Based on my interpretation, it sounded like the root of the problem is due to many services overriding policy evaluation in unit tests. This practice is common since it requires the test to model authorization relayed in tokens before invoking the API using oslo libraries. Bypassing authorization altogether is easier. Ideally, we'd like to unit test this across projects, or even using tempest and patrole, but that's going to be a significant amount of non-trivial code. Instead, I took a look at some of cinder's API tests and attempted to propose a patch that would exercise the policy engine and the default cinder policies. I hope the review process shares knowledge and promotes a convention that makes it easy for other developers to improve policy test coverage.
Consistent Policy Names
On Monday we discussed standardizing on a consistent set of policy names since there are several different formats in use. Operators also expressed interest in this initiative on Wednesday when we asked them for feedback.
I spent some time last week going through most projects that use oslo.policy and attempting to distill a general convention for each service. I sent an email to the development and operator mailing lists afterward, ultimately kick starting a discussion to reach a uniform convention. I'm going to propose a couple of formats based on that feedback by the end of the week. Oslo.policy supports renaming and deprecating policies, which should make implementing the new convention easier than in the past. Note that only deployments that override policies with custom values are affected by this change.
Horizon Feature Gaps
We sat down to discuss missing identity features in horizon, and ended up building a laundry list of things that would be useful to add:
Pagination of shadow users
Management of unified limits
System role assignments
Implied role assignment
Setting user options
If you're interested in working on any of these items, please don't hesitate to reach out to e0ne or me directly.
Photo: I snapped the photo for this blog post nearly a year ago, to the day, as we worked out system-scope details.
* Train is obviously the unofficial name for the next release but it does seem fitting given the relationship we've built with the A-line trains in Stapleton.