Wednesday, October 20, 2010

Memory Tuning - How OS Page Sizes Can Impact Oracle Performance

Theory

Modern operating systems provide each process it's own virtual address space that is potentially larger than physical memory. It does this by mapping the virtual addresses onto actual physical memory pages. The following diagram shows how large virtual address spaces are mapped onto small physical memory. 



The virtual to physical relationship is maintained in mapping tables. Linux maintains a separate map for each process. The map contains one entry for each page where the size of the page is set by the operating system.  The CPU attempts to speed up the mapping of virtual memory to physical addresses by caching a subset of the mapping entries in its internal cache.  Modern large memory systems and the resulting large processes have increased the size of memory mapping tables faster than the cpu cache sizes have increased. The processor cache is usually very small with only maybe 100 entries in may processors. This mismatch causes faults that can have significant performance implications.

System Analysis

Linux uses physical memory to maintain the process memory mapping tables and swaps those mapping tables in and out of the processor cache on context switches and on memory page table faults. This physical memory is unavailable for other uses essentially reducing the amount of memory available to the rest of the system.  The amount of memory used for page tables is the sum of all the virtual process sizes / the page table size multiplied by the size of a page table entry.  So a system with a 20GB Oracle SGA uses 50MB of physical memory for page tables just for the SGA:
  • 20GB Process Size / 4K page size * 10 = 50Megabytes. (5 million entries)
Every connection to an Oracle database runs in its own process that has direct access to the Oracle DB through shared memory.  This doesn't seem too bad until you consider that page tables for the entire shared memory space are duplicated across each process.  So a 20GB Oracle process with 200 database connections would 5GB for the page tables just for the SGA portion of each process:
  • 20GB Process Size/ 4K page size * 10 entry size * 200 connections = 10GB.  (1 Billion page table entries)
This means a 32GB Oracle server with a 20GB SGA and 200 connections uses 30% of the physical memory just for the page tables.  It also means that the oracle processes, the OS and other processes no longer fit into memory resulting in possibly significant paging.    You cannot rely on ps commands or other process monitors when investigating this because Linux always uses all available memory allocating any "extra" to file system buffers and other needs.  You can get your actual page table usage in your system by dumping the contents of /etc/procinfo with cat /etc/procinfo.

Linux provides the ability to change the page size from 4KB to 2MB potentially returning large amounts of memory to the swappable pool.  Calculations show that converting  just the Oracle SGA in the previous example would save almost 10GB. 
  • 20GB Process size / 2MB page size * 10 entry size * 200 connections = 20MB (2 million entries)
Why Aren't Huge Pages the Default?

Huge (2MB) pages are different from the normal 4K pages in that they are locked into memory and do not swap.  This means that extreme care must be taken when selecting the number and size of the huge page space.  The Huge Page space should be large enough to hold the Oracle SGA while leaving enough other memory to run the non-SGA client code, the operating system and other processes. Teams must understand their memory needs so that htey can set the correct number of huge pages. 

Production Example
I was talking with a colleague about their system and we did some back of the napkin calculations.  They had an Oracle RAC with two 72GB servers each with 500 database connections. This ran a little slower than expected when both machines were up but their real problem was that  they were unable to fail over to a single node even though they had plenty of CPU capacity.  The following calculations show why.

Before
Page Table Entries (SGA) = 35GB / 4KB = 8.75MB
Page Table Entries (1000 connections) = 1000 * 8.75MB = 8.75GB
Memory (1000 connections) = 10B * 8.75G = 87.5GB

After
Page Table Entries (SGA) = 35GB / 2MB = 17.5KB
Page Table Entries (1000 connections) = 1000 * 17.5KB = 17.5MB
Memory (1000 connections) = 10B * 17.5M = 175MB

The team converted 35GB of memory to Huge Pages so that the entire SGA fit into the huge pages.  This returned 40GB of free memory to each machine and made it possible fall back to a single machine when they had planned or unplanned outages on a single server. 

A smaller system at another site recovered 7GB of physical memory on their 32GB servers by converting their 20GB oracle SGA to Huge Pages.  This ended all their system paging.

Recap

Intelligent conversion from small memory pages to large memory pages modern modern large RAM systems can have significant performance implications:
  • Larger page sizes result in fewer page table entries in the CPU cache.  This results in fewer cache misses and less CPU stalling.
  • The memory savings resulting from the 1/500 reduction in page table entries is multiplied by the number of processes sharing the same shared memory. 
Every team with large Oracle Databases should investigate the use of Huge Pages no matter what their operating system.

Monday, October 18, 2010

Improving Software Quality with Continuous Instrumentation

Better Software is No Accident
 The most reliable way to build better software is to integrate quality into process with as much automation as possible.  Automated measurements and metrics provide continuous public visibility to the current state and trends. It is easiest to apply these techniques to new projects but they can also be applied to existing processes in an incremental fashion:
  • Balance the value with the ease of automation. You may sometimes pick less than ideal metrics if they are easier to add
  • Success breeds success.  Do simple stuff first and get points on the board
  • Incrementally improve and change.  Add changes at the rate the team can absorb, often at a slightly faster rate than they would like to go.
This presentation covers what we were able to implement in about 15 months.  We didn't take on design concepts like cohesion, fan-in, fan-out and file tangles until the second year. It focuses more on actual software production than on the requirements, design or business testing phases. This document does not cover important topics like coding standards even though they are essential because they create a common set of expectations and offer a framework for new team measures.

Spending a Little More Up Front Saves Money
Everyone knows that defects cost more to fix when the later they are found.  More people have to get involved as the code moves to upper environments and customer satisfaction or legal issues can appear if a defect makes it all the way to production.
Continuous integration and measurement adds cost at the beginning with a significant reduction in defects which actually shortens the total to-customer cycle time.  Numbers from the original presentation deleted at company's request.

Continuous Improvement
Software development has a lot in common with other processes.  Shorter cycles combine with honest measurement to provide opportunity for improvement and higher quality.  Every cycle is an opportunity to do something better. 

Many software projects only measure defect rates after it's too late to improve.  Modern tools make it easy to track some quality measurements early in the process where changes can have a bigger downstream impact.

Automation Assisted Improvement
Quality processes involve many steps that must run over and over often for years.  Team members interest in the process can fade over time as other "more pressing" problems must be handled.  Automation is the best way to support a process over time.

This diagram shows the every day cycle of software development as viewed purely from the development point of view.  Peer/Code reviews are a core part of the quality life cycle.  Manual code inspection finds defects better than any other tool and acts as an excellent vehicle for staff training.  I've worked on two large (1M LOC) projects where 100% peer review was required with a significant reduction in bad code and defects.   The problem with peer reviews is that they vary in quality based on the reviewer and they are subject to management pressure when the team is behind.

Automated measurements work the same no matter what the schedule or desires of individuals. The preceding diagram shows a system where Automated static analysis teams and automated unit test analysis feeds into a report that is then used for developer coaching.  Agile style continuous integration generates reports on a regular basis, often daily.  This continuous visibility makes it easy for everyone to see the current state.  People tend to pay more quality when metrics reports are generated and made visible automatically.

Obvious Benefits
This sample project made the following changes in the 15 months. The project already had continuous integration, via Hudson, and a code review process in place prior to introducing automated metrics (Java/Sonar)
  1. The code base increased 22% from 450,000 LOC to 540,000 LOC
  2. Unit test coverage increased from 27% to 67% coverage.
  3. The total amount of untested code decreased by 33% meaning that the team added tests for new code and any modified existing code.
The changes didn't happen over night. Peer/Code reviews and continuous integration builds ran for over a year before implementing an automated metrics dashboard.

Automated Auditing
A fair amount of work has already been done on code quality and how it can measured or audited.  We went after four main areas because of the good tooling support.
  • Complexity:  For our purposes, this is essentially the number of code paths through a function or module. There is a lot of empirical data that shows that complex code cannot be completely tested and that modules beyond certain complexity are almost guaranteed to have defects.
  • Compliance:  This is the measure of the number of violations of programming rules as defined by the development teams.  Java compliance tools like Checkstyle, PMD and Findbugs have hundreds of rules that range from standards oriented to bad practice to guaranteed defects. The example project has 134 rules compliance rules.
  • Coverage:  Unit test code cover represents the number of lines of code and the number of code branches covered by the unit tests. Coverage can't measure the quality of the tests or the validity but it does show code executed during the unit test phase of continuous integration. The example project had an 80% code coverage target and increased the number of unit tests from 2,800 to 14,000 in 15 months.
  • Comments:  There are a lot of opinions on the usefulness of comments.  Teams argue that out of date documentation is worse than none at all. Others argue that there code is self documenting and that comments are redundant.  Our standard was that every method/function and class/file had to have parameter and API documentation and that any code block that requires more than a few sentences to explain in a code review requires those same comments in the code. 
Complexity Shares a Room with Defects
  • Modules with the highest complexity tend to have the most defects
  • Complex modules can be impossible to fully test
  • Unit testing drives down complexity
    • This is a double win because you increase coverage while providing simpler code 
Teams are often surprised by what they find when they first analyze their code.
  1. We had a single method with over 1 million possible paths through all the looping and conditional blocks.  It was not possible that we could ever test all those paths.
  2. We also found that 20 problem modules in production where 12X more times complex, contained 5X more rule violations and had 25% less code coverage than our average modules.  This makes sense because simpler modules are simpler and have fewer problems.  Restructuring code for testing often simplified the code enough to reduce the complexity and make it possible for developers to understand what was going on.
Quality Baked In - The Process
Automation is the simple part.  We implemented a fairly straightforward process that built on standard Agile continuous integration processes.


  1. non automated Code/peer review for every change.  Small changes were 5 minutes or less but large changes could take over an hour.
  2. Continuous integration builds ran all unit tests on every change made.  Committers notified if they had changes in the build that broke any unit tests.
  3. Continuous integration build runs a deployment check after all unit tests pass.
  4. Integration tests scheduled for twice a day.
  5. Daily builds invoke Sonar twice a day to generate metrics.
    1. Developers immediately notified of compliance, complexity, coverage or comment problems.
  6. The entire team receives daily reports describing quality changes for the day along with a technical debt report. These reports are part of the scheduled build process.
  7. Defect reports opened against any quality issues that stay unfixed over a certain amount of time.
Reporting
Sonar and other reporting systems can be augmented to tie back changes to their source.  VCS integration makes it possible to track changes by
  • Developer
  • Team
  • Project
  • Branch
  • Change ticket
It is also possible to add meta-data to files to aid reporting on functional area or tier.

Here is an example report that shows code coverage changes for Java files based on changes in Sonar data.


Training and Expectations
Bringing the team and management up to speed is a lot harder.  You'll probably start in a pretty sad place and have to ramp up.
  1. Continuous integration is initially used just to make sure the build works. There aren't many tests.
  2. There is no standard set of expectations.  Create a code checklist.  It will contain both coding standards and project standards based on the system architecture or tools set.
  3. Train everyone on continuous integration and standards.  Some folks will have never seen any of the concepts before.
  4. Start mandatory peer / code reviews.  All production code should be looked at by someone other than the developer.  Be prepared because all code will fail one or more code review.  Be prepared to here that there is no time to make the changes. How is it that they have time to fix the inevitable defects?
  5. Update the standards and review guidelines document based on the review results. There will be lots of updates in the early days.
  6. Train everyone on continuous integration and standards.  Some folks will appear to have never seen any of the concepts before.
  7. Start your continuous measurement builds for compliance and test coverage
  8. Train everyone on Test Driven Development but expect to do Test Assisted Development in the beginning of it is an existing project.
  9. Facts are friendly. Generate reports and tell folks how they and the project are doing.
  10. Plan on rewards to help balance the frustration of forced change.  It feels like more work in the beginning but eventually becomes second nature.

Conclusion
Automated tooling makes it possible for organizations to build better software with reduced  manual steps. This is especially helpful for organizations in denial about the fact that they are software development shops and or companies without a natural engineering mindset.  You can point out that liability and downtime are reduced by following "industry standard practices" or "following standard metrics and methods".  It's never to late to start improving.

Tooling, and static and dynamic analysis tools should carry weight when choosing a language or platform. They can make a very significant difference in the quality of the code and the long-term maintainability of applications.  Basic automated tooling is not a panacea because it does not fully cover the problem space.  Bad design and bogus tests for coverage often require peer review.  Requirements verification, security audits and end-to-end testing need their own tooling and metrics.

Saturday, October 16, 2010

Eclipse in the Enterprise

Eclipse in a Large Company's IT Shop

Many say that the optimal development team is 5-7 people but that isn't how larger Enterprise software projects work. Those projects have dozens or even hundreds of developers using multiple technologies and Languages. Eclipse's plug-in architecture makes Eclipse more than an IDE. It is a developer desktop environment that many enterprises cannot take full advantage of because of they don't standardize their configurations and toolsets and because they require human intervention for setup.

Supersize Me
  • Complex problem that often seems simple to the outsider or new team member
  • No one person knows the whole business problem
  • No one person knows the whole technical implementation
  • No one understands how each technical layer works
  • A lot more effort is put into staffing the initial build than the ongoing maintenance
  • Some work will be defensive in nature protecting from future mistakes
  • All applications are legacy applications as soon as they ship. You can't just tear them apart to fix stuff.
  • Someone calculated a big ROI so there is real money on the line.
We're not a Software Company
Big projects are often taken on to support some non IT or computer oriented line of business. This leads to internal conflict and problems.
  • The software exits to support the business.
  • The group is often geared towards production support and infrastructure.
  • The company doesn't view the software as a product even though it has all the characteristics of a software product
  • The project is staffed with standard large corporation skills mix and motivation
  • Contracting resource are often released after the first major release even though the maintenance phase has lots of development.
Large companies often contain fiefdoms that support similar businesses or problems of similar complexity. They usually have the same HR culture but operate very differently. The management teams may have completely different levels of (software) engineering background. Differences include:
  • Different developer workstation management
  • Different CM processes
  • Wildly varying IT management experience.
  • Different software development processes and requirements
  • Different branching, release and code control strategies
We're these Guys
A typical mid to large IT shop might be
  • 150 developers
  • 3 business units
  • 10 main apps plus production support
  • 15 managers
  • 2 Million+ lines of source code
  • 250 Production Application Server VMs
  • 100 Network Cache VMs
  • Single VCS with completely different branch and control strategies.
The Great Mismatch
Statisticians and sociologists say that people fall into some type of standard distribution


  • Desire
  • Skills
  • Instincts
  • Experience





Software companies and their really smart engineers create tools for really smart, motivated engineering types.

  • "We create tools we want to use"
  • "We create tools for our friends"






Both those groups are wrong. The average large company has a different mix that changes over the life of the project.
  • Some folks like hard problems
  • Some folks are just there for the job
  • Some folks can't be fired

This will sometimes change because of some massive culture shift or some financial change. But, even projects that start off differently often fade into this model as more mainstream employees fill in after the initial push.

Cogs in Our Machine
The smart thing to do is to take this staffing and management reality into account when making choices. We always want to push , to provide opportunity but we (possible) capabilities drive choices even when training and mentoring.

  • High End developers all think they know how it should be done. "Everyone is an artist"
  • Mid tier developers just want to finish their tasks
  • Others

Tooling and Skills Affect Decisions
Modern development tools increase the quality and scope of people's work. Eclipse is so good that it can drive language and toolbox decisions. Cool technologies without great tools support and integration are not worth whatever other benefit they provide. A fancy "super language" with no "reference search", refactoring support, system integration or code analysis tools cause problems over the life of a project.


Eclipse Delivers More with Less Investment
Our Team
  • 50 Developers
  • 8000 Java Classes
  • 580 XHTML Files
  • 95 Core Model Entities
  • BPM: 126 Flows, 209 subflows, 137 UI touchpoints
  • 60 Rule Services
  • 200 Lookup Services
  • 30 Eclipse Projects
  • 28 Minute build with 15,000 unit tests
  • 6 week release cycle
  • Super complex business rules including state, customer type and date effectivity


Technologies? Yeah, we got those
  • Java
  • XHTML
  • Javascript
  • JSF
  • Rich Faces
  • XML
  • Spring
  • AspectJ
  • Oracle SQL
  • Coherence Web
  • Rule Engine
  • Workflow BPM
  • Spring Web Flow
  • D-Rules
  • Axis
  • MQueue

  • Maven
  • JUnit 4
  • EasyMock
  • Hudson
  • Sonar
    • Checkstyle
    • Findbugs
    • PMD
    • LCOM4
  • Eclipse Validation
  • Eclipse Formatter
  • Groovy for Tools
  • Perforce
  • Jira
  • Eclipse
  • Crazy Monitoring
  • Specialized Logging

Goals
  • Increase Quality
  • Reduce Defects
  • Reduce configuration costs and overhead
  • Remove need for manual configuration
  • Standardize plug-in configurations
    • Trade off performance vs sophistication
  • Standardize static analysis configuration
  • Reduce cost of creating new workspaces
  • Use best of breed plug-ins where available
We Do....
Attempt to automate as much as possible. Current configuration saves over 1000 minutes per month when taking into account the number of Eclipse workspaces created due to the number of source branches and the number of new staff or machines.
  • Rely only on dev team resources.
  • Support mass migration to new IDE versions
  • Store all configurations under version control
    • Include fully configured eclipse
    • include all 3rd party plug-ins
    • include SCM plugin
    • include all metrics settings
    • include all server and infrastructure configuration
  • Provide samples for any files that must be loaded and desktop shortcut options
  • Force folks to move from of old versions, after some migration period, by deleting old versions of Eclipse from SCM.

We Shy Away From
  • Monolithic Eclipse bundles ad baseline
    • reduces plug-in flexibility
  • Network file-systems for IDE/executables
    • We want everything on the local disk
  • Manual Configuration
    • Manual setup document for whole environment is currently 80 pages
  • Plug-in configurations that run against the whole codebase
    • Some analysis tools don't support "on save" or "on change" scanning.
  • Some advanced Eclipse plugins
    • Some Eclipse projects are cool but can't spin up a high enough percentage of developers.

Use Eclipse to Improve Your Product
We also use Eclipse's advanced plug-ins and features to drive our code quality and structure/shape. We've found that people will fix things earlier if the system points out issues early on. People focus on what's measured.

Eclipse has a set of plug-ins that monitor and analyze code. It also integrates with Sonar to show the results of sonar's code coverage static analysis tools. This can drive all developers in the right direction. Our project made the following changes in a single year.

It is more than an IDE. It provides a Framework for Enterprise Development for all sizes of projects.

Enhancing Scalability with Distributed Object Caching

Teams take various approaches when scaling applications usually doing things like increasing the number of tiers in the application(s) and adding more hardware to each tier. Both of these approaches tend to create operational issues as the system size increases. Some systems have large amounts of user data that can create their own problems when trying to provide fault tolerance and when the number of users combined with the user data is to large to fit in server memory. Persisting data to a relational store often has two much overhead in these situations. Teams can scale up their applications and provide additional stability through the use of commercially available distributed object caches. (Originally 7/2010)



Lightweight Monitors for Metrics and Troubleshooting

Lightweight Monitoring tools can be integrated into applications to provide quantitative information about how an application is being used and how it is performing. Development teams can integrate lightweight monitoring that augments , or in some cases replaces, the server level monitoring provided by system or operations tools. Internal monitoring can provide stick counts and performance information at the method level or on a business transaction basis. Monitors can easily be enhanced to that lets teams track traffic or performance across time, release and program instance. (7/2010)


Coding Standards are Part of Continuous Improvement

Coding standards extend to something beyond just code formatters. Coding Standards give team members the same understanding of what's expected of them.  Developers spend less time trying to understand code when it's style and operating model look similar across modules and applications. Standards must take into account the range of developer skill set allowing for ongoing improvement from training. It may be that you allow or disallow some feature/approach to reduce the number of defects even if some developers are capable of doing "more". Developers like to be free to develop "in their own fashion" but management wants code to have the same shape and look from developer to developer because they care about developer portability across development efforts. Standards can be created that still let developers be creative on the interesting problems.

Automated Standards Support
Tooling can help enforce coding standards by providing constant reminders that there are minimum standards. Teams may wish to have coding standards that are not supported by automated tools but the core standard should be audited on a regular basis. Be sure the standards are adhered to before the code is checked in because you aren't going to have time to fix it later. Java developers have the advantage that their IDEs provide good automated analysis and review. These automated tools find more than just formatting issues.
They find formatting issues that lead to defects down the road and they find actual defects. Tooling should be included in the standard if you can get away with it. Most of my projects have a standard IDE configuration that everyone loads for their Java code. The checks are automatically turned on. You may only use some of the tools on a continuous basis while other tools may be run "on demand". The Eclipse formatter/clean-up module is a good continuous tool. The Checkstyle plugin can be configured to only run on save so it does not impact your build times.

Code Format
This is the simplest one but causes a surprising amount of conflict on a team.  (See "artist" comment above. Code formatters help you with this. We basically use the Eclipse style with some changes. The great thing about the Eclipse formatter is that it can also do some code clean up in addition to managing indentation and line wraps. Conditional blocks without curly braces ({}) are defects waiting to happen. You can set up Eclipse so that it automatically adds the curly braces. We force all the developers to use Eclipse and we provide a standard set of preferences that we make everyone load. Eclipse has a setting that will reformat the source on every save so  folks can just stream their code into the IDE and hit Save.

Code Constructs
Internally developed coding standards should focus on avoiding defects. It is best if you pick standards that can be flagged by our automated tools, IDE and plugins. As a first step, go down the list of style warnings flagged by Eclipse and figure out which ones you want to turn on. There may be some conflict here because teams include both experienced and inexperienced developers. Ex: We don't allow ternary operators because of legibility and the way folks overuse
them. Include the documentation settings in this review because you are delivering code for future generations. Some folks claim unit test are documentation and then don't document their complicated unit tests. Set a minimum bar that makes sense. Then go through the better tools/plugins, Checkstyle, PMD and/or Findbugs and take thieir suggestions and add them to your coding conventions. Standards that are supported by tools are the easiest to enforce.

Application Specific Standards
Applications and architectures develop coding patterns over time. Those coding patterns should be added to the coding standards as they are developed and accepted. A project might use Spring only in a Singleton pattern so you might add that singleton pattern to the coding standard. It may be that an application caches results of some operation based on a hash key so every class that can be a result of that operation needs to implement a custom hashCode() method. This would be added to the coding standard. Basically every defect that shows up due to a coding error is a candidate to be added to the application specific coding standard. Every tier in the application has these conventions.

Coding for Testing
Coding Standards are about more than pretty code. They should drive developers to create software structured in a way that meets other requirements like testability. Code takes on a different shape when it is written in a unit testable fashion. Static classes tend to disappear or made pluggable. Methods tend to get smaller and accept more inputs as parameters instead of doing environmental rendezvous with their data. Void methods in controller classes tend to vanish to be replaced by methods that return values to be handled by the callers. Coding standards that include minimum unit testing requirements drive users to create more modular code.

Non-Java Languages in Java Projects
Coding standards should include non-Java languages including things like xhtml, xml, Javascript, etc. Some of these standards may be supported by automated tools but that is often limited to source formatters. Application specific coding conventions should be included here as well. Example non-format non-Java related standards might be to limit types functions or data in Javascript code or in hidden fields in HTML. Many of the standards will have to be manually enforced, especially for the dynamically typed languages. The lack of good tooling support is one of the biggest reasons that teams often keep scripting or dynamic language usage to a minimum in large projects. Small teams full of really smart people will tell you that they are more efficient with these languages but it is very hard to create standard, readable code that can be handed off from one generation to the next of developer on the project.

Standards Verification
The team motto should be "Trust but Verify". Run tools in the IDE. Cover coding standards in code reviews and training. Then measure your progress with a tool like Sonar. It will run all of the automated tools described elsewhere in this document and track the progress over time. It has a web interface that lets you drill down by package and class and see where you are having compliance issues, coding problems or poor unit test coverage. Metrics can help a team
improve their game without a lot of manual intervention.

Training
Training is required whenever a team changes a policy or method of operation.  Coding standards will result in changes for many users so training will be mandatory. Some teams send out the coding  standards and then assume everyone is OK with them when no one complains. Zero complaints means they didn't read it or that they don't think the company is serious. Teams must do at least one
walk through of the different sections of any standards. Training should almost always result in standards updates, clarifications or additions.

Final
Coding standards can really help keep a code base together and set expectations for new and existing team members. Teams should expand the concept of a coding standard beyond naming and formatting and include all the standardized behavior  and style that they want in their code base. Standards should be reevaluated whenever a new type of subsystem is written or a new 3rd party library included. Organization is the key to repeatable success and lower defect counts.

Originally Written 3/2010