Congresso Brasileiro de Software: Teoria e Prática
Transcrição
Congresso Brasileiro de Software: Teoria e Prática
Congresso Brasileiro de Software: Teoria e Prática 29 de setembro a 04 de outubro de 2013 Brasília-DF Anais SBES 2013 XXVII Simpósio brasileiro de engenharia de software SBES 2013 SBES 2013 XXVII Simpósio Brasileiro de Engenharia de Software 29 de setembro a 04 de outubro de 2013 Brasília-DF, Brasil ANAIS Volume 01 ISSN: 2175-9677 COORDENADOR DO COMITÊ DE PROGRAMA Auri M. R. Vincenzi, Universidade Federal de Goiás COORDENAÇÃO DO CBSOFT 2013 Genaína Rodrigues – UnB Rodrigo Bonifácio – UnB Edna Dias Canedo - UnB Realização Universidade de Brasília (UnB) Departamento de Ciência da Computação (DIMAp/UFRN) Promoção Sociedade Brasileira de Computação (SBC) Patrocínio CAPES, CNPq, Google, INES, Ministério da Ciência, Tecnologia e Inovação, Ministério do Planejamento, Orçamento e Gestão e RNP Apoio Instituto Federal Brasília, Instituto Federal Goiás, Loop Engenharia de Computação, Secretaria de Turismo do GDF, Secretaria de Ciência Tecnologia e Inovação do GDF e Secretaria da Mulher do GDF 2 SBES 2013 SBES 2013 XXVII Brazilian Symposium on Software Engineering (SBES) September 29 to October 4, 2013 Brasília-DF, Brazil PROCEEDINGS Volume 01 ISSN: 2175-9677 PROGRAM CHAIR Auri M. R. Vincenzi, Universidade Federal de Goiás, Brasil CBSOFT 2013 gENERAL CHAIRS Genaína Rodrigues – UnB Rodrigo Bonifácio – UnB Edna Dias Canedo - UnB ORGANIZATION Universidade de Brasília (UnB) Departamento de Ciência da Computação (DIMAp/UFRN) PROMOTION Brazilian Computing Society (SBC) SPONSORS CAPES, CNPq, Google, INES, Ministério da Ciência, Tecnologia e Inovação, Ministério do Planejamento, Orçamento e Gestão e RNP SUPPORT Instituto Federal Brasília, Instituto Federal Goiás, Loop Engenharia de Computação, Secretaria de Turismo do GDF, Secretaria de Ciência Tecnologia e Inovação do GDF e Secretaria da Mulher do GDF 3 SBES 2013 Autorizo a reprodução parcial ou total desta obra, para fins acadêmicos, desde que citada a fonte 4 SBES 2013 Apresentação Bem-vindo à XXVII edição do Simpósio Brasileiro de Engenharia de Software (SBES) que, este ano, é sediado na capital do Brasil, Brasília. Como tem acontecido desde 2010, o SBES 2013 faz parte do Congresso Brasileiro de Software: Teoria e Prática (CBSoft), que reúne o Simpósio Brasileiro de Linguagens de Programação (SBLP), o Simpósio Brasileiro de Métodos Formais (SBMF), o Simpósio Brasileiro de Componentes, Arquiteturas e Reutilização de Software (SBCARS) e da Miniconferência Latino-Americana de Linguagens de Padrões para Programação (MiniPLoP). Dentro do SBES o participante encontra seções técnicas, o Fórum de Educação de Engenharia de Software e três palestrantes convidados: dois internacionais e um nacional. Complementando este programa, o CBSoft oferece uma gama de atividades, incluindo cursos de curta duração, workshops, tutoriais, uma sessão de ferramentas, a trilha da Industrial e o workshop de Teses e Dissertações. Nas seções técnicas do SBES, trabalhos de pesquisa inéditos são apresentados, cobrindo uma variedade de temas sobre engenharia de software, mencionados na chamada de trabalhos, amplamente divulgados na comunidade brasileira e internacional. Um processo de revisão rigoroso permitiu a seleção criteriosa de artigos com a mais alta qualidade. O Comitê de Programa incluiu 76 membros da comunidade nacional e internacional de Engenharia de Software. Ao todo, 113 pesquisadores participaram na revisão dos 70 trabalhos submetidos. Desses, 17 artigos foram aceitos para apresentação e publicação nos anais do SBES. Pode-se observar que o processo de seleção foi competitivo, o que resultou numa taxa de aceitação de 24% dos artigos submetidos. Além da publicação de artigos no anais, disponível na Biblioteca Digital do IEEE, os oito melhores artigos – escolhido por um comitê selecionado a partir do Comitê de Programa – são convidados a submeter uma versão estendida para o Journal of Software Engineering Research and Development (JSERD). Para SBES 2013 os palestrantes convidados são: Jeff Offut (George Mason University) - “How the Web Brought Evolution Back Into Design”; Sam Malek (George Mason University) - “Toward the Making of Software that Learns to Manage Itself”; e Thais Vasconcelos Batista (DIMAP-UFRN) - “Arquitetura de Software: uma Disciplina Fundamental para Construção de Software”. Finalmente, gostaríamos de agradecer a todos aqueles que contribuíram com esta edição do SBES. Agradecemos ao os membros do Comitê Gestor do SBES e CBSoft, os membros do comitê de programa, os avaliadores dos trabalhos, as comissões organizadoras e todos aqueles que de alguma forma tornaram possível a realização de mais um evento com o padrão de qualidade dos melhores eventos internacionais. Mais uma vez, bem-vindo ao SBES 2013. Brasília, DF, setembro/outubro de 2013. Auri Marcelo Rizzo Vincenzi (INF/UFG) Coordenador do Comitê de Programa da Trilha Principal 5 SBES 2013 Foreword Welcome to the XXVII edition of the Brazilian Symposium on Software Engineering (SBES), which this year takes place in the capital of Brazil, Brasilia. As has happened since 2010, the SBES 2013 is part of the Brazilian Conference on Software: Theory and Practice (CBSoft) that gathers the Brazilian Symposium on Programming Languages (SBLP), the Brazilian Symposium on Formal Methods (SBMF), the Brazilian Symposium on Software Components, Architectures and Reuse (SBCARS) and the Latin American Mini conference on Pattern Languages of Programming (MiniPLoP). Within the SBES the participant finds two technical tracks, the Forum on Software Engineering Education and three invited speakers: two international and one national. Complementing this program, the CBSoft provides a range of activities including short courses, workshops, tutorials, a tools session, the Industrial Track and the Workshop of Theses and Dissertations. In the main technical track of SBES, unpublished research papers are presented, covering a range of topics on Software Engineering, mentioned in the call for papers, widely advertised in the Brazilian and international community. A rigorous peer review process enabled the careful selection of articles with the highest quality. The Program Committee included 76 members of the international Software Engineering community. In all, 113 investigators participated in the review of the 70 papers submitted. From those, 17 articles were accepted for presentation and publication in the SBES proceedings. It can be seen from these figures that we had a very competitive process that resulted in an acceptance rate of 24% of submitted articles. Besides the publication of articles in the proceedings, available in the IEEE Digital Library, the top eight articles – chosen by a committee selected from members of the Program Committee – are invited to submit an extended version to the Journal of Software Engineering Research and Development (JSERD). For SBES 2013 the invited speakers are: “How the Web Brought Evolution Back Into Design” - Jeff Offut (George Mason University) “Toward the Making of Software that Learns to Manage Itself” - Sam Malek (George Mason University) “Software Architecture: a Core Discipline to Engineer Software” - Thais Vasconcelos Batista (DIMAP-UFRN) Finally, we would like to thank those who contributed to making this edition of SBES. We thank the members of the Steering Committee of the SBES and CBSoft, the program committee members, the reviewers of papers, the organizing committees and all those who somehow made possible the realization of yet another event with a quality standard of the best international events. Once again, welcome to SBES 2013. Brasilia, DF, September/October 2013. Auri Marcelo Rizzo Vincenzi (INF/UFG) Coordinator of the Program Committee of the Main Track 6 SBES 2013 Comitês Técnicos / Technical Committees SBES Steering Committee Alessandro Garcia, PUC-Rio Auri Marcelo Rizzo Vincenzi, UFG Marcio Delamaro, USP Sérgio Soares, UFPE Thais Batista, UFRN CBSoft General Committee Genaína Nunes Rodrigues, UnB Rodrigo Bonifácio, UnB Edna Dias Canedo, UnB CBSoft Local Committee Diego Aranha, UnB Edna Dias Canedo, UnB Fernanda Lima, UnB Guilherme Novaes Ramos, UnB Marcus Vinícius Lamar, UnB George Marsicano, UnB Giovanni Santos Almeida, UnB Hilmer Neri, UnB Luís Miyadaira, UnB Maria Helena Ximenis, UnB Comitê do programa / Program Committee Adenilso da Silva Simão, ICMC - Universidade de São Paulo, Brasil Alessandro Garcia, PUC-Rio, Brasil Alfredo Goldman, IME - Universidade de São Paulo, Brasil Antônio Tadeu Azevedo Gomes, LNCC, Brasil Antônio Francisco Prado, Universidade Federal de São Carlos, Brasil Arndt von Staa, PUC-Rio, Brasil Augusto Sampaio, Universidade Federal de Pernambuco, Brasil Carlos Lucena, PUC-Rio, Brasil Carolyn Seaman, Universidade de Maryland, EUA Cecilia Rubira, Unicamp, Brasil Christina Chavez, Universidade Federal da Bahia, Brasil Claudia Werner, COPPE /UFRJ, Brasil Claudio Sant’Anna, Universidade Federal da Bahia, Brasil Daltro Nunes, UFRGS, Brasil, Daniel Berry, Universidade de Waterloo, Canadá Daniela Cruzes, Universidade Norueguesa de Ciência e Tecnologia, Noruega Eduardo Almeida, Universidade Federal da Bahia, Brasil Eduardo Aranha, Universidade Federal do Rio Grande do Norte, Brasil 7 SBES 2013 Eduardo Figueiredo, Universidade Federal de Minas Gerais, Brasil Ellen Francine Barbosa, ICMC - Universidade de São Paulo, Brasil Fabiano Ferrari, Universidade Federal de São Carlos, Brasil Fabio Queda Bueno da Silva, Universidade Federal de Pernambuco, Brasil Fernanda Alencar, Universidade Federal de Pernambuco, Brasil Fernando Castor, Universidade Federal de Pernambuco, Brasil Flavia Delicato, Universidade Federal do Rio Grande do Norte, Brasil Flavio Oquendo, Universidade Européia de Brittany - UBS/VALORIA, França Glauco Carneiro, Universidade de Salvador, Brasil Gledson Elias, Universidade Federal da Paraíba, Brasil Guilherme Travassos, COPPE/UFRJ, Brasil Gustavo Rossi, Universidade Nacional de La Plata, Argentina Itana Maria de Souza Gimenes, Universidade Estadual de Maringá, Brasil Jaelson Freire Brelaz de Castro, Universidade Federal de Pernambuco, Brasil Jair Leite, Universidade Federal do Rio Grande do Norte, Brasil João Araújo, Universidade Nova de Lisboa, Portugal José Carlos Maldonado, ICMC - Universidade de São Paulo, Brasil José Conejero, Universidade de Extremadura, Espanha Leila Silva, Universidade Federal de Sergipe, Brasil Leonardo Murta, UFF, Brasil Leonor Barroca, Open Un./UK, Great Britain Luciano Baresi, Politecnico di Milano, Itália Marcelo Fantinato, Universidade de São Paulo, Brasil Marcelo de Almeida Maia, Universidade Federal de Uberlândia, Brasil Marco Aurélio Gerosa, IME-USP, Brasil Marco Túlio Valente, Universidade Federal de Minas Gerais, Brasil Marcos Chaim, Universidade de São Paulo, Brasil Márcio Barros, Universidade Federal do Estado do Rio de Janeiro, Brasil Mehmet Aksit, Universidade de Twente, Holanda Nabor Mendonça, Universidade de Fortaleza, Brasil Nelio Cacho, Universidade Federal do Rio Grande do Norte, Brasil Nelson Rosa, Universidade Federal de Pernambuco, Brasil Oscar Pastor, Universidade Politécnica de Valência, Espanha Otávio Lemos, Universidade Federal de São Paulo, Brasil Patricia Machado, Universidade Federal de Campina Grande, Brasil Paulo Borba, Universidade Federal de Pernambuco, Brasil Paulo Masiero, ICMC - Universidade de São Paulo, Brasil Paulo Merson, Software Engineering Institute, EUA Paulo Pires, Universidade Federal do Rio de Janeiro, Brasil Rafael Bordini, PUCRS, Brasil Rafael Prikladnicki, PUCRS, Brasil Regina Braga, Universidade Federal de Juiz de Fora, Brasil Ricardo Choren, IME-Rio, Brasil Ricardo Falbo, Universidade Federal de Espírito Santo, Brasil Roberta Coelho, Universidade Federal do Rio Grande do Norte, Brasil Rogerio de Lemos, Universidade de Kent, Reino Unido Rosana Braga, ICMC - Universidade de São Paulo, Brasil Rosângela Penteado, Universidade Federal de São Carlos, Brasil Sandra Fabbri, Universidade Federal de São Carlos, Brasil Sérgio Soares, Universidade Federal de Pernambuco, Brasil 8 SBES 2013 Silvia Abrahão, Universidade Politécnica de Valencia, Espanha Silvia Vergilio, Universidade Federal do Paraná, Brasil Simone Souza, ICMC - Universidade de São Paulo, Brasil Thais Vasconcelos Batista, Universidade Federal do Rio Grande do Norte, Brasil Tiago Massoni, Universidade Federal de Campina Grande, Brasil Uirá Kulesza, Universidade Federal do Rio Grande do Norte, Brasil Valter Camargo, Universidade Federal de São Carlos, Brasil Vander Alves, Universidade de Brasília, Brasil revisores externos / External Reviewers A. César França, Federal University of Pernambuco, Brazil Americo Sampaio, Universidade de Fortaleza, Brazil Anderson Belgamo, Universidade Metodista de Piracicaba, Brazil Andre Endo, ICMC/USP, Brazil Breno França, UFRJ, Brazil Bruno Cafeo, Pontifícia Universidade Católica do Rio de Janeiro, Brazil Bruno Carreiro da Silva, Universidade Federal da Bahia, Brazil Célio Santana, Universidade Federal Rural de Pernambuco, Brazil César Couto, CEFET-MG, Brazil Cristiano Maffort, CEFET-MG, Brazil Draylson Souza, ICMC-USP, Brazil Edson Oliveira Junior, Universidade Estadual de Maringá, Brazil Fernando H. I. Borba Ferreira, Universidade Presbiteriana Mackenzie, Brasil Frank Affonso, UNESP - Universidade Estadual Paulista, Brazil Gustavo Henrique Lima Pinto, Federal University of Pernambuco, Brazil Heitor Costa, Federal University of Lavras, Brazil Higor Souza, University of São Paulo, Brazil Igor Steinmacher, Universidade Tecnológica Federal do Paraná, Brazil Igor Wiese, UTFPR -Universidade Tecnológica Federal do Parana, Brazil Ingrid Nunes, UFRGS, Brazil Juliana Saraiva, Federal University of Pernambuco, Brazil Lucas Bueno, University of São Paulo, Brazil Luiz Carlos Ribeiro Junior, Universidade de Brasilia – UnB, Brazil Marcelo Eler, Universidade de São Paulo, Brazil Marcelo Gonçalves, Universidade de São Paulo, Brazil Marcelo Morandini, Universidade de São Paulo, Brazil Mauricio Arimoto, Universidade de São Paulo, Brazil Milena Guessi, Universidade de São Paulo, Brazil Paulo Afonso Parreira Júnior, Universidade Federal de São Carlos, Brazil Paulo Meirelles, IME – USP, Brazil Pedro Santos Neto, Universidade Federal do Piauí, Brazil Ricardo Terra, UFMG, Brazil Roberto Araujo, EACH/USP, Brazil Sidney NogueiraFederal University of Pernambuco, Brazil Vanessa Braganholo, UFF, Brazil Viviane Santos Universidade de São Paulo, Brazil Yijun YuOpen University, Great Britain 9 SBES 2013 Comitê organizador / Organizing Committee COORDENAÇÃO GERAL Genaína Nunes Rodrigues, CIC, UnB Rodrigo Bonifácio, CIC, UnB Edna Dias Canedo, CIC, UnB COMITÊ LOCAL Diego Aranha, CIC, UnB Edna Dias Canedo, FGA, UnB Fernanda Lima, CIC, UnB Guilherme Novaes Ramos, CIC, UnB Marcus Vinícius Lamar, CIC, UnB George Marsicano, FGA, UnB Giovanni Santos Almeida, FGA, UnB Hilmer Neri, FGA, UnB Luís Miyadaira, FGA, UnB Maria Helena Ximenis, CIC, UnB COORDENADOR DO COMITÊ DE PROGRAMA SBES 2013 Auri M. R. Vincenzi, Universidade Federal de Goiás, Brasil 10 SBES 2013 palestras convidadas / invited keynotes TOWARD THE MAKING OF SOFTWARE THAT LEARNS TO MANAGE ITSELF SAM MALEK A self-managing software system is capable of adjusting its behavior at runtime in response to changes in the system, its requirements, or the environment in which it executes. Self-management capabilities are sought-after to automate the management of complex software in many computing domains, including service-oriented, mobile, cyber-physical and ubiquitous settings. While the benefits of such software are plenty, its development has shown to be much more challenging than the conventional software. At the state of the art, it is not an impervious engineering problem – in principle – to develop a selfadaptation solution tailored to a given system, which can respond to a bounded set of conditions that are expected to require automated adaptation. However, any sufficiently complex software system – once deployed in the field – is subject to a broad range of conditions and many diverse stimuli. That may lead to the occurrence of behavioral patterns that have not been foreseen previously: in fact, those may be the ones that cause the most critical problems, since, by definition, they have not manifested themselves, and have not been accounted for during the previous phases of the engineering process. A truly self-managing system should be able to cope with such unexpected behaviors, by modifying or enriching its adaptation logic and provisions accordingly. In this talk, I will first provide an introduction to some of the challenges of making software systems self-managing. Afterwards, I will provide an overview of two research projects in my group that have tackled these challenges through the applications of automated inference techniques (e.g., machine learning, data mining). The results have been promising, allowing the software engineers to empower a software system with advanced self-management capabilities with minimal effort. I will conclude the talk with an outline of future research agenda for the community. HOW THE WEB BROUGHT EVOLUTION BACK INTO DESIGN JEFF OFFUTT To truly understand the effect the Web is having on software engineering, we need to look to the past. Evolutionary design was near universal in the days before the industrial revolution. The production costs were very high, but craftsmen were able to implement continuous improvement–every new object could be better than the last. Software is different; it has a near-zero production cost, allowing millions of identical copies to be made. Unfortunately, near-zero production cost means software must be near-perfect “out of the box.” This fact has driven our research agenda for 50 years. But it is no longer true! This talk will discuss how near-zero production cost for near-perfect software has driven our research agenda. Then it will point out how the web has eliminated the need for near-perfect software out of the box. The talk will finish by describing how this shift is changing software development and research, and speculate on how this change our future research agenda. 11 SBES 2013 SOFTWARE ARCHITECTURE: A CORE DISCIPLINE TO ENGINEER SOFTWARE THAIS BATISTA Software architecture has emerged in the last decades as an important discipline of software engineering, dealing with the design decisions to define the organization of the system that have a long-lasting impact on its quality attributes. The architectural description documents the decisions and it is used as a blueprint to other activities in the software engineering process, such as implementation, testing, and evaluation. In this talk we will discuss the role of software architecture as a core activity to engineer software, its influence on other activities of software development, and the new trends and challenges in this area. 12 SBES 2013 PALESTRANTES / keynotes Sam Malek (George Mason University) Sam Malek is an Associate Professor in the Department of Computer Science at George Mason University. He is also the director of Software Design and Analysis Laboratory at GMU, a faculty associate of the C4I Center, and a member of the DARPA’s Computer Science Study Panel. Malek’s general research interests are in the field of software engineering, and to date his focus has spanned the areas of software architecture, autonomic software, and software dependability. Malek received his PhD and MS degrees in Computer Science from the University of Southern California, and his BS degree in Information and Computer Science from the University of California, Irvine. He has received numerous awards for his research contributions, including the National Science Foundation CAREER award (2013) and the GMU Computer Science Department Outstanding Faculty Research Award (2011). He has managed research projects totaling more than three million dollars in funding received from NSF, DARPA, IARPA, ARO, FBI, and SAIC. He is a member of the ACM, ACM SIGSOFT, and IEEE. Jeff Offutt (George Mason University) Dr. Jeff Offutt is Professor of Software Engineering at George Mason University and holds part-time visiting faculty positions at the University of Skovde, Sweden, and at Linkoping University, Linkoping Sweden. Offutt has invented numerous test strategies, has published over 150 refereed research papers (h-index of 51 on Google Scholar), and is co-author of Introduction to Software Testing. He is editor-in-chief of Wiley’s journal of Software Testing, Verification and Reliability; co-founded the IEEE International Conference on Software Testing, Verification, and Validation; and was its founding steering committee chair. He was awarded the George Mason University Teaching Excellence Award, Teaching With Technology, in 2013, and was named a GMU Outstanding Faculty member in 2008 and 2009. For the last ten years he has led the 25-year old MS program in Software Engineering, and led the efforts to create PhD and BS programs in Software Engineering. His current research interests include software testing, analysis and testing of web applications, secure software engineering, objectoriented program analysis, usable software security, and software evolution. Offutt received the PhD in computer science in 1988 from the Georgia Institute of Technology and is on the web at http://www. cs.gmu.edu/~offutt/. Thais Batista (UFRN) Thais Batista is an Associate Professor at the Federal University of Rio Grande do Norte (UFRN) since 1996. She holds a Ph.D in Computer Science from the Catholic University of Rio de Janeiro (PUC-Rio), Brazil, 2000. In 2004-2005 she was a post-doctoral researcher at the Lancaster University, UK. Her main research area is software architecture, distributed systems, middleware, cloud computing. Índice 13 SBES 2013 Índice de Artigos / Table of Contents Criteria for Comparison of Aspect-Oriented Requirements Engineering Approaches : Critérios para Comparação de Abordagens para Engenharia de Requisitos Orientada a Aspectos 16 Paulo Afonso Parreira Júnior, Rosângela Aparecida Dellosso Penteado Using Tranformation Rules to Align Requirements and Archictectural Models 26 Monique Soares, Carla Silva, Gabriela Guedes, Jaelson Castro, Cleice Souza, Tarcisio Pereira An automatic approach to detect traceability links using fuzzy logic 36 Andre Di Thommazo, Thiago Ribeiro, Guilherme Olivatto, Vera Werneck, Sandra Fabbri Determining Integration and Test Orders in the Presence of Modularization Restrictions 46 Wesley Klewerton Guez Assunção, Thelma Elita Colanzi, Silvia Regina Vergilio, Aurora Pozo Functional Validation Driven by Automated Tests / Validação Funcional Dirigida por Testes Automatizados 56 Thiago Delgado Pinto, Arndt von Staa Visualization, Analysis, and Testing of Java and AspectJ Programs with Multi-Level System Graphs 64 Otavio Augusto Lazzarini Lemos, Felipe Capodifoglio Zanichelli, Robson Rigatto, Fabiano Ferrariy, Sudipto Ghosh A Method for Model Checking Context-Aware Exception Handling 74 Lincoln S. Rocha, Rossana M. C. Andrade, Alessandro F. Garcia Prioritization of Code Anomalies based on Architecture Sensitiveness Roberta Arcoverde, Everton Guimarães, Isela Macía, Alessandro Garcia, Yuanfang Cai 14 84 SBES 2013 Are domain-specific detection strategies for code anomalies reusable? An industry multi-project study : Reuso de Estratégias Sensíveis a Domínio para Detecção de Anomalias de Código: Um Estudo de Múltiplos Casos 94 Alexandre Leite Silva, Alessandro Garcia, Elder José Reioli, Carlos José Pereira de Lucena F3T: From Features to Frameworks Tool 104 Matheus Viana, Rosangela Penteado, Antônio do Prado, Rafael Durelli A Metric of Software Size as a Tool for IT Governance 114 Marcus Vinícius Borela de Castro, Carlos Alberto Mamede Hernandes 124 An Approach to Business Processes Decomposition for Cloud Deployment: Uma Abordagem para Decomposição de Processos de Negócio para Execução em Nuvens Computacionais Lucas Venezian Povoa, Wanderley Lopes de Souza, Antonio Francisco do Prado, Luís Ferreira Pires, Evert F. Duipmans On the InFLuence of Model Structure and Test CaseProFIle on the Prioritization of Test Cases in theContext of Model-based Testing 134 Joao Felipe S. Ouriques, Emanuela G. Cartaxo, Patrícia D. L. Machado 144 The Impact of Scrum on Customer Satisfaction: An Empirical Study Bruno Cartaxo, Allan Araujo, Antonio Sa Barreto, Sergio Soares Identifying a Subset of TMMi Practices to Establish a Streamlined Software Testing Process 152 Kamilla Gomes Camargo, Fabiano Cutigi Ferrari, Sandra Camargo Pinto Ferraz Fabbri On the Relationship between Features Granularity and Non-conformities in Software Product Lines: An Exploratory Study 162 Iuri Santos Souza, Rosemeire Fiaccone, Raphael Pereira de Oliveira, Eduardo Santana de Almeida 172 An Extended Assessment of Data-driven Bayesian Networks in Software Effort Prediction Ivan A. P. Tierno, Daltro J. Nunes 15 Criteria for Comparison of Aspect-Oriented Requirements Engineering Approaches Critérios para Comparação de Abordagens para Engenharia de Requisitos Orientada a Aspectos Paulo Afonso Parreira Júnior 1, 2, Rosângela Aparecida Dellosso Penteado 2 1 Bacharelado em Ciência da Computação – UFG (Câmpus Jataí) - Jataí – Goiás, Brasil 2 Departamento de Computação - UFSCar - São Carlos - São Paulo, Brasil {paulo_junior, rosangela}@dc.ufscar.br Resumo— Early-aspects referem-se a requisitos de software que se encontram espalhados ou entrelaçados com outros requisitos e são tratados pela Engenharia de Requisitos Orientada a Aspectos (EROA). Várias abordagens para EROA têm sido propostas nos últimos anos e possuem diferentes características, limitações e pontos fortes. Sendo assim, torna-se difícil a tomada de decisão por parte de: i) engenheiros de software, quanto à escolha da abordagem mais apropriada as suas necessidades; e ii) pesquisadores em EROA, quando o intuito for entenderem as diferenças existentes entre suas abordagens e as existentes na literatura. Este trabalho tem o objetivo de apresentar um conjunto de critérios para comparação de abordagens para EROA, criado com base nas variabilidades e características comuns dessas abordagens. Além disso, tais critérios são aplicados a seis abordagens e os resultados obtidos podem servir como um guia para que usuários escolham a abordagem que melhor atenda as suas necessidades, bem como facilite a realização de pesquisas na área de EROA. Palavras-chave — Engenharia de Software Orientada a Aspectos, Critérios para Comparação, Avaliação Qualitativa, Early Aspects. Abstract— Early-aspects consist of software requirements that are spread or tangled with other requirements and can be treated by Aspect-Oriented Requirements Engineering (AORE). Many AORE approaches have been proposed in recent years and have different features, strengths and limitations. Thus, it becomes difficult the decision making by: i) software engineers, regards to the choice of the most appropriate approach to your needs, and ii) AORE researchers, when the intent is to understand the differences between their own approaches and other ones in the literature. This paper aims to present a set of comparison criteria for AORE approaches, based on common features and variability of these approaches. Such criteria are applied on six of the main AORE approaches and the results can serve as a guide so that users can choose the approach that best meets their needs, and to facilitate the conduct of research in AORE. Keywords — Aspect-Oriented Requirements Engineering, Comparison Criteria, Qualitative Evaluation, Early Aspects. I. INTRODUÇÃO O aumento da complexidade do software e a sua aplicabilidade nas mais diversas áreas requerem que a Engenharia de Requisitos (ER) seja realizada de modo abrangente e completo, a fim de: i) contemplar todas as necessidades dos stakeholders [1]; e ii) possibilitar que os engenheiros de software tenham o completo entendimento da funcionalidade do software, dos serviços e restrições existentes e do ambiente sobre o qual ele deve operar [2]. Um requisito de software define uma propriedade ou capacidade que atende às regras de negócio de um software [1]. Um conjunto de requisitos relacionados com um mesmo objetivo, durante o desenvolvimento do software, define o conceito de “interesse” (concern). Por exemplo, um interesse de segurança pode contemplar diversos requisitos relacionados a esse objetivo, que é garantir que o software seja seguro. Idealmente, cada interesse do software deveria estar alocado em um módulo específico do software, que satisfizesse aos seus requisitos. Quando isso ocorre, diz-se que o software é bem modularizado, pois todos os seus interesses estão claramente separados [2]. Entretanto, há alguns tipos de interesses (por exemplo, desempenho, segurança, persistência, entre outros) para os quais essa alocação não é possível apenas utilizando as abstrações usuais da engenharia de software, como casos de uso, classes e objetos, entre outros. Tais interesses são denominados “interesses transversais” ou “early aspect” e referem-se aos requisitos de software que se encontram espalhados ou entrelaçados com outros requisitos. A falta de modularização ocasionada pelos requisitos espalhados e entrelaçados tende a dificultar a manutenção e a evolução do software, pois prejudica a avaliação do engenheiro de software quanto aos efeitos provocados pela inclusão, remoção ou alteração de algum requisito sobre os demais [1]. A Engenharia de Requisitos Orientada a Aspectos (EROA) é uma área de pesquisa que objetiva promover melhorias com relação à Separação de Interesses (Separation of Concerns) [3] durante as fases iniciais do desenvolvimento do software, oferecendo estratégias mais adequadas para identificação, modularização e composição de interesses transversais. Várias abordagens para EROA têm sido desenvolvidas nos últimos anos [4][5][7][8][9][10][11][12][13][14], cada uma com diferentes características, limitações e pontos fortes. Além disso, avaliações qualitativas ou quantitativas dessas abordagens foram realizadas [1][2][15][16][17][19][20]. Mesmo com a grande variedade de estudos avaliativos, apenas alguns aspectos das abordagens para EROA são considerados. Assim, para se ter uma visão mais abrangente sobre uma determinada abordagem há necessidade de se recorrer a outros estudos. Por exemplo, as informações sobre as atividades da EROA contempladas em uma abordagem são obtidas na publicação na qual ela foi proposta ou em alguns estudos comparativos que a envolvem. Porém, nem sempre essas publicações apresentam informações precisas sobre a escalabilidade e/ou cobertura e a precisão dessa abordagem, sendo necessário recorrer a estudos de avaliação quantitativa. Na literatura há escassez de estudos que realizam a comparação de abordagens para EROA por meio de um conjunto bem definido de critérios. Também é difícil encontrar, em um mesmo trabalho, a comparação de características qualitativas e quantitativas das abordagens. Esses fatos dificultam a tomada de decisão por parte de: i) engenheiros de software, quanto à escolha da abordagem mais apropriada as suas necessidades; e ii) pesquisadores em EROA, para entenderem as diferenças existentes entre suas abordagens e as demais existentes na literatura. Este trabalho apresenta um conjunto de oito critérios para facilitar a comparação de abordagens para EROA. Esses critérios foram desenvolvidos com base nas variabilidades e características comuns de diversas abordagens, bem como nos principais trabalhos relacionados à avaliação qualitativa e quantitativa dessas abordagens. Os critérios elaborados contemplam: (1) o tipo de simetria de cada abordagem; (2) as atividades da EROA e (3) interesses contemplados por ela; (4) as técnicas utilizadas para realização de suas atividades; (5) o nível de envolvimento necessário para sua aplicação, por parte do usuário; (6) sua escalabilidade; (7) nível de apoio computacional disponível; e (8) as avaliações já realizadas sobre tal abordagem. A fim de verificar a aplicabilidade dos critérios propostos, seis das principais abordagens para EROA disponíveis na literatura são comparadas: Separação Multidimensional de Interesses [8]; Theme [9][10]; EA-Miner [4][5]; Processo baseado em XML para Especificação e Composição de Interesses Transversais [7]; EROA baseada em Pontos de Vista [13][14]; e Aspect-Oriented Component Requirements Engineering (AOCRE) [11]. O resultado obtido com essa comparação pode servir como um guia para que usuários possam compreender de forma mais clara e abrangente as principais características, qualidades e limitações dessas abordagens para EROA, escolhendo assim, aquela que melhor atenda as suas necessidades. O restante deste artigo está organizado da seguinte forma. Na Seção 2 é apresentada uma breve descrição sobre EROA, com enfoque sobre suas principais atividades. Na Seção 3 é apresentada uma visão geral sobre as abordagens para EROA comparadas neste trabalho. O conjunto de critérios para comparação de abordagens para EROA está na Seção 4. A aplicação dos critérios sobre as abordagens apresentadas é exibida e uma discussão dessa aplicação é mostrada na Seção 5. Os trabalhos relacionados estão na Seção 6 e, por fim, as conclusões e trabalhos futuros são apresentados na Seção 7. II. ENGENHARIA DE REQUISITOS ORIENTADA A ASPECTOS O princípio da Separação de Interesses tem por premissa a identificação e modularização de partes do software relevantes a um determinado conceito, objetivo ou propósito [3]. Abordagens tradicionais para desenvolvimento de software, como a Orientação a Objetos (OO), foram criadas com base nesse princípio, porém, certos interesses de escopo amplo (por exemplo, segurança, sincronização e logging) não são fáceis de serem modularizados e mantidos separadamente durante o desenvolvimento do software. O software gerado pode conter representações entrelaçadas, que dificultam o seu entendimento e a sua evolução [7]. Uma abordagem efetiva para ER deve conciliar a separação de interesses com a necessidade de atender aos interesses de escopo amplo [8]. A EROA surge como uma tentativa de se contemplar esse objetivo por meio da utilização de estratégias específicas para modularização de interesses que são difíceis de serem isolados em módulos individuais. Um “interesse” encapsula um ou mais requisitos especificados pelos stakeholders e um “interesse transversal” ou “early aspect” é um interesse que se intercepta com outros interesses do software. A explícita modularização de interesses transversais em nível de requisitos permite que engenheiros de software raciocinem sobre tais interesses de forma isolada desde o início do ciclo de vida do software, o que pode facilitar a criação de estratégias para sua modularização. Na Figura 1 está ilustrado o esquema de um processo genérico para EROA, proposto Chitchyan et al. [4], que foi desenvolvido com base em outros processos existentes na literatura [8][9][12][14] (os retângulos de bordas arredondadas representam as atividades do processo). Figura 1. Processo genérico para EROA (adaptado de Chitchyan et al. [4]). A partir de um conjunto inicial de requisitos disponível, a atividade Identificação de Interesses identifica e classifica interesses do software como base ou transversais. Em seguida, a atividade Identificação de Relacionamento entre Interesses permite que o engenheiro de software conheça as influências e as restrições impostas pelos interesses transversais sobre os outros interesses do software. A atividade Triagem auxilia na decisão sobre quais desses interesses são pertinentes ao software e se há repetições na lista de interesses identificados. A atividade Refinamento de Interesses ocorre quando houver necessidade de se alterar o conjunto de interesses e relacionamentos já identificados. Os interesses classificados como pertinentes são então representados durante a atividade Representação de Interesses em um determinado formato (template), de acordo com a abordagem para EROA utilizada. Esse formato pode ser um texto, um modelo de casos de uso, pontos de vista, entre outros. Por exemplo, no trabalho de Rashid et al. [13][14], interesses são representados por meio de pontos de vista; no de Baniassad e Clarke [9][10] são utilizados temas. Durante a representação dos interesses, o engenheiro de software pode identificar a necessidade de refinamento, ou seja, de incluir/remover interesses e/ou relacionamentos. Isso ocorrendo, ele pode retornar para as atividades anteriores do processo da Figura 1. Finalmente, os interesses representados em um determinado template precisam ser compostos e analisados para a detecção dos conflitos entre interesses do software. Essas análises são feitas durante as atividades de Composição de Interesses e de Análise, Identificação e Resolução de Conflitos. Em seguida, os conflitos identificados são resolvidos com o auxílio dos stakeholders. Em geral, as atividades descritas no processo da Figura 1 são agregadas em quatro atividades maiores, a saber: “Identificação”, “Representação” e “Composição” de interesses e “Análise e Resolução de Conflitos”. Essas atividades são utilizadas como base para apresentação das características das abordagens para EROA na Seção 3 deste trabalho. sistema. Seja C1, C2, C3, ..., Cn os interesses concretos de um determinado sistema e SC1, SC2, SC3, ..., SCn, os conjuntos de interesses que eles entrecortam, respectivamente. III. ABORDAGENS PARA EROA A escolha das seis abordagens para EROA analisadas neste trabalho foi realizada por meio de um processo de Revisão Sistemática (RS), cujo protocolo foi omitido neste trabalho, devido a restrições de espaço. Tais abordagens têm sido consideradas maduras por outros autores em seus estudos comparativos [2][15][17], bem como foram divulgadas em veículos e locais de publicação de qualidade e avaliadas de forma quantitativa com sistemas reais. Apenas os principais conceitos dessas abordagens são apresentados; mais detalhes podem ser encontrados nas referências aqui apresentadas. Figura 2. Regras de composição para o interesse “Recuperação da Informação” (adaptado de Moreira et al. [8]). A. Separação Multidimensional de Interesses Esta abordagem propõe que requisitos devem ser decompostos de forma uniforme com relação a sua natureza funcional, não funcional ou transversal [8]. Tratando todos os interesses da mesma forma, pode-se então escolher qualquer conjunto de interesses como base para analisar a influência dos outros interesses sobre essa base. i) Identificação e Representação de Interesses. Tem por base a observação de que certos interesses, como por exemplo, mobilidade, recuperação de informação, persistência, entre outros aparecem frequentemente durante o desenvolvimento de software. Assim, os autores dividiram o espaço de interesses em dois: i) o dos metainteresses, que consiste em um conjunto abstrato de interesses típicos, como os que foram mencionados acima; e ii) o dos sistema, que contempla os interesses específicos do sistema do usuário. Para se utilizar esta abordagem, os requisitos do sistema devem ser analisados pelo engenheiro de requisitos e categorizados com base nos interesses existentes no espaço de metainteresses, gerando assim os interesses concretos. Para representação dos interesses, tanto os abstratos (metainteresses) quanto os concretos, são utilizados templates XML. ii) Composição de Interesses. Após a representação dos interesses, regras de composição são definidas para se especificar como um determinado interesse influencia outros requisitos ou interesses do sistema. As regras de composição também são especificadas por meio de templates XML. Na Figura 2 é apresentado um exemplo de regra de composição na qual o interesse “Recuperação de Informações” afeta todos os requisitos do interesse de “Customização” (especificado pelo atributo id = “all”), o requisito 1 do interesse “Navegação” e o requisito 1 do interesse “Mobilidade” (especificados pelo atributo id = “1”), incluindo seus subrequisitos (especificado pelo atributo children = “include”). iii) Análise e Resolução de Conflitos. É realizada a partir da observação das interações de um interesse com os outros do <?xml version="1.0" ?> <Composition> <Requirement concern="InformationRetrieval" id="all"> <Constraint action="provide" operator="during"> <Requirement concern="Customisability" id="all" /> <Requirement concern="Navigation" id="1" /> <Requirement concern="Mobility" id="1" children="include" /> </Constraint> <Outcome action="fulfilled" /> </Requirement> </Composition> Para se identificar os conflitos entre C1 e C2 deve-se analisar a Interseção de Composição SC1 ∩ SC2. Uma Interseção de Composição é definida por: seja o interesse Ca membro de SC1 e SC2. Ca aparece na interseção de composição SC1 ∩ SC2 se e somente se, C1 e C2 afetarem o mesmo conjunto de requisitos presentes em Ca. Por exemplo, na Figura 2, nota-se que o interesse “Recuperação de Informações” afeta o requisito 1 do interesse “Navegação”. Supondo que o interesse “Mobilidade” também afete esse requisito, então SCRecuperação de Informações ∩ SCMobilidade = {“Navegação”}. Os conflitos são analisados com base no tipo de contribuição que um interesse pode exercer sobre outro com relação a uma base de interesses. Essas contribuições podem ser negativas (-), positivas (+) ou neutras. Uma matriz de contribuição é construída, de forma que cada célula apresenta o tipo da contribuição (+ ou -) dos interesses em questão com relação aos interesses do conjunto de interseções de composição localizado dentro da célula. Uma célula vazia denota a não existência de relacionamento entre os interesses. Se a contribuição é neutra, então apenas o conjunto de interseções de composição é apresentado. B. Theme A abordagem Theme [9][10] apoia EROA em dois níveis: a) de requisitos, por meio da Theme/Doc, que fornece visualizações para requisitos textuais que permitem expor o relacionamento entre comportamentos em um sistema; b) de projeto, por meio da Theme/UML, que permite ao desenvolvedor modelar os interesses base e transversais de um sistema e especificar como eles podem ser combinados. i) Identificação de Interesses. Para esta atividade o engenheiro de software dispõe da visualização de ações, um tipo de visualização dos requisitos do sistema proposto pelos autores. Duas entradas são obrigatórias para se gerar uma visualização de ações: i) uma lista de ações-chaves, isto é, verbos identificados pelo engenheiro de software ao analisar o documento de requisitos; e ii) o conjunto de requisitos do sistema. Na Figura 3 é apresentada a visualização de ações criada a partir de um conjunto de requisitos e de uma lista de ações-chaves de um pequeno sistema de gerenciamento de cursos [9]. As ações-chaves são representadas por losangos e os requisitos do texto, por caixas com bordas arredondadas. Se um requisito contém uma ação-chave em sua descrição, então ele é associado a essa ação-chave por meio de uma seta da caixa com borda correspondente à ação. arredondada para o losango Figura 3. Exemplo de uma visualização de ações [9]. A ideia é utilizar essa visualização para separar e isolar ações e requisitos em dois grupos: 1) o grupo “base” que é autocontido, ou seja, não possui requisitos que se referem a ações do outro grupo; e 2) o grupo “transversal” que possui requisitos que se referem a ações do grupo base. Para atingir essa separação em grupos, o engenheiro de software deve examinar os requisitos para classificá-los em um dos grupos. Caso o engenheiro de software decida que uma ação principal entrecorta as demais ações do requisito em questão, então uma seta de cor cinza com um ponto em uma de suas extremidades é traçada da ação que entrecorta para a ação que é entrecortada. Na Figura 3, denota-se que a ação logged entrecorta as ações unregister, give e register. ii) Representação e Composição de Interesses. Para essas atividades utiliza-se Theme/UML, que trabalha com o conceito de temas - elementos utilizados para representar interesses e que podem ser do tipo base ou transversal. Os temas base encapsulam as funcionalidades do domínio do problema, enquanto que os transversais encapsulam os interesses que afetam os temas base. A representação gráfica de um tema é um pacote da UML denotado com o estereótipo <<theme>>. Os temas transversais são representados por meio de gabaritos da UML, que permitem encapsular o comportamento transversal independentemente do tema base, ou seja, sem considerar os pontos reais do sistema que serão afetados. Um gabarito é representado graficamente por um pacote da UML com um parâmetro no canto superior direito, um template. Após a especificação do sistema em temas base e transversais, é necessário realizar a composição deles. Para isso utiliza-se o relacionamento de ligação (bind), que descreve para quais eventos ocorridos nos temas base o comportamento do tema transversal deve ser disparado. Para auxiliar o engenheiro de software a descobrir e representar os temas e seus relacionamentos, visualizações de temas (theme view) são utilizadas. Elas diferem das visualizações de ações, pois não apresentam apenas requisitos e ações, mas também entidades do sistema (informadas pelo engenheiro de software) que serão utilizadas na modelagem dos temas. iii) Análise e Resolução de Conflitos. Os trabalhos analisados sobre a abordagem Theme não apresentaram detalhes sobre a realização dessa atividade. C. EA-Miner A abordagem EA-Miner segue o processo genérico apresentado na Figura 1, o qual foi definido pelos mesmos autores dessa abordagem. Além disso, os autores propuseram uma suíte de ferramentas que apoiam as atividades desse processo [4][5]. Essas ferramentas exercem dois tipos de papéis: i) gerador de informações: que analisa os documentos de entrada e os complementa com informações linguísticas, semânticas, estatísticas e com anotações; e ii) consumidor de informações: que utiliza as anotações e informações adicionais atribuídas ao conjunto de entrada para múltiplos tipos de análise. A principal geradora de informações da abordagem EAMiner é a ferramenta WMATRIX [6], uma aplicação web para Processamento de Linguagem Natural (PLN), que é utilizada por essa abordagem para identificação de conceitos do domínio do sistema. i) Identificação de Interesses. É realizada pela ferramenta EA-Miner (Early Aspect Mining), que recebe o mesmo nome da abordagem. Para identificação de interesses transversais não funcionais, EA-Miner constrói uma árvore de requisitos não funcionais com base no catálogo de Chung e Leite [18]. Os interesses transversais são identificados pela equivalência semântica entre as palavras do documento de requisitos e as categorias desse catálogo. Para identificação de interesses transversais funcionais, EA-Miner utiliza uma estratégia semelhante à da abordagem Theme, detectando a ocorrência de verbos repetidos no documento de requisitos, o que pode sugerir a presença de interesses transversais funcionais. ii) Representação e Composição de Interesses. Para esta atividade, utiliza-se a ferramenta ARCADE (Aspectual Requirements Composition and Decision). Com ela, o engenheiro de software pode selecionar quais requisitos são afetados pelos interesses do sistema, escolher os relacionamentos existentes entre eles e, posteriormente, gerar regras de composição. ARCADE utiliza a mesma ideia de regra de composição da abordagem “Separação Multidimensional de Interesses” [8]. iii) Análise e Resolução de Conflitos. ARCADE possui também um componente analisador de conflitos, o qual identifica sobreposição entre aspectos com relação aos requisitos que eles afetam. O engenheiro de requisitos é alertado sobre essa sobreposição e decide se os aspectos sobrepostos prejudicam ou favorecem um ao outro. D. Processo baseado em XML para Especificação e Composição de Interesses Transversais O processo de Soeiro et al. [7] é composto das seguintes atividades: identificar, especificar e compor interesses. i) Identificação de Interesses. Ocorre por meio da análise da descrição do sistema feita por parte do engenheiro de software. Os autores indicam que a identificação dos interesses pode ser auxiliada pelo uso de catálogos de requisitos não funcionais, como o proposto por Chung e Leite [18]. Para cada entrada do catálogo, deve-se decidir se o interesse em questão existe ou não no sistema em análise. ii) Representação e Composição de Interesses. Para essas atividades foram criados templates XML com o intuito de coletar e organizar todas as informações a respeito de um interesse. A composição dos interesses do sistema ocorre por regras de composição, que consistem dos seguintes elementos: Term: pode ser um interesse ou outra regra de composição. Operator: define o tipo de operação >>, [> ou ||. C1 >> C2 refere-se a uma composição sequencial e significa que o comportamento de C2 inicia-se se e somente se C1 tiver terminado com sucesso. C1 [> C2 significa que C2 interrompe o comportamento de C1 quando começa a executar. C1 || C2 significa que o comportamento de C1 está sincronizado com o de C2. Outcome: expressa o resultado das restrições impostas pelos operadores comentados anteriormente. iii) Análise e Resolução de Conflitos. Os trabalhos analisados sobre essa abordagem não apresentaram detalhes sobre a realização desta atividade. E. EROA baseada em Pontos de Vista Rashid et al. [13][14] propuseram uma abordagem para EROA baseada em pontos de vista (viewpoints). São utilizados templates XML para especificação dos pontos de vista, dos interesses transversais e das regras de composição entre pontos de vista e interesses transversais do sistema. Além disso, a ferramenta ARCADE automatiza a tarefa de representação dos conceitos mencionados anteriormente com base nos templates XML pré-definidos na abordagem. A primeira atividade dessa abordagem consiste na Identificação e Especificação dos Requisitos do Sistema e, para isso, pontos de vista são utilizados. i) Identificação e Representação de Interesses. É realizada por meio da análise dos requisitos iniciais do sistema pelo engenheiro de software. De modo análogo ao que é feito com os pontos de vista, interesses também são especificados em arquivos XML. Após a identificação dos pontos de vista e dos interesses, é necessário detectar quais desses interesses são candidatos a interesses transversais. Para isso cria-se uma matriz de relacionamento, na qual os interesses do sistema são colocados em suas linhas e os pontos de vista, nas colunas. Cada célula dessa matriz, quando marcada, representa que um determinado interesse exerce influência sobre os requisitos do ponto de vista da coluna correspondente daquela célula. Sendo assim, é possível observar quais pontos de vista são entrecortados pelos interesses do sistema. Segundo os autores, quando um interesse entrecorta os requisitos de vários pontos de vista do sistema, isso pode indicar que se trata de um interesse transversal. ii) Composição de Interesses e Análise e Resolução de Conflitos. Após a identificação dos candidatos a interesses transversais e dos pontos de vista do sistema, os mesmos devem ser compostos por meio de regras de composição e, posteriormente, a análise e resolução de conflitos deve ser realizada. A definição das regras de composição e da atividade de análise e resolução de conflitos segue a mesma ideia da abordagem “Separação Multidimensional de Interesses” [8]. F. Aspect-oriented Component Requirements Engineering (AOCRE) Whittle e Araújo [11] desenvolveram um processo de alto nível para criar e validar interesses transversais e não transversais. O processo se inicia com um conjunto de requisitos adquiridos pela aplicação de técnicas usuais para este fim. i) Identificação e Representação de Interesses. Os interesses funcionais e não funcionais são identificados a partir dos requisitos do sistema. Os interesses funcionais são representados por meio de casos de uso da UML e os interesses não funcionais, por um template específico com as informações: i) fonte do interesse (stakeholders, documentos, entre outros); ii) requisitos a partir dos quais ele foi identificado; iii) sua prioridade; iv) sua contribuição para outro interesse não funcional; e v) os casos de uso (interesses funcionais) afetados por ele. Com base na análise do relacionamento entre interesses funcionais e não funcionais, os candidatos a interesses transversais são identificados e, posteriormente, refinados em um conjunto de cenários. Cenários transversais (derivados dos interesses transversais) são representados por IPSs (Interaction Pattern Specifications) e cenários não transversais são representados por diagramas de sequência da UML. IPS é um tipo de Pattern Specifications (PSs) [23], um modo de se representar formalmente características estruturais e comportamentais de um determinado padrão. PSs são definidas por um conjunto de papéis (roles) da UML e suas respectivas propriedades. Dado um modelo qualquer, diz-se que ele está em conformidade com uma PS se os elementos desse modelo, que desempenham os papéis definidos na PS, satisfazem a todas as propriedades definidas para esses papéis. IPSs servem para especificar formalmente a interação entre papéis de um software. ii) Composição de Interesses. Cenários transversais são compostos com cenários não transversais. A partir desse conjunto de cenários compostos e de um algoritmo desenvolvido pelos autores da abordagem, é gerado um conjunto de máquinas de estados executáveis que podem ser simuladas em ferramentas CASE para validar tal composição. iii) Análise e Resolução de Conflitos. Os trabalhos analisados sobre essa abordagem não apresentaram detalhes sobre a realização dessa atividade. IV. CRITÉRIOS PARA COMPARAÇÃO DE ABORDAGENS PARA EROA O conjunto de critérios apresentado nesta seção foi elaborado de acordo com: i) a experiência dos autores deste trabalho que conduziram o processo de RS; ii) os trabalhos relacionados à avaliação de abordagens para identificação de interesses transversais [1][2][15][16][17][19][20]; e iii) os trabalhos originais que descrevem as abordagens selecionadas para comparação [4][5][7][8][9][10][11][13][14]. A confecção desse conjunto de critérios seguiu o seguinte procedimento: i) a partir da leitura dos trabalhos relacionados à avaliação de abordagens para identificação de interesses transversais (obtidos por meio da RS) foi criado um conjunto inicial de critérios; ii) esse conjunto foi verificado pelos autores deste trabalho e aprimorado com novos critérios ou adaptado com os já elencados; e iii) os critérios elencados foram aplicados às abordagens apresentadas na Seção 3. A. Tipo de Simetria: Assimétrica ou Simétrica Abordagens para EROA podem ser classificadas como; a) assimétricas – quando há distinção e tratamento explícitos para os interesses transversais e não transversais; b) simétricas – quando todos os interesses são tratados da mesma maneira. É importante conhecer tal característica das abordagens para EROA, pois ela fornece indícios sobre: a representatividade da abordagem em questão: em geral, abordagens assimétricas possuem melhor representatividade, uma vez que os modelos gerados por meio delas possuem elementos que fazem distinção explícita entre interesses transversais e não transversais. Isso pode favorecer o entendimento desses modelos e consequentemente do software sob análise; e a compatibilidade com outras abordagens para EROA: conhecer se uma abordagem é simétrica ou não pode auxiliar pesquisadores e profissionais a refletirem sobre o esforço necessário para adaptar essa abordagem as suas necessidades. Por exemplo, criando mecanismos para integrá-la com outras abordagens já existentes. Para cada abordagem analisada com esse critério as seguintes informações devem ser coletadas: nome da abordagem em questão, tipo de simetria (simétrica ou assimétrica) e descrição. Essa última informação especifica os elementos de abstração utilizados para tratar com interesses transversais e não transversais, o que explica a sua classificação como simétrica ou assimétrica. Para todos os critérios mencionados nas próximas subseções, o nome da abordagem em análise foi uma das informações coletadas e não será comentada. B. Cobertura: Completa ou Parcial Com esse critério pretende-se responder à seguinte questão: “A abordagem contempla as principais atividades preconizadas pela EROA?” Neste trabalho, considera-se como completa a abordagem que engloba as principais atividades descritas no processo genérico para EROA apresentado na Figura 1, isto é, “Identificação”, “Representação” e “Composição” de interesses e “Análise e Resolução de Conflitos”. Uma abordagem parcial é aquela que trata apenas com um subconjunto (não vazio) dessas atividades. Para cada abordagem analisada com esse critério deve-se obter o tipo de cobertura. Se for cobertura parcial, deve-se destacar as atividades contempladas pela abordagem. C. Propósito: Geral ou Específico Este critério tem a finalidade de avaliar uma abordagem quanto ao seu propósito, ou seja, se é específica para algum tipo de interesse (por exemplo, interesses transversais funcionais, interesses transversais não funcionais, interesses de persistência, segurança, entre outros) ou se é de propósito geral. Se o propósito da abordagem for específico, deve-se destacar os tipos de interesses contemplados por ela. D. Técnicas Utilizadas Este critério elenca as técnicas utilizadas pela abordagem para realização de suas atividades. Por exemplo, para a atividade de identificação de interesses transversais não funcionais, uma abordagem A pode utilizar técnicas de PLN, juntamente com um conjunto de palavras-chave, enquanto que outra abordagem B pode utilizar apenas catálogos de requisitos não funcionais e análise manual dos engenheiros de software. Para esse critério, as seguintes informações são obtidas: i) atividade da EROA contemplada pela abordagem; e ii) tipo de técnicas utilizadas para realização dessa atividade. E. Nível de Envolvimento do Usuário: Amplo ou Pontual O envolvimento do usuário é amplo quando há participação efetiva do usuário na maior parte das atividades propostas pela abordagem, sem que ele seja auxiliado por qualquer tipo de recurso ou artefato que vise a facilitar o seu trabalho. Essa participação efetiva pode ocorrer por meio da: i) inclusão de informações extras; ii) realização de análises sobre artefatos de entrada e/ou saída; e iii) tradução de informações de um formato para outro. Um exemplo de participação efetiva do usuário ocorre quando ele deve fornecer informações adicionais, além daquelas constantes no documento de requisitos do sistema para identificação dos interesses do sistema (por exemplo, um conjunto de palavras-chave a ser confrontado com o texto do documento de requisitos). Outro exemplo seria se a representação de interesses do sistema fosse feita manualmente, pelo usuário, de acordo com algum template pré-estabelecido (em um arquivo XML ou diagrama da UML). Um envolvimento pontual significa que o usuário pode intervir no processo da EROA para tomar certos tipos de decisões. Por exemplo, resolver um conflito entre dois interesses que se relacionam. Sua participação, porém, tem a finalidade de realizar atividades de níveis mais altos de abstração, que dificilmente poderiam ser automatizadas. É importante analisar tal critério, pois o tipo de envolvimento do usuário pode impactar diretamente na escalabilidade da abordagem e na produtividade proporcionada pela mesma. O envolvimento excessivo do usuário pode tornar a abordagem mais dependente da sua experiência e propensa a erros. Para comparação das abordagens com base neste critério, deve-se observar: i) o tipo de envolvimento do usuário exigido pela abordagem; e ii) a descrição das atividades que o usuário deve desempenhar. F. Escalabilidade Com esse critério, pretende-se conhecer qual é o porte dos sistemas para os quais a abordagem em análise tem sido aplicada. Embora algumas abordagens atendam satisfatoriamente a sistemas de pequeno porte, não há garantias que elas sejam eficientes para sistemas de médio e grande porte. Os problemas que podem surgir quando o tamanho do sistema cresce muito, em geral, estão relacionados: i) à complexidade dos algoritmos utilizados pela abordagem; ii) à necessidade de envolvimento do usuário, que dependendo do esforço requisitado, pode tornar impraticável a aplicação da abordagem em sistema de maior porte; e iii) à degradação da cobertura e precisão da abordagem; entre outros. Para esse critério as seguintes informações devem ser coletadas: i) o nome do sistema utilizado no estudo de caso em que a abordagem foi avaliada; ii) os tipos de documentos utilizados; iii) as medidas de tamanho/complexidade do sistema (em geral, quando se trata de documentos textuais, os tamanhos são apresentados em números de páginas e/ou palavras); e iv) a referência da publicação na qual foi relatada a aplicação desse sistema à abordagem em questão. G. Apoio Computacional Para quais de suas atividades a abordagem em análise oferece apoio computacional? Essa informação é importante, principalmente, se o tipo de envolvimento dos usuários exigido pela abordagem for amplo. Em muitos casos, durante a avaliação de uma abordagem para EROA, percebe-se um relacionamento direto entre os critérios “Tipo de Envolvimento do Usuário” e “Apoio Computacional”. Se a abordagem exige envolvimento amplo do usuário, consequentemente, ele deve possuir fraco apoio computacional; se exige envolvimento pontual, possivelmente deve oferecer apoio computacional adequado. Porém, essa relação precisa ser observada com cuidado, pois pode haver casos em que o fato de uma atividade exigir envolvimento pontual do usuário para sua execução não esteja diretamente ligado à execução automática da mesma. Por exemplo, sejam A e B duas abordagens para EROA que exijam que o usuário informe um conjunto de palavraschave para identificação de interesses em um documento de requisitos. A abordagem A possui um apoio computacional que varre o texto do documento de requisitos, selecionando algumas palavras mais relevantes (utilizando-se de técnicas de PLN) que possam ser utilizadas pelo engenheiro de software como palavras-chave. A abordagem B não possui apoio computacional algum, porém disponibiliza uma ontologia com termos do domínio do sistema em análise e um dicionário de sinônimos desses termos, que podem ser utilizados pelo engenheiro de software como diretrizes para elencar o conjunto de palavras-chave exigido pela abordagem. Neste caso, as duas abordagens poderiam ser classificadas como pontuais com relação ao critério “Tipo de Envolvimento do Usuário”, mesmo B não possuindo apoio computacional. Entretanto, um engenheiro de software que esteja utilizando a abordagem A, provavelmente, terminará a tarefa de definição do conjunto de palavras-chave em menor tempo do que outro que esteja utilizando a abordagem B. Assim, deve-se conhecer quais atividades da abordagem para EROA são automatizadas. Para cada abordagem comparada com esse critério deve-se obter: i) as atividades da EROA contempladas pela abordagem em questão; ii) os nomes dos apoios computacionais utilizados para automatização dessas atividades; e iii) a referência da publicação, na qual o apoio computacional foi proposto/apresentado. Uma abordagem pode oferecer mais de um apoio computacional para uma mesma atividade. H. Tipo de Avaliação da Abordagem A quais tipos de avaliação a abordagem em questão para EROA têm sido submetida? Para as avaliações realizadas, há um relatório adequado sobre a acurácia da abordagem, ressaltando detalhes importantes como cobertura, precisão e tempo necessário para execução das atividades dessa abordagem? Para Wohlin et al. [22], a avaliação qualitativa está relacionada à pesquisa sobre o objeto de estudo, sendo os resultados apresentados por meio de informações descritas em linguagem natural, como neste artigo. A avaliação quantitativa, geralmente, é conduzida por meio de estudos de caso e experimentos controlados, e os dados obtidos podem ser comparados e analisados estatisticamente. Estudos de caso e experimentos visam a observar um atributo específico do objeto de estudo e estabelecer o relacionamento entre atributos diferentes, porém, em experimentos controlados o nível de controle é maior do que nos estudos de caso. Para este critério deve-se destacar: i) o(s) tipo(s) de avaliação(ões) realizada(s) sobre a abordagem, listando a referência da avaliação conduzida (os tipos de avaliação são: qualitativa, estudo de caso e experimento controlado); e ii) os resultados obtidos com essa(s) avaliação(ões) realizada(s). Para o item (ii) sugere-se a coleta dos valores médios obtidos para as seguintes métricas: cobertura, precisão e tempo de aplicação da abordagem. Tais métricas foram sugeridas, pois são amplamente utilizadas para medição da eficácia de produtos e processos em diversas áreas de pesquisa, tais como recuperação da informação e processamento de linguagem natural, entre outras. Na área de EROA, essas métricas têm sido utilizadas em trabalhos relacionados à identificação de interesses tanto em nível de código [21], quanto em nível de requisitos [2][15]. A análise conjunta dos dados deste critério com os do critério “Escalabilidade” pode revelar informações importantes sobre a eficácia e eficiência de uma abordagem para EROA. V. AVALIAÇÃO DAS ABORDAGENS PARA EROA As abordagens para EROA apresentadas na Seção 3 foram comparadas com base nos critérios apresentados na Seção 4. As siglas utilizadas para o nome das abordagens são: i) SMI Separação Multidimensional de Interesses; ii) EA-Miner Early-Aspect Mining; iii) Theme - Abordagem Theme; iv) EROA/XML - Processo baseado em XML para Especificação e Composição de Interesses Transversais; v) EROA/PV EROA baseada em Pontos de Vista; e vi) AOCRE - AspectOriented Component Requirements Engineering. Na Tabela 1 encontra-se a avaliação dessas abordagens quanto ao tipo de simetria, com breve justificativa para o tipo escolhido. Tabela 1. TIPO DE SIMETRIA DAS ABORDAGENS PARA EROA. Abordagem Tipo de Simetria SMI Simétrica EA-Miner Assimétrica Theme Assimétrica EROA/XML Simétrica EROA/PV Assimétrica AOCRE Assimétrica Descrição Tanto os interesses transversais quanto os não transversais são tratados de modo uniforme. Todos são denominados “interesses” e podem influenciar /restringir uns aos outros. Os interesses transversais são tratados como aspectos e os não transversais, como pontos de vista. Os interesses transversais são tratados como temas transversais e os não transversais, como temas base. Tanto os interesses transversais quanto os não transversais são tratados apenas como interesse (concerns). Os interesses transversais são tratados como aspectos e os não transversais, como pontos de vista. Os interesses transversais são tratados como IPSs e os não transversais, como diagramas de sequência. Todas as abordagens analisadas foram consideradas como de propósito geral, pois contemplam tanto interesses funcionais quanto não funcionais. Quanto à cobertura, SMI, EA-Miner e EROA/PV são completas, uma vez que atendem às principais atividades da EROA definidas no processo da Figura 1. As abordagens Theme, EROA/XML e AOCRE foram consideradas parciais, uma vez que não apresentam apoio à atividade de Análise e Resolução de Conflitos. As técnicas utilizadas por cada atividade das abordagens comparadas são descritas na Tabela 2. Nota-se que as técnicas mais utilizadas para a atividade “Identificação de Interesses” são o uso de palavras-chave e catálogos para interesses não funcionais. A técnica baseada em palavras-chave é fortemente dependente da experiência dos engenheiros de software que a aplica. Por exemplo, um profissional com pouca experiência no domínio do software em análise ou sobre os conceitos de interesses transversais pode gerar conjuntos vagos de palavraschave e que podem gerar muitos falsos positivos/negativos. Além disso, técnicas como essas são ineficazes para detecção de interesses implícitos, isto é, interesses que não aparecem descritos no texto do documento de requisitos. Já para “Representação” e “Composição” de interesses, a maioria das abordagens optou por criar seus próprios modelos de representação e composição de interesses utilizando para isso a linguagem XML. O uso de XML é, muitas vezes, justificado pelos autores das abordagens por permitir a definição/representação de qualquer tipo de informação de forma estruturada e por ser uma linguagem robusta e flexível. Outra forma de representação, utilizada pelas abordagens Theme e AOCRE, ocorre por meio de modelos bem conhecidos da UML, como diagramas de sequência e estados, para realização dessas atividades. Para a atividade “Análise e Resolução de Conflitos”, também parece haver um consenso na utilização de matrizes de contribuição e templates XML. Tabela 2. TÉCNICAS UTILIZADAS PARA REALIZAÇÃO DAS ATIVIDADES CONTEMPLADAS PELAS ABORDAGENS PARA EROA. Identificação de Representação de Interesses Interesses 1 Palavras-chave e Temas e Técnicas de Diagramas UML Visualização Composição de Interesses Análise & Resolução de Conflitos Temas e Templates UML - Templates XML e Matriz de 2 Regras de Contribuição e Composição Templates XML Templates XML e Matriz de Catálogo de INF 3 Template XML Regras de Contribuição e Estendido Composição Templates XML Templates XML e 4 Catálogo de INF Template XML Regras de Composição Templates XML e Matriz de Matriz de Pontos de Vista e 5 Regras de Contribuição e Relacionamento Templates XML Composição Templates XML Diagramas de Casos de Uso, Diagramas de Sequência, Template 6 Sequência e IPSs e específico para IPSs Máquinas de INF. Estado Legenda: 1) Theme; 2) EA-Miner; 3) SMI; 4) EROA/XML; 5) EROA/PV; 6) AOCRE; INF: Interesses não funcionais. escalabilidade e na acurácia da abordagem, quando sistemas de larga escala forem analisados com elas. A abordagem EA-Miner, entretanto, requer interferência pontual do usuário, sendo a sua participação em atividades mais estratégicas do que mecânicas. Isso se deve, em parte, à utilização de uma suíte de ferramentas computacionais de apoio à execução desta abordagem. Tabela 3. TIPO DE ENVOLVIMENTO DO USUÁRIO REQUERIDO PELAS ABORDAGENS PARA EROA. Abordagem Envolvimento A P SMI X - EA-Miner - X Theme X - EROA/XML X EROA/PV X AOCRE X Palavras-chave e Templates XML Catálogo de INF O tipo de envolvimento do usuário requerido pelas abordagens analisadas é apresentado na Tabela 3. A maioria das abordagens (SMI, Theme, EROA/XML, EROA/PV e AOCRE) foi classificada como as que exigem envolvimento amplo de seus usuários. Isto pode ser um fator impactante na Atividades Desenvolvidas Especificação dos interesses concretos do sistema a partir de um conjunto de metainteresses; Representação dos interesses em templates de arquivos XML; Definição de regras de composição; Definição da contribuição (positiva ou negativa) e das prioridades de um interesse sobre o(s) outro(s). Tomada de decisão com relação às palavras ambíguas detectadas no documento de requisitos; Definição de regras de composição; Definição da contribuição (positiva ou negativa) e das prioridades de um interesse sobre o(s) outro(s). Definição de um conjunto de ações e entidades-chave; Análise manual das visualizações geradas pela abordagem com o objetivo de encontrar interesses base e transversais; Construção manual dos temas a partir das visualizações geradas pela abordagem; Definição de regras de composição. Identificação manual dos interesses do sistema; Representação dos interesses em templates de arquivos XML; Definição de regras de composição. Identificação manual dos interesses do sistema; Representação dos interesses em templates de arquivos XML; Definição de regras de composição; Definição da contribuição (positiva ou negativa) e das prioridades de um interesse sobre o(s) outro(s). Identificação manual dos interesses do sistema; Representação dos interesses em cenários; Definição de diagramas de sequência e IPSs. Legenda: A (Amplo); P (Pontual) Na 0 estão descritas as ferramentas disponibilizadas por cada abordagem para automatização de suas atividades. Notase que a abordagem mais completa em termos de apoio computacional é a EA-Miner, pois todas as suas atividades são automatizadas em partes ou por completo. Por exemplo, a atividade de composição de interesses é totalmente automatizada pela ferramenta ARCADE. O usuário precisa apenas selecionar os interesses a serem compostos e toda regra de composição é gerada automaticamente. ARCADE trabalha com base nos conceitos da abordagem SMI, automatizando as suas atividades. Nota-se ainda que as atividades melhor contempladas com recursos computacionais são “Representação” e “Composição de Interesses”. Dessa forma, as atividades para EROA que exigem maior atenção da comunidade científica para confecção de apoios computacionais são “Identificação de Interesses” e “Análise e Resolução de Conflitos”. A aplicação dos critérios escalabilidade e tipo de avaliação para as abordagens analisadas são apresentados na 0 e Tabela 6. Tabela 4. APOIO COMPUTACIONAL DAS ABORDAGENS PARA EROA. Abordagem Theme Atividade Representação de Interesses Identificação de Interesses EA-Miner Triagem Representação e Composição de Interesses e Análise e Resolução de Conflitos Representação e Composição SMI de Interesses e Análise e Resolução de Conflitos Representação e Composição EROA/XML de Interesses Representação e Composição EROA/PV de Interesses e Análise e Resolução de Conflitos AOCRE Composição de Interesses Apoio Computacional Ref Plugin Ecplise Theme/UML [9] EA-Miner, WMATRIX e RAT(Requirement Analysis Tool) KWIP (Key Word In Phrase) [4] ARCADE ARCADE [4] APOR (AsPect-Oriented Requirements tool) [7] ARCADE [4] Algoritmo proposto pelos autores [11] Tabela 5. ESCALABILIDADE DAS ABORDAGENS PARA EROA. Abordagem Sistema SMI, EROA/XML EROA/PV Health Watcher EA-Miner e Theme Complaint System ATM System Documentos Utilizados Documento de Requisitos e Casos de Uso Documento de Requisitos Documento de Requisitos Tamanho Ref 19 páginas; 3.900 palavras. [2] 230 páginas. [15] 65 páginas. [15] As abordagens SMI, Theme e EROA/PV são as abordagens mais avaliadas, tanto qualitativamente quanto quantitativamente. Isso ocorre, pois essas foram algumas das primeiras abordagens para EROA. Outros pontos interessantes são que: i) EROA/XML não havia ainda sido avaliada qualitativamente, de acordo com a revisão de literatura realizada neste trabalho; e ii) não foram encontrados estudos quantitativos que contemplassem a abordagem AOCRE. Quanto à escalabilidade ressalta-se que a maioria delas, com exceção da AOCRE, foi avaliada com documentos de requisitos de médio e grande porte. EA-Miner e Theme foram avaliadas com documentos de requisitos mais robustos (295 páginas de documentos, no total). Com os valores presentes na Tabela 6 percebe-se que as abordagens SMI, Theme, EROA/XML e EROA/PV apresentaram os maiores tempos para execução das atividades da EROA em proporção ao tamanho do documento de requisitos. EA-Miner foi classificada neste trabalho como a única abordagem que exige envolvimento pontual de seus usuários. Infere-se que, pelo envolvimento pontual de seus usuários, ela apresentou os melhores resultados com relação ao tempo para realização das atividades da EROA. Com base nos estudos de casos realizados observa-se que quanto à cobertura e à precisão das abordagens, a identificação de interesses base é melhor do que a de interesses transversais. A justificativa para isso é que interesses base são mais conhecidos e entendidos pela comunidade científica [2]. Além disso, tais requisitos aparecem no documento de requisitos de forma explícita, mais bem localizada e isolada, facilitando sua identificação. Dessa forma, a atividade de identificação de interesses em documentos de requisitos configura-se ainda um problema de pesquisa relevante e desafiador e que merece a atenção da comunidade científica. VI. TRABALHOS RELACIONADOS A literatura contém diversos trabalhos com o objetivo de avaliar qualitativa ou quantitativamente as abordagens para EROA. Herrera et al. [15] apresentaram uma análise quanto à acurácia das abordagens EA-Miner e Theme, quando são utilizados documentos de requisitos de dois sistemas de software reais. As métricas relacionadas à eficácia e à eficiência das abordagens, como cobertura, precisão e tempo, foram as que receberam maior enfoque. Sendo assim, poucos aspectos qualitativos das abordagens analisadas foram levantados, como tipo de simetria, cobertura, entre outros. Nessa mesma linha, Sampaio et al. [2] apresentaram um estudo quantitativo para as abordagens: EROA/PV, SMI, EROA/XML e Goal-based AORE. Foi avaliada a acurácia e a eficiência dessas abordagens. Por se tratarem de abordagens com características bem distintas, os autores elaboraram também um mapeamento entre os principais conceitos delas e criaram um esquema de nomenclatura comum para EROA. Outros trabalhos, em formato de surveys [1][16][17] [19][20], foram propostos com o intuito de comparar abordagens para EROA, descrevendo as principais características de cada abordagem. Entretanto, cada um desses trabalhos considerou apenas um conjunto restrito e distinto de características dessas abordagens, criando assim, um gap que dificulta a compreensão mais abrangente das características comuns e específicas de cada abordagem. Singh e Gill [20] e Chitchyan et al. [1] fizeram a caracterização de algumas abordagens para EROA, sem utilizar um conjunto de critérios. Bakker et al. [19] compararam algumas abordagens com relação: i) ao objetivo da abordagem; ii) às atividades contempladas; iii) ao apoio computacional oferecido; iv) aos artefatos utilizados; e v) à rastreabilidade. Porém, não há informações sobre a acurácia dessas abordagens, nem sobre os estudos avaliativos realizados com elas. Bombonatti e Melnikoff [16] compararam as abordagens considerando apenas os tipos de interesses (funcionais ou não funcionais) e atividades da EROA contemplados por essas abordagens. Rashid et al. [17] comparam as abordagens para EROA sob o ponto de vista dos objetivos da Engenharia de Requisitos, separação de interesses, rastreabilidade, apoio à verificação de consistência, entre outros. A principal diferença deste trabalho em relação aos demais comentados anteriormente está no fato de que o conjunto de critérios proposto contempla não apenas os pontos qualitativos comuns e específicos das abordagens para EROA analisadas, mas proporciona um vínculo com informações quantitativas obtidas por outros pesquisadores em trabalhos relacionados. Tabela 6. TIPOS DE AVALIAÇÃO REALIZADOS COM AS ABORDAGENS PARA EROA. Abordagem SMI EA-Miner Theme EROA/XML EROA/PV AOCRE Tipo de Avaliação Q EC [2] [1][16][17][20] - EXC - - [15] - [1][17] - - - [15] - [1] [17][19] [20] [1][16][17][19][20] [1][17] [20] [2] [2] - - Cobertura IB: 100%; ITF: 50%; ITNF: 70% Complaint System IB: 64%; ITF: 64%; ITNF: 45% ATM System IB: 86%; ITF: 80%; ITNF: 100% Complaint System IB: 73%; ITF: 55%; ITNF: 73% ATM System IB: 86%; ITF: 73%; ITNF: 40% IB: 100%; ITF: 50%; ITNF: 55% IB: 100%; ITF: 0%; ITNF: 100% - Relatório Precisão IB: 88%; ITF: 100%; ITNF: 77% Complaint System IB: 31%; ITF: 78%; ITNF: 71% ATM System IB: 35%; ITF: 63%; ITNF: 71% Complaint System IB: 48%; ITF: 86%; ITNF: 80% ATM System IB: 50%; ITF: 91%; ITNF: 50% IB: 88%; ITF: 100%; ITNF: 100% IB: 70%; ITF: 0%; ITNF: 83% - Tempo 104 min Complaint System: 70 min ATM System: 140 min Complaint System: 760 min ATM System: 214 min 173 min 62 min - Legenda: IB: Interesses Base; ITF: Interesses Transversais Funcionais; ITNF: Interesses Transversais Não Funcionais. Q: Qualitativa. EC: Estudo de Caso. EXC: Experimento Controlado. Além disso, tais critérios compreendem um framework comparativo que pode ser estendido para contemplar outros tipos de critérios relacionados à área de ER. VII. CONSIDERAÇÕES FINAIS A grande variedade de abordagens para EROA existentes na literatura, com características diferentes, tem tornado difícil a escolha da mais adequada às necessidades dos usuários. Este trabalho apresentou um conjunto de critérios para comparação de abordagens para EROA, concebidos com base nas características comuns e especificidades das principais abordagens disponíveis na literatura, bem como em trabalhos científicos que avaliaram algumas dessas abordagens. Além disso, realizou-se a aplicação desses critérios sobre três abordagens bem conhecidas. Essa comparação pode servir como guia para que o engenheiro de software escolha a abordagem para EROA mais adequada as suas necessidades. Também foram destacados alguns dos pontos fracos das abordagens analisadas, como por exemplo, a baixa precisão e cobertura para interesses transversais não funcionais. Como trabalhos futuros, pretende-se: i) expandir o conjunto de critérios aqui apresentado a fim de se contemplar características específicas para cada uma das fases da EROA; ii) aplicar o conjunto de critérios expandido às abordagens já analisadas com o intuito de se obter novas informações sobre elas, bem como a novos tipos de abordagens existentes na literatura; iii) desenvolver uma aplicação web que permita aos engenheiros de software e pesquisadores da área de EROA pesquisarem e/ou divulgarem seus trabalhos utilizando o conjunto de critérios elaborados; e iv) por último, propor uma nova abordagem que reutilize os pontos fortes e aprimore os pontos fracos de cada abordagem analisada. REFERÊNCIAS [1] [2] [3] [4] Chitchyan, R.; Rashid, A; Sawyer, P.; Garcia, A.; Alarcon, M. P.; Bakker, J.; Tekinerdogan, B.; Clarke, S.; Jackson, A. “Report synthesizing state-of-theart in aspect-oriented requirements engineering, architectures and design”. Lancaster University: Lancaster, p. 1-259, 2005. Technical Report. Sampaio, A.; Greenwood P.; Garcia, A. F.; Rashid, A. “A Comparative Study of Aspect-Oriented Requirements Engineering Approaches”. In 1st International Symposium on Empirical Software Engineering and Measurement (ESEM '07) p 166-175, 2007. Dijkstra, E. W. “A Discipline of Programming”. Pearson Prentice Hall, 217 p., ISBN: 978-0132158718, 1976. Chitchyan, R.; Sampaio, A.; Rashid, A.; Rayson, P. “A tool suite for aspectoriented requirements engineering”. In International Workshop on Early Aspects at ICSE. ACM, p. 19-26, 2006. [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] Sampaio, A.; Chitchyan, R.; Rashid, A.; Rayson, P. “EA-Miner: a Tool for Automating Aspect-Oriented Requirements Identification”, Int'l Conf. Automated Software Engineering (ASE), ACM, pp. 353-355, 2005. WMATRIX. Corpus Analysis and Comparison Tool. Disponível em: http://ucrel.lancs.ac.uk/wmatrix/. Acessado em: Abril de 2013. Soeiro E.; Brito, I. S; Moreira, A. “An XML-Based Language for Specification and Composition of Aspectual Concerns”. In 8th International Conference on Enterprise Information Systems (ICEIS). 2006. Moreira, A.; Rashid, A.; Araújo, J. “Multi-Dimensional Separation of Concerns in Requirements Engineering”. 13th International Conference on Requirements Engineering (RE). Proceedings… p. 285-296, 2005. Baniassad, E.; Clarke, S. “Theme: An approach for aspect-oriented analysis and design”. In 26th Int. Conf. on Software Engineering (ICSE’04). 2004. Clarke, S.; Baniassad, E. “Aspect-Oriented Analysis and Design”: The Theme Approach: Addison-Wesley, 2005. Whittle J.; Araújo, J. “Scenario Modeling with Aspects”. IEEE Software, v. 151(4), p. 157-172, 2004. Yijun Y.; Leite, J.C.S.P.; Mylopoulos, J. “From Goals to Aspects: Discovering Aspects from Requirements Goal Models”. In International Conference on Requirements Engineering (RE). 2004. Rashid, A.; Moreira, A.; Araújo, J. “Modularisation and composition of aspectual requirements”. In 2nd International Conference on Aspect-Oriented Software Development (AOSD’03). ACM, 2003. Rashid, A.; Sawyer, P.; Moreira, A.; Araújo, J. “Early Aspects: a Model for Aspect-Oriented Requirements Engineering”. In International Conference on Requirements Engineering (RE). 2002. Herrera, J. et al. “Revealing Crosscutting Concerns in Textual Requirements Documents: An Exploratory Study with Industry Systems”. In 26th Brazilian Symposium on Software Engineering. p. 111-120, 2012. Bombonatti, D. L. G.; Melnikoff, S. S. S. “Survey on early aspects approaches: non-functional crosscutting concerns integration in software sytems”. In 4th World Scientific and Engineering Academy and Society (WSEAS). Wisconsin, USA, p. 137-142, 2010. Rashid, A.; Chitchyan, R. “Aspect-oriented requirements engineering: a roadmap”. In 13th Int. Workshop on Early Aspects (EA). p. 35-41, 2008. Chung, L.; Leite, J. S. P. “Non-Functional Requirements in Software Engineering”. Springer, 441 p., 2000. Bakker J.; Tekinerdoğan, B.; Akist, M. “Characterization of Early Aspects Approaches”. In Early Aspects: Aspect-Oriented Requirements Engineering and Architecture Design. 2005. Singh, N.; Gill, N. S. “Aspect-Oriented Requirements Engineering for Advanced Separation of Concerns: A Review”. International Journal of Computer Science Issues (IJCSI). v 8(5). 2011. Kellens, A.; Mens, K., and Tonella, P. “A survey of automated code-level aspect mining techniques”. Transactions on Aspect-Oriented Software Development IV, v. 4640, p. 143-162, 2007. Wohlin, C.; Runeson, P.; Höst, M.; Regnell, B.; Wesslén, A. “Experimentation in Software Engineering: an Introduction”. 2000. France, R.; Kim, D.; Ghosh, S.; Song, E. “A UML-based pattern specification technique”. IEEE Trans. Software Engineering. v. 30 (3), pp. 193–206. 2004. Using Tranformation Rules to Align Requirements and Archictectural Models Monique Soares, Carla Silva, Gabriela Guedes, Jaelson Castro, Cleice Souza, Tarcisio Pereira Centro de Informática Universidade Federal de Pernambuco – UFPE Recife, Brasil {mcs4, ctlls, ggs, jbc, tcp}@cin.ufpe.br Abstract— In previous works we have defined the STREAM strategy to align requirements and architectural models. It includes four activities and several transformations rules that could be used to support the systematic generation of a structural architectural model from goal oriented requirements models. The activities include the Preparation of Requirements Models, Generation of Architectural Solutions, Selection of Architectural Solution and Refinement of the Architecture. The first two activities are time consuming and rely on four horizontal and four vertical transformation rules which are current performed manually, requiring much attention from the analyst. For example, the first activity consists of the refactoring of the goal models, while the second one derives architectural models from the refactored i* (iStar) models. In this paper we automate seven out of the eight transformation rules of the two first activities of STREAM approach. The transformation language used to implement the rules was QVTO. We rely on a running example to illustrate the use of the automated rules. Hence, our approach has the potential to improve the process productivity and the quality of the models produced. KeywordsRequirements Engineering, Architecture, Transformation Rules, Automation I. Software INTRODUCTION The STREAM (A STrategy for Transition between REquirements Models and Architectural Models) is a systematic approach to integrate requirements engineering and architectural design activities, based on model transformation, to generate architectural models from requirements models [1]. It generates structural architectural models, described in Acme [4] (the target language), from goal-oriented requirements models, expressed in i* (iStar) [3] (i.e. the source language). This approach has four activities, namely: Prepare Requirements Models, Generate Architectural Solutions, Select Architectural Solution and Refine Architecture. The first two activities are time consuming and rely on horizontal and vertical transformation rules (HTRs and VTRs), respectively. Currently, these transformations rules are made manually, requiring much attention from the analyst. However, they seem likely to be automated, which could reduce not only the human effort required to generate the target models, but also minimize the number of errors produced during the process. Hence, our proposal is to use the QVT [2] transformation language to properly define the rules, and also to develop some tool support to execute them. Therefore, two research questions are addressed by this paper: Is it possible to automate the transformation rules defined in the first two STREAM activities, namely: Prepare Requirements Models, Generate Architectural Solutions? And, if so, how could these rules be automated? Henceforth, the main objective of this paper is to automate the transformation rules defined by the first two phases of STREAM process 1 . To achieve this goal it is necessary to: describe the transformation rules using a suitable transformation language; make the vertical and horizontal transformation rules compatible with the modeling environment used to create the goal-oriented requirements models, i.e. the iStarTool [6]; make the vertical transformation rules compatible with the modeling environment used to create the structural architectural models, i.e. the AcmeStudio [4]. In order to automate the HTRs and VTRs proposed by the STREAM process, it was necessary to choose a language that would properly describe the transformation rules and transform the models used in STREAM approach. We opted for the QVTO (Query / View / Transformation Operational) language [2], a transformation language that is integrated with Eclipse environment [16] and that is better supported and maintained. Note that as input of the first activity of the STREAM process is based on an i* goal model. The iStarTool [6] is used to generate XMI file of the goal-oriented requirements model. This file is read by the Eclipse QVTO plugin, which generates the XMI file of the Acme architectural model. Note that this file is consistent with the metamodel created in accordance with the AcmeStudio tool. The rest of the paper is organized as follows. Section II presents the theoretical background. Section III describes the horizontal transformations rules in QVTO. In Section IV, we present the vertical transformation rules in QVTO. In order to illustrate our approach, in Section V we use the BTW example [10]. Section VI presents some related works. Finally, Section VII concludes the paper with a brief explanation of the contributions achieved and the proposal of future work. 1 Note that, it is out of scope of this paper to support the other two phases of the approach (Select Architectural Solution, Refine Architecture). TABLE I. EXAMPLE OF HORIZONTAL TRANSFORMATION RULES ADAPTED FROM [8] Resulting model after applying the rule Rule Original Model (b) Ator X Hu rt Goal 2 Tarefa 8 Goal 1 Ator Y Tarefa 6 Softgoal 1 Goal 2 He lp Tarefa 8 Tarefa 7 Ator Z Tarefa 6 D Tarefa 1 D Tarefa 5 Goal 1 rt Hu HTR1 Ator Y Softgoal 1 Ator X Help (a) Tarefa 5 Tarefa 1 Tarefa 2 Tarefa 7 Tarefa 3 Tarefa 4 G Tarefa 2 G D Tarefa 4 D Tarefa 3 Ator Z Ator X Ator X Goal 2 Ator Z Goal 2 Tarefa 1 Tarefa 1 Tarefa 6 Tarefa 6 Tarefa 1 Ator X D Tarefa 2 D Ator Z Ator X Ator Z Softgoal 1 Softgoal 1 Softgoal 1 D Tarefa 1 D Tarefa 6 Ator X Softgoal 1 Tarefa 6 Ator Z Ator X Goal 1 Help Help Hurt Hurt A. STREAM STREAM is a systematic approach to generate architectural models from requirements models based on model transformation [1]. The source and target modelling languages are i* for requirements modelling and Acme for architectural description, respectively. The STREAM process consists of the following activities: 1) Prepare requirements models, 2) Generate architectural solutions, 3) Choose an architectural solution and 4) Derive architecture. Horizontal Transformation Rules (HTRs) are part of the first activity. They are useful to increase the modularity of the i* requirements models. Vertical Transformation Rules (VTRs) are proposed in second activity. They are used to derive architectural models from the modularized i* requirements model. Non-functional requirements (NFRs) are used in the third activity to select one of the possible architectural descriptions obtained. Depending on the NFR to be satisfied, some architectural patterns can be applied, in activity 4. The first STREAM activity is concerned with improving the modularity of the expanded system actor. It allows delegation of different parts of a problem to different software actors (instead of having a unique software actor). In particular, it is sub-divided into three steps: (i) analysis of internal elements (identify which internal elements can be extracted from the original software actor and relocated to a new software actor); (ii) application of horizontal transformation rules (the actual extraction and relocation of the identified internal elements); and, (iii) evaluation of the i* model (checking if the model needs to be modularized again, i.e., return to the step 1). In order to develop these steps, it is necessary to use, respectively: • Heuristics to guide the decomposition of the software's actor; • A set of rules to transform i* models; • Metrics for assessing the degree of modularization of both the initial and modularized i* models. This is a semi-automatic process, since not all the activities can be automated. For example, the step 1 of the first activity cannot be automated because the analyst is the one in charge to choose the sub-graph to be moved to another actor. The Horizontal Transformation Rule 1 (HTR1) moves a sub-graph previously selected. Hence, HTR1 cannot be fully automatized because it always depends on the sub-graph chosen by the analyst. Observe that after applying the HTR1, the resulting model may not be in compliance with the i* syntax. So, the next HTRs are to correct possible syntax errors. The Horizontal Transformation Rule 2 (HTR2) moves a means-end link crossing actor‟s boundary. HTR2 considers HTR2 In this section we present the baseline of this research: the original rules from the STREAM approach and the model transformation language (QVT) used to implement HTRs and VTRs of STREAM. the situation where the sub-graph moved to another actor has the root element as a “means” in a means-end relationship. The Horizontal Transformation Rule 3 (HTR3) moves a contribution link crossing the actor‟s boundary. HTR3 considers the situation where the sub-graph moved to another actor has a contribution relationship with others elements that were not moved. The Horizontal Transformation Rule 4 (HTR4) moves a task-decomposition link crossing the actor‟s boundary. HTR4 considers the situation where the sub-graph moved has a task-decomposition relationship with other elements that were not moved. Table 1 shows examples of these rules. The graph to be moved in HTR1 is highlighted with a dashed line and labelled with G. HTR3 BACKGROUND HTR4 II. Tarefa 1 Ator Z Goal 1 Tarefa 3 Tarefa 3 Tarefa 5 Tarefa 5 D Tarefa 3 D The transformation rules are intended to delegate internal elements of the software actor to other (new) software actors. This delegation must ensure that new actors have with the original actor. Thus, the original model and the final model are supposed to be semantically equivalent. At the end of the first activity, the actors representing the software are easier to understand and maintain, since there is more actors with less internal elements. In the second STREAM activity (Derive Architectural Solutions), transformation rules are used to transform and i* requirements model into an initial Acme architectural model. In this case, we use the VTRs. In order to facilitate the understanding, we have separated the vertical transformation rules into four rules. VTR1 maps the i* actors into Acme components. VTR2 maps the i* dependencies into Acme connectors. VTR3 maps a depender actor as a required port of Acme connector. And last but not least, VTR4 maps the dependee actor to a provided port of an Acme connector. Note the goal of this paper is to fully automate three HTRs (HTR2, HTR3 and HTR4) and all VTRs proposed by the STREAM. HTR1 is not amenable to automation. First, we specify them in QVTO [2]. It is worth noting that to create the i* models, we have relied on the iStarTool tool [6]. B. QVT The QVT language has a hybrid declarative/imperative nature. The declarative part is divided into a two-tier architecture, which forms the framework for the execution semantics of the imperative part [5]. It has the following layers: • A user-friendly Relations metamodel and language that supports standard combination of complex object and create the template object. • A Core metamodel and language defined using minimal extensions to EMOF and OCL. In addition to the declarative languages (Relations and Core), there are two mechanisms for invoking imperative implementations of Relations or Core transformations: a standard language (Operational Mappings) as well as nonstandard implementations (Black-box MOF Operation). The QVT Operational Mapping language allows both the definition of transformations using a complete imperative approach (i.e. operational transformations) or it lets hybrid approach in which the transformations are complemented with imperatives operations (which implements the relationships). The operational transformation represents the definition of a unidirectional transformation that is expressed imperatively. This defines a signature indicating the models involved in the transformation and defines an input operation for its implementation (called main). An operational transformation can be instantiated as an entity with properties and operations, such as a class. III. AUTOMATION OF HORIZONTAL TRANSFORMATION The first activity of the STREAM process presents some transformation rules that can be defined precisely using the QVT (Query / View / Transformation) transformation language [5], in conjunction with OCL (Object Constraint Language) [9] to represent constraints. The transformation process requires the definition of transformation rules and metamodels for the source and target languages. The first STREAM activity uses the HTRs, which aim to improve the modularity of the i* models and have the i* language as source and target language. The rules were defined in QVTO and executed through a plugin for the Eclipse platform. Transformations were specified based on the i* language metamodel considered by the iStarTool. In QVT, it is necessary to establish a reference to the metamodel to be used. As explained in section II, the steps of the first activity of the STREAM process (Refactor Models Requirements) are: Analysis of internal elements; Application of horizontal transformation rules, and Evaluation of i* models. The Horizontal Transformation Rules activity takes as input two artefacts: the i* model and the selection of internal elements. The former is the i* system model, and the latter is the choice of elements to be modularized made by Engineer Requirements. The output artefact produced by the activity is refactored and more modularized i* model. Modularization is performed by a set of horizontal transformation rules. Each rule performs a small and located transformation that produces a new model that decomposes the original model. Both the original and the produced model are described in i*. Thus, the four horizontal transformation rules proposed by [8] are capable of implementation. First the analyst uses the iStarTool to produce the i* requirements model. Then the HTR1 can be performed manually by him/her also using the iStarTool. The analyst may choose to move the sub-graph for a new actor or an existing actor, and then moves the sub-graph. This delegation must ensure that new actors and the original actor have a relationship of dependency. Thus, the original model and the final model are supposed to be semantically equivalent. Upon completion of HTR1, the artefact generated is used in automated transformations that perform all other HTRs at once. This is the case if the obtained model is syntactically wrong. Table 1 describes the different types of relationship between the components that have been moved to another actor and a member of the actor to which it belonged. If the relationship is a means-end rule, HTR2 should be applied. While if the relationship is a contribution, HTR3 is used. In the situation where tasks decomposition is present, HTR4 is recommended . In the next section we detail how each of these HTRs was implemented in QVTO. A. HTR2- Move a means-end link across the actor's boundary If after applying the HTR1, there is a means-end link crossing the actors‟ boundaries, the HTR2 corrects this syntax error since means-end links can exist only inside the actor‟s boundary. The means-end link is usually used to connect a task (means) to a goal (end). Thus, the HTR2 make a copy of the task inside the actor who has the goal, in such way that the means-end link is now inside of the actor‟s boundary that has the goal (Actor X in Table 1). After that, the rule establishes a dependency from that copied task to the task inside of the new actor (Actor Z in Table 1). To accomplish this rule, the HTR2 checks if there is at least a means-end link crossing the actors‟ boundaries (line 7 of the code present in Table 2). If so, it then checks if this means-end link has as source and target attributes elements present in the boundary of different actors. If this condition holds (line 10), the HTR2 creates a copy of the source element inside the boundary of the actor which possesses the target element of the means-end link (atorDaHora variable in line 19). A copy of the same source element is copied outside the actors‟ boundaries to become a dependum (line 18). Then, a dependency is created from the element copied inside the actor to the dependum element (line 20) and from the dependum element to the original source element of the means-end link that remained inside the other actor (line 21). The result is the same presented in Table 1 for HTR2. The source code in QVTO for HTR2 is presented in Table 2. TABLE II. HTR2 DESCRIBED IN QVTO 20 actor := atorDaHora.name; }; self.links += object DependencyLink { source := atorDaHora.elements->last(); target := self.elements->last(); name := "M"; type := DependencyLinkType::COMMITED; 21 }; self.links += object DependencyLink { source := self.elements->last(); 1 actorResultAmount := target := otherActor.meansEnd- oriModel.rootObjects()[Model].actors.name->size(); 2 while(actorResultAmount > 0){ 3 >at(meansend).source; name := "M"; if(self.actors- type := DependencyLinkType::COMMITED; >at(actorResultAmount).type.=(ActorType::ACTORBOUN DARY)) then { 4 atoresBoundary += self.actors- }; } endif; 22 >at(actorResultAmount); 5 var meansend := self.actors>at(actorResultAmount).meansEnd->size(); 6 var atorDaHora := self.actors>at(actorResultAmount); 7 while(meansend > 0) { 8 var sourceDaHora := atorDaHora.meansEnd- >at(meansend).source.actor; 9 var targetDaHora := atorDaHora.meansEnd>at(meansend).target.actor; 10 if(sourceDaHora <> targetDaHora) then { 11 var atoresBoundarySize := atoresBoundary- >size(); 12 var otherActor : Actor; 13 14 while(atoresBoundarySize > 0) { if(atoresBoundary- >at(atoresBoundarySize).name <> atorDaHora.name) then { 15 otherActor := atoresBoundary- >at(atoresBoundarySize); } else { 16 otherActor := atorDaHora; } endif; 17 atoresBoundarySize := atoresBoundarySize - 1; 18 }; self.elements += object Element{ name := atorDaHora.meansEnd- >at(meansend).source.name; type := atorDaHora.meansEnd>at(meansend).source.type; 19 }; atorDaHora.elements += object Element{ name := atorDaHora.meansEnd- >at(meansend).source.name; type := atorDaHora.meansEnd>at(meansend).source.type; meansend := meansend - 1; }; } endif; 23 actorResultAmount := actorResultAmount - 1; }; B. HTR3- Move a contribution link across the actor's boundary HTR3 copies a softgoal that was moved to its source actor, if this softgoal is a target in a contribution link with some element in his initial actor. The target of the link is moved from the softgoal to its copy in the initial actor. This softgoal is still replicated as a dependum of a dependence link from the original softgoal to its copy. If an element of some actor has a contribution link with a softgoal that is within the actor that was created or received elements in HTR1, then this softgoal will be copied into the actor that has an element that has a contribution link with this softgoal. The target of the contribution link becomes that copy. This softgoal is also copied as a dependum of a softgoal dependency in its original copy. In order to accomplish this rule, we analyse if any contribution link has the source and target attributes with elements present in different actors. If the actor element present in the source or the target is different from the actor referenced in attribute of the element, then this element (softgoal) is copied to the actor present in source or target that has the different name of the actor analysed. The target attribute of the contribution link shall refer to this copy. This same softgoal is also copied to the modelling stage and creates a dependency from the softgoal copy to original softgoal with to the copied softgoal to the stage as dependum. The target of this dependence is the copy and the source is the original softgoal. C. HTR4- Move a task-decomposition link across the actor's boundary HTR4 replicates an element that is the decomposition of a task into this other actor as dependum a dependency link between this element and the task, and removes the decomposition link. If an some actor's element is task decomposition within the actor that was created or received elements in HTR1, then that decomposition link is removed, and a copy of this element will be created and placed during the modelling stage as a dependum of a dependency between the task in the actor created or received elements in HTR1 and the element present in another actor that was the decomposition of this task. The target of this dependence is the element and the source is the task. In order to accomplish this rule, we analyse if any decomposition task link has source and target attributes with elements present in different actors. If the actor of the element present in the source or target is different from the referenced actor in the moves attribute of element, then that element is copied during the modelling stage to create a dependency from the referenced element as the source of decomposition link to the element referenced as the target, i.e., a dependency from the original element of the task, with the copied element to the stage as a dependum. The target of this dependence is the task and the source is the element. The decomposition link is removed. IV. AUTOMATION OF VERTICAL TRANSFORMATIONS The second STREAM activity (Generate Architectural Solutions) uses transformation rules to map i* requirements models into an initial architecture in Acme. As these transformations have different source and target languages, they are called vertical transformations. In order to facilitate the understanding of de VTRs as well as the description of them, we separate the vertical transformation rules in four rules [14]. Below we detail how each of these VTRs was implemented. A. VTR1- Mapping the i* actors in Acme components In order to describe this first VTR it is necessary to obtain the quantity of actors present in the artefact developed in the first activity. From this, we create the same quantity of Acme components (line 3 of code present in Figure 1), giving the same actors name. The Figure 1 shows an excerpt of QVTO code for VTR1. 1 while(actorsAmount > 0) { 2 result.acmeElements += object Component{ 3 name := self.actors.name->at(actorsAmount); } 4 actorsAmount := actorsAmount - 1; } Figure 1. Excerpt of the QVTO code for VTR1 The XMI output file will contain the Acme components within the system represented by the acmeElements tag (line 1 of code present in Figure 2), an attribute of that tag, xsi:type (line 1), that contain information that is a component element and the attribute name the element name, as depicted in Figure 2. Figure 3 shows graphically a component. 1 <acmeElements xsi:type="Acme:Component" name="Advice Receiver"> … 2 </acmeElements> Figure 2. XMI tag of component Figure 3. Acme components linked by connector However, an Acme component has other attributes, not just the name, so it is also necessary to perform the VTR3 and VTR4 rules to obtain the other necessary component attributes. B. VTR2- Mapping the i* dependencies in Acme connectors Each i* dependency creates two XMI tags. One captures the link from depender to the dependum and the other defines the link from the dependum to the dependee. 1 while(dependencyAmount > 0) { 2 if(self.links.source>includes(self.links.target->at(dependencyAmount)) and self.actors.name->excludes(self.links.target>at(dependencyAmount).name)) then { 3 result.acmeElements += object Connector{ 4 name := self.links.target.name>at(dependencyAmount); 5 roles += object Role{ 6 name := "dependerRole"; }; 7 roles += object Role{ 8 name := "dependeeRole"; }; }; } endif; 9 dependencyAmount := dependencyAmount - 1; }; Figure 4. Excerpt of the QVTO code for VTR2 As seen in Figure 4, for the second vertical rule, which transforms i* dependencies to Acme connectors (line 3 of code present in Figure 4), each i* dependency creates two tags in XMI, one captures the connection from the depender to the dependum (line 5) and another defines the connection from the dependee to the dependum (line 7). In order to map these dependencies into Acme connectors it is necessary to recover the two dependencies tags, observing that the have the same dependum, i.e., the target of a tag must be equal to the source of another tag, which can characterize the dependum. However, they should not consider the actor which plays the role of depender (source) in some dependency and dependee (target) in another. Once this is performed, there are only dependums (intentional elements) left. For each dependum, one Acme connector is created (line 1 of code present in Figure 5). The connector created receives the name of the intentional element that represents the dependum of the dependency link. Two roles are created within the connector, one named dependerRole and another named dependeeRole. The XMI output file will contain the connectors represented by tags (see Figure 5). 1 <acmeElements xsi:type="Acme:Connector" name="Connector"> 2 <roles name="dependerRole"/> 3 <roles name="dependeeRole"/> 4 </acmeElements> Figure 5. Connector in XMI C. VTR3- Mapping depender actors as required port of Acme connector With the VTR3, we map all depender actors (source from some dependency) into a required port of an Acme connector. Thus, we list all model‟s actors that are source in some dependency (line 2 of code present in Figure 6). Furthermore, we create an Acme port for each depender actor (line 3). Each port created has a name and a property (lines 4 and 5), the name is assigned randomly, just to help to control them. The property must have a name and a value, the property name is “Required” once we are creating the required port, as figured in Figure 6. 1 while(dependencyAmount > 0) { 2 if(self.actors.name>includes(self.links.source>at(dependencyAmount).name) and self.actors.name>at(actorsAmount).=(self.links.source>at(dependencyAmount).name)) then { 3 ports += object Port{ 4 name := "port"+countPort.toString(); 5 properties := object Property { 6 name := "Required"; 7 value := "true" }; }; } endif; 8 dependencyAmount := dependencyAmount - 1; 9 countPort := countPort + 1; }; Figure 6. Excerpt of the QVTO code for VTR3 The XMI output file will contain within the component tag (line 1 of code present in Figure 7) the tags of the ports. Inside the port‟s tag there will be a property tag with the name attribute assigned as Required and the attribute value set true (lines 2 to 4). Figure 7 presents an example of a required port in XMI, while Figure 3 shows the graphic representation of the required port (white). 1 <acmeElements xsi:type="Acme:Component" name="Component"> 2 <ports name="port8"> 3 <properties name="Required" value="true"/> 4 </ports> 5 </acmeElements> Figure 7. Example of a required port in XMI D. VTR4- Mapping dependee actors as provided port of Acme connector VTR4 is similar to VTR3. We map all dependee actors (target from some dependency) as a provided Acme port of a connector. Thus, we list all model‟s actors that are target in some dependency (line 2 of code present in Figure 8). We create an Acme port for each dependee actor. It has a name and property, the name is assigned randomly, simply to control them (line 4). The property must have a name and a value. The property name is set to “Provided” once we are creating the provided port. Figure 8 presents an QVTO excerpt code for the provided port. The XMI output file will contain within the component a tag to capture the ports. Inside the port‟s tag that are provided there will be a property tag with the name attribute assigned as Provided (line 3 of code present in Figure 9). While the value attribute is set to true and the type attribute as boolean. 1 <acmeElements xsi:type="Acme:Component" name="Advice Giver"> 2 <ports name="port17"> 3 <properties name="Provided" value="true" type="boolean"/> 5 </ports> 6 </acmeElements> Figure 9. Provided port in XMI Figure 3 shows the graphic representation of the provided port (black). V. RUNNING EXAMPLE BTW (By The Way) [10] is a route planning system that helps users with advices on a specific route searched by them. The information is posted by other users and can be filtered for the user to provide only the relevant information 1 while(dependencyAmount > 0) { 2 if(self.actors.name>includes(self.links.target>at(dependencyAmount).name) and self.actors.name>at(actorsAmount).=(self.links.target>at(dependencyAmount).name)) then { 3 ports += object Port{ 4 name := "port"+countPort.toString(); 5 properties := object Property{ 6 name := "Provided"; 7 value := "true"; 8 type := "boolean"; }; }; } endif; 9 countPort := countPort + 1; }; Figure 8. Creation of Provided Port about the place he wants to visit. BTW was an awarded projected presented at the ICSE SCORE competition held in 2009 [11]. In order to apply the automated rules in i* models of this example, it necessary to perform the following steps: 1. Create the i* requirements model using the iStarTool; 2. Use the three heuristics defined by STREAM to guide the selection of the sub-graph to be moved from an actor to another; 3. Manually apply the HTR1, but with support of the iStarTool. The result is an i* model with syntax errors that must be corrected using the automated transformation rules; 4. Apply the automated HTR2, HTR3 and HTR4 rules. After step 1, we identified some elements inside the BTW software actor that are not entirely related to the application domain of the software and these elements can be placed on other (new) software actors. In fact, the sub-graphs that contain the "Map to be Handled", "User Access be Controlled", and "Information be published in map" elements can be viewed as independents of the application domain. To facilitate the reuse of the first sub-graph, it will be moved to a new actor named "Handler Mapping". The second sub-graph will be moved to a new actor named "User decomposition and contribution links crossing the actors‟ boundaries, meaning that the model is syntactically incorrect and must be corrected by the automated HTRs. In order to apply the HTR2, HTR3 and HTR4, we only need to execute a single QVTO file. Thus, with the eclipse configured to QVT transformations, along with the metamodel of i* language referenced in the project and the input files referenced in the run settings, the automated rules will be applied simultaneously by executing the QVTO project. Figure 11. BTW i* SR diagram and selected elements Figure 10. BTW i* diagram and selected elements Access Controller" while the third sub-graph will be moved to a new actor called Information Publisher. Steps 1 and 2 are performed using the iStarTool. This tool generates two types of files (extensions): the file "istar_diagram" has information related to the i* requirements model; the "istar" file has information related to the i* modelling language metamodel. Since the file "istar" is a XMI file, we changed its type (extension) to "xmi". XMI files are used as input and output files by the automated rules (HTR2, HTR3 and HTR4). The BTW i* model and the elements to be moved to other actors are shown in Figure 10. Figure 11 depicts the BTW i* model after applying HTR1. Note that there are some task- After applying the HTRs, a syntactically correct i* model is produced. In this model, the actors are expanded, but in order to apply the vertical transformation rules, it is necessary to contract all the actors (as shown in Figure 12) to be used as input in the second STREAM activity (Generate Architectural Solutions). Moreover, when applying the VTRs, we only need to execute a single QVTO file. The VTRs are executed sequentially and the analyst will visualize just the result model [15]. Figure 13 presents the graphical representation of the XMI model generated after the application of the VTRs. This XMI is compatible with the Acme metamodel. Figure 11. BTW i* model after performing HTR1 Figure 12. BTW i* model after applying all HTRs Figure 13. BTW Acme Model obtained from the i* model VI. RELATED WORKS Coelho et al. proposes an approach to relate aspect oriented goal models (described in PL-AOV-Graph) and architectural models (described in the PL-AspectualACME) [12]. It defines the mapping process between these models and a set of transformations rules between their elements. The MaRiPLA (Mapping Requirements to Product Line Architecture) tool automates this transformation, which is implemented using the Atlas Transformation Language (ATL) transformation language. Medeiros et al. presents a MaRiSA-MDD, a strategy based on models that integrate aspect-oriented requirements, architectural and detailed design, using the languages AOVgraph, AspectualACME and aSideML, respectively [13]. MARISA-MDD defines, for each activity, models (and metamodels) and a number of model transformations rules. These transformations were specified and implemented in ATL. However, none of these works relied on i*, our source language, which has much larger community of adopters than AOV Graph. VII. CONCLUSION This paper presented the automation of most of the transformation rules that support the first and second STREAM activities, namely Refactor Requirements Models and Derive Architectural Solutions [1]. In order to decrease the time and effort required to perform these STREAM activities, as well as to minimize the errors introduced by the manual execution of the rules, we proposed the use of the QVTO language to automatize the execution of seven out of the eight STREAM transformation rules. The input and output models of the Refactor Requirements Models activity are compatible with the iStarTool. While the ones generated by the Derive Architectural Solutions activity are compatible with the AcmeStudio tool. The iStarTool was used to create the i* model and to perform the HTR1 manually. The result is the input file to be processed by the automated transformation rules (HTR2, HTR3 and HTR4). Both the input and output files handled by the transformation process are XMI files. The STREAM transformation rules were defined in QVTO and an Eclipse based tool support was provided to enable their execution. In order to illustrate the use of the automated transformation rules the automated rules were used in the BTW example [10]. The output of the execution of the VTRs is a XMI file with the initial Acme architectural model. Currently, the AcmeStudio tool is only capable of reading XMI files, since it was designed to only process files described using the Acme textual language. As a consequence, the XMI file produced by the VTRs currently cannot be graphically displayed. Hence, we still need to define new transformation rules to generate a description in Acme textual language from the XMI file already produced. Moreover, more case studies are still required to assess the benefits and identify the limitations of our approach. For example we plan to run an experiment to compare the time necessary to perform the first two STREAM activities automatically against an ad-hoc way. VIII. ACKNOWLEDGEMENTS This work has been supported by CNPq, CAPES and FACEPE. REFERENCES [1] [2] [3] [4] [5] [6] J. Castro, M. Lucena, C. Silva, F. Alencar, E. Santos, J. Pimentel, "Changing Attitudes Towards the Generation of Architectural Models", Journal of Systems and Software March 2012: Vol 85. pp. 463-479. Object Management Group. (January 2011). QVT 1.1. Meta Object Facility (MOF) 2.0. Query/View/Transformation Specification. Available in: <http://www.omg.org/spec/QVT/1.1/>. Acessed: April 2013. E. Yu, "Modelling Strategic Relationships for Process Reengineering". Tese (Doutorado). University of Toronto: Department of Computer Science, 1995. ACME. Acme. Acme - The Acme Architectural Description Language and Design Environment., 2009. Available in: <http://www.cs.cmu.edu/~acme/index.html>. Accessed: April 2013. OMG. QVT 1.1. Meta Object Facility (MOF) 2.0 Query/View/Transformation Specifica-tion, 01 January 2011. Available em: <http://www.omg.org/spec/QVT/1.1/>. Accessed: April 2013. A. Malta, M. Soares, E. Santos, J. Paes, F. Alencar and J. Castro, "iStarTool: Modeling requirements using the i* framework". IStar 11, August 2011. [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] ECLIPSE GMF. GMF - Graphical Modelling Framework. Available in: <http://www.eclipse.org/modeling/gmf/ >. Accessed: April 2013. J. Pimentel, M. Lucena, J. Castro, C. Silva, E. Santos, and F. Alencar, “Deriving software architectural models from requirements models for adaptive systems: the STREAM-A approach”, Requirements Engineering, vol. 17, no. 4, pp. 259–281, June. 2012. OMG. OCL 2.0. Object Constraint Language: OMG Available Specification, 2001. Available in: <http://www.omg.org/spec/OCL/2.0/>. Accessed: April 2013. J. Pimentel, C. Borba and L. Xavier, "BTW: if you go, my advice to you Project", 2010. Available in: <http://jaqueira.cin.ufpe.br/jhcp/docs>. Accessed: April 2013. SCORE 2009. The Student Contest on Software Engineering SCORE 2009, 2009. Available in: <http://score.elet.polimi.it/index.html>. Accessed: April 2013. K. Coelho, „From Requirements to Architecture for Software Product Lines: a strategy of models transformations” (In Portuguese: Dos Requisitos à Arquitetura em Linhas de Produtos de Software: Uma Estratégia de Transformações entre Modelo). Dissertation (M.Sc.). Centro de Ciências Exatas e da Terra: UFRN, Brazil, 2012.. A. Medeiros, “MARISA-MDD: An Approach to Transformations between Oriented Aspects Models: from requirements to Detailed Project” (In Portuguese: MARISA-MDD: Uma Abordagem para Transformações entre Modelos Orientados a Aspectos: dos Requisitos ao Projeto Detalhado). Dissertation (M.S.c). Center for Science and Earth: UFRN, Brazil, 2008. M. Soares, “Automatization of the Transformation Rules on the STREAM process” (In Portuguese: Automatização das Regras de Transformação do Processo STREAM). Dissertation (M.Sc.). Center of Informatic: UFPE, Brazil, 2012. M. Soares, J. Pimentel, J. Castro, C. Silva, C. Talitha, G. Guedes, D. Dermeval, “Automatic Generation of Architectural Models From Goals Models”, SEKE 2012: 444-447. Eclipse. Available in: <http://eclipse.org/>. Acessed: April 2013. An automatic approach to detect traceability links using fuzzy logic Andre Di Thommazo Thiago Ribeiro, Guilherme Olivatto Instituto Federal de São Paulo, IFSP Universidade Federal de São Carlos, UFSCar São Carlos, Brazil [email protected] Instituto Federal de São Paulo, IFSP São Carlos, Brazil {guilhermeribeiro.olivatto, thiagoribeiro.d.o} @gmail.com Vera Werneck Universidade do Estado do Rio de Janeiro, UERJ Rio de Janeiro, Brazil [email protected] Abstract – Background: The Requirements Traceability Matrix (RTM) is one of the most commonly used ways to represent requirements traceability. Nevertheless, the difficulty of manually creating such a matrix motivates the investigation into alternatives to generate it automatically. Objective: This article presents one approach to automatically create the RTM based on fuzzy logic, called RTM-Fuzzy, which combines two other approaches, one based on functional requirements' entry data – called RTM-E – and the other based on natural language processing – called RTM-NLP. Method: To create the RTM based on fuzzy logic, the RTM-E and RTM-NLP approaches were used as entry data for the fuzzy system rules. Aimed at evaluating these approaches, an experimental study was conducted where the RTMs created automatically were compared to the reference RTM (oracle) created manually based on stakeholder knowledge. Results: On average the approaches matched the following results in relation to the reference RTM: RTM-E achieved 78% effectiveness, RTM-NLP 76% effectiveness and the RTM-Fuzzy 83% effectiveness. Conclusions: The results show that using fuzzy logic to combine and generate a new RTM offered an enhanced effectiveness for determining the requirement’s dependencies and consequently the requirement’s traceability links. Keywords- component; requirements traceability; fuzzy logic; requirements traceability matrix. I. INTRODUCTION Nowadays, the software industry is still challenged to develop products that meet client expectations and yet respect delivery schedules, costs and quality criteria. Studies performed by the Standish Group [1] showed that the quantity of projects in 2010 which finished successfully whilst respecting the schedule, budget and, principally, meeting the client’s expectations is only 37%. Another study performed previously by the same institute [2] found that the three most important factors to define whether a software project was successful or not are: user specification gaps, incomplete requirements, and constant changes in requirements. Duly noted, these factors are SandraFabbri Universidade Federal de São Carlos, UFSCar São Carlos, Brazil [email protected] directly related to requirements management. According to Salem [3], the majority of software errors found are derived from errors in the requirements gathering and on keeping pace with their evolution throughout the software development process. One of the main features of requirements management is the requirements traceability matrix (RTM), which is able to record the existing relationship between the requirements on a system and, due to its importance, is the main focus of many studies. Sundaram, Hayes, Dekhtyar and Holbrook [4], for instance, consider traceability determination essential in many software engineering activities. Nevertheless, such determination is a time consuming and error prone task, which can be facilitated if computational support is provided. The authors claim that the use of such automatic tools can significantly reduce the effort and costs required to elaborate and maintain requirements traceability and the RTM, and go further to state that such support is still very limited in existing tools. Among the ways to automate traceability, Wang, Lai and Liu [5] highlight that current studies make use of a spatial vector model, semantic indexing or probability network models. Regarding spatial vector modeling, the work of Hayes, Dekhtyar and Osborne [6] can be highlighted and it is going to be presented in detail in Section IV-B. Related to semantic indexing, Hayes, Dekhtyar and Sundaram [7] used the ideas proposed by Deerwester, Dumais, Furnas, Landauer and Harshman from Latentic Semantic Indexing (LSI) [8] in order to also automatically identify traceability. When LSI is in use, not only is the word frequency taken into consideration, but in addition the meaning and context used in their construction. With respect to the Network Probability model, Baeza-Yates, Berthier and Ribeiro-Neto [9] use ProbIR (Probabilistic Information Retrieval) to create a matrix in which the dependency between each term is mapped in relation to the other document terms. All the quoted proposals are also detailed by Cleland-Huamg, Gotel, Zisman [10] as possible approaches for traceability detection. As traceability determination involves many uncertainties this activity poses not to be trivial, not even to the team involved in the requirements gathering. Therefore, it is possible to achieve better effectiveness in the traceability link identification if we can use a technique that can handle uncertainties, like fuzzy logic. Given the aforementioned context, the focus of this paper is to present one approach to automatically create the RTM based on fuzzy logic, called RTM-Fuzzy. This approach combines two approaches: one based on functional requirements (FR) entry data – called RTM-E – that is effective on CRUD FR traceability and other based on natural language processing – called RTM-NLP – that is effective on more descriptive FRs. The motivation of the RTM-Fuzzy is to join the good features of the two others approaches. The main contribution of this paper is to present the fuzzy logic approach, once it has equal or better effectiveness than the other ones (RTM-E and RTMNLP) singly. The three proposed approaches were evaluated by an experimental study to quantify the effectiveness of each. It is worth mentioning that the RTM-E and RTM-NLP approaches had already provided satisfactory results in a previous experimental study [11]. In the re-evaluation in this paper, RTM-E had similar results and RTM-NLP (that was modified and improved) had a better effectiveness than the results of the first evaluation [11]. To make the experiment possible, the three RTM automatic generation approaches were implemented in the COCAR tool [12]. This article is organized as follows: in Section II the requirements management, traceability and RMT are introduced; Section III presents a brief definition of fuzzy logic theory; in Section IV, the three RMT automatic creation approaches are presented and exemplified by use of the COCAR tool; Section V shows the experimental study performed to evaluate the effectiveness of the approaches; conclusions and future work are discussed in Section VI. II. REQUIREMENTS MANAGEMENT TECHNIQUES Requirements management is an activity that should be performed throughout the whole development process, with the main objective of organizing and storing all requirements as well as managing any changes to them [13][14]. As requirements are constantly changing, managing them usually becomes a laborious and extensive task, thus making relevant the use of support tools to conduct it [5]. According to the Standish Group [15], only 5% of all developed software makes use of any requirements management tool, which can partially explain the huge problems that large software companies face when implementing effective requirements management and maintaining its traceability. Various authors emphasize the importance of tools for this purpose [13][14][16][17]. Zisman and Spanoudakis [14], for instance, consider the use of requirements management tools to be the only way for successful requirements management. Two important concepts for requirements management are requirements traceability and a traceability matrix, which are explained next. A. Requirements traceability Requirements traceability concerns the ability to describe and monitor a requirement throughout its lifecycle [18]. Such requirement control must cover all its existence from its source – when the requirement was identified, specified and validated – through to the project phase, implementation and ending at the product’s test phase. Thus traceability is a technique that allows identifying and visualizing the dependency relationship between the identified requirements and the other requirements and artifacts generated throughout the software’s development process. The dependency concept does not mean, necessarily, a precedence relationship between requirements but, instead, how coupled they are to each other with respect to data, functionality, or any other perspective. According to Guo, Yang, Wang, Yang and Li [18], requirements traceability is an important requirements management activity as it can provide the basis to requirements evolutional changes, besides directly acting on the quality assurance of the software development process. Zisman and Spanadousk [14] consider two kinds of traceability: • Horizontal: when the relationships occur between requirements from different artifacts. This kind of traceability links a FR to a model or a source code, for example. • Vertical: when the traceability is analyzed within the same artifact, like the RD for instance. By analyzing the FRs of this artifact it is possible to identify their relationships and generate the RTM. This type of traceability is the focus of this paper. Munson and Nguyen [19] state that traceability techniques will only be better when supported by tools that diminish the effort required to execute them. B. Requirement Traceability Matrix - RTM According to Goknil, Kurtev, Van den Berg and Veldhuis [17], despite existing various estudies treating traceability between requirements and other artifacts (horizontal traceability), only minor attention is given to the requirements relationship between themselves, i.e. their vertical traceability. The authors also state that this relationship influences various activities within the software development process, such as requirements consistency verification and change management. A method of mapping such a relationship between requirements is RTM creation. In addition, Cuddeback, Dekhtyar and Hayes [20] state that a RTM supports many software engineering validation and verification activities, like change impact analysis, reverse engineering, reuse, and regression tests. In addition, they state that RTM generation is laborious and error prone, a fact that means, in general, it is not generated or updated. Overall, RTM is constructed as follows: each FR is represented in the i-eseme line and in the i-eseme column of the RTM, and the dependency between them is recorded in the cell corresponding to each FR intersection [13]. Several authors [13] [17] [18] [19] [21] debate the importance and need of the RTM in the software development process, once such matrix allows predicting the impact that a change (or the insertion of a new requirement) has on the system as a whole. Sommerville [13] emphasizes the difficulty of obtaining such a matrix and goes further by proposing a way to subjectively indicate not only whether the requirements are dependent but how strong such a dependency is. III. FUZZY LOGIC Fuzzy logic was developed by Zadeh [22], and proposes, instead of simply using true or false, the use of a variation of values between a complete false and an absolute true statement. In classic set theory there are only two pertinence possibilities for an element in relation to a set as a whole: the element pertains or does not pertain to a set [23]. In fuzzy logic the pertinence is given by a function whose values pertain to the real closed interval between 0 and 1. The process of converting a real number into its fuzzy representation is called “Fuzzyfication”. Another important concept in fuzzy logic is related to the rules that use linguistic variables in the execution of the decision support process. The linguistic variables are identified by names, have a variable content and assume linguistic values, which are the names of the fuzzy sets [23]. In the context of this work, the linguistic variables are the traceability obtained by the three RTM generation approaches and may assume the values (nebulous sets) “non-existent”, “weak” or “strong”, which will be represented by a pertinence function. This process is detailed in Section IV C. Fuzzy logic has been used in many software engineering areas and, specifically in the requirements engineering area, the work of Ramzan, Jaffar, Iqbal, Anwar, and Shahid [24] and Yen, Tiao and Yin [25] can be highlighted. The former conducts requirements prioritization based on fuzzy logic and the later uses fuzzy logic to aid the collected requirements’ precision analysis. In the metrics area, MacDonell, Gray and Calvet [26] also use fuzzy logic to propose metrics to the software development process and in the reuse area, Sandhu and Singh [27] likewise use fuzzy logic to analyze the quality of the created components. IV. APPROACHES TO RTM GENERATION The three approaches were developed aiming to generate the RTM automatically. The first one - RTM-E – is effective to detect traceability links in FRs that have the same entry data, specially like CRUDs FRs. The second one – RTM-NLP – is appropriate to detect traceability links in FRs that have a lot of knowledge in the text description. The third one – RTM-Fuzzy – combines the previous approaches trying to extract the best of each one. These approaches only take into consideration the software FRs and establish the relationship degree between each pair of them. The RTM names were based on each approach taken. The approach called RTM-E had its traceability matrix named RTMe, the RTM-NLP’s RTM was called RTMnlp, whereas the RTM-Fuzzy’s RTM was called RTMfuzzy. The quoted approaches are implemented in the COCAR tool, which uses a template [28] for requirements data entry. The RD formed in the tool can provide all data necessary to evaluate the approaches. The main objective of such a template is to standardize the FR records, thus avoiding inconsistencies, omissions and ambiguities. One of the fields found in this template (which makes the RTM implementation feasible) is called Entry, and it records in a structured and organized way the data used in each FR. It is important noting that entry data should be included with the full description, i.e. “client name” or “user name” and not only “name”, avoiding ambiguity. Worth mentioning here is the work of Kannenberg and Saiedian [16], which considers the use of a tool to automate the requirements recording task highly desirable. Following, the approaches are presented. Aiming to exemplify the RTM generation according to them, a real application developed for a private aviation company was used in a case study. An example of how each approach calculates the dependence between a pair of FRs is presented at the end of each sub-session (IV-A, IV-B and IV-C). The system’s purpose is to automate the company’s stock control. As the company is located in several cities, the system manages the various stock locations and the products being inserted, retrieved or reallocated between each location. The system has a total of 14 FRs, that are referrenced in the first line and first column of the RTMs generated. In Section V the results of this case study ARE compared with the results of the experimental study. A. RMT-E Approach: RTM generation based on input data In the RMT-E approach, the dependency relationship between FRs is determined by the percentage of common data between FR pairs. This value is obtained through the Jaccard Index calculation [29], which compares the similarity and/or diversity degree between the data sets of each pair. Equation 1 represents this calculation. (1) The equation numerator is given by the quantity of data intersecting both sets (A and B), whereas the denominator corresponds to the quantity associated to the union between those sets. The RTM-E approach defines the dependency between two FRs, according to the following: considering FRa the data set entries for a functional requirement A and FRb the data set entries for a functional requirement B, their dependency level can be calculated by Equation 2: (2) Thus according to the RTM-E approach, each position (i,j) of the traceability matrix RTM(i,j) corresponds to values from Equation 3: (3) Positions on the matrix’s main diagonal are not calculated once they indicate the dependency of the FRs to themselves. Besides, the generated RTM is symmetrical, i.e. RTM(i,j) has the same value as RTM(j,i). Implementing such an approach in COCAR was possible because the requirements data are stored in an atomic and separated way, according to the template mentioned before. Each time entry data is inserted in a requirement data set it is automatically available and can be used as entry data for another FR. Such implementation avoids data ambiguity and data inconsistency. It is worth noting that initiatives using FR data entries to automatically determine the RTM were not found in the literature. Similar initiatives do exist to help determine traceability links between other artifacts, mainly models (for UML diagrams) and source codes, like those found in Cysneiros and Zisman [30]. In Figure 1, the matrix cell highlighted by the rectangle indicates the level of dependency between FR3 and FR5. In this case, FR3 is related to products going in and out from a company’s stock (warehouse) and FR5 is related to an item transfer from one stock location to another. As both FRs deal with stock items, it is expected that they present a relationship. The input data of FR3 are: Contact, Transaction Date, Warehouse, Quantity, Unit Price and User. The input data of FR5 are: Contact, Transaction Date, Warehouse, Quantity, Unit Price, User, Origin Warehouse, Destination warehouse and Status. As the quantity of elements of the intersection between the input data of FR3 and FR5 (n(FR3 ∩ FR5)) is equal to 6, and the quantity of elements of the union set (n(FR3 ∪ FR5)) is equal to 9, the value obtained from Equation 4, that establishes the dependency relationship between FR3 and FR5 is: (4) The 66.67% dependency rate is highlighted in Figure 1, which is the RTM-E built using the aforementioned approach. It is worth mentioning that the colors indicate each FR dependency level as follows: green for a weak dependency level and red for a strong dependency level. Where there is no relationship between the FRs, no color is used in the cell. Figure 2 illustrates the intersection and union sets when the RTM-E approach is applied to FR3 and FR5 used so far as example. Also worth mentioning is that the COCAR tool presents a list of all input data entries already in place, in order to minimize requirements input errors such as the same input data entry with different names. The determination of the dependency levels was carried out based on the application taken as an example (stock control), and from two further RDs from a different scope. Such a process was performed in an interactive and iterative way, adjusting the values according to the detected traceability between the three RD. The levels obtained were: “no dependence” where the calculated value was 0%; “weak” for values between 0% and 50%; and “strong” for values above 50%. B. RMT-NLP approach: RTM generation based on natural language processing Even though there are many initiatives that make use of NLP to determine traceability in the software development process, as mentioned previously few of them consider traceability inside the same artifact [17]. In addition, the proposals found in the literature do not use a requirements description template and do not determine dependency levels as in this work. According to Deeptimahanti and Sanyal [31], the use of NLP in requirements engineering does not aim at text comprehension itself but, instead, at extracting embedded RD concepts. This way, the second approach to establish the RTM uses concepts extracted from FRs using NLP techniques to determine the FR’s traceability. Initially, NLP was only applied to the field that describes the processing (actions) of the FR, and such a proposal was evaluated using an experimental study [11]. It was noted that many times, as there is a field for entry data input in the template (as already pointed in the RTME proposal), the analysts did not record the entry data once again in the processing field, thus reducing the similarity value. With such a fact in mind, this approach has been improved and all text fields contained in the template were used. This way, the template forces the requirements engineer to gather and enter all required data, and all this information is used by the algorithm that performs the similarity calculation. As it will be shown in this work’s sequence, this modification had a positive impact on the approach’s effectiveness. To determine the dependency level between the processing fields of two FRs, the Frequency Vector and Cosine Similarity methods [32] are used. Such a method is able to return a similarity percentage between two text excerpts. Figure 1 – Resultant RTM generated using the RTM-E approach. Figure 2 – An example of the union and intersection of entry data between two FRs. With the intention of improving the process’ efficiency, text pre-processing is performed before applying the Frequency Vector and Cosine Similarity methods in order to eliminate all words that might be considered irrelevant, like articles, prepositions and conjunctions (also called stopwords) (Figure 3-A). Then, a process known as stemming (Figure 3-B) is applied to reduce all words to their originating radicals, leveling their weights in the text similarity determination. After the two aforementioned steps, the method calculates the similarity between two FR texts (Figure 3-C) using the template’s processing fields, thus identifying, according to the technique, similar ones. The first step for Vector Frequency and Cosine Similarity calculation is to represent each sentence in a vector with each position containing one word from the sentence. The cosine calculation between them will determine the similarity value. As an example, two FRs (FR1 and FR2) described respectively by sentences S1 and S2 are taken. The similarity calculation is carried out as follows: 1) S1 is represented in vector x and S2 in vector y. Each word will use one position in each vector. If S1 has p words, vector x will also initially have p positions. In the same way, if S2 has q words, vector y will also have q positions. 3) All vectors are alphabetically reordered. 4) Vectors have their terms searched for matches on the other and, when the search fails, the word is included in the “faulting” vector with 0 as its frequency. At the end of this step, both vectors will have not matched words included and the same number of positions. 5) With the adjusted vectors, the similarity equation – sim(x,y) (Equation 5) – must be applied between vectors x and y, considering n as the number of positions found in the vectors. (5) Considering the same example used to illustrate the RTM-E approach (private aviation stock system) the RTMnlp was generated (Figure 4), evaluating the similarity between FR functionalities inserted into COCAR inside the “processing” attribute of the already mentioned template. After applying pre-processing (stopwords removal and stemming), and the steps depicted earlier for calculating the Frequency Vector and Cosine Similarity, the textual similarity between FR3 and FR5 (related to product receive in stock and product transfer between stocks, respectively) was determined as 88.63% (Figure 3-D). This high value does make sense in this relationship, once the texts describing both requirements are indeed very similar. 2) As the vector cannot have repeated words, occurrences are counted to determine each word’s frequency to be included in the vector. At the end, the vector should contain a single occurrence of each word followed by the frequency that such word appears in the text. Entry FR3 Text FR5 Text Pre-processing Remove Stopwords (articles, prepositions, conjunctions) Stemming – reduce used words to their radicals Frequency Vector and Cosine Similarity [32] Dependency between FR3 and FR5 (88.63%) A B C D Figure 3 – Steps to apply the RTM-NLP approach. Figure 4 – Resultant RTM generated using the RTM-NLP approach. As in the RTM-E, the dependency level values had been chosen in an interactive and iterative way based on the data provided by the example application (stock control) and two more RDs from different scopes. The levels obtained were: “no dependence” where the value was between 0% and 40%; “weak” for values between 40% and 70%; and “strong” for values above 70%. C. RTM-Fuzzy approach: RTM generation based on fuzzy logic The purpose of this approach is to combine those detailed previously using fuzzy logic, so that we can consider both aspects explored so far – the relationship between the entry data manipulated by the FRs (RTM-E) and the text informed in the FRs (RTM-NLP) – to create the RTM. In the previously presented approaches, the dependency classification between two FRs of “no dependence”, “weak”, and “strong” is determined according to the approach’s generated value related to the values set for each of the created levels. One of the problems with this approach is that the difference between the classification in one level and another can be miniscule. For instance, if the RTM-NLP approach generates a value of 39.5% for the dependency between 2 FRs, this would not indicate any dependency between the FRs, whereas a value of 40.5% would already indicate a weak dependency. Using the fuzzy logic, this problem is minimized as it is possible to work with a nebulous level between those intervals through the use of a pertinence function. As seen earlier, this conversion from absolute values to its fuzzy representation is called fuzzification, used for creating the pertinence functions. In the pertinence functions, the X axis represents the dependency percentage between FRs (from 0% to 100%), and the Y axis represents the pertinence level, i.e. the probability of belonging to a certain fuzzy set (“no dependence”, “weak” or “strong”), which can vary from 0 to 1. Figure 5 illustrates the fuzzy system adopted, with RTMe and RTMnlp as the entry data. Figures 6 and 7 present, respectively, the pertinence function of the RTME and RTM-NLP approaches, and the X axis indicates the dependency percentage calculated in each approach. The Y axis indicates the pertinence degree, ranging from 0 to 1. The higher the pertinence value, the bigger the chance of it being in one of the possible sets (“no dependence”, “weak”, or “strong”). There ranges of values exist in which the pertinence percentage can be higher for one set and low for the other (for example the range with a dependence percentage between 35% and 55% in Figure 6). Table I indicates the rules created for the fuzzy system. Such rules are used to calculate the output value, i.e. the RTMfuzzy determination. These rules were derived from the authors’ experience through an interative and iterative process. Figure 5 – Fuzzy System no dependence weak strong Figure 6 – Pertinence function for RTM-E Figure 8 shows the output pertinence function in the same way as shown in Figures 6 and 7, where the X axis indicates the RTMfuzzy dependence percentage and the Y axis indicates the pertinence degree between 0 and 1. no dependence weak strong V. EXPERIMENTAL STUDY To evaluate the effectiveness of the proposed approaches, an experimental study has been conducted following the guidelines below: - Context: The experiment has been conducted in the context of the Software Engineering class at UFSCar, Federal University of São Carlos, as a volunteer extra activity. The experiment consisted of each pair of students conducting requirements gathering on a system involving a real stakeholder. The final RD had to be created in the COCAR tool. weak dependence Figure 7 – Pertinence function for RTM-NLP TABLE I – RULES USED IN FUZZY SYSTEM if if if if if if if if if Antecedent RTM-E = “no dependence” AND RTM-NLP = “no dependence” RTM-E = “weak” AND RTMNLP = “weak” RTM-E = “no dependence” AND RTM-NLP = “strong” RTM-E = “strong” AND RTMNLP = “strong” RTM-E = “no dependence” AND RTM-NLP = “weak” RTM-E = “weak” AND RTMNLP = “no dependence” RTM-E = “no dependence” AND RTM-NLP = “strong” RTM-E = “strong” AND RTMNLP = “weak” RTM-E = “strong” AND RTMNLP= “no dependence” then then then then then then then then then Consequent “no dependence” “weak dependence” “weak dependence” “strong dependence” “no dependence” “weak dependence” “weak dependence” “strong dependence” “weak dependence” To exemplify the RTM-Fuzzy approach, the same aviation company stock system is used in the other approaches. The selected FRs to be used in the example are FR3, related to data insertion in a stock (and already used in the other examples), and FR7, related to the report generation on stock. Such a combination was due to the fact they do not have common entry data and, therefore, there is no dependency between them. Despite this, RTMNLP indicates a strong dependency (75.3%) between these requirements. This occurs because both FRs deal with the same set of data (although they do not have common entry data) and a similar scope, thus explaining their textual similarity. It can be observed in Figure 8 that RTM-E shows no dependency, whereas RTM-NLP shows a strong dependency (treated in the third rule). In the fuzzy logic processing (presented in Figure 9) and after applying Mandami’s inference technique, the resulting value for the entries is 42.5. Looking at Figure 8, it can be concluded that this value corresponds to a “weak” dependence, with 1 as the pertinence level. In this way, the cell corresponding to the intersection of FR3 and FR7 of the RTMfuzzy has as the value “weak”. no dependence weak strong 42.5 Figure 8 – Pertinence functions for the Fuzzy System output. Figure 9 – RTM-Fuzzy calculation - Objective: Evaluate the effectiveness of the RTM-E, RTM-NLP, and RTM-Fuzzy approaches in comparison to a reference RTM (called RTM-Ref) and constructed by the detailed analysis of the RD. The RTM-Ref creation is detailed next. - Participants: 28 graduation students on the Bachelor Computer Sciences course at UFSCar - Artifacts utilized: RD, with the following characteristics: • produced by a pair of students on their own; • related to a real application, with the participation of a stakeholder with broad experience in the application domain; • related to information systems domain with basic creation, retrieval, updating and deletion of data; inspected by a different pair of students in order to identify and eliminate possible defects; • included in the COCAR tool after identified defects are removed. - RTM-Ref: • created from RD input into the COCAR tool; requirements with the stakeholders. In an attempt to minimize this risk, known domain systems were used as well as RD inspection activities. The latter was conducted based on a defect taxonomy commonly adopted in this context and which considers inconsistencies, omissions, and ambiguities, among others. Another risk is the fact that RTM-Ref had been built by people who did not have direct contact with the stakeholder, and therefore this matrix could be influenced by the eventual problems possessed by their RDs. To minimize this risk, whenever any doubts were found when determining whether a relationship was occurring or not, the students’ help was solicited. In some cases a better comprehension of the requirements along with the stakeholder was necessary, which certainly minimized errors when creating the RTMRef. built based on the detailed reading and analysis of each FR pair, determining the dependency between them as “no dependence”, “weak”, or “strong”; recorded in a spreadsheet so that the RTM-Ref created beforehand could be compared to the RTMe, RTMnlp and RTMfuzzy for each system; built by this work’s authors, who were always in touch with the RD’s authors whenever a doubt was found. Every dependency (data, functionality or predecessor) was considered as a link. • • • • - Metrics: the metric used was the effectiveness of the three approaches with regard to the coincidental dependencies found by each approach in relation to the RTM-Ref. The effectiveness is calculated by the relation between the quantity of dependencies correctly found in each approach, against the total of all dependencies that can be found between the FRs. Considering a system consisting of n FRs, the total quantity of all possible dependencies (T) is given by Equation 6: - Results: The results of the comparison between the data in RTMe, RTMnlp, and RTMfuzzy are presented in Table II. The first column contains the name of the specified system; the second column contains the FR quantity; the third provides the total number of possible dependencies between FRs that may exist (being “strong”, “weak” or “no dependence”), and the formula for which was shown in Figure 6. The fourth, sixth and eighth columns contain the total number of coincidental dependencies between the RTMe, RTMnlp and RTMfuzzy matrices. Exemplifying: if the RTM-Ref has determined a “strong” dependency in a cell and the RTM-E approach also registered the dependency as “strong” in the same position, a correct relationship is determined. The fifth, seventh and ninth columns represent the effectiveness of the RTM-E, RTMnlp, and RTMfuzzy approaches, respectively, calculated by the relation between the quantity of correct dependencies found by the approach and the total number of dependencies that could be found (third column). (6) Therefore, the effectiveness rate is given by Equation 7: (7) - Threads to validity: The experimental study conducted poses some threads to the validity, mainly in terms of the students’ inexperience to identify the TABLE II – EXPERIMENTAL STUDY RESULTS RTM-E System Req Qty RTM-NLP RTM-Fuzzy # of possible dependencies correct effect. correct effect. correct effect. 77% 138 81% 143 84% 1 Zoo 19 171 131 2 Habitation 24 276 233 84% 205 74% 241 87% 3 Student Flat 28 378 295 78% 325 86% 342 90% 4 Taxi 15 105 82 78% 77 73% 85 81% 5 Clothing Store 27 351 295 84% 253 72% 296 84% 6 Freight 16 120 98 82% 85 71% 102 85% 7 Court 24 276 204 74% 181 66% 212 77% 8 Finantial Control 17 136 94 69% 101 74% 101 74% 78% 129 75% 137 80% 9 Administration 19 171 134 10 Book Store 19 171 129 75% 145 85% 147 86% 11 Ticket 15 105 88 84% 91 87% 94 90% 12 Movies 16 120 88 73% 82 68% 97 81% 13 Bus 15 105 72 69% 78 74% 81 77% 14 School 15 105 82 78% 77 73% 84 80% - Analysis of Results: The statistical analysis has been conducted using SigmaPlot software. By applying the Shapiro-Wilk test it could be verified that the data were following a normal distribution, and the results shown next are in the format: average ± standard deviation. To compare the effectiveness of the proposed approaches (RTM-E, RTM-NLP and RTM-Fuzzy) variance analysis (ANOVA) has been used for post-test repeated measurements using the Holm-Sidak method. The significance level adopted is 5%. The RTM-Fuzzy approach was found to be the most effective with (82.57% ± 4.85), whereas the RTM-E approach offered (77.36% ± 5.05) and the RTM-NLP obtained an effectiveness level of (75.64% ± 6.57). These results are similar to the results of the real case study presented in Section IV (company’s stock control). In the case study, the RTM-Fuzzy effectiveness was 81.69%, the RTM-E effectiveness was 78.06% and the RTM-NLP effectiveness was 71.72%. In this experimental study, the results found for the RTM-E approach were similar to those found in a previous study [11]. Despite that, in the previous study the RTM-NLP only presented an effectiveness level of 53%, which lead us to analyze and modify this approach and the improvements were already in place when this work was evaluated. Even with such improvements, the approach still generates false positive cases, i.e. non-existing dependencies between FRs. According to Sundaram, Hayes, Dekhtyar and Holbrook [4] the occurrence of false positive is an NLP characteristic, although this type of processing can easily retrieve the relationship between FRs. In the RTM-NLP approach, the reason for it generating such false positive cases is the fact that, many times, the same words are used to describe two different FRs, thus indicating a similarity between the FRs which is not a dependency between them. Examples of word that can generate false positives are “set”, “fetch” and “list”. Solutions to this kind of problem are being studied in order to improve this approach. One of the alternatives is the use of a Tagger to classify each natural language term in its own grammatical class (article, preposition, conjunction, verb, substantive, or adjective). In this way verbs could receive a different weight from substantives in similarity calculus. A preliminary evaluation of this alternative was manually executed, generating better effectiveness in true relationship determination. In the RTM-E data analysis, false positives did not occur. The dependencies found, even the weak ones, did exist. The errors influencing this approach were due to relationships that should have been counted as “strong” being counted as “weak”. This occurred because many times the dependency between two FRs was related to data manipulated by both, regardless of them being entry or output data. This way, the RTM-E approach is also being evaluated with the objective to incorporate improvements that can make it more effective. As previously mentioned, if a relation was found as “strong” in RTM-Ref and the proposed approach indicated that the relation was “weak”, an error in the experiment’s traceability was counted. In the case relationships indicating only if “there is” or “there is not” a traceability link were generated, i.e. without using the “weak” or “strong” labels, the effectiveness determined would be higher. In such a case the Precision and Recall [10] metrics could be used, given that such metrics only take in account the fact that a dependency exists and not their level (“weak” or “strong”). In relation to the RTM-Fuzzy approach, the results generated by it were always the same or higher than the results found by the RMT-E and RTM-NLP approaches alone. Nevertheless, with some adjustments in the fuzzy system pertinence functions, better results could be found. Such an adjustment is an iterative process, depending on an evaluation each time a change is done. A more broadened research could, for instance, introduce intermediary levels between linguistic variables as a way to map concepts that are hard to precisely consider in the RTM. To improve the results, genetic algorithms can make a more precise determination of the parameters involved in the pertinence functions. VI. CONCLUSIONS AND FUTURE WORK This paper presented an approach based on fuzzy logic – RTM-Fuzzy – to automatically generate the requirements traceability matrix. Fuzzy logic was used to treat those uncertainties that might negatively interfere in the requirements traceability determination. The RTM-Fuzzy approach was defined based on two other approaches also presented in this paper: RTM-E, which is based on the percentage of entry data that two FRs have in common, and RTM-NLP, which uses NLP to determine the level of dependency between requirements. From the three approaches presented, it is worth pointing that there are already some reported proposals in the literature using NLP for traceability link determination, mainly involving different artifacts (requirements and models, models and source-code, or requirements and test cases). Such a situation is not found in RTM-E, for which no similar attempt was found in the literature. All approaches were implemented in the COCAR environment, so that the experimental study could be performed to evaluate the effectiveness of each approach. The results showed that RTM-Fuzzy presented a superior effectiveness compared to the other two. This transpired because the RTM-Fuzzy uses the results presented in the other two approaches but adds a diffuse treatment in order to perform more flexible traceability matrix generation. Hence the consideration of traceability matrix determination is a difficult task, even for specialists, and using the uncertainties treatment provided by fuzzy logic has shown to be a good solution to automatically determine traceability links with enhanced effectiveness. The results motivate the continuity of this research, as well as further investigation into how better to combine the approaches for RTM creation using fuzzy logic. The main contributions of this particular work are the incorporation of the COCAR environment, and correspond to the automatic relationship determination between FRs. This facilitates the evaluation of the impact that a change in a requirement can generate on the others. New studies are being conducted to improve the effectiveness of the approaches. As future work, it is intended to improve the NLP techniques used by considering the use of a tagger and the incorporation of a glossary for synonym treatment. Another investigation to be done regards how an RTM can aid the software maintenance process, more specifically, offer support for regression test generation. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] Standish Group, CHAOS Report 2011, 2011. Available at: http://www1.standishgroup.com/newsroom/chaos_2011.php Last accessed March 2012. Standish Group, CHAOS Report 1994, 1994. Available at: http://www.standishgroup.com/sample_research/chaos_1994_2.ph p Last accessed February 2007. A.M. Salem, "Improving Software Quality through Requirements Traceability Models", 4th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2006), Dubai, Sharjah, UAE, 2006. S.K.A. Sundaram, J.H.B. Hayes, A.C. Dekhtyar, E.A.D. Holbrook, "Assessing Traceability of Software Engineering Artifacts", 18th International IEEE Requirements Engineering Conference, Sydney, Australia, 2010. X. Wang, G. Lai, C. Liu, "Recovering Relationships between Documentation and Source Code based on the Characteristics of Software Engineering", Electronic Notes in Theoretical Computer Science, 2009. J.H. Hayes, A. Dekhtyar, J. Osborne, "Improving Requirements Tracing via Information Retrieval", Proceedings of 11th IEEE International Requirements Engineering Conference, IEEE CS Press, Monterey, CA, 2003, pp. 138–147. J.H. Hayes, A. Dekhtyar, S. Sundaram, "Advancing Candidate Link Generation for Requirements Tracing: The Study of Methods", IEEE Transactions on Software Engineering, vol. 32, no. 1, January 2006, 4–19. S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman, "Indexing by Latent Semantic Analysis", Journal of the American Society for Information Science, vol. 41, no. 6, 1990, pp. 391–407. R. Baeza-Yates, A..Berthier, A. Ribeiro-Neto, Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. J. Cleland-Huang, O. Gotel, A. Zisman, Software and Systems Traceability. Springer, 2012, 491 p. A. Di Thommazo, G. Malimpensa, G. Olivatto, T. Ribeiro, S. Fabbri, "Requirements Traceability Matrix: Automatic Generation and Visualization", Proceedings of the 26th Brazilian Symposium on Software Engineering, Natal, Brazil, 2012. A. Di Thommazo, M.D.C. Martins, S.C.P.F. Fabbri, “Requirements Management in COCAR Enviroment” (in Portuguese), WER 07: Workshop de Engenharia de Requisitos, Toronto, Canada, 2007. I. Sommerville, Software Engineering. 9th ed. New York, Addison Wesley, 2010. A. Zisman, G. Spanoudakis, "Software Traceability: Past, Present, and Future", Newsletter of the Requirements Engineering Specialist Group of the British Computer Society, September 2004. Standish Group, CHAOS Report 2005, 2005. Available at: http://www.standishgroup.com/sample_research/PDFpages/q3spotlight.pdf Last accessed February 2007. A. Kannenberg, H. Saiedian, "Why Software Requirements Traceability Remains a Challenge", CrossTalk: The Journal of Defense Software Engineering, July/August 2009. [17] A. Goknil, I. Kurtev, K. Van den Berg, J.W. Veldhuis, "Semantics of Trace Relations in Requirements Models for Consistency Checking and Inferencing", Software and Systems Modeling, vol. 10, iss. 1, February 2011. [18] Y. Guo, M. Yang, J. Wang, P. Yang, F. Li, "An Ontology based Improved Software Requirement Traceability Matrix", 2nd International Symposium on Knowledge Acquisition and Modeling, KAM, Wuhan, China, 2009. [19] E.V. Munson, T.N. Nguyen, "Concordance, Conformance, Versions, and Traceability", Proceedings of the 3rd International Workshop on Traceability in Emerging Forms of Software Engineering, Long Beach, California, 2005. [20] D. Cuddeback, A. Dekhtyar, J.H. Hayes, "Automated Requirements Traceability: The Study of Human Analysts", Proceedings of the 2010 18th IEEE International Requirements Engineering Conference (RE2010), Sydney, Australia, 2010. [21] IBM, Ten Steps to Better Requirements Management. Available at: http://public.dhe.ibm.com/common/ssi/ecm/en/raw14059usen/RA W14059USEN.PDF Last accessed March 2012. [22] L.A. Zadeh, "Fuzzy Sets", Information Control, vol. 8, pp. 338– 353, 1965. [23] A.O. Artero, "Artificial Intelligence - Theory and Practice", Livraria Fisica, 2009, 230 p. [24] M. Ramzan, M.A. Jaffar, M.A. Iqbal, S. Anwar, A.A. Shahid, "Value based Fuzzy Requirement Prioritization and its Evaluation Framework", 4th International Conference on Innovative Computing, Information and Control (ICICIC), 2009. [25] J. Yen, W.A. Tiao, "Formal Framework for the Impacts of Design Strategies on Requirements", Proceedings of the Asian Fuzzy Systems Symposium, 1996. [26] S.G. MacDonell, A.R. Gray, J.M. Calvert, "FULSOME: A Fuzzy Logic Modeling Tool for Software Metricians", Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS), 1999. [27] P.S. Sandhu, H. Singh, "A Fuzzy-Inference System based Approach for the Prediction of Quality of Reusable Software Components", Proceedings of the 14th International Conference on Advanced Computing and Communications (ADCOM), 2006. [28] K.K. Kawai, "Guidelines for Preparation of Requirements Document with Emphasis on the Functional Requirements" (in Portuguese), Master Thesis, Universidade Federal de São Carlos, São Carlos, 2005. [29] R. Real, J.M. Vargas, “The Probabilistic Basis of Jaccard's Index of Similarity”, Systematic Biology, vol. 45, no. 3, pp.380-385, 1996. Avalilable at: http://sysbio.oxfordjournals.org/content/45/3/380.full.pdf Last accessed November 2012. [30] G. Cysneiros, A. Zisman, "Traceability and Completeness Checking for Agent Oriented Systems", Proceedings of the 2008 ACM Symposium on Applied Computing, New York, USA, 2008. [31] D.K. Deeptimahanti, R. Sanyal, "Semi-automatic Generation of UML Models from Natural Language Requirements", Proceedings of the 4th India Software Engineering Conference 2011 (ISEC'11), Kerala, India, 2011. [32] G. Salton, J. Allan, "Text Retrieval Using the Vector Processing Model", 3rd Symposium on Document Analysis and Information Retrieval, University of Nevada, Las Vegas, 1994. Determining Integration and Test Orders in the Presence of Modularization Restrictions Wesley Klewerton Guez Assunção1,2 , Thelma Elita Colanzi1,3 , Silvia Regina Vergilio1 , Aurora Pozo1 1 DINF - Federal University of Paraná (UFPR), CP: 19081, CEP: 81.531-980, Curitiba, Brazil COINF - Technological Federal University of Paraná (UTFPR), CEP: 85.902-490, Toledo, Brazil 3 DIN - Computer Science Department - State University of Maringá (UEM), CEP: 87.020-900, Maringá, Brazil Email: {wesleyk, thelmae, silvia, aurora}@inf.ufpr.br 2 Abstract—The Integration and Test Order problem is very known in the software testing area. It is related to the determination of a test order of modules that minimizes stub creation effort, and consequently testing costs. A solution approach based on Multi-Objective and Evolutionary Algorithms (MOEAs) achieved promising results, since these algorithms allow the use of different factors and measures that can affect the stubbing process, such as number of attributes and operations to be simulated by the stub. However, works based on such approach do not consider different modularization restrictions related to the software development environment. For example, the fact that some modules can be grouped into clusters to be developed and tested by independent teams. This is a very common practice in most organizations, particularly in that ones that adopt a distributed development process. Considering this fact, this paper introduces an evolutionary and multi-objective strategy to deal with such restrictions. The strategy was implemented and evaluated with real systems and three MOEAs. The results are analysed in order to compare the algorithms performance, and to better understand the problem in the presence of modularization restrictions. We observe an impact in the costs and a more complex search, when restrictions are considered. The obtained solutions are very useful and the strategy is applicable in practice. Index Terms—Software testing; multi-objective evolutionary algorithms; distributed development. I. I NTRODUCTION The Integration and Test Order problem is concerning to the determination of a test sequence of modules that minimizes stubbing costs in the integration testing. The test is generally conducted in different phases. For example, the unit testing searches for faults in the smallest part to be tested, the module. In the integration test phase the goal is to find interaction faults between the units. In many cases, there are dependency relations between the modules, that is, to test a module A another module B needs to be available. When dependency cycles among modules exist it is necessary to break the cycle and to construct a stub for B. However, the stubbing process may be expensive and to reduce stubbing costs we can find in the literature several approaches. This is an active research topic that was recently addressed in a survey [1]. The most promising results were found by the search-based approach with Multi-Objective and Evolutionary Algorithms (MOEAs) [2]–[7]. These algorithms offer a multi-objective treatment to the problem. They use Pareto’s dominance concepts to provide the tester a set of good solutions (orders) that represent the best trade-off among different factors (objectives) to measure the stubbing costs, such as, the number of operations, attributes, methods parameters and outputs, which are necessary to emulate the stub behaviour. The use of MOEAs to solve the Integration and Test Order problem in the object-oriented context was introduced in our previous work [2]. After achieving satisfactory results, we applied MOEAs in the aspect-oriented context with different number of objectives [4], [6] and using different strategies to test aspects and classes [7] (based on the study of Ré and Masiero [8]). In [3], the approach was generalized and named MOCAITO (Multi-objective Optimization and Coupling-based Approach for the Integration and Test Order problem). MOCAITO is an approach that solves the referred problem by using MOEAs and coupling measures. It is suitable to any type of unit to be integrated in different contexts, including object and aspect-oriented software, component-driven, software product line, and service-oriented contexts. The units to be tested can be components, classes, aspects, services and so on. The steps include: i) the construction of a model to represent the dependencies between the units; ii) the definition of a cost model related to the fitness functions and objectives; iii) the multi-objective optimization, i.e., the application of the algorithms; and iv) the selection of a test order to be used by the tester. MOCAITO was implemented and evaluated in the object and aspect-oriented contexts and presented better results when compared with other existing approaches. However, we observe a limitation for MOCAITO and all approaches found in the literature. In practice there may be different restrictions related to the software development that are not considered by existing approaches. For example, some relevant restrictions are related to software modularization. Modularity is an important design principle that allows the division of the software in modules. Modularity is useful for dealing with complexity, improves comprehension, eases reuse, and reduces development efforts. Furthermore, it facilitates the management in a distributed development [9], [10]. In this kind of development, generally clusters of related modules are developed and tested at separate locations by different teams. In a posteriori stage all the sets are then integrated. In some cases, these teams may be members of the same organization; in other cases, collaboration or outsourcing involving different organizations may exist. The dependencies between modules across different clusters make the integration testing more difficult to perform. To determine an order that implies a minimum cost is in most situations a very hard task for software engineers without using an automated strategy. Considering this fact, this paper introduces a strategy to help in this task and to determine the best module orders to the Integration and Test Order problem in the presence of modularization restrictions. The strategy is based on evolutionary optimization algorithms and is implemented in the MOCAITO approach. We implemented the MOEAs NSGA-II, SPEA2 and PAES, traditionally used in related work, and introduce evolutionary operators to consider that some modules are developed and tested together, and thus these modules need to appear as a cluster in the solution (order). Moreover, four measures (objectives) are used. Determining the orders in the presence of restrictions imposes some limitations in the search space, and consequently impacts the performance of the MOEAs. So, it is important to evaluate the impact of modularization restrictions during the integration testing. To evaluate this impact, we conducted experiments by applying two strategies: a strategy with and another one without software modularization restrictions. The experiment uses eight real systems and two development contexts: object and aspect-oriented ones. The paper is organized as follows. Section II reviews previous related researches, including the approach MOCAITO. Section III introduces the proposed strategy for the integration problem in the presence of modularization restrictions and shows how the algorithms were implemented. Section IV contains the experimental evaluation setting. Section V presents and analyses the obtained results. Finally, Section VI concludes the paper and points out future research works. II. R ELATED W ORK The Integration and Test Order problem has been addressed in the literature by many works [1] in different software development contexts: object and aspect-oriented software, component-driven development, software product lines and service-oriented systems. The existing approaches are generally based on graphs where the nodes represent the units to be integrated and the edges the relationships between them [11]. The goal is to find an order for integrating and testing the units that minimizes stubbing costs. At this end, several optimization algorithms have been applied, as well as, different cost measures. The called traditional approaches [11]–[15] are based on classical algorithms, which provides an exact solution, not necessarily optima. Metaheuristics, search-based techniques, such as Genetic Algorithms (GAs), provide better solutions since avoid local optima [16]. The multi-objective algorithms offer a better treatment to the problem that is in fact dependent on different and conflicting measures [2], [4]. However, we observe that all of the existing approaches and studies have a limitation. They do not consider and were not evaluated with real restrictions associated to the software development, such as modularization restrictions and groups of modules that are developed together. To introduce a strategy that consider the problem in the presence of such restrictions is the goal of this paper. To this end, the strategy was proposed and implemented to be used with the multi-objective approach. This choice is justified by studies conducted in the works described above showing that, independently of the development context, multi-objective approaches present better results. Pareto’s concepts are used to determine a set of good and non-dominated solutions to the problem. A solution is considered non-dominated according to its associated objective values. At least one of them needs to be better than the corresponding values of all other solutions, and the remaining values need to be at least equal. Then, the solution that deals with modularization restrictions is proposed to be used with MOCAITO [3]. The main reason to do this is that this approach is generic and can be used in different software development contexts. MOCAITO is based on multi-objective optimization of coupling measures, which are used as objectives for the algorithms. The steps of such approach are presented in Figure 1. First of all, a dependency model that represents the dependency relations among the units to be integrated is built. This allows MOCAITO application in different development contexts, with different kind of units to be integrated. Example of such model, used in our work, is the ORD (Object Relation Diagram [11]), and its extension for the aspect-oriented context [12]. In these diagrams, the vertexes represent the modules (classes or aspects), and the edges represent the relations that can be: association, cross-cutting association, use, aggregation, composition, inheritance, inter-type declarations, and so on. Other step is the definition of a cost model. This model is generally based on software measures, used as fitness function (objectives) by the optimization algorithms. Such measures are related to the stubbing process costs. MOCAITO was evaluated with different numbers of objectives, traditionally considered in the literature, and based on four coupling measures. Then, considering that mi and mj are two coupled modules, the used coupling measures are defined as follows: - Attribute Coupling (A): The maximum number of attributes to be emulated in stubs related to the broken dependencies [16]. A is represented by a matrix AM (i, j), where rows and columns are modules and i depends on j. For a given test order t with n modules and a set of d dependencies to be broken, considering that k is any module included before the module i, A is calculated to: Pn Pn according A(t) = i=1 j=1 AM (i, j); j 6= k - Operation Coupling (O): The maximum number of operations to be emulated in stubs related to the broken dependencies [16]. O is represented by a matrix OM (i, j), where rows and columns are modules and i depends on j. Then, for a given test order t with n modules and a set of d dependencies to be broken, considering that k is any module included before the module i, O is computed Pn Pnas defined by: O(t) = i=1 j=1 OM (i, j); j 6= k Dependency information Construction of the dependency model Dependency model Rules Legend Artifact Constraints Cost information Definition of the cost model Multi-objective optimization Test orders Cost model Order selection Steps User Information Selected test order Fig. 1. MOCAITO Steps (extracted from [3]) - Number of distinct return types (R): Number of distinct return types of the operations locally declared in the module mj that are called by operations of the module mi . Returns of type void are not counted, since they represent the absence of return. Similarly to previous, this measure is given by: Pnthe P n R(t) = i=1 j=1 RM (i, j); j 6= k - Number of distinct parameter types (P): Number of distinct parameters of the operations locally declared in mj that are called by operations of mi . When there is overloading operation, the number of parameters is equals to the sum of all distinct parameter types among all implementations of each overloaded operation. So, it is considered the worst case, represented by situations in which the coupling consists of calls to all implementation of a given operation. This measure is given by: Pn Pn P (t) = i=1 j=1 P M (i, j); j 6= k After this, multi-objective algorithms are applied. They can work with constraints given by the tester, which can make an order invalid. Some constraints, adopted in the approach evaluation [3], are not to break inheritance and inter-type declarations dependencies. These dependencies are complex to simulate, so to deal with these types of constraints the dependent modules are preceded by the required modules. The treatment involves to check the test order from the first to the last module, and if a precedence constraint is broken, the module in question is placed at the end of the order. As output, the algorithms generate a set of solutions for the problem that have the best trade-off among the objectives. The orders can be ranked according to some priorities (rules) of the tester. In [3], MOCAITO was evaluated in object and aspectoriented contexts with different number of objectives and with three MOEAs: NSGA-II [17], SPEA2 [18] and PAES [19]. These three MOEAs were chosen because they are well known, largely applied and adopt different evolution and diversification strategies [20]. Moreover, to know which algorithm is more suitable to solve a particular problem is a question that needs to be answered by means of experimental results. The main results found are: there is no difference among the algorithms for simple systems and contexts; SPEA2 was the most expensive with the greatest runtime; NSGA-II was the most suitable in a general case (considering different quality indicators and all systems); PAES presented better performance for more complex systems. However, the approach was not evaluated taking into account some real restrictions, generally associated to the soft- ware development, and mentioned in the last section. Most organizations nowadays adopted the distributed development, but in spite of this, we observe in the survey of the literature [1] that related restrictions are not considered by studies that deal with the Integration and Test Order problem. In this sense, the main contribution of this paper is to introduce and evaluate a strategy based on optimization algorithms to solve the referred problem in the presence of modularization restrictions. In fact, in the presence of such restrictions a new problem (variant of the original one) emerges, which presents new challenges related to task allocation versus the establishment of an integration and test order. For this, a novel solution strategy is necessary and proposed, including a new representation and new genetic operators. This new solution strategy is evaluated with the MOCAITO approach, by using its cost and dependency models. However, it could be applied with other approaches. So, the next section describes implementation aspects of the strategy to deal with such modularization restrictions. III. W ORKING WITH M ODULARIZATION R ESTRICTIONS A restriction is a condition that a solution is required to satisfy. In the context of this study, restrictions are related to modularization. We consider that some modules are grouped by the software engineer, forming a cluster. Modules in a cluster must be developed and tested together. Figure 2 presents an example of modularization restrictions. Considering a system with twelve modules identified from 1 to 12. Due to a distributed development environment, the software engineer determines three clusters (groups) identified by A1, A2 and A3. To mix modules of distinct clusters is not valid, such as it happens in Order C. Using Order C the developers of A1 need to wait developers of A3 to develop some modules to finish and test their modules. Orders A and B are examples of valid orders. These orders allow teams working independently. In order A the modules of cluster A1 are firstly developed, integrated and tested. Since the team responsible for the modules of A1 finish their work, the development of modules of cluster A3 can start having all the modules of A1 available to be used in the integration test. Similarly, when the team responsible for the cluster A2 starts its work, the modules of A1 and A3 are already available. The independence on the development of each cluster by different teams also occurs in order B, since the modules of each cluster are in sequence to be developed, integrated and tested. Although Figure 1 shows the modules in sequence, when there Cluster1 Cluster2 Cluster3 Genetic Operators in MECBA-Cluster 1 Child Strategy 7 8 9 6 5 4 3 2Combined 4 2 3 1 6 5 7 8 9 Child1 Cluster1 Cluster2 Cluster3 5 6 9 8 7 1 3 2 4 Child2 Cluster2 Cluster3 Cluster1 System Modules 1 2 Cluster1 3 4 5 6 7 8 9 10 11 2 8 9 Clusters of Modules Cluster1 A1 A2 1 3 4 A3 6 7 5 10 11 Cluster2 Cluster3 1 2 3 4 6 5 7 8 9 Parent1 Cluster2 4 1 5 3 11 8 2 9 12 6 10 7 Order B 2 9 8 12 11 6 7 10 4 3 1 5 Order c 6 7 10 4 5 8 3 12 11 2 1 9 Cluster3 Cluster3 Cluster1 Cluster1 Cluster2 Cluster3 5 6 9 8 7 1 2 3 4 Child2 Cluster3 Cluster1 Cluster3 Cluster1 Cluster1 1 2 3 4 Cluster1 4 3 2 1 Crosspoint1 Cluster1 Crosspoint2 Cluster2 Cluster2 Cluster3 Cluster1 (b) Intra Cluster Fig. 4. Crossover Operator Cluster1 Cluster2 Cluster3 1 2 3 4 6 5 7 8 9 Parent1 their parents. As illustrated in the example of Figure 4(a), after 5 6 9 8 7 4 3 2 1 Parent2 the random selection of the cluster to be exchanged, Child1 Cluster2 Cluster3 Cluster1 receives Cluster2 and Cluster3 from Parent1, and Cluster1 A. Problem Representation from Parent2. In the same way, Cluster1 Child2Cluster2 receives Cluster2 and Cluster3 4 3 2 from 1 6 5Parent1. To implement a solution to the problem with restrictions, the Cluster3 from Parent2, and Cluster1 7 8 9 Child1 first point refers to the problem representation. The traditional The Intra Cluster crossover aims at creating new solutions 5 6 9 8 7 1 2 3 4 Child2 way to deal with the problem uses as representation an array that receive clusters generatedCluster2 withCluster3 two points Cluster1 crossover of of integers, where each number in the array corresponds to a specific cluster. After the random selection of one cluster, a module identifier [3]. However, a more elaborated repre- the traditional two points crossover for permutation problems sentation is needed to consider the modules grouped into is applied. The other clusters that do not participate of the clusters. A class Cluster, as presented in Figure 3 was crossover are copied from parents to children. As illustrated in implemented. An object of the class Cluster is composed by Figure 4(b), Cluster1 was randomly selected, as well as, two two attributes: (i) a cluster identifier (id) of the type integer; points are defined to crossover. Cluster2 and Cluster3 from and (ii) an array of integers (modules) that represents the in Parent1 are just copied to Child1, and in the same way the Genetic Operators MECBA-Cluster Combined Strategy cluster modules. An individual is composed of n Cluster Cluster2 and Cluster3 from Parent2 are just copied to Child2. objects, where n is the number of clusters. During the crossover, in the evolutionary process, two parents are selected and four children are generated, two using Inter Cluster crossover, and two, using Intra Cluster crossover. Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 ClusterCluster1 1 2 3 4 6 5 7 8 9 Parent1 42)3 Mutation: 1 6 5 7MOCAITO 2 8 9 Parent approach applies the traditional +id: int +modules: ArrayList<Integer> swap mutation [3], swapping module position in the order. But, 5 6 9 8 7 4 3 2 1 Parent2 again, the simple application of this mutation could disperse Cluster2 Cluster3 Cluster1 4 1 2 3 6 5 7 8 9 Child Genetic Operators in MECBA-Cluster Fig. 3. Class Cluster the Cluster1 modulesCluster2 of aCluster3 cluster across the order. So, two different Combined Strategy Crosspoint: Cluster1 types of mutation are implemented: (i) Inter Cluster; and (ii) Cluster1 1 2 3 4 Intra Cluster, which are presented in Figure 5. B. Evolutionary Operators 3 2evolutionary 1 Cluster2 Cluster2 Cluster3 Cluster3 Cluster1 Cluster2 Cluster3 The traditional way [2]–[5] to Cluster1 apply4 the op- Cluster1 Cluster1 8 9 Parent 4 3 2 1 6 5 7 8 9 Parent erators to the Integration and TestCrosspoint1 Order Crosspoint2 problem is the1 2 34 43 62 51 76 85 97 Parent1 same adopted in a permutation problem. However, with the5 6 9 8 7 4 3 2 1 Parent2 Cluster1 Cluster2 Cluster3 Cluster1 modularization restrictions a new way to generate and dealCluster2 7Cluster3 4 1 2 3 6 5 7 8 9 Child 4 2 3 1 6 5 7 8 9 Child1 8 9 6 5 4 3 2 1 Child Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3 with the individuals (solutions) is required. Next, we present Crosspoint: Cluster1 5 6 9 8 operators. (a) Inter Cluster (b) Intra Cluster 7 1 3 2 4 Child2 the implemented crossover and Cluster2 mutation Cluster3 Cluster1 1 2 3 4 1) Crossover: MOCAITO approach applies the two points Cluster1 Fig. 5. Mutation Operator crossover [3]. However, a simple random selection of two Cluster1 4 3 2 1 Cluster1 Cluster2 Cluster3 points to perform the crossover could disperse the modules of Crosspoint1 Both types of mutation are simple. Cluster 4 3 2 1While 6 5 the 7 8Inter 9 Parent Crosspoint2 a cluster across the order. So, considering the modularization mutation swaps cluster positions in the order, the Intra Cluster Cluster1 Cluster2 Cluster3 1 were 2 3 4 6 5 7 8 9 Parent1 Cluster2 swaps Cluster3 module positions in a cluster. Figure 5(a) restrictions, two types of crossover implemented: (i) Inter Cluster1 mutation 4 2 3 1 6 5 7 8 9 Child1 7 8 9 6 5 4 3 2 1 Child Cluster; and (ii) Intra Cluster, which are depicted in Figure 4. illustrates the Inter Cluster mutation, where the positions of Cluster1 Cluster2 Cluster3 5 6 9 8 7 4 3 2 1 Parent2 Cluster3 is to Cluster1 and Cluster3 are swapped. Figure 5(b) presents an The goal of the Inter Cluster Cluster2 crossover generate chil-5 6Cluster1 9 8 7 1 3 2 4 Child2 Cluster3 dren receiving the full exchange of complete cluster betweenCluster2example ofCluster1 Intra Cluster, where after the random selection Cluster1 Cluster2 Cluster3 4 3 2 1 6 5 7 8 9 Child1 5 6 9 8 7 1 2 3 4 Child2 Cluster2 Cluster3 Cluster1 Cluster1 Cluster2 Cluster3 1 2 3 4 6 5 7 8 9 Parent1 Cluster1 Clu Cluster1 Clu 4 3 2 1 6 7 8 9 6 5 Cluster1 5 6 9 8 7 1 3 2 4 Child2 are no dependencies between some clusters, the development may be performed in a parallel way. In this case, each team could develop and test the modules of the cluster according to the test order, later the modules would be integrated also in accordance with the same test order. 4 1 2 3 6 Cluster3 4 2 3 1 6 5 7 8 9 Child1 (a) Inter Cluster Fig. 2. Example of modularization restrictions Clu Crosspoint: Cluster1 4 3 2 1 6 5 7 8 9 Child1 Cluster2 Cluster1 4 3 2 1 6 5 6 9 8 7 4 3 2 1 Parent2 Cluster2 5 6 9 8 7 4 3 2 1 Parent2 12 Integration and Test Orders Order A Cluster2 1 2 3 4 6 5 7 8 9 Parent1 12 Clu of Cluster1, the positions of Modules 1 and 3 are swapped. During the evolutionary process both kind of mutations have 50% of probability to be chosen. C. Repairing broken dependencies There are two types of repairing orders with broken dependencies constraints (inheritance and inter-type declarations). In the Intra Cluster treatment, the constraints between modules in the same cluster are checked and the precedence is corrected by placing the corresponding module at the cluster end. After all the precedences of modules inside the clusters are correct, the constraints between modules of different clusters are checked during the Inter Cluster treatment. The precedence is corrected by placing the cluster at the order end, thus the dependent cluster becomes the last one of the order. IV. E XPERIMENTAL S ETUP The goal of the conducted experiment is to evaluate the solution of the Integration and Test Order problem in the presence of modularization restrictions and to answer some questions such as: “How does the use of the modularization restrictions impact on the performance of the algorithms?” and “How are the usefulness and the applicability of the solutions obtained by the proposed strategy?”. To the first research question we followed the GQM method [21] 1 . In the case of the second question, a qualitative analysis was performed. The experiment was conducted using similar methodology and same systems of related work [3]. Two strategies were applied and compared: a strategy named here MC, which deals with modularization restrictions using clusters, according to the implementation described in the last section, and a strategy M applied according to [3] without using modularization restrictions. A. Used Systems The study was conducted with eight real systems; the same ones used in [3]. Table I presents some information about these systems, such as number of modules (classes for Java programs; classes and aspects for AspectJ), dependencies, LOC (Lines of Code) and clusters. TABLE I U SED S YSTEMS System Language Modules Dependencies LOC Clusters BCEL Java 45 289 2999 3 JBoss Java 150 367 8434 8 JHotDraw Java 197 809 20273 13 MyBatis Java 331 1271 23535 24 AJHotDraw AspectJ 321 1592 18586 12 AJHSQLDB AspectJ 301 1338 68550 15 HealthWatcher AspectJ 117 399 5479 7 Toll System AspectJ 77 188 2496 7 1 Due to lack of space, the GQM table is available at: https://dl.dropboxusercontent.com/u/28909248/GQM-Method.pdf. B. Clusters Definition To define the clusters of modules, the separation of concerns principle [22] was followed. Considering this principle, the effort to develop a software, and consequently test it, became negligibly small. Following the separation of concerns the modules should be interconnected in a relatively simple manner presenting low coupling to other clusters. Hence, this procedure benefits the distributed development since decreases the interdependence between the teams. In this way, each system was divided into clusters according to the concerns that they realize. So each team should develop, integrate and test one cluster that deals with one concern present in the system. Aiming at confirming the interdependencies between the modules of the clusters, we check such division by constructing directed graphs and considering the inheritance and inter-type declarations dependencies, that ones that should not be broken. The number of clusters for each system is presented in the last column of Table I. C. Obtaining the Solutions Sets To analyze the result we will use different sets of solutions. These sets are found in different ways. Below, we describe how we obtained each solution set used: • P Fapprox : one set P Fapprox for a system was obtained in each run of an algorithm. Each MOEA was executed 30 times for each system in order to know the behavior of each algorithm to solve the problem. So, at the end, 30 sets P Fapprox were obtained. • P Fknown : this set was obtained for each system through the union of the 30 sets P Fapprox , removing dominated and repeated solutions. P Fknown represents the best solutions found by each MOEA. • P Ftrue : this set was obtained for each system through the union of the sets P Fknown , removing dominated and repeated solutions. P Ftrue represents the best solutions known to the problem. This procedure to obtain the best solutions to a problem is recommended when the ideal set of solutions is not known [23]. D. Quality Indicators To compare the results presented by the MOEAs, we used two quality indicators, generally used in the MOEA literature: (i) Coverage and (ii) Euclidean Distance from an Ideal Solution (ED). The Coverage (C) [19] calculates the proportion of solutions in the Pareto Front, P Fa , which are dominated by P Fb . The function C(P Fa , P Fb ) maps the ordered pair of (P Fa , P Fb ) into the range [0,1] according to the proportion of solutions in P Fb that are dominated by P Fa . Similarly, we compare C(P Fb , P Fa ) to obtain the proportion of solutions in P Fa that are dominated by P Fb . Value 0 for C indicates that the solutions of the former set do not dominate any element of the latter set; on the other hand, value 1 indicates that all elements of the latter set are dominated by elements of the former set. The Euclidean Distance from an Ideal Solution (ED) is used to find the closest solutions to the best objectives. It is based on Compromise Programming [24], a technique used to support decision maker when a set of good solutions is available. An Ideal Solution has the minimum value of each objective of P Ftrue , considering a minimization problem. In this section the results are presented and evaluated to answer the research questions. The impact of using restrictions is analysed and the practical use of MC is addressed. E. Parameters of the Algorithms A. On the impact of using modularization restrictions The same methodology adopted in [2]–[4] was adopted to configure the algorithms. The parameters are in Table II. The number of fitness evaluations was used as stop criterion for the algorithms, this allows comparing the solutions with similar computational cost. Moreover, they were executed in the same computer and the runtime was recorded. In this section the restrictions impact is analysed in two ways (subsections): (i) evaluating the MOEAs performance using MC, and (ii) comparing the strategies M and MC. At the end, a synthesis about the impact of modularization restrictions is presented. 1) Performance of the MOEAs using MC: The analysis conducted in this section allows evaluating the performance of the MOEAs when the modularization restrictions are considered. It is based on the quality indicators described previously. Table III presents the values of the indicator C for the sets P Fknown of each MOEA. The results show difference for five systems: BCEL, MyBatis, AJHotDraw, AJHSQLDB and Toll System. For BCEL, the NSGA-II solutions dominate 75% of PAES solutions and around 60% of the SPEA2 solutions. The SPEA2 solutions also dominate 75% of PAES solutions. For MyBatis, PAES solutions dominated 100% NSGA-II and SPEA2 solutions, and NSGA-II solutions dominated around 73% of SPEA2 solutions. For AJHotDraw, PAES was also better, but SPEA2 was better than NSGA-II. For AJHSQLDB, a similar behaviour was observed. For Toll System NSGAII and SPEA2 solutions dominate 50% of PAES solutions. Hence, NSGA-II and SPEA2 presented the best results. TABLE II MOEA S PARAMETERS Parameter NSGA-II PAES SPEA2 Strategy MC Population Size 300 300 300 Fitness Evaluations 60000 60000 60000 Archive Size 250 250 Crossover Rate 0,95 0,95 Inter Cluster Crossover rate 1,0 1,0 Intra Cluster Crossover rate 1,0 1,0 Mutation Rate 0,02 1 0,02 Inter Cluster Mutation Rate 0,5 0,5 0,5 Intra Cluster Mutation Rate 0,5 0,5 0,5 Strategy M Population Size 300 300 300 Fitness Evaluations 60000 60000 60000 Archive Size 250 250 Crossover Rate 0,95 0,95 Mutation Rate 0,02 1 0,02 F. Threats to Validity The main threats to our work are related to the evaluation of the proposed solution. In fact an ideal evaluation should consider similar strategies, and different kind of algorithms, including the traditional ones. However, we have not found a similar strategy in the literature. A random strategy could be used, however, this strategy is proven to present the worst results in the related literature, and the obtained results would be obvious. Besides, the traditional approaches, not based on evolutionary algorithms, are very difficult to adapt (some of them impossible) to consider the modularization restrictions and different cost measures. Hence, we think that addressing such restrictions with multi-objective and evolutionary approaches is more promising and practical. In addition to this, a comparison with a strategy that does not consider the restrictions can provide insights about the usage impact. Other threat is related to the clusters and systems used. An ideal scenario would be consider clusters used in real context of distributed development. To mitigate such threat we consider as a criterion to compose the clusters the separation of concerns, which we think it is implicitly considered in team allocations. Another threat is the number of systems used that is reduced and can influence in the generalization of the obtained results. To reduce this influence we selected object and aspect-oriented systems, with different sizes and complexities, given by the number of modules and dependencies. V. R ESULTS AND A NALYSIS TABLE III C OVERAGES VALUES - S TRATEGY MC System BCEL JBoss JHotDraw MyBatis AJHotDraw AJHSQLDB Health Watcher Toll System MOEA NSGA-II PAES SPEA2 NSGA-II PAES SPEA2 NSGA-II PAES SPEA2 NSGA-II PAES SPEA2 NSGA-II PAES SPEA2 NSGA-II PAES SPEA2 NSGA-II PAES SPEA2 NSGA-II PAES SPEA2 NSGA-II 0 0,181818 0 0 0,427273 0,345455 1 0,349515 1 0,666667 1 0,540816 0 0 0 0 PAES 0,75 0,75 0 0 0,027972 0,020979 0 0 0 0 0 0 0,166667 0,166667 0,5 0,5 SPEA2 0,578947 0 0 0 0,166667 0,45098 0,729167 1 0,16129 1 0,117647 1 0 0 0 0 - Table IV contains the results obtained for indicator ED. The second column presents the cost of the ideal solutions. Such costs were obtained considering the lowest values of each objective from all solutions of the P Ftrue of each system and independently from which solution they were achieved. TABLE IV C OST OF THE I DEAL S OLUTION AND L OWER ED F OUND - S TRATEGY MC System Ideal Solution BCEL (40,54,33,59) JBoss (25,17,4,14) JHotDraw (283,258,92,140) MyBatis (259,148,57,145) AJHotDraw (190,100,40,62) AJHSQLDB (3732,737,312,393) Health Watcher (115,149,49,52) Toll System (68,41,18,16) NSGA-II Lowest ED Solution Cost 24,5764 (57,59,50,60) 2,0000 (25,17,6,14) 63,2297 (301,274,105,197) 203,2855 (1709,204,81,191) 51,6817 (196,105,43,113) 526,5302 (4217,879,415,499) 39,7869 (138,166,67,73) 5,4772 (68,42,20,21) The other columns present the solution closest to the ideal solution and its cost in terms of each objective. For the systems JBoss, JHotDraw, Health Watcher and Toll System, all MOEAs found the same solution with the lowest value of ED. For BCEL, SPEA2 found the solution with the lowest ED. Finally, PAES obtained solutions with the lowest ED for MyBatis, AJHotDraw and AJHSQLDB. From the results of both indicators it is possible to see that, in the context of our study, PAES is the best MOEA, since it obtained the best results for six systems: JBoss, JHotDraw, MyBatis, AJHotDraw, AJHSQLDB and Health Watcher. Such systems have the greatest numbers of modules and clusters (Table I). NSGA-II is the second best MOEA, since it found the best results for five systems: BCEL, JBoss, JHotDraw, Health Watcher and Toll System. SPEA2 also obtained the best results for four systems: BCEL, JBoss, Health Watcher and Toll System. NSGA-II and SPEA2 have similar behavior, presenting satisfactory results for systems with few modules and few clusters (Table I). 2) Comparing the strategies M and MC: Aiming at analysing the impact of using restrictions, two pieces of information were collected for the strategies M and MC: number of obtained solutions and runtime. Such numbers are presented in Table V. The third and the sixth columns contain the cardinality of P Ftrue . The fourth and the seventh columns present the mean quantity of solutions from the sets P Fapprox and the cardinality of P Fknown between parentheses. The fifth and eighth columns present the mean runtime (in seconds) used to obtain each P Fapprox and the standard deviation (between parentheses), respectively. Verifying the number of solutions of P Ftrue , it can be noticed that for BCEL and MyBatis the number of solutions found by MC was lower than M. On the other hand, for JBoss and JHotDraw such number was greater in MC than in M. So, it can be observed that the systems with more solutions found by M have less solutions found by MC and vice-versa. In spite of the strategies M and MC involve the same effort related to the number of fitness evaluations, the runtime between them have great difference (Figure 6 and Table V). For all systems, NSGA-II, PAES and SPEA2 spent more runtime in strategy MC. The single exception was SPEA2 that spent less time with strategy MC for JHotDraw. From the three MOEAs, SPEA2 spent the greatest runtime. Such fact allows us to infer that in the presence of several restrictions in the search space the SPEA2 behavior may become random. PAES Lowest ED Solution Cost 74,0000 (51,59,34,132) 2,0000 (25,17,6,14) 63,2297 (301,274,105,197) 147,5263 (282,235,78,260) 49,1325 (197,106,45,110) 167,2692 (3836,810,365,488) 39,7869 (138,166,67,73) 5,4772 (68,42,20,21) SPEA2 Lowest ED Solution Cost 23,4094 (45,63,52,68) 2,0000 (25,17,6,14) 63,2297 (301,274,105,197) 221,4746 (386,267,97,276) 49,6488 (200,106,45,110) 403,7809 (4069,879,403,538) 39,7869 (138,166,67,73) 5,4772 (68,42,20,21) Figure 7 presents the solutions in the objectives space. Due to graphics dimension limitation, only three measures were presented in the pictures. In the case of JHotDraw (Figure 7(a)), the solutions of M are closer to the minima objectives (A=0, O=0, R=0, P=0). These solutions are not feasible for the strategy MC due to the restrictions. They impose the MOEAs to find solutions in other places in the search space, where a greater number of solutions are feasible, but more expensive. MyBatis illustrates well this point. Figure 7(b) presents that the M solutions for MyBatis are in the same area, next to the minima objectives. The restrictions impose MOEAs to explore other areas in the search space, and in this case, a lower number of solutions is found. These solutions are more expensive. From the results, it is possible to state that the restrictions imply a more complex search, limiting the search space and imposing a greater stubbing cost. To better evaluate the impact on the cost of the solutions obtained by both strategies, we use the indicator ED. The solutions closest to the ideal solution are those ones that have the best trade-off among the objectives and are good candidates to be adopted by the tester. We compare the cost of the ideal solutions and the cost of the solutions obtained by a MOEA. In our comparison we chosen the PAES solutions, this algorithm presented the best performance, lower ED values for six systems. These costs are presented in Table VI. TABLE VI C OST OF THE SOLUTIONS IN BOTH STRATEGIES M System BCEL JBoss JHotDraw MyBatis AJHotDraw AJHSQLDB Health Watcher Toll System Ideal Solution (45,24, 0,96) (10,6, 2,9) (27,10, 1,12) (203,70, 13,47) (39,12, 0,18) (1263,203, 91,138) (9,2, 0,1) (0,0, 0,0) PAES Solution (64,39, 15,111) (10,6, 2,9) (30,12, 1,18) (265,172, 49,184) (46,19, 1,34) (1314,316, 138,236) (9,2, 0,1) (0,0, 0,0) MC Ideal PAES Solution Solution (40,54, (51,59, 33,59) 34,132) (25,17, (25,17, 4,14) 6,14) (283,258, (301,274, 92,140) 105,197) (259,148, (282,235, 57,145) 78,260) (190,100, (197,106, 40,62) 45,110) ) (3732,737, (3836,810, 312,393) 365,488) (115,149, (138,166, 49,52) 67,73) (68,41, (68,42, 18,16) 20,21) We can observe that, except for BCEL, the cost of the MC solutions are notably greater than the M solutions cost. In most cases the MC cost is two or three times greater, depending on TABLE V N UMBER OF S OLUTIONS AND RUNTIME System M # P Ftrue Number of Solutions 37,43 (37) 37 39,30 (37) 36,70 (37) 1,00 (1) 1 1,13 (1) 1,00 (1) 8,40 (10) 11 10,47 (19) 9,63 (9) 276,37 (941) 789 243,60 (679) 248,77 (690) 70,03 (79) 94 40,73 (84) 68,87 (78) 156,63 (360) 266 145,97 (266) 119,10 (52) 1,00 (1) 1 1,07 (1) 1,00 (1) 1,00 (1) 1 1,07 (1) 1,00 (1) MOEA NSGA-II BCEL PAES SPEA2 NSGA-II JBoss PAES SPEA2 NSGA-II JHotDraw PAES SPEA2 NSGA-II MyBatis PAES SPEA2 NSGA-II AJHotDraw PAES SPEA2 NSGA-II AJHSQLDB PAES SPEA2 NSGA-II Health PAES Watcher SPEA2 NSGA-II Toll PAES System SPEA2 200000 140000 M MC 180000 MC # P Ftrue Number of Solutions 7,57 (11) 15 3,40 (8) 8,53 (19) 1,97 (2) 2 2,87 (2) 2,17 (2) 45,80 (110) 153 85,47 (143) 49,17 (102) 72,60 (103) 200 108,43 (200) 64,33 (144) 16,30 (36) 31 26,57 (31) 17,53 (31) 62,07 (196) 240 122,57 (240) 58,30 (170) 10,70 (11) 11 7,47 (12) 10,20 (11) 4,27 (4) 4 3,50 (4) 4,00 (4) Execution Time 5,91 (0,05) 6,58 (1,25) 123,07 (18,84) 18,73 (0,20) 10,69 (0,62) 2455,35 (612,18) 29,85 (0,34) 24,29 (1,50) 922,99 (373,98) 74,03 (0,87) 104,30 (7,91) 128,88 (2,65) 75,05 (0,57) 62,07 (2,16) 195,56 (28,22) 62,34 (0,53) 75,62 (5,27) 104,29 (0,68) 12,72 (0,15) 8,27 (0,58) 2580,39 (596,29) 7,33 (0,09) 4,10 (0,75) 3516,71 (570,76) 4000000 M MC 120000 100000 80000 60000 3000000 100000 Runtime (s) Runtime (s) Runtime (s) 120000 80000 60000 40000 2000000 1500000 20000 500000 0 0 0 er ch m ste Sy B at W LD w ra System (b) PAES ll To lth ea H D is SQ JH A at ot er ch m w ra D yB JH A M ss ot JH EL o JB BC ste Sy B LD at W System (a) NSGA-II ll To lth ea H w ra D SQ JH A is at ot w ra D yB JH A M ss ot JH er ch m B LD at W w ra ste Sy EL o JB BC ll To lth ea H D is SQ JH A at ot w ra D yB JH A M ss ot JH EL o JB BC System 2500000 1000000 40000 20000 M MC 3500000 160000 140000 Execution Time 8,61 (0,11) 29,89 (22,25) 3786,79 (476,23) 42,50 (0,47) 56,15 (12,50) 3536,01 (335,97) 71,90 (0,45) 51,18 (2,82) 532,83 (81,93) 189,91 (0,83) 132,37 (3,91) 517,52 (67,52) 194,34 (0,83) 115,12 (2,82) 1005,36 (268,37) 160,38 (1,64) 122,01 (4,92) 505,11 (101,90) 27,52 (0,10) 46,98 (5,34) 990,19 (95,94) 13,23 (0,09) 31,13 (16,23) 2229,26 (271,47) (c) SPEA2 Fig. 6. Runtime M MC P P M MC 250 200 150 100 50 0 140 120 100 80 60 R 40 20 0 0 3500 2500 3000 1500 2000 1000 500 350 300 250 200 150 100 50 0 120 110 100 90 80 70 60 50 R 40 30 20 10 A 0 500 1000 1500 2000 2500 3000 A (b) MyBatis (a) JHotDraw Fig. 7. P Ftrue with and without Modularization Restrictions the measure. The greatest difference was obtained by programs Health Watcher, Toll System and JHotDraw. In the two first cases optimal solutions were found by all the algorithms with the strategy M. These solutions are not feasible when the restrictions are considered. 3) Summarizing impact results: Based on the results, it is clear that the modularization restrictions increase the integration testing costs. Hence, the strategy MC can also be used in the modularization task as a simulation and decision supporting tool. For example, in a distributed software development, the strategy MC can be used to allocate the modules to the different teams to ensure lower testing costs. Furthermore, all the implemented algorithms can be used and present good results, solving efficiently the problem. However we observe that for most complex systems PAES is the best choice. B. Practical Use of MC This section evaluates through an example the usefulness and applicability of the proposed strategy. We presented in the last section that the strategy MC implies in greater costs than M. However the automatic determination of such orders in the presence of restrictions is fundamental. When we consider the restrictions, a huge additional effort is necessary. The usefulness of the proposed strategy relies on the infeasibility of manually obtaining a satisfactory solution for the problem. To illustrate this, consider the smallest system used in the experiment BCEL, with 3 clusters and 45 modules. For it, there is a number of 1.22E+47 possibilities of different permutations among the clusters and modules inside clusters to be analysed. For the other systems such effort is even higher. Since the task of determining a test order is delegated to some MOEA, the tester only needs to concentrate his/her effort on choosing an order achieved by the algorithm, as it is explained in the example of how to use the proposed strategy presented next. 1) Example of Practical Use of MC: Table VII presents some solutions from the set of non-dominated solutions achieved by PAES for JHotDraw. The first column presents the cost of the solutions (metrics A,O,R,P) and the second column presents the order of modules in the cluster. JHotDraw is the fourth largest system (197 modules) and the third largest system considering the clusters (13 clusters). For this system PAES found 143 solutions. Therefore, it is necessary that the software engineer chooses which of these orders will be used. To demonstrate how each solution should be analysed, we use the first solution from the table, the solution cost is (A=283,O=292,R=102,P=206). The order shown in the second column; {87, 9, 196, ...}, ..., {..., 120, 194, 141}; indicates the sequence in which the modules must be developed, integrated and tested. Using this order, to perform integration testing of the system will be needed the construction of stubs to simulate 283 attributes; 292 operations, that may be class methods, aspect methods or aspect advices; different 102 types of return and 206 distinct parameter types. To choose among the solutions presented in Table VII, it could be used the rule concerning to the lowest cost for a given measure. The lowest cost is highlighted in bold, therefore, the first solution has the lowest cost for the measure A, the second solution has the lowest cost measure for O, and so on. The fifth solution provides the best balance of cost among the four measures and was selected based on the indicator ED (Table IV). So, if the system under development presents complex attributes to construct, then the first solution should be used, or if the system presents parameters of the operations that are difficult to be simulated, the fourth solution should be used. However, if the software tester choose to prioritize all of the measures the third solution is the best option since it is closer to the minimum cost for all of the measures. This diversity of solutions with different trade-offs among the measures is one of the great advantages of using multiobjective optimization, easing the selection of an order of modules that meets the needs of the tester. VI. C ONCLUDING R EMARKS This work described a strategy to solve the Integration and Test Order problem in the presence of modularization restrictions. The strategy is based on multi-objective and evolutionary optimization algorithms and generates orders considering that some modules are grouped and need to be developed and tested together, due, for instance, to a distributed development. Specific evolutionary operators were proposed to allow mutation and crossover inside a cluster of modules. To evaluate the impact of such restrictions the strategy, named MC, was applied using three different multi-objective evolutionary algorithms and eight real systems. During the evaluation the results obtained from the application of MC were compared with another strategy without restrictions. With respect to our first research question, all the MOEAs achieved similar results, so they are able to satisfactorily solve the referred problem. The results point out that the modularization restrictions impact significantly on the optimization process. The search becomes more complex since the restrictions limit the search space, then the stubbing cost increases. Therefore, as the modularization restrictions impact on the costs, the proposed strategy can be used as a decision supporting tool during the cluster composition task, helping, for example, in the allocation of modules to the different teams in a distributed development, aiming at minimizing integration testing costs. Regarding to the second question, the usefulness of the strategy MC is supported by the difficulty of manually obtaining solutions with satisfactory trade-off among the objectives. The application of MC provides the tester a set of solutions allowing he/she to prioritize some coupling measures, reducing testing efforts and costs. MOCAITO adopts only coupling measures, despite such measures are the most used in the literature, we are aware that other factors can impact on the integration testing cost. So, such limitation could be eliminated by using other measures during the optimization process. This should be evaluated in future experiments. Other future work we intend to perform is to conduct experiments involving systems with a greater number of clusters and dependencies as well as using other algorithms. In further experiments, we also intend to use a way to group the modules of a system, such as a clustering algorithm. As MOCAITO is a generic approach, it is possible to explore other development contexts and kind of restrictions besides modularization. ACKNOWLEDGMENTS We would like to thank CNPq for financial support. TABLE VII S OME SOLUTIONS OF PAES FOR THE SYSTEM JH OT D RAW Solution Cost Order {87, 9, 196, 187, 67, 185}, {68, 86, 173, 105, 118, 48, 24, 170, 93, 99, 30, 104, 102, 91, 116, 23, 82, 190, 186, 46, 134, 121, 49, 89, 164, 188, 25, 115, 117, 189, 100, 50, 157, 69, 155, 26, 38, 191}, {84, 85, 58, 0, 79, 66, 98, 7, 111}, {96, 180, 123, 114, 101, 53, 12, 75, 36, 140, 107, 144, 145, 47, 62, 11, 34, 77, 97, 146, 57, 3, 73, 103, 152, 28, 158, 2, 179, 83, 122, 16, 129, 182, 27, 45, 176, 159, 5, 31, 52, 156, 165, 166, 32, 4, 150, 192, 54, (283,292,102,206) 110, 137, 1, 8, 151, 113, 65, 95, 135, 132, 130}, {128, 181, 147, 125, 169, 124, 61}, {138, 108, 15, 33, 29, 154, 153}, {56, 6, 139, 42, 39, 41, 44, 40, 172, 43}, {184, 160, 76, 119, 18, 70, 178, 55, 35, 71, 64, 59, 175, 74, 126, 167, 72, 177, 174}, {136, 94, 60, 171}, {133, 51, 14, 109, 21, 148}, {10, 90, 195, 78, 88, 183, 81}, {17, 63, 19, 20, 22, 161, 37, 106, 163, 112, 168, 142, 127, 149, 92, 143, 193, 162, 80}, {131, 13, 120, 194, 141} {87, 9, 196, 187, 67, 185}, {84, 85, 0, 58, 66, 98, 7, 111, 79}, {138, 108, 15, 33, 29, 154, 153}, {68, 86, 173, 105, 118, 48, 24, 170, 93, 99, 30, 104, 102, 91, 116, 23, 82, 190, 186, 46, 134, 121, 49, 89, 164, 188, 25, 115, 117, 189, 100, 50, 157, 69, 155, 26, 38, 191}, {96, 180, 123, 114, 101, 53, 12, 75, 36, 140, 107, 144, 16, 145, 47, 62, 11, 34, 77, 97, 146, 57, 3, 73, 103, 152, 28, 158, 2, 132, 110, 179, 83, 122, 32, 129, 182, 27, 45, 176, (322,258,103,192) 159, 5, 31, 52, 156, 165, 166, 135, 4, 150, 192,54, 137, 1, 8, 151, 113, 130, 65, 95}, {56, 6, 139, 42, 39, 41, 44, 40, 172, 43}, {17, 63, 19, 20, 22, 161, 37, 106, 163, 112, 168, 142, 127, 149, 92, 143, 193, 162, 80}, {184, 160, 76, 119, 18, 70, 178, 55, 35, 71, 64, 59, 175, 74, 126, 167, 72, 177, 174}, {136, 94, 60, 171}, {133, 51, 14, 109, 21, 148}, {128, 181, 147, 125, 169, 124, 61}, {131, 13, 120, 194, 141}, {10, 90, 195, 78, 88, 183, 81} {96, 180, 123, 114, 101, 53, 12, 75, 36, 140, 107, 144, 145, 47, 62, 11, 34, 77, 97, 146, 57, 3, 73, 103, 152, 28, 158, 2, 179, 83, 122, 16, 129, 182, 27, 45, 176, 159, 5, 31, 52, 156, 165, 166, 32, 4, 150, 192, 54, 110, 137, 1, 8, 151, 113, 65, 95, 135, 132, 130}, {138, 108, 15, 33, 29, 154, 153}, {87, 9, 196, 187, 67, 185}, {84, 85, 58, 0, 79, 66, 98, 7, 111}, {128, 181, 147, 125, 169, 124, 61}, {68, 86, 173, 105, 118, 48, 24, 170, 93, 99, 30, 104, (2918,326,92,201) 102, 91, 116, 23, 82, 190, 186, 46, 134, 121, 49, 89, 164, 188, 25, 115, 117, 189, 100, 50, 157, 69, 155, 26, 38, 191}, {56, 6, 139, 42, 39, 41, 44, 40, 172, 43}, {184, 160, 76, 119, 18, 70, 178, 55, 35, 71, 64, 59, 175, 74, 126, 167, 72, 177, 174}, {136, 94, 60, 171}, {133, 51, 14, 109, 21, 148}, {10, 90, 195, 78, 88, 183, 81}, {17, 63, 19, 20, 22, 161, 37, 106, 163, 143, 168, 142, 127, 149, 92, 112, 193, 162, 80}, {131, 13, 120, 194, 141} {138, 108, 15, 29, 154, 33, 153}, {56, 6, 42, 39, 139, 43, 40, 172, 44, 41}, {96, 103, 114, 123, 180, 101, 165, 12, 97, 47, 34, 36, 146, 140, 107, 77, 32, 45, 53, 75, 57, 145, 83, 11, 16, 156, 3, 95, 73, 152, 158, 192, 4, 28, 113, 144, 166, 110, 137, 27, 5, 159, 52, 62, 54, 2, 182, 179, 122, 31, 129, 150, 135, 132, 130, 1, 8, 65, 151, 176}, {187, 87, 9, 67, 196, 185}, {84, 85, 0, 58, 66, 7, 98, 79, 111}, {128, 181, 124, 147, 169, 61, 125}, {17, 19, 37, (3423,313,103,140) 168, 127, 161, 20, 163, 106, 22, 63, 112, 142, 143, 149, 193, 92, 80, 162}, {10, 88, 195, 78, 183, 81, 90}, {136, 171, 94, 60}, {133, 51, 14, 21, 148, 109}, {134, 157, 24, 25, 102, 191, 89, 26, 115, 173, 46, 104, 49, 30, 91, 100, 170, 82, 116, 164, 105, 121, 68, 93, 38, 99, 190, 117, 50, 48, 69, 86, 189, 155, 118, 186, 188, 23}, {131, 141, 13, 194, 120}, {184, 119, 76, 160, 18, 35, 71, 178, 70, 55, 64, 59, 74, 175, 167, 177, 174, 72, 126} {138, 29, 108, 15, 33, 154, 153}, {187, 87, 9, 196, 185, 67}, {24, 116, 93, 82, 25, 190, 49, 91, 30, 99, 170, 104, 105, 26, 115, 173, 191, 164, 121, 86, 50, 189, 46, 69, 186, 134, 38, 48, 155, 102, 100, 188, 117, 89, 23, 118, 157, 68}, {101, 96, 114, 12, 53, 180, 123, 62, 182, 16, 156, 140, 107, 103, 145, 45, 75, 144, 34, 36, 146, 11, 97, 3, 152, 158, 95, 73, 2, 122, 179, 176, 47, 28, 27, 31, 5, 165, 54, 77, 4, 57, 159, 113, 150, 52, 129, 166, 192, (301,274,105,197) 83, 110, 32, 137, 135, 1, 8, 151, 65, 132, 130}, {128, 147, 124, 125, 169, 181, 61}, {131, 13, 141, 120, 194}, {58, 84, 66, 0, 7, 98, 111, 85, 79}, {10, 81, 78, 88, 195, 90, 183}, {133, 51, 14, 148, 109, 21}, {56, 6, 139, 42, 40, 39, 44, 43, 172, 41}, {136, 60, 94, 171}, {184, 76, 160, 119, 18, 64, 55, 74, 59, 35, 70, 126, 178, 175, 177, 71, 174, 167, 72}, {17, 161, 163, 20, 63, 142, 168, 19, 149, 106, 112, 193, 22, 143, 37, 127, 92, 162, 80} R EFERENCES [1] Z. Wang, B. Li, L. Wang, and Q. Li, “A brief survey on automatic integration test order generation,” in Software Engineering and Knowledge Engineering Conference (SEKE), 2011, pp. 254–257. [2] W. K. G. Assunção, T. E. Colanzi, A. T. R. Pozo, and S. R. Vergilio, “Establishing integration test orders of classes with several coupling measures,” in 13th Genetic and Evolutionary Computation Conference (GECCO), 2011, pp. 1867–1874. [3] W. K. G. Assunção, T. E. Colanzi, S. R. Vergilio, and A. T. R. Pozo, “A multi-objective optimization approach for the integration and test order problem,” Information Sciences, 2012, submitted. [4] T. E. Colanzi, W. K. G. Assunção, A. T. R. Pozo, and S. R. Vergilio, “Integration testing of classes and aspects with a multi-evolutioanry and coupling-based approach,” in 3th International Symposium on Search Based Software Engineering (SSBSE). Springer Verlag, 2011, pp. 188– 203. [5] S. Vergilio, A. Pozo, J. Árias, R. Cabral, and T. Nobre, “Multiobjective optimization algorithms applied to the class integration and test order problem,” International Journal on Software Tools for Technology Transfer, vol. 14, no. 4, pp. 461–475, 2012. [6] T. E. Colanzi, W. K. G. Assunção, S. R. Vergilio, and A. T. R. Pozo, “Generating integration test orders for aspect oriented software with multi-objective algorithms,” in Proceedings of the Latin-American Workshop on Aspect Oriented Software (LA-WASP), 2011. [7] W. Assunção, T. Colanzi, S. Vergilio, and A. Pozo, “Evaluating different strategies for integration testing of aspect-oriented programs,” in Proceedings of the Latin-American Workshop on Aspect Oriented Software (LA-WASP), 2012. [8] R. Ré and P. C. Masiero, “Integration testing of aspect-oriented programs: a characterization study to evaluate how to minimize the number of stubs,” in Brazilian Symposium on Software Engineering (SBES), 2007, pp. 411–426. [9] E. Carmel and R. Agarwal, “Tactical approaches for alleviating distance in global software development,” Software, IEEE, vol. 18, no. 2, pp. 22 –29, mar/apr 2001. [10] J. Noll, S. Beecham, and I. Richardson, “Global software development and collaboration: barriers and solutions,” ACM Inroads, vol. 1, no. 3, pp. 66–78, Sep. 2011. [11] D. C. Kung, J. Gao, P. Hsia, J. Lin, and Y. Toyoshima, “Class firewall, test order and regression testing of object-oriented programs,” Journal of Object-Oriented Program, vol. 8, no. 2, pp. 51–65, 1995. [12] R. Ré, O. A. L. Lemos, and P. C. Masiero, “Minimizing stub creation during integration test of aspect-oriented programs,” in 3rd Workshop on Testing Aspect-Oriented Programs (WTAOP), Vancouver, British Columbia, Canada, 2007, pp. 1–6. [13] L. C. Briand, Y. Labiche, and Y. Wang, “An investigation of graph-based class integration test order strategies,” IEEE Transactions on Software Engineering, vol. 29, no. 7, pp. 594–607, 2003. [14] Y. L. Traon, T. Jéron, J.-M. Jézéquel, and P. Morel, “Efficient objectoriented integration and regression testing,” IEEE Transactions on Reliability, pp. 12–25, 2000. [15] A. Abdurazik and J. Offutt, “Coupling-based class integration and test order,” in International Workshop on Automation of Software Test (AST). Shanghai, China: ACM, 2006. [16] L. C. Briand, J. Feng, and Y. Labiche, “Using genetic algorithms and coupling measures to devise optimal integration test orders,” in Software Engineering and Knowledge Engineering Conference (SEKE), 2002. [17] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182 –197, 2002. [18] E. Zitzler, M. Laumanns, and L. Thiele, “SPEA2: Improving the Strength Pareto Evolutionary Algorithm,” Swiss Federal Institute of Technology (ETH) Zurich, Gloriastrasse 35, CH-8092 Zurich, Switzerland, Tech. Rep. 103, 2001. [19] J. D. Knowles and D. W. Corne, “Approximating the nondominated front using the pareto archived evolution strategy,” Evolutionary Computation, vol. 8, pp. 149–172, 2000. [20] C. A. C. Coello, G. B. Lamont, and D. A. van Veldhuizen, Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006. [21] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in Software Engineering: An Introduction. Kluwer Academic Publishers, Norwell, MA, USA, 2000. [22] R. S. Pressman, Software Engineering : A Practitioner’s Approach. NY: McGraw Hill, 2001. [23] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, and V. G. da Fonseca, “Performance assessment of multiobjective optimizers: An analysis and review,” IEEE Transactions on Evolutionary Computation, vol. 7, pp. 117–132, 2003. [24] J. L. Cochrane and M. Zeleny, Multiple Criteria Decision Making. University of South Carolina Press, Columbia, 1973. Functional Validation Driven by Automated Tests Validação Funcional Dirigida por Testes Automatizados Thiago Delgado Pinto Departamento de Informática Centro Federal de Educação Tecnológica, CEFET/RJ Nova Friburgo, Brasil [email protected] Resumo—A qualidade funcional de um software pode ser avaliada por quão bem ele atende aos seus requisitos funcionais. Estes requisitos são muitas vezes descritos por intermédio de casos de uso e verificados por testes funcionais que checam sua correspondência com as funcionalidades observadas pela interface com o usuário. Porém, a criação, a manutenção e a execução destes testes são trabalhosas e caras, enfatizando a necessidade de ferramentas que as apoiem e realizem esta forma de controle de qualidade. Neste contexto, o presente artigo apresenta uma abordagem totalmente automatizada para a geração, execução e análise de testes funcionais, a partir da descrição textual de casos de uso. A ferramenta construída para comprovar sua viabilidade, chamada de FunTester, é capaz de gerar casos de teste valorados junto com os correspondentes oráculos, transformá-los em código-fonte, executá-los, coletar os resultados e analisar se o software está de acordo com os requisitos funcionais definidos. Avaliações preliminares demonstraram que a ferramenta é capaz de eficazmente detectar desvios de implementação e descobrir defeitos no software sob teste. Abstract – The functional quality of any software system can be evaluated by how well it conforms to its functional requirements. These requirements are often described as use cases and are verified by functional tests that check whether the system under test (SUT) runs as specified. There is a need for software tools to make these tests less laborious and more economical to create, maintain and execute. This paper presents a fully automated process for the generation, execution, and analysis of functional tests based on use cases within software systems. A software tool called FunTester has been created to perform this process and detect any discrepancies from the SUT. Also while performing this process it generates conditions to cause failures which can be analyzed and fixed. Keywords – functional validation; automated functional tests; use cases; business rules; test data generation; test oracle generation; test case generation and execution; I. INTRODUÇÃO A fase de testes é sabidamente uma das mais caras da construção de um software, correspondendo a 35 a 50% de seu custo total quando feito da forma tradicional [1] e de 15 a 25% quando desenvolvido com uso de técnicas formais leves [2]. Quando feita de forma manual, a atividade de teste se torna Arndt von Staa Departamento de Informática Pontifícia Universidade Católica, PUC-Rio Rio de Janeiro, Brasil [email protected] ineficiente e tediosa [3], usualmente apoiada em práticas ad hoc e dependente da habilidade de seus criadores. Assim, torna-se valioso o uso de ferramentas que possam automatizar esta atividade, diminuindo os custos envolvidos e aumentando as chances de se entregar um software com menor quantidade de defeitos remanescentes. Em geral, é entendido que um software de qualidade atende exatamente aos requisitos definidos em sua especificação [4]. Para verificar este atendimento, geralmente são realizados testes funcionais que observam a interface (gráfica) do software visando determinar se este realmente executa tal como especificado. Evidentemente supõe-se que os requisitos estejam de acordo com as necessidades e expectativas dos usuários. Como isso nem sempre é verdade, torna-se necessária a possibilidade de redefinir a baixo custo os testes a serem realizados. Para simular a execução destes testes, é possível imitar a operação de um usuário sobre a interface, entrando com ações e dados, e verificar se o software se comporta da maneira especificada. Esta simulação pode ser realizada via código, com a combinação de arcabouços de teste unitário e arcabouços de teste de interface com o usuário. Entretanto, para gerar o código de teste de forma automática, é preciso que a especificação do software seja descrita com mais formalidade e de maneira estruturada ou, pelo menos, semiestruturada. Como casos de uso são largamente utilizados para documentar requisitos de um software, torna-se interessante adaptar sua descrição textual para este fim. A descrição textual de casos de uso, num estilo similar ao usado por Cockburn [5], pode ser descrita numa linguagem restrita e semiestruturada, como o adotado por Días, Losavio, Matteo e Pastor [6] para a língua espanhola. Esta formalidade reduz o número de variações na interpretação da descrição, facilitando a sua transformação em testes. Trabalhos como [7, 8, 9, 10, 11, 12], construíram soluções para apoiar processos automatizados ou semiautomatizados para a geração dos casos de teste. Entretanto, alguns aspectos importantes não foram abordados, deixando de lado, por exemplo, a geração dos valores utilizados nos testes, a geração dos oráculos e a combinação de cenários entre múltiplos casos de uso, que são essenciais para sua efetiva aplicação prática. TABELA I. PANORAMA SOBRE AS FERRAMENTAS # Questão [9] [11] [12] [7] [10] 1 Usa somente casos de uso como fonte para os testes? Qual a forma de documentação dos casos de uso? Controla a declaração de casos de uso? Dispensa a declaração de fluxos alternativos que avaliam regras de negócio? Gera cenários automaticamente? Há um cenário cobrindo cada fluxo? Há cenários que verifiquem regras de negócio para um mesmo fluxo? Há cenários que combinam fluxos? Há cenários que incluem mais de um caso de uso? Há métricas para cobertura dos cenários? Gera casos de teste semânticos? Gera valores para os casos de teste automaticamente? Gera oráculos automaticamente? Casos de teste são gerados para um formato independente de linguagem ou framework? Gera código de teste? Os resultados da execução do código gerado são rastreados? sim sim sim sim sim Fun Tester sim1 PRS IRS IRS IRS UCML VRS 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 a. sim não não não não Sim não não não não não Sim sim sim sim sim sim Sim sim sim sim sim sim sim não sim não não não sim não sim sim sim sim sim não não não não não sim não não não sim sim sim não sim não não sim sim não não não não não sim não não não não não sim não sim não não sim sim sim não sim sim não sim sim N/A não não N/A sim modo que os testes abstratos sejam realizados. A seguir selecionam-se os valores dos dados de entrada, gerando os casos de teste valorados. Aplicando a especificação aos casos de teste valorados determinam-se os oráculos, obtendo-se assim os casos de teste úteis. Estes, finalmente são traduzidos para scripts ou código a ser usado por ferramentas ou arcabouços de teste automatizado. Este artigo descreve um processo totalmente automatizado que trata muitos dos problemas não resolvidos por trabalhos anteriores (como a geração automática de dados de teste, oráculos e cenários que combinam mais de um caso de uso) e introduz novas abordagens para aumentar sua aplicação prática e realizar uma validação funcional de alta eficácia. As próximas seções são organizadas da seguinte forma: A Seção II apresenta trabalhos correlatos. A Seção III detalha o novo processo definido. A Seção IV expõe brevemente a arquitetura da solução. A Seção V retrata uma avaliação preliminar da ferramenta. Por fim, a Seção VI apresenta as conclusões do trabalho. II. TRABALHOS CORRELATOS Esta seção realiza uma avaliação de alguns trabalhos correlatos, com foco na descrição textual de casos de uso como principal fonte para a geração dos testes. A Tabela I apresenta um panorama sobre os trabalhos que construíram ferramentas para este propósito, incluindo a ferramenta que materializa a abordagem discutida neste artigo, chamada de FunTester (acrônimo para Funcional Tester). Nela, é possível observar que FunTester apresenta uma solução mais completa, implementando avanços que permitem sua aplicação prática em projetos reais. III. PROCESSO A Figura 1 apresenta o processo realizado na abordagem apresentada e seguido pela ferramenta construída. N/A=Não se aplica; PRS=Português Restrito Semiestruturado; IRS=Inglês Restrito Semiestruturado; UCML=Use Case Markup Language; VRS=Vocabulário Restrito Semiestruturado independente de idioma. No processo de geração de casos de teste automatizados criam-se primeiro os casos de teste abstratos. Estes determinam as condições que cada caso de teste deve satisfazer (por exemplo, os caminhos a serem percorridos). A partir deles determinam-se os casos de teste semânticos, isto é, casos de teste independentes de arcabouço de testes. Estes levam em conta as condições que os dados de entrada devem satisfazer de 1 As regras de negócio, descritas adicionalmente em relação às outras soluções, ainda pertencem aos casos de uso. Fig. 1. Processo seguido pela ferramenta Neste processo, o usuário participa apenas das etapas de descrição dos casos de uso e de suas regras de negócio, sendo as demais totalmente automatizadas. As etapas 1, 2, 3, 4, 5 e 9 são realizadas pela ferramenta em si, enquanto as etapas 6, 7 e 8 são realizadas por extensões da ferramenta, para a linguagem e arcabouço de testes alvo. A seguir, será realizada uma descrição de cada uma delas. A. Descrição textual de casos de uso (Etapa 1) Nesta etapa, o usuário realiza a especificação do software através de casos de uso, auxiliado pela ferramenta. A descrição textual segue um modelo similar ao de Cockburn [5]. A ferramenta usa somente alguns dos campos desta descrição para a geração de testes. As pré-condições e pós-condições são usadas para estabelecer as dependências entre casos de uso, numa espécie de máquina de estados. Dos fluxos (disparador, principal e alternativos) são obtidos os cenários de execução. De seus passos são extraídas as ações executadas pelo ator e pelo sistema, que junto a outras informações do caso de uso (como a indicação se ele pode ser disparado somente através de outro caso de uso), são usadas para a geração dos testes úteis, na etapa 5. A ferramenta permite definir um vocabulário composto pelos termos esperados por sua extensão para a transformação dos testes úteis em código-fonte e pelos termos correspondentes, usados na descrição textual dos casos de uso. Isto permite tanto documentar o software usando palavras ou até idiomas diferentes do vocabulário usado para geração dos testes quanto adaptar esse último para diferentes arcabouços de teste. A sintaxe de um passo numa gramática livre de contexto (GLC) similar à Backus-Naur Form (BNF) é descrita a seguir: <passo> <disparador> <alvo> <elemento> <ação> <documentação> <caso-de-uso> <widget> <URL> <comando> <tecla> <tempo> ::= | ::= ::= ::= | ::= ::= ::= ::= ::= ::= ::= ::= <disparador> <ação> <alvo>+ <disparador> <documentação> "ator" | "sistema" <elemento> | <caso-de-uso> <widget> | <URL> | <comando> <tecla> | <tempo> string string string string string string string integer O ator ou o sistema dispara uma ação sobre um ou mais alvos ou sobre uma documentação. Cada alvo pode ser um elemento ou um caso de uso. Um elemento pode ser um widget, uma URL, um comando, uma tecla ou um tempo (em milissegundos, que é geralmente usado para aguardar um processamento). preencher os passos dos fluxos. Este detalhamento possibilita saber o que representa cada elemento, extrair as informações necessárias para sua conversão em widgets, inferir seus possíveis valores e formatos e extrair as informações necessárias para a geração de oráculos. A introdução das regras de negócio também permite reduzir o número de fluxos alternativos necessários para tratar erros de uso, uma vez que a ferramenta gerará automaticamente casos de teste para isto. Isto permite reduzir consideravelmente o número de caminhos no uso do software (introduzidos pela combinação dos fluxos), diminuindo o número de cenários, casos de teste e consequentemente o tempo de execução dos testes. A sintaxe definida para as regras de negócio permite determinar regras tanto para dados contínuos quanto para dados discretos. O detalhamento de um elemento e de suas regras é exposto a seguir (em GLC): <elemento> <tipo> <nome> <nome-interno> <regra> <tipo-dado> <espec-valor> <tipo-espec> <mensagem> <ref-valor> ::= | ::= | ::= ::= ::= ::= | ::= ::= | | | | | | ::= ::= <nome><tipo><nome-interno> <nome><tipo><nome-interno><regra>+ "widget" | "url" | "comando" "teclas" | "tempo" string string <tipo-dado><espec-valor>+ "string" | "integer" | "double" "date" | "time" | "datetime" <tipo-espec><mensagem> "valor-min" <ref-valor> "valor-max" <ref-valor> "comprimento-min" <ref-valor> "comprimento-max" < ref-valor> "formato" <ref-valor> "igual-a" <ref-valor>+ "diferente-de" <ref-valor>+ string <valor>+|<elemento> Um elemento – que seria equivalente ao widget da descrição do passo – possui um nome (que é o exibido para o usuário na documentação), um tipo (ex.: uma janela, um botão, uma caixa de texto, etc.) e um nome interno (que é usado internamente para identificar o widget no SST). Se for definido como editável (se recebe entradas de dados do usuário), pode conter uma ou mais regras de negócio. Cada regra define o tipo de dado admitido pelo elemento e uma ou mais especificações de valor, que permitem definir valores limítrofes, formatos, lista de valores admissíveis ou não admissíveis, além (opcionalmente) da mensagem esperada do SST caso alguma destas definições seja violada (ex.: valor acima do limítrofe). Cada especificação de valor pode ser proveniente de definição manual, de definição a partir de outro elemento ou a partir de consulta parametrizável a um banco de dados (obtendo seus parâmetros de outra definição, se preciso), fornecendo flexibilidade para a construção das regras. O tipo de alvo e o número de alvos possíveis para uma ação podem variar conforme a configuração do vocabulário usado. O tipo de elemento também pode variar conforme a ação escolhida. A definição de regras com valores obtidos através de consulta a banco de dados permite utilizar dados criados com o propósito de teste. Estes dados podem conter valores idênticos aos esperados pelo sistema, permitindo simular condições de uso real, o que é desejável em ferramentas de teste. B. Detalhamento das regras de negócio (Etapa 2) Esta etapa é um dos importantes diferenciais da ferramenta e permite que o usuário detalhe os elementos descritos ao C. Geração de cenários para cada caso de uso (Etapa 3) Nesta etapa, a ferramenta combina os fluxos de cada caso de uso, gerando cenários. Cada cenário parte do fluxo principal, possivelmente passando por fluxos alternativos, retornando ao fluxo principal ou caindo em recursão (repetindo a passagem por um ou mais fluxos). Como os casos de recursão potencializam o número de combinações entre fluxos, o número de recursões deve ser mantido baixo, para não inviabilizar a aplicação prática da geração de cenários. Para isso, a ferramenta permite parametrizar o número máximo de recursões, limitando a quantidade de cenários gerados. A geração de cenários realizada cobre a passagem por todos os fluxos do caso de uso, bem como a combinação entre todos eles, pelo menos uma vez. Em cada fluxo, todos os passos são cobertos. Isto garante observar defeitos relacionados à passagem por determinados fluxos ou ocasionados por sua combinação. Segundo o critério de cálculo de cobertura de caso de uso adotado por Hassan e Yousif [13], que divide o total de passos cobertos pelo teste pelo total de passos do caso de uso, a cobertura atingida é de 100%. D. Combinação de cenários entre casos de uso (Etapa 4) Esta etapa realiza a combinação entre cenários, levando em conta os estados definidos nas pré-condições e pós-condições, bem como as chamadas a casos de uso, que podem ocorrer em passos de certos fluxos. Quando uma pré-condição referencia uma pós-condição gerada por outro caso de uso, é estabelecida uma relação de dependência de estado. Logo, os cenários do caso de uso do qual se depende devem ser gerados primeiro, para então realizar a combinação. O mesmo ocorre quando existe uma chamada para outro caso de uso. Para representar a rede de dependências entre os casos de uso do SST e obter a ordem correta para a geração dos cenários, é gerado um grafo acíclico dirigido dos casos de uso e então aplicada uma ordenação topológica [14]. Antes de combinar dois cenários, entretanto, é preciso verificar se os fluxos de um geram os estados esperados pelo outro. Caso não gerem, os cenários não são combinados, uma vez que a combinação poderá gerar um novo cenário incompleto ou incorreto, podendo impedir que a execução alcance o caso de uso alvo do teste. Assim, na versão atual da ferramenta, para combinar dois casos de uso, e , onde depende de , são selecionados de somente os cenários que terminem com sucesso, isto é, que não impeçam a execução do caso de uso conforme previsto. Para garantir a correta combinação dos cenários de um caso de uso, sem que se gerem cenários incompletos ou incorretos, realiza-se primeiro a combinação com os cenários de casos de uso chamados em passos; depois com os cenários de fluxos disparadores do próprio caso de uso; e só então com cenários de casos de uso de pré-condições. Dados e dois casos de uso quaisquer do conjunto de casos de uso do software, , seja o número de cenários do caso de uso , o número de cenários do caso de uso , o número de cenários de sucesso de , se A depende de B, então a cobertura dos cenários de , , pode ser calculada como: Se, por exemplo, um caso de uso tiver 5 cenários, sendo 3 de sucesso, e outro caso de uso, , que depende de , tiver 8 cenários, a cobertura total seria de 40 combinações, enquanto a cobertura alcançada seria de 24 combinações, ou 60% do total. Apesar de esta cobertura não ser total (100%), acredita-se que ela seja eficaz para testes de casos de uso, uma vez que a geração de cenários incorretos ou incompletos pode impedir o teste do caso de uso alvo. É importante notar que a combinação entre cenários é multiplicativa, ou seja, a cada caso de uso adicionado, seu conjunto de cenários é multiplicado pelos atuais. Foram consideradas algumas soluções para este problema, cuja implementação está entre os projetos futuros, discutidos na Seção V. E. Geração de casos de teste úteis (Etapa 5) Nesta etapa, a ferramenta gera os casos de teste úteis (formados por comandos, dados valorados e oráculos) utilizando os cenários e as regras de negócio. Estas regras permitem inferir os valores válidos e não válidos para cada elemento e gerá-los automaticamente, de acordo com o tipo de teste. Os oráculos são gerados com uso da definição das mensagens esperadas para quando são fornecidos valores não válidos, de acordo com o tipo de verificação desejada para o teste. Como cada descrição de expectativa de comportamento do sistema é transformada em comandos semânticos e estes (posteriormente) em comandos na linguagem usada pelo arcabouço de testes, quando um comando não é executado corretamente por motivo do SST não corresponder à sua expectativa, o teste automaticamente falhará. Assim, não é necessário haver oráculos que verifiquem a existência de elementos de interface, exceto a exibição de mensagens. Segundo Meyers [15], o teste de software torna-se mais eficaz se os valores de teste são gerados baseados na análise das "condições de contorno" ou "condições limite". Ao utilizar valores acima e abaixo deste limite, os casos de teste exploram as condições que aumentam a chance de encontrar defeitos. De acordo com o British Standard 7925-1 [16], o teste de software torna-se mais eficaz se os valores são particionados ou divididos de alguma maneira, como, por exemplo, ao meio. Além disto, é interessante a inclusão do valor zero (0), que permite testar casos de divisão por zero, bem como o uso de valores aleatórios, que podem fazer o software atingir condições imprevistas. Portanto, levando em consideração as regras de negócio definidas, para a geração de valores considerados válidos, independentes do tipo, são adotados os critérios de: (a) valor mínimo; (b) valor imediatamente posterior ao mínimo; (c) valor máximo; (d) valor imediatamente anterior ao máximo; (e) valor intermediário; (f) zero, se dentro da faixa permitida; (g) valor aleatório, dentro da faixa permitida. E para a geração de valores considerados não válidos, os critérios de: (a) valor imediatamente anterior ao mínimo; (b) valor aleatório anterior ao mínimo; (c) valor imediatamente posterior ao máximo; (d) valor aleatório posterior ao máximo; (e) formato de valor incorreto. A Tabela II exibe os tipos de teste gerados na versão atual da ferramenta, que visam cobrir as regras de negócio definidas. Baseado neles, podemos calcular a quantidade mínima de casos de teste úteis gerados por cenário, QM, como: Onde é o número de elementos editáveis, é o número de elementos editáveis obrigatórios e é o número de elementos editáveis com formato definido. Para cada cenário de um caso de uso simples, por exemplo, com 5 elementos, sendo 3 obrigatórios e 1 com formatação, existirão 32 casos de teste. TABELA II. TIPOS DE TESTE GERADOS Descrição Somente obrigatórios Todos os obrigatórios exceto um Todos com valor/tamanho mínimo Todos com valor/tamanho posterior ao mínimo Todos com valor/tamanho máximo Todos com valor/tamanho anterior ao máximo Todos com o valor intermediário, dentro da faixa Todos com zero, ou um valor aleatório dentro da faixa, se zero não for permitido Todos com valores aleatórios dentro da faixa Todos com valores aleatórios dentro da faixa, exceto um, com valor imediatamente anterior ao mínimo Todos com valores aleatórios dentro da faixa, exceto um, com valor aleatório anterior ao mínimo Todos com valores aleatórios dentro Conclui caso de uso Sim Não Testes Sim Sim 1 1 por elemento editável obrigatório 1 1 Sim Sim 1 1 Sim 1 Sim 1 Sim 1 Não 1 por elemento editável Não 1 por elemento editável Não 1 por elemento editável Não 1 por elemento editável Não 1 por elemento com formato definido da faixa, exceto um, com valor imediatamente posterior ao máximo Todos com valores aleatórios dentro da faixa, exceto um, com valor aleatório posterior ao máximo Todos com formato permitido, exceto um Os testes gerados cobrem todas as regras de negócio definidas, explorando seus valores limítrofes e outros. Como o total de testes possíveis para um caso de uso não é conhecido, acredita-se que com a cobertura acumulada, unindo todas as coberturas descritas até aqui, espera-se exercitar consideravelmente o SST na busca por defeitos, obtendo alta eficácia. Atualmente os casos de teste úteis são exportados para a JavaScript Object Notation (JSON), por ser compacta, independente de linguagem de programação e fácil de analisar gramaticalmente. F. Transformação em código-fonte (Etapa 6) Esta etapa e as duas seguintes utilizam a extensão da ferramenta escolhida pelo usuário, de acordo o software a ser testado. A extensão da ferramenta lê o arquivo JSON contendo os testes úteis e os transforma em código-fonte, para a linguagem e arcabouços de teste disponibilizados por ela. Atualmente, há uma extensão que os transforma em código Java e os arcabouços TestNG2 e FEST3, visando o teste de aplicações com interface gráfica Swing. Para facilitar o rastreamento de falhas ou erros nos testes, a extensão construída realiza a instrumentação do código-fonte gerado, indicando, com comentários de linha, o passo semântico correspondente à linha de código. Esta instrumentação será utilizada na Etapa 8, para pré-análise dos resultados. G. Execução do código-fonte (Etapa 7) Nesta etapa, a extensão da ferramenta executa o códigofonte de testes gerado. Para isto, ela usa linhas de comando configuradas pelo usuário, que podem incluir a chamada a um compilador, ligador (linker), interpretador, navegador web, ou qualquer outra aplicação ou arquivo de script que dispare a execução dos testes. Durante a execução, o arcabouço de testes utilizado gera um arquivo com o log de execução dos testes. Este arquivo será lido e analisado na próxima etapa do processo. H. Conversão e pré-análise dos resultados de execução (Etapa 8) Nesta etapa, a extensão da ferramenta lê o log de execução dos testes e analisa os testes que falharam ou obtiveram erro, investigando: (a) a mensagem de exceção gerada, para detectar o tipo de problema ocorrido; (b) o rastro da pilha de execução, para detectar o arquivo e a linha do código-fonte onde a exceção ocorreu e obter a identificação do passo semântico correspondente (definida pela instrumentação realizada na Etapa 6), possibilitando rastrear o passo, fluxo e cenário correspondentes; (c) comparar o resultado esperado pelo teste semântico com o obtido. O log de execução e as informações obtidas da pré-análise são convertidos para um formato independente de arcabouço de testes e exportados para um arquivo JSON, que será lido e analisado pela ferramenta na próxima etapa. I. Análise e apresentação dos resultados (Etapa 9) Por fim, a ferramenta realiza a leitura do arquivo com os resultados da execução dos testes e procura analisá-los para rastrear as causas de cada problema encontrado (se houverem). Nesta etapa, o resultado da execução é confrontado com a especificação do software, visando identificar possíveis problemas. 2 3 http://testng.org http://fest.easytesting.org Fig. 2. Arquitetura da solução IV. ARQUITETURA A Figura 2 apresenta a arquitetura da solução construída, indicando seus componentes e a ordem de interação entre os mesmos, de forma a fornecer uma visão geral sobre como o processo descrito é praticado. É interessante observar que, em geral, o código gerado fará uso de dois arcabouços de teste: um para automatizar a execução dos testes e outro para testar especificamente o tipo de interface (com o usuário) desejada. Esse arcabouço de automação dos testes que irá gerar os resultados da execução dos testes lidos pela extensão da ferramenta. V. AVALIAÇÃO Uma avaliação preliminar da eficácia da ferramenta foi realizada com um software construído por terceiros, coletado da Internet. O software avaliado contém uma especificação de requisitos por descrição textual de casos de uso, realiza acesso a um banco de dados MySQL4 e possui interface Swing. A especificação encontrada no software avaliado estava incompleta, faltando, por exemplo, alguns fluxos alternativos e regras de negócio. Quando isso ocorre, dá-se margem para que a equipe de desenvolvedores do software complete a especificação conforme sua intuição (e criatividade), muitas vezes divergindo da intenção original do projetista, que tenta mapear o problema real. Como isto acabou ocorrendo no software avaliado, optou-se por coletar os detalhes não 4 http://www.mysql.com/ presentes na especificação a partir da implementação do software. Desta forma, são aumentadas as chances da especificação estar próxima da implementação, acusando menos defeitos deste tipo. Para testar a eficácia da ferramenta, foram geradas: (a) uma versão da especificação funcional do SST com divergências; (b) duas versões modificadas do SST, com emprego de mutantes. Para gerar as versões com empregos de mutantes, levou-se em consideração, assim como no trabalho de Gutiérrez et al. [7], o modelo de defeitos de caso de uso introduzido por Binder [17], que define operadores mutantes para casos de uso. O uso desses operadores é considerado mais apropriado para testes funcionais (do que os operadores "clássicos"), uma vez que seu objetivo não é testar a cobertura do código do SST em si, mas gerar mudanças no comportamento do SST que possam ser observadas por testes funcionais. Nesse contexto, o termo "mutante" possui uma conotação diferente do mesmo termo aplicado a código. Um mutante da especificação funcional tem o poder de gerar uma variedade de casos de teste que reportarão falhas. Da mesma forma, um mutante de código tem o poder de gerar falhas em uma variedade de casos de teste gerados a partir da especificação funcional. Na versão da especificação do SST com divergências, foram introduzidas novas regras de negócio, visando verificar se os testes gerados pela ferramenta seriam capazes de identificar as respectivas diferenças em relação ao SST. Na primeira versão modificada do SST, foi usado o operador mutante para casos de uso "substituição de regras de validação ou da admissão de um dado como correto" (SRV). Com ele, operadores condicionais usados no código de validação de dados do SST foram invertidos, de forma a não serem admitidos como corretos. E na segunda versão do SST, foi utilizado operador mutante de casos de uso "informação incompleta ou incorreta mostrada pelo sistema" (I3). Com ele, as mensagens mostradas pelo sistema quando detectado um dado inválido foram modificadas, de forma a divergirem da esperada pela especificação. A Tabela III apresenta a quantidade de modificações realizadas nos três principais cenários analisados e a Tabela IV, o número testes que obtiveram falha, aplicando-se estas modificações. Com o total de 7 modificações na especificação original, mais 9 testes obtiveram falha em relação ao resultado original. Com o emprego de 22 mutações com o primeiro mutante (SRV), mais 51 testes obtiveram falha. E com o emprego de 10 mutações com o segundo mutante (I3), mais 12 testes falharam. TABELA III. MODIFICAÇÕES NOS TRÊS PRINCIPAIS CENÁRIOS Cenário Modificações na especificação funcional 2 3 2 Cenário 1 Cenário 2 Cenário 3 TOTAL 7 TABELA IV. 10 Número de testes que falharam SST original TOTAL 22 Mutações com mutante I3 4 4 2 NÚMERO DE TESTES QUE FALHARAM Cenário Cenário 1 Cenário 2 Cenário 3 Mutações com mutante SRV 2 14 6 4 35 37 76 SST frente à especificação funcional com divergências 10 40 35 SST com mutante SRV SST com mutante I3 8 59 60 10 37 41 85 127 88 Assim, além de corretamente detectar defeitos na versão original, geradas por entradas de dados não válidos e por diferenças na especificação, os testes gerados pela ferramenta foram capazes de observar as mudanças realizadas, tanto em relação à especificação quanto em relação à implementação do SST. VI. CONCLUSÕES O presente artigo apresentou uma nova abordagem para a geração e execução automática de testes funcionais, baseada na especificação de requisitos através da descrição textual de casos de uso e do detalhamento de suas regras de negócio. Os resultados preliminares obtidos com o emprego da ferramenta foram bastante promissores, uma vez que se pôde perceber que ela é capaz de atingir alta eficácia em seus testes, encontrando corretamente diferenças entre a especificação e a implementação do SST. Além disso, seu uso permite estabelecer um meio de aplicar Test-Driven Development no nível de especificação, de forma que o software seja construído, incrementalmente, para passar nos testes gerados pela especificação construída. Ou seja, uma equipe de desenvolvimento pode criar a especificação funcional de uma parte do sistema, gerar os testes funcionais a partir desta especificação, implementar a funcionalidade correspondente e executar os testes gerados para verificar se a funcionalidade corresponde à especificação. Isto, inclusive, pode motivar a criação de especificações mais completas e corretas, uma vez que compensará fazê-lo. A. Principais contribuições Dentre as principais contribuições da abordagem construída destacam-se: (a) Apresentação de um processo completo e totalmente automatizado; (b) Uso da descrição de regras de negócio na especificação dos casos de uso, permitindo gerar os valores e oráculos dos testes, bem como tornar desnecessário descrever uma parcela significativa dos fluxos alternativos; (c) Uso de fontes de dados externas (ex.: banco de dados) na composição das regras de negócio, permitindo a simulação de condições reais de uso; (d) Geração de cenários que envolvem repetições de fluxos (loops) com número de repetições parametrizável; (e) Geração de cenários que envolvem mais de um caso de uso; (f) Geração de testes semânticos com nomes correlatos ao tipo de verificação a ser realizada, permitindo que o desenvolvedor entenda o que cada teste verifica, facilitando sua manutenção; e (g) Uso de vocabulário configurável, permitindo realizar a descrição textual em qualquer idioma e diferentes arcabouços de teste. B. Principais restrições As principais restrições da abordagem apresentada atualmente são: (a) Não simulação de testes para fluxos que tratam exceções (ex.: falha de comunicação via rede, falha da mídia de armazenamento, etc.), uma vez que exceções tendem a ser de difícil (senão impossível) simulação através de testes funcionais realizados através da interface com o usuário; (b) As regras de negócio atualmente não suportam o uso de expressões que envolvam cálculos, fórmulas matemáticas ou o uso de expressões condicionais (ex.: if-then-else); (c) A abrangência dos tipos de interface gráfica passíveis de teste pela ferramenta é proporcional aos arcabouços de testes de interface utilizados. Assim, é possível que determinados tipos de interface com o usuário, como as criadas para games ou aplicações multimídia, possam não ser testadas por completo, se o arcabouço de testes escolhido não suportá-los, ou não suportar determinadas operações necessárias para seu teste. C. Trabalhos em andamento Atualmente outros tipos de teste estão sendo acrescentados à ferramenta, aumentando seu potencial de verificação. A capacidade de análise automática dos problemas ocorridos, de acordo com os resultados fornecidos pelo arcabouço de testes alvo, também está sendo ampliada. Testes mais rigorosos da ferramenta estão sendo elaborados para verificar a flexibilidade e eficácia da ferramenta em diferentes situações de uso prático. D. Trabalhos futuros A atual geração de cenários combina todos os fluxos do caso de uso pelo menos uma vez, garantindo um bom nível de cobertura para a geração de testes. Isto é um atributo desejável para verificar o SST antes da liberação de uma versão para o usuário final. Durante seu desenvolvimento, entretanto, pode ser interessante diminuir a cobertura realizada, para que o processo de execução de testes ocorra em menor tempo. Para isto, propõe-se o uso de duas técnicas: (a) atribuir um valor de importância para cada fluxo, que servirá como como filtro para a seleção dos cenários desejados para teste; e (b) realizar a indicação da não influência de certos fluxos em outros, evitando gerar cenários que os combinem. Esta última técnica deve ser empregada com cuidado, para evitar a indicação de falsos negativos (fluxos que se pensa não influenciar o estado de outros, mas na verdade influencia). Para diminuir a cobertura realizada pela combinação de cenários, com intuito de acelerar a geração dos cenários para fins de testes rápidos, pode-se empregar a técnica de uso do valor de importância, descrita anteriormente, além de uma seleção aleatória e baseada em histórico. Nesta seleção, o combinador de cenários realizaria a combinação entre dois casos de uso escolhendo, pseudoaleatoriamente, um cenário compatível (de acordo com as regras de combinação descritas na Etapa 4) de cada um. Para esta seleção não se repetir, seria guardado um histórico das combinações anteriores. Dessa forma, a cobertura seria atingida gradualmente, alcançando a cobertura completa ao longo do tempo. Para reduzir o número de testes gerados, também se pode atribuir um valor de importância para as regras de negócio, de forma a gerar testes apenas para as mais relevantes. Também se poderia adotar a técnica de seleção gradual aleatória (descrita anteriormente) para as demais regras, a fim de que a cobertura total das regras de negócio fosse atingida ao longo do tempo. Por fim, pretende-se criar versões dos algoritmos construídos (não discutidos neste artigo) que executassem em paralelo, acelerando o processo. Obviamente, também serão criadas extensões para outros arcabouços de teste, como Selenium5 ou JWebUnit6 (que visam aplicações web), aumentando a abrangência do uso da ferramenta. REFERÊNCIAS [1] MILER, Keith W., MORELL, Larry J., NOONAN, Robert E., PARK, Stephen K., NICOL, David M., MURRIL, Branson W., VOAS, Jeffrey M., "Estimating the probability of failure when testing reveals no failures", IEEE Transactions on Software Engineering, n. 18, 1992, pp 33-43. BENDER, Richard, "Proposed software evaluation and test KPA", n. 4, 1996, Disponível em: http://www.uml.org.cn/test/12/softwaretestingmaturitymodel.pdf MAGALHÃES, João Alfredo P. "Recovery oriented software", Tese de Doutorado, PUC-Rio, Rio de Janeiro, 2009. CROSBY, Philip B., "Quality is free", McGraw-Hill, New-York, 1979. COCKBURN, Alistar. "Writing effective use cases", Addison-Wesley, 2000. [2] [3] [4] [5] 5 6 http://seleniumhq.org http://jwebunit.sourceforge.net/ [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] DÍAS, Isabel, LOSAVIO, Francisca, MATTEO, Alfredo, PASTOR, Oscar, "A specification pattern for use cases", Information & Management, n. 41, 2004, pp. 961-975. GUTIÉRREZ, Javier J., ESCALONA, Maria J., MEJÍAS, Manuel, TORRES, Jesús, CENTENO, Arturo H, "A case study for generating test cases from use cases", University of Sevilla, Sevilla, Spain, 2008. CALDEIRA, Luiz Rodolfo N., "Geração semi-automática de massas de testes funcionais a partir da composição de casos de uso e tabelas de decisão", Dissertação de Mestrado, PUC-Rio, Rio de Janeiro, 2010. PESSOA, Marcos B., "Geração e execução automática de scripts de teste para aplicações web a partir de casos de uso direcionados por comportamento", Dissertação de Mestrado, PUC-Rio, Rio de Janeiro, 2011. KASSEL, Neil W., "An approach to automate test case generation from structured use cases", Tese de Doutorado, Clemson University, 2006. JIANG, Mingyue, DING, Zuohua, "Automation of test case generation from textual use cases", Zhejiang Sci-Tech University, Hangzhou, China, 2011. BERTOLINI, Cristiano, MOTA, Alexandre, "A framework for GUI testing based on use case design", Universidade Federal de Pernambuco, Recife, Brazil, 2010. HASSAN, Hesham A., YOUSIF, Zahraa E., "Generating test cases for platform independent model by using use case model", International Journal of Engineering Science and Technology, vol. 2, 2010. KAHN, Arthur B., "Topological sorting of large networks", Communications of the ACM 5, 1962, pp. 558–562. MEYERS, Glenford J., "The art of software testing", John Wiely & Sons, New York, 1979. BRITISH STANDARD 7925-1, "Software testing: vocabulary", 1998. BINDER, Robert V., "Testing object-oriented systems: models, patterns and tools", Addison-Wesley, 2000. Visualization, Analysis, and Testing of Java and AspectJ Programs with Multi-Level System Graphs Otávio Augusto Lazzarini Lemos∗ , Felipe Capodifoglio Zanichelli∗ , Robson Rigatto∗ , Fabiano Ferrari† , and Sudipto Ghosh‡ ∗ Science and Technology Department – Federal University of São Paulo at S. J. dos Campos – Brazil {otavio.lemos, felipe.zanichelli, robson.rigatto}@unifesp.br † Computing Department – Federal University of Sao Carlos – Brazil [email protected] ‡ Department of Computer Science, Colorado State University, Fort Collins, CO, USA [email protected] Abstract—Several software development techniques involve the generation of graph-based representations of a program created via static analysis. Some tasks, such as integration testing, require the creation of models that represent several parts of the system, and not just a single component or unit (e.g., unit testing). Besides being a basis for testing and other analysis techniques, an interesting feature of these models is that they can be used for visual navigation and understanding of the software system. However, the generation of such models – henceforth called system graphs – is usually costly, because it involves the reverse engineering and analysis of the whole system, many times done upfront. A possible solution for such a problem is to generate the graph on demand, that is, to postpone detailed analyses to when the user really needs it. The main idea is to start from the package structure of the system, representing dependencies at a high level, and to make control flow and other detailed analysis interactively and on demand. In this paper we use this idea to define a model for the visualization, analysis, and structural testing of objectoriented (OO) and aspect-oriented (AO) programs. The model is called Multi-Level System Graph (MLSG), and is implemented in a prototype tool based on Java and AspectJ named SysGraph4AJ (for Multi-Level System Graphs for AspectJ). To evaluate the applicability of SysGraph4AJ with respect to performance, we performed a study with three AspectJ programs, comparing SysGraph4AJ with a similar tool. Results indicate the feasibility of the approach, and its potential in helping developers better understand and test OO and AO Java programs. In particular, SysGraph4AJ performed around an order of magnitude faster than the other tool. I. I NTRODUCTION Several software engineering tasks require the representation of source code in models suitable for analysis, visualization, and testing [1]. For instance, control flow graphs can be used for structural testing [2], and call graphs can be used for compiler optimization [3]. Some techniques require the generation of graphs that represent the structure of multiple modules or whole systems. For instance, structural integration testing may require the generation of control flow graphs for several units that interact with each other in a program [2, 4]. Most testing tools focus only on the representation of local parts of the systems, outside their contexts, or do not even support the visualization of the underlying models. For instance, JaBUTi, a family of tools for testing Java Object Oriented (OO) and Aspect-Oriented (AO) programs, supports the visualization of models of single units [5], pairs of units [4], subsets of units [2], or, at most, a cluster of units of the system [6]. On the other hand, tools like Cobertura [7] and EMMA [8], which support branch and statement coverage analysis of Java programs, do not support visualization of the underlying control flow models. Being able to view these models in a broader context is important to improve understanding of the system as a whole, especially for testers and white-box testing researchers and educators. Nevertheless, the generation of such system models is usually costly because it requires the reverse engineering and analysis of whole systems. Such a task may affect the performance of tools, a feature very much valued by developers nowadays. For instance, recently Eclipse, the leading Java IDE, has been criticized for performance issues [9]. To make the construction of these models less expensive, analyses can be made at different levels of abstraction, on demand, and interactively. This strategy also supports a visual navigation of the system, where parts considered more relevant can be targeted. The visualization itself might also help discovering faults, since it is a form of inspection, but more visually appealing. The strategy of analyzing systems incrementally is also important because studies indicate that the distribution of faults in software systems follow the Pareto principle; that is, a small number of modules tend to present the majority of faults [10]. In this way, it makes more sense to be able to analyze systems in parts (but also within their contexts), providing a focused testing activity and thus saving time. In this paper we apply this strategy for the representation of Java and AspectJ programs. The type of model we propose – called Multi-Level System Graph (MLSG) – supports the visualization, analysis, and structural testing of systems written in those languages. Since researchers commonly argue that Aspect-Oriented Programming (AOP) introduces uncertainty about module interactions [11], we believe the MLSG is particularly useful in this context. The first level of the MLSG shows the system’s package structure. As the user chooses which packages to explore, classes are then analyzed and shown in a second level, and methods, fields, pieces of advice, and pointcuts can be viewed in a third level. From then on, the user can explore low level control-flow and call chain graphs built from specific units (methods or pieces of advice). At this level, dynamic coverage analysis can also be performed, by supporting the execution of test cases. It is important to note that the analysis itself is only done when the user selects to explore a particular structure of the system; that is, it is not only a matter of expanding or collapsing the view. To automate the visualization, analysis, and testing based on MLSGs, we implemented a tool called SysGraph4AJ (MultiLevel System Graphs for AspectJ). The tool implements the MLSG model and supports its visualization and navigation. Currently the tool also supports statement, branch, and cyclomatic complexity coverage analysis at unit level within MLSGs. We have also conducted an initial evaluation of the tool with three large AspectJ systems. Results show evidence of the feasibility of the tool for the analysis and testing of OO and AO programs. The remainder of this paper is structured as follows. Section II discuss background concepts related to the presented approach and Section III presents the proposed approach. Section IV introduces the SysGraph4AJ tool. Section V discusses an initial evaluation of the approach and Section VI presents related work. Finally, Section VII presents the conclusions and future work. II. BACKGROUND To help understanding the approach proposed in this paper, we briefly introduce key concepts related to Software testing and AOP. A. Software testing and White-box testing Software testing can be defined as the execution of a program against test cases with the intent of revealing faults [12]. The different testing techniques are defined based on the artifact used to derive test cases. Two of the most wellknown techniques are functional – or black-box – testing, and structural – or white-box – testing. Black-box testing derives test cases from the specification or description of a program, while white-box testing derives test cases from the internal representation of a program [12]. Some of the well-known structural testing criteria are statement, branch, or def-use [13] coverage, which require that all commands, decisions, or pairs of assignment and use locations of a variable be covered by test cases. In this paper we consider method and advice (see next section) as the smallest units to be tested, i.e., the units targeted by unit testing. In white-box testing, the control-flow graph (CFG) is used to represent the flow of control of a program, where nodes represent a block of statements executed sequentially, and edges represent the flow of control from one block to another [13]. White-box testing is usually supported by tools, as manually deriving CFGs and applying testing criteria is unreliable and uneconomical [14]. However, most open and professional coverage analysis tools do not support the visualization of the CFGs (such as Cobertura, commented in Section I, Emma1 , and Clover2 ). Some prototypical academic tools do support the visualization of CFG, but mostly separated from its context (i.e., users select to view the model from code, and only the CFG for that unit is shown, apart from any other representation). For instance, JaBUTi supports the visualization of CFGs or CFG clusters apart from the system overall structure. In this paper we propose a multi-level model for visualization, analysis, and testing of OO and AO programs that is built interactively. CFGs are shown within the larger model, so testing can be done with a broader understanding of the whole system. B. AOP and AspectJ AOP supports the implementation of separate modules called aspects that contribute to the implementation of other modules of the system. General-purpose AOP languages define four features: (1) a join point model that describes hooks in the program where additional behavior may be defined; (2) a mechanism for identifying these join points; (3) modules that encapsulate both join point specifications and behavior enhancement; and (4) a weaving process to combine both base code and aspects [15]. AspectJ [16], the most popular AOP language to date, is an extension of the Java language to support general-purpose AOP. In AspectJ, aspects are modules that combine join point specifications – pointcuts or, more precisely, pointcut designators (PCDs3 ); pieces of advice, which implement the desired behavior to be added at join points; and regular OO structures such as methods and fields. Advice can be executed before, after, or around join points selected by the corresponding pointcut, and are implemented as method-like constructs. Advice can also pick context information from the join point that caused them to execute. Aspects can also declare members – fields and methods – to be owned by other types, i.e., inter-type declarations. AspectJ also supports declarations of warnings and errors that arise when certain join points are identified at compile time, or reached at execution. Consider the partial AspectJ implementation of an Online Music Service shown in Figure 1, where songs from a database can be played and have their information shown to the user (adapted from an example presented by Bodkin and Laddad [17]). Each user has an account with credit to access songs available on the database. At a certain price, given for each song, users can play songs. The song price is debited from the user account whenever the user plays it. Reading lyrics of a song should be available to users at no charge. If a 1 http://emma.sourceforge.net/ - 01/14/2013. 2 http://www.atlassian.com/software/clover/overview - 01/14/2013. pointcut is the set of selected join points itself and the PCD is usually a language construct that defines pointcuts. For simplicity, we use these terms interchangeably. 3A public class Song implements Playable { private String name; public Song(String name) { ... } public String getName() { ... } public void play() { ... } public void showLyrics() { ... } public boolean equals(Object o) { ... } public int hashCode() { ... } } public aspect BillingPolicy { public pointcut useTitle() : execution(* Playable.play(..)) || execution(* Song.showLyrics(..)); public pointcut topLevelUseTitle(): useTitle() && !cflowbelow(useTitle()); after(Playable playable) returning : topLevelUseTitle() && this(playable) { User user = (User)Session.instance().getValue("currentUser"); int amount = playable.getName().length(); user.getAccount().bill(amount); System.out.println("Charge: " + user + " " + amount); } } public aspect AccountSuspension { private boolean Account.suspended = false; public boolean Account.isSuspended() { ... } after(Account account) returning: set(int Account.owed) && this(account) { ... } before() : BillingPolicy.topLevelUseTitle() { User user = (User) Session.instance().getValue("currentUser"); if (user.getAccount().isSuspended()) { throw new IllegalArgumentException(); } } } Fig. 1. Partial source code of the Online Music Service [17]. In the AO code example presented in Figure 1 there are two faults. The first is related to the useTitle pointcut, which selects the execution of the showLyrics method as a join point. This makes the user be charged when accessing songs’ lyrics. However, according to the specification of the program, reading lyrics should not be charged. As commented in Section I, AOP tends to introduce uncertainty about module interactions, when only the code is inspected. In particular, for this example scenario, the explicit visualization of where the pieces of advice affect the system is important to discover the fault4 . We believe the model we propose would help the tester in such a task. The second fault present in the AO code example in Figure 1 is the throwing of an inadequate exception in AccountSuspension’s before advice – IllegalArgumentException instead of AccountSuspensionException. Such fault would not be easily spotted via inspection: only by executing the true part of the branch, would the fault most probably be revealed. For this case, structural testing using branch coverage analysis would be more adequate than the system’s inspection5 . Based on this motivating example, we believe the interplay between visualization and structural testing is a promising approach, specially for AO programs. Therefore, in this section we define a model to support such approach, also keeping in mind the performance issue when dealing with large models. The incremental strategy is intended to keep an adequate performance while deriving the model. A. The Multi-Level System Graph user tries to play a song without enough credits, the system yields an adequate failure. The system also manages the user access sessions. In particular, Figure 1 shows the Song class that represents songs that can be played; the BillingPolicy aspect, that implements the billing policy of the system; and the AccountSuspension aspect, which implements the account suspension behavior of the system. Note that the after returning advice of the BillingPolicy aspect and the before advice of the AccountSuspension aspect affect the execution of some of the system’s methods, according to the topLevelUseTitle pointcut. III. I NCREMENTAL V ISUALIZATION , A NALYSIS , AND T ESTING OF OO AND AO PROGRAMS To support the visualization, analysis, and structural testing of OO and AO programs, an interesting approach is to derive the underlying model by levels and interactively. Such a model could then be used as a means to navigate through the system incrementally and apply structural testing criteria to test the program as it is analyzed. The visualization of the system using this model would also serve itself as a visually appealing inspection of the system’s structure. This type of inspection could help discovering faults statically, while the structural testing functionality would support the dynamic verification of the system. The model we propose is called Multi-Level System Graph (MLSG), and it represents the high-level package structure of the system all the way down to the control flow of its units. The MLSG is a composition of known models – such as call and control flow graphs – that nevertheless, to the best of our knowledge, have not been combined in the way we propose in this paper. The MSLG can be formally defined as a directed graph M LSG = N, E where: • N = P ∪ C ∪ A ∪ M ∪ Ad ∪ F ∪ P c ∪ F l, where: – P is the set of package nodes, C is the set of class nodes, A is the set of aspect nodes, M is the set of method nodes, Ad is the set of advice nodes, F is the set of field nodes, P c is the set of pointcut nodes, and F l is the set of control flow nodes that represent blocks of code statements. 4 There are some features presented by modern IDEs that also help discovering such a fault. For instance, Eclipse AJDT [18] shows where each advice affects the system. This could help the developer notice that the showLyrics method should not be affected. However, we believe the model we present makes the join points more explicit, further facilitating the inspection of such types of fault. 5 The example we present is only an illustration. It is clear that the application of graph inspections and structural testing could help revealing other types of faults. However, we believe the example serves as a motivation for the approach we propose in this paper, since it shows two instances of faults that could be found using an approach that supports both techniques. billing main model repository P P P P + package level BillingPolice A AccountSuspension A Main C + Song C Playable C + + init useTitle pc + init m + isSuspended m + topLevelUseTitle m afterReturning pc afterReturning a init + m m + getName a class/aspect level + showLyrics before + C Playlist + hashCode + a m m f + m play equals m method/advice level fl 1 fl fl 2 3 control flow level Legend P Package node Contains edge C Class node A Aspect node Call/Interception edge m Method node f Field node Advice node pc Pointcut node fl n Control flow node + Unanalyzed node Control flow edge Fig. 2. An example MLSG for the Music Online program. E = Co ∪ Ca ∪ I ∪ F e, where: – Co is the set of contains edges. A contains edge (N1 , N2 ) represents that the structure represented by node N1 contains the structure represented by node N2 ; Ca is the set of call edges (N1 , N2 ), N1 ∈ M, N2 ∈ M , which represent that the method represented by N1 calls the method represented by N2 ; I is the set of interception edges. An interception edge (N1 , N2 ), N1 ∈ Ad, N2 ∈ (M ∪ Ad), represents that the method or advice represented by N2 is intercepted by the advice represented by N1 ; F e is the set of flow edges (N1 , N2 ), N1 ∈ F l, N2 ∈ F l, which represent that control may flow from the block of code represented by N1 to the block of code represented by N2 . The edges’ types are defined by the types of their source and target nodes (e.g., interception edges must have advice nodes as source nodes and advice or method nodes as target nodes). An example of a partial MLSG of the example AO system discussed in Section II is presented in Figure 2. Note that there are parts of the system that were not fully analyzed (Unanalyzed Nodes). This is because the model shows a scenario where the user chose to expand some of the packages, modules (classes/aspects), and units (methods/pieces of advice), but not all. By looking at the model we can quickly see that the BillingPolicy after returning advice and the AccountSuspension before advice are affecting the Song’s play and showLyrics methods. As commented before, the interception of the showLyrics method shown by crossed interception edges is a fault, since only the play operation should be charged. This is an example of how the inspec• a tion of the MLSG can support discovering faults. In the same model we can also see the control-flow graph of the AccountSuspension before advice (note that the control flow graphs of the other units could also be expanded once the user selects this operation). Coverage analysis can be performed by using the MLSG. For instance, by executing test cases against the program, with the model we could analyze how many of the statement blocks or branches of the AccountSuspension before advice would be covered. While doing this the fault present in this unit could be uncovered. Another type of model that is interesting for the visualization and testing of systems is the call chain graph (CCG), obtained from the analysis of the call hierarchy (such as done in [19]). The CCG shows the sequence of calls (and, in our case, advice interactions as well) that happen from a given unit. The same information is available at the MLSG; however, the CCG shows a more vertical view of the system, according to the call hierarchy, while the MLSG shows a more horizontal view of the system, according to its structure, with methods and pieces of advice at the same level. Figure 3 shows an example CCG built from the Playlist’s play method. Numbers indicate the order in which the methods are called or in which control flow is passed to pieces of advice. IV. I MPLEMENTATION : S YS G RAPH 4AJ We developed a prototype tool named SysGraph4AJ (from Multi-Level System Graphs for AspectJ) that implements the proposed MLSG model. The tool is a standalone application written in Java. The main features we are keeping in mind while developing SysGraph4AJ is its visualization capability (we want the models to be shown by the tool to be intuitive play getName m 1 3 2 a m before afterReturning a Fig. 3. An example CCG for the Music Online program, built from the Playlist’s play method. and useful at the same time) and its performance. In particular, we started developing SysGraph4AJ also because of some performance issues we have observed in other testing tools (such as JaBUTi). We believe this feature is much valued by developers nowadays, as commented in Section I. The architecture of the tool is divided into the following five packages: model, analysis, gui, visualization, and graph. The model package contains classes that represent the MLSG constituents (each node and edge type); the analysis package is responsible for the analysis of the object code in order to be represented by a MLSG; the gui package is responsible for the Graphical User Interface; the visualization package is responsible for implementing the visualization of the MLSGs; and the graph package is responsible for the control flow graph implementation (we decided to separate it from the model package because this part of the MLSG is more complex than others). For the analysis part of the tool, we used both the Java API (to load classes, for instance), and the Apache Byte Code Engineering Library (BCEL6 ). BCEL was used to analyze methods (for instance, to check their visibility, parameter and return types) and aspects. We decided to make the analysis using bytecode for three reasons: (1) AspectJ classes and aspects are compiled into common Java bytecode and can be analyzed with BCEL, so no weaving of the source code is required (advice interactions can be more easily identified due to implementation conventions adopted by AspectJ [20]); (2) analysis can be made even with the absence of the source code; and (3) in comparison with source code, the bytecode represents more faithfully the interactions that actually occur in the system (so the model is more realistic). The analysis package also contains a class responsible for the coverage analysis. Currently we are using the Java Code Coverage Library (JaCoCo7 ) to implement class instrumentation and statement and branch coverage analysis. We decided to use JaCoCo because it is free and provided good performance in our initial experiments. For the visualization functionality, we used the Java Universal Network/Graph Framework (JUNG8 ). We decided to use JUNG because it is open source, provides adequate documentation and is straightforward to use. Moreover, we made performance tests with JUNG and other graph libraries and noted that JUNG was the fastest. The visualization package is the one that actually uses the JUNG API. This package is 6 http://commons.apache.org/bcel/ - 01/15/2013. 7 http://www.eclemma.org/jacoco/trunk/index.html 8 http://jung.sourceforge.net/ - 01/30/2013 - 01/31/2013. responsible for converting the underlying model represented in our system – built using the analysis package – to a graph in JUNG representation. This package also implements the creation of the panel that will show the graphical representation of the graphs, besides coloring and mouse functionalities. To convert our model to a JUNG graph we begin by laying it out as a tree, where the root is the “bin” folder of the project being analyzed, and the methods, pieces of advice, fields and pointcuts are the leaves. This strategy makes the graph present the layered structure that we desire, such as in the example presented in Figure 2. Since the model is initially a tree, there are no call, interception, or flow edges, only contains edges. Information about the additional edges is stored in a table, and later added to the model. The visualization package also contains a class to construct CCGs from MLSGs. The construction of the CCG is straightforward because all call and interception information is available at the MLSG. Control flow graphs are implemented inside the graph package. It contains a model subpackage that implements the underlying model, and classes to generate the graphs from a unit’s bytecode. A. Tool’s Usage When the user starts SysGraph4AJ, he must first decide which Java project to analyze. Currently this is done by selecting a “bin” directory which contains all bytecode classes of the system. Initially the tool shows the root package (which represents the “bin” directory) and all system packages. Each package is analyzed until a class or aspect is found (that is, if there are packages and subpackages with no classes, the lower level packages are analyzed until a module is reached). From there on, the user can double-click on the desired classes or aspects to see their structure constituents (methods, pieces of advice, fields, and pointcuts). When a user double-clicks a method, dependence analysis is performed, and call and interception edges are added to the MLSG, according to the systems’ interactions. Figure 4 shows a screenshot of SysGraph4AJ with a MLSG similar to the one presented in Figure 2, for the same example system. In SysGraph4AJ, we decided to use colors instead of letter labels to differentiate node types. We can see, for instance, that aspects are represented by pink nodes, methods are represented by light blue nodes, and pieces of advice are represented by white nodes. Contains edges are represented by dashed lines, call/interception edges are represented by dashed lines with larger dashes, and control-flow edges are represented by solid lines. Within CFGs, yellow nodes represent regular blocks while black nodes represent blocks that contain return statements (exit nodes). Control flow graphs can be shown by left-clicking on a unit and choosing the “View control flow graph” option. Call chain graphs can also be shown by choosing the “View call chain” option, or the “View recursive call chain”. The second option does the analysis recursively until the lowest level calls and advice interceptions, and shows the corresponding graph. The first option shows only the units that were already Fig. 4. A screenshot of the SysGraph4AJ tool. analyzed in the corresponding MLSG. The CCG is shown in a separate window, as its view is essentially different from that of the MLSG (as commented in Section III). Figure 5 shows a screenshot of SysGraph4AJ with a CCG similar to the one presented in Figure 3. The sequence of calls skips number 2 because it refers to a library or synthetic method call. We decided to exclude those to provide a leaner model containing only interactions within the system. Such calls are not represented in the MLSG either. Fig. 5. An example CCG for the Music Online program, built from the Playlist’s play method, generated with SysGraph4AJ. Code coverage is performed by importing JUnit test cases. Currently, this is done via a menu option on SysGraph4AJ’s main window menu bar. The user chooses which JUnit test class to execute, and the tool automatically runs the test cases and calculates instruction and branch coverage. Coverage is shown in a separate window, but we are currently implementing the visualization of coverage on the MLSG itself. For instance, when a JUnit test class is run, we want to show on the MLSG which classes were covered and the coverage percentages close to the corresponding class and units. V. E VALUATION As an initial evaluation of the SysGraph4AJ tool, we decided to measure its performance while analyzing realistic software systems. It is important to evaluate performance because this was one of our main expected quality attributes while developing the tool. For this study, we selected three medium-sized AspectJ applications. The first is an AspectJ version of a Java-based object-relational data mapping framework called iBATIS9 . The second system is an aspect-oriented version of HealthWatcher [21], a typical Java web-based information system. The third target application is a software product line for mobile devices, called MobileMedia [22]. The three systems were used in several evaluation studies [2, 11, 23, 24]. To have an idea of size, the iBATIS version used in our study has approximately 15 KLOC within 220 modules (classes and aspects) and 1330 units (methods and pieces of advice); HealthWatcher, 5 KLOC within 115 modules and 512 units; and MobileMedia, 3 KLOC within 60 modules and 280. Besides measuring the performance of our tool by itself while analyzing the three systems, we also wanted to have a basis for comparison. In our case, we believe the JaBUTi tool [5] is the most similar in functionality to SysGraph4AJ. In particular, it also applies control-flow testing criteria and supports the visualization of CFGs (although not interactively and within a multi-level system graph such as in SysGraph4AJ). Moreover, JaBUTi is an open source tool, so we have access to its code. Therefore, we also evaluated JaBUTi while making similar analysis of the same systems and selected modules. Since JaBUTi also performs data-flow analysis, to make a fair comparison, we removed this functionality from the system 9 http://ibatis.apache.org/ - 02/05/2013 TABLE I R ESULTS OF THE EXPLORATORY EVALUATION . System Class – Method/Advice SG4AJ – u SG4AJ – C JBT – C 24 25 150 22 53 895 24 56 766 109 113 489 63 155 711 55 56 395 36 52 216 18 19 139 19 20 324 Avg. 41.12 61.00 453.89 DynamicTagHandler - doStartFragment iBATIS ScriptRunner - runScript BeanProbe - getReadablePropertyNames HWServerDistribution - around execution(...) HealthWatcher HWTimestamp - updateTimestamp SearchComplaintData - executeCommand MediaUtil - readMediaAsByteArray MobileMedia UnavailablePhotoAlbumException - getCause PersisteFavoritesAspect - around getBytesFromImageInfo(...) Legend: SG4AJ – SysGraph4AJ; JBT – JaBUTi; u – analysis of a single unit; C – analysis of whole classes. before running the experiment. The null hypothesis H10 of our experiment is that there is no difference in performance between SysGraph4AJ and JaBUTi while performing analyses of methods; and the alternative hypothesis H1a is that SysGraph4AJ presents better performance than JaBUTi while performing analyses of methods. We randomly selected three units (method/advice) inside three modules (class/aspect) of each of the target systems. The idea was to simulate a situation where the tester would choose a single unit to be tested. We made sure that the analyzed structures were among the largest. The time taken to analyze and instrument each unit – including generating its CFG – was measured in milliseconds (ms). Since the generation of the model in SysGraph4AJ is interactive, to measure only the analysis and instrumentation performance, we registered the time taken by the specific operations that perform these functions (i.e., we recorded the time before and after each operation and summarized the results). With respect to JaBUTi, in order to test a method with this tool, the class that contains the method must be selected. When this is done, all methods contained in the class are analyzed. Therefore, in this evaluation we were in fact comparing the interactive analysis strategy adopted by SysGraph4AJ against the upfront analysis strategy adopted by JaBUTi. Table I shows the results of our study. Note that the analysis of methods and pieces of advice in SysGraph4AJ (column SG4AJ – u) is very fast (it takes much less than a second, on average, to analyze and instrument each unit and generate the corresponding model). In JaBUTi, on the other hand, the performance is around 10 times worse. This is somehow expected, since JaBUTi analyzes all methods within the class. To check wether the difference was statistically significant, we ran a Wilcoxon/Mann Whitney paired test, since the observations did not seem to follow a normal distribution (according to a Shapiro-Wilk normality test). The statistical test revealed a significant difference at 95% confidence level (p-value = 0.0004094; with Bonferroni correction since we are performing two tests with the same data). Now it is clear that this evaluation is in fact measuring how analyzing methods interactively is faster than forthrightly. To compare the performance of SysGraph4AJ with JaBUTi only with respect to their instrumentation and analysis methods, regardless of the adopted strategy, we also measured how long SysGraph4AJ took to analyze all methods within the target classes. Column SG4AJ – C of Table I shows these observations. The difference was again statistically significant (p-value = 0.0007874, with Bonferroni correction). Both statistical tests support the alternative hypothesis that SysGraph4AJ is faster than JaBUTi in the analyses of methods. Although SysGraph4AJ appears to perform better than JaBUTi while analyzing and instrumenting modules and units, the figures shown even for JaBUTi are still small (i.e., waiting half a second for an analysis to be done might not affect the user’s experience). However, we must note that these figures are for the analysis of single modules, and therefore only a part of the startup operations performed by JaBUTi. To have an idea of how long it would take for JaBUTi to analyze a whole system, we can estimate the total time to analyze iBATIS, the largest system in our sample. Note that even if the tester was interested in testing a single method from each class, the whole system would have to be analyzed, because of the strategy adopted by the tool. iBATIS contains around 220 modules (classes + aspects). Therefore, we could multiply the number of modules by the average time JaBUTi took to analyze the nine target modules (453.89 ms). This would summarize 99, 855.80 ms more than 1.5 minutes. Having to wait more than a minute and a half before starting to use the tool might annoy users, considering, for instance, the recent performance critiques the Eclipse IDE has received [9]. Also note that iBATIS is a medium-sized system, for larger systems the startup time could be even greater. It is important to note, however, that JaBUTi implements many other features that are not addressed by SysGraph4AJ (e.g., dataflow-based testing and metrics’ gathering). This might also explain why it performs worse than SysGraph4AJ: the design goals of the developers were broader than ours. Moreover, we believe that the use of consolidated libraries in the implementation of core functions of SysGraph4AJ, such as BCEL for analysis, JaCoCo for instrumentation and coverage analysis, and JUNG for visualization, helped improving its performance. JaBUTi’s implementation also relies on some libraries (such as BCEL), but many other parts of the tool were implemented by the original developers themselves (such as class instrumentation), which might explain in part its weaker performance (i.e., it is hard to be an expert in the implementation of a diversity of features, and preserve their good performance, specially when the developers are academics more interested in proof-of-concept prototypes). With respect to the coverage analysis performance, we have not yet been able to measure it for the target systems, because they require a configured environment unavailable at the time. However, since the observed execution time overhead for instrumented applications with JaCoCo is typically less than 10% [25], we believe the coverage analysis performance will also be adequate. In any case, to have an initial idea of performance for the coverage analysis part of SysGraph4AJ, we recorded the time taken to execute 12 test cases against the Music Online example application shown in Section II. It took only 71 ms to run the tests and collect coverage analysis information for the BillingPolicy aspect. VI. R ELATED WORK Research which is related to this work addresses: (i) challenges and approaches for testing AO programs, with focus on structural testing; and (ii) tools to support structural testing of Java and AspectJ programs. Both categories are next described. A. Challenges and Approaches for AO Testing Alexander et al. [26] were the first to discuss the challenges for testing AO programs. They described potential sources of faults and fault types which are directly related to the control and data flow of such programs (e.g. incorrect aspect precedence, incorrect focus on control flow, and incorrect changes in control dependencies). Ever since, several refinements and amendments to Alexander et al.’s taxonomy have been proposed [24, 27, 28, 29]. To deal with challenges such as the ones described by Alexander et al., researchers have been investigating varied testing approaches. With respect to structural testing, Lemos et al. [30] devised a graph-based representation, named AODU, which includes crosscutting nodes, that is, nodes that represent control flow and data flow information about the advised join points in the base code. Evolutions of Lemos et al.’s work comprises the integration of unit graphs to support pairwise [4], pointcut-based [2] and multi-level integration testing of Java and AspectJ programs [6]. These approaches are supported by the JaBUTI family of tools, which is described in the next section. The main difference between the AODU graph and the MLSG introduced in this paper is that the former is constructed only for a single unit and is displayed out of its context. The MLSG shows the CFGs of units within the system’s context. On the other hand, the AODU contains dataflow information, which are not yet present in our approach. Other structural testing approaches for AO programs at the integration level can also be found in the literature. For example, Zhao [31] proposes integrated graphs that include particular groups of communicating class methods and aspect advices. Another example is the approach of Wedyan and Ghosh [32], who represent a whole AO system through a data flow graph, named ICFG, upon which test requirements are derived. Our approach differs from Zhao’s and Wedyan and Ghosh’s approaches again with respect to the broader system context in which the MLSG is built. Moreover, to the best of our knowledge, none of the described related approachers were implemented. B. Tools JaBUTi [5] was developed to support unit-level, control flow and data flow-based testing of Java programs. The tool is capable of deriving and tracking the coverage of test requirements for single units (i.e., class methods) based on bytecode-level instrumentation. In subsequent versions, JaBUTi was extended to support the testing of AO programs written in AspectJ at the unit [30] and integration levels [2, 4, 6]. The main difference between the JaBUTi tools and SysGraph4AJ (described in this paper) relies on the flexibility the latter offers to the user. SysGraph4AJ enables the user to expand the program graph up to a full view of the system in terms of packages, modules, units and internal unit CFGs. That is, SysGraph4AJ provides a comprehensive representation of the system, and derives test requirements from this overall representation. JaBUTi members, on the other hand, provide more restricted views of the system structure. For example, the latest JaBUTi/AJ version automates multi-level integration testing [6]. In this case, the tester selects a unit from a preselected set of modules, then an integrated CFG is built up to a prespecified depth level. Once the CFG is built, the test requirements are derived and the tester cannot modify the set of modules and units under testing. DCT-AJ [32] is another recent tool that automates data flow-based criteria for Java and AspectJ programs. Differently from JaBUTi/AJ and SysGraph4AJ, DCT-AJ builds an integrated CFG to represent all interacting system modules, which however is only used as the underlying model to derive test requirements. That is, the CFG created by DCT-AJ cannot be visualized by the user. Other open and professional coverage analysis tools such as Cobertura [7], EMMA [8] and Clover [33] do not support the visualization of the CFGs. They automate control flowbased criteria like statement and branch coverage, and create highlighted code coverage views the user can browse through. Finally, widely used IDEs such as Eclipse10 and NetBeans11 offer facilities related to method call hierarchy browsing. This enables the programmer to visualize method call chains in a tree-based representation that can be expanded or collapsed through mouse clicks. However, these IDEs neither provide native coverage analysis nor a program graph representation as rich in detail as the MLSG model. 10 http://www.eclipse.org/ 11 http://netbeans.org/ - 17/04/2013. - 17/04/2013. VII. C ONCLUSIONS In this paper we have presented an approach for visualization, analysis, and structural testing of Java and AspectJ programs. We have defined a model called Multi-Level System Graph (MLSG) that represents the structure of a system, and can be constructed interactively. We have also implemented the proposed approach in a tool, and provided initial evidence of its good performance. Currently, the tool supports visualization of the system’s structure and structural testing at unit level. However, we intend to make SysGraph4AJ a basic framework for implementing other structural testing approaches, such as integration testing. Since, in general, most professional developers do not have time to invest in understanding whole systems with the type of approach presented in this paper, we believe the MLSG can be especially useful for testers at this moment. However, we also believe that if the MLSG could be seamlessly integrated with development environments, the approach would also be interesting for other types of users. For instance, by providing direct links from the MLSG nodes to the source code of the related structures, users could navigate through the system and also easily edit its code. In the future we intend to extend our tool to provide such type of functionality. [7] [8] [9] [10] [11] ACKNOWLEDGEMENTS The authors would like to thank FAPESP (Otavio Lemos, grant 2010/15540-2), for financial support. [12] R EFERENCES [13] [1] E. S. F. Najumudheen, R. Mall, and D. Samanta, “A dependence representation for coverage testing of objectoriented programs.” Journal of Object Technology, vol. 9, no. 4, pp. 1–23, 2010. [2] O. A. L. Lemos and P. C. Masiero, “A pointcutbased coverage analysis approach for aspect-oriented programs,” Inf. Sci., vol. 181, no. 13, pp. 2721–2746, Jul. 2011. [3] D. Grove, G. DeFouw, J. Dean, and C. Chambers, “Call graph construction in object-oriented languages,” in Proc. of the 12th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, ser. OOPSLA ’97. New York, NY, USA: ACM, 1997, pp. 108–124. [4] O. A. L. Lemos, I. G. Franchin, and P. C. Masiero, “Integration testing of object-oriented and aspect-oriented programs: A structural pairwise approach for java,” Sci. Comput. Program., vol. 74, no. 10, pp. 861–878, Aug. 2009. [5] A. M. R. Vincenzi, J. C. Maldonado, W. E. Wong, and M. E. Delamaro, “Coverage testing of java programs and components,” Science of Computer Programming, vol. 56, no. 1-2, pp. 211–230, Apr. 2005. [6] B. B. P. Cafeo and P. C. Masiero, “Contextual integration testing of object-oriented and aspect-oriented programs: A structural approach for java and aspectj,” in Proc. of [14] [15] [16] [17] [18] [19] the 2011 25th Brazilian Symposium on Software Engineering, ser. SBES ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 214–223. M. Doliner, “Cobertura tool website,” Online, 2006, http: //cobertura.sourceforge.net/index.html - last accessed on 06/02/2013. V. Roubtsov, “EMMA: A free Java code coverage tool,” Online, 2005, http://emma.sourceforge.net/ - last accessed on 06/02/2013. The H Open, “Weak performance of eclipse 4.2 criticised,” Online, 2013, http://www.h-online.com/open/news/item/ Weak-performance-of-Eclipse-4-2-criticised-1702921. html - last accessed on 19/04/2013. C. Andersson and P. Runeson, “A replicated quantitative analysis of fault distributions in complex software systems,” IEEE Trans. Softw. Eng., vol. 33, no. 5, pp. 273–286, May 2007. F. Ferrari, R. Burrows, O. Lemos, A. Garcia, E. Figueiredo, N. Cacho, F. Lopes, N. Temudo, L. Silva, S. Soares, A. Rashid, P. Masiero, T. Batista, and J. Maldonado, “An exploratory study of fault-proneness in evolving aspect-oriented programs,” in Proc. of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ser. ICSE ’10. New York, NY, USA: ACM, 2010, pp. 65–74. G. J. Myers, C. Sandler, T. Badgett, and T. M. Thomas, The Art of Software Testing, 2nd ed. John Wiley & Sons, 2004. S. Rapps and E. J. Weyuker, “Selecting software test data using data flow information,” IEEE Trans. Softw. Eng., vol. 11, no. 4, pp. 367–375, 1985. IEEE, “IEEE standard for software unit testing,” Institute of Electric and Electronic Engineers, Standard 10081987, 1987. T. Elrad, R. E. Filman, and A. Bader, “Aspect-oriented programming: Introduction,” Communications of the ACM, vol. 44, no. 10, pp. 29–32, 2001. G. Kiczales, J. Irwin, J. Lamping, J.-M. Loingtier, C. Lopes, C. Maeda, and A. Menhdhekar, “Aspectoriented programming,” in Proc. of the European Conference on Object-Oriented Programming, M. Akşit and S. Matsuoka, Eds., vol. 1241. Berlin, Heidelberg, and New York: Springer-Verlag, 1997, pp. 220–242. R. Bodkin and R. Laddad, “Enterprise AspectJ tutorial using eclipse,” Online, 2005, eclipseCon 2005. Available from: http://www.eclipsecon.org/2005/presentations/ EclipseCon2005 EnterpriseAspectJTutorial9.pdf (accessed 12/3/2007). The Eclipse Foundation, “AJDT: Aspectj development tools,” Online, 2013, http://www.eclipse.org/ajdt/ - last accessed on 19/04/2013. A. Rountev, S. Kagan, and M. Gibas, “Static and dynamic analysis of call chains in java,” in Proc. of the 2004 ACM SIGSOFT international symposium on Software testing and analysis, ser. ISSTA ’04. New York, NY, USA: ACM, 2004, pp. 1–11. [20] E. Hilsdale and J. Hugunin, “Advice weaving in aspectj,” in Proceedings of the 3rd international conference on Aspect-oriented software development, ser. AOSD ’04. New York, NY, USA: ACM, 2004, pp. 26–35. [21] P. Greenwood, T. Bartolomei, E. Figueiredo, M. Dosea, A. Garcia, N. Cacho, C. Sant’Anna, S. Soares, P. Borba, U. Kulesza, and A. Rashid, “On the impact of aspectual decompositions on design stability: an empirical study,” in Proc. of the 21st European conference on ObjectOriented Programming, ser. ECOOP’07. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 176–200. [22] E. Figueiredo, N. Cacho, C. Sant’Anna, M. Monteiro, U. Kulesza, A. Garcia, S. Soares, F. Ferrari, S. Khan, F. Castor Filho, and F. Dantas, “Evolving software product lines with aspects: an empirical study on design stability,” in Proc. of the 30th international conference on Software engineering, ser. ICSE ’08. New York, NY, USA: ACM, 2008, pp. 261–270. [23] F. C. Filho, N. Cacho, E. Figueiredo, R. Maranhão, A. Garcia, and C. M. F. Rubira, “Exceptions and aspects: the devil is in the details,” in Proc. of the 14th ACM SIGSOFT international symposium on Foundations of software engineering, ser. SIGSOFT ’06/FSE-14. New York, NY, USA: ACM, 2006, pp. 152–162. [24] F. C. Ferrari, J. C. Maldonado, and A. Rashid, “Mutation testing for aspect-oriented programs,” in Proc. of the 2008 International Conference on Software Testing, Verification, and Validation, ser. ICST ’08. Washington, DC, USA: IEEE Computer Society, 2008, pp. 52–61. [25] Mountainminds GmbH & Co. KG and Contributors, “Control flow analysis for java methods,” Online, 2013, available from: http://www.eclemma.org/jacoco/ trunk/doc/flow.html (accessed 02/04/2013). [26] R. T. Alexander, J. M. Bieman, and A. A. Andrews, “Towards the systematic testing of aspect-oriented programs,” Dept. of Computer Science, Colorado State University, Tech. Report CS-04-105, 2004. [27] M. Ceccato, P. Tonella, and F. Ricca, “Is AOP code easier or harder to test than OOP code?” in Proceedings of the 1st Workshop on Testing Aspect Oriented Programs (WTAOP) - held in conjunction with AOSD, Chicago/IL - USA, 2005. [28] A. van Deursen, M. Marin, and L. Moonen, “A systematic aspect-oriented refactoring and testing strategy, and its application to JHotDraw,” Stichting Centrum voor Wiskundeen Informatica, Tech.Report SEN-R0507, 2005. [29] O. A. L. Lemos, F. C. Ferrari, P. C. Masiero, and C. V. Lopes, “Testing aspect-oriented programming pointcut descriptors,” in Proceedings of the 2nd Workshop on Testing Aspect Oriented Programs (WTAOP). Portland/Maine - USA: ACM Press, 2006, pp. 33–38. [30] O. A. L. Lemos, A. M. R. Vincenzi, J. C. Maldonado, and P. C. Masiero, “Control and data flow structural testing criteria for aspect-oriented programs,” The Journal of Systems and Software, vol. 80, no. 6, pp. 862–882, 2007. [31] J. Zhao, “Data-flow-based unit testing of aspect-oriented programs,” in Proceedings of the 27th Annual IEEE International Computer Software and Applications Conference (COMPSAC). IEEE Computer Society, 2003, pp. 188–197. [32] F. Wedyan and S. Ghosh, “A dataflow testing approach for aspect-oriented programs,” in Proceedings of the 12th IEEE International High Assurance Systems Engineering Symposium (HASE). San Jose/CA - USA: IEEE Computer Society, 2010, pp. 64–73. [33] Atlassian, Inc., “Clover: Java and Groovy code coverage,” Online, http://www.atlassian.com/software/clover/ overview - last accessed on 06/02/2013. A Method for Model Checking Context-Aware Exception Handling Lincoln S. Rocha Rossana M. C. Andrade Alessandro F. Garcia Grupo de Pesquisa GREat UFC, Quixadá-CE, Brasil Email: [email protected] Grupo de Pesquisa GREat UFC, Fortaleza-CE, Brasil Email: [email protected] Grupo de Pesquisa OPUS PUC-Rio, Rio de Janeiro-RJ, Brasil Email: [email protected] Resumo—O tratamento de exceção sensível ao contexto (TESC) é uma técnica de recuperação de erros empregada na melhoria da robustez de sistemas ubíquos. No projeto do TESC, os projetistas especificam condições de contexto que são utilizadas para caracterizar situações de anormalidade e estabelecem critérios para a seleção das medidas de tratamento. A especificação errônea dessas condições representam faltas de projeto críticas. Elas podem fazer com que o mecanismo de TESC, em tempo de execução, não seja capaz de identificar as situações de anormalidade desejadas ou reagir de forma adequada quando estas são detectadas. Desse modo, para que a confiabilidade do TESC não seja comprometida, é necessário que estas faltas de projeto sejam rigorosamente identificadas e removidas em estágios iniciais do desenvolvimento. Contudo, embora existam abordagens formais para verificação de sistemas ubíquos sensíveis ao contexto, estas não proveem suporte apropriado para a verificação do TESC. Nesse cenário, este trabalho propõe um método para verificação de modelos do TESC. As abstrações propostas pelo método permitem aos projetistas modelarem aspectos comportamentais do TESC e, a partir de um conjunto de propriedades pré-estabelecidas, identificar a existência de faltas de projeto. Com o objetivo de avaliar a viabilidade do método: (i) uma ferramenta de suporte à verificação foi desenvolvida; e (ii) cenários recorrentes de faltas em TESC foram injetados em modelos de um sistema de forma a serem analisados com a abordagem de verificação proposta. Index Terms—Sistemas Ubíquos, Tratamento de Exceção, Verificação de Modelos Abstract—The context-aware exception handling (CAEH) is an error recovery technique employed to improve the ubiquitous software robustness. In the design of CAEH, context conditions are specified to characterize abnormal situations and used to select the proper handlers. The erroneous specification of such conditions represents a critical design fault that can lead the CAEH mechanism to behave erroneously or improperly at runtime (e.g., abnormal situations may not be recognized and the system’s reaction may deviate from what is expected). Thus, in order to improve the CAEH reliability this kind of design faults must be rigorously identified and eliminated from the design in the early stages of development. However, despite the existence of formal approaches to verify context-aware ubiquitous systems, such approaches lack specific support to verify the CAEH behavior. This work proposes a method for model checking CAEH. This method provides a set of modeling abstractions and 3 (three) properties formally defined that can be used to identify exiting design faults in the CAEH design. In order to assess the method feasibility: (i) a support tool was developed; and (ii) fault scenarios that are recurring in the CAEH was injected in a correct model and verified using the proposed approach. Index Terms—Ubiquitous Systems, Exception Handling, Model Checking I. I NTRODUÇÃO Os sistemas ubíquos sensíveis ao contexto são sistemas capazes de observar o contexto em que estão inseridos e reagir de forma apropriada, adaptando sua estrutura e comportamento ou executando tarefas de forma automática [1]. Nesses sistemas, o contexto representa um conjunto de informações sobre o ambiente (físico ou lógico, incluindo os usuários e o próprio sistema), que pode ser usado com o propósito de melhorar a interação entre o usuário e o sistema ou manter a sua execução de forma correta, estável e otimizada [2]. Devido ao seu amplo domínio de aplicação (e.g., casas inteligentes, guias móveis de visitação, jogos e saúde) e por tomarem decisões de forma autônoma no lugar das pessoas, os sistemas ubíquos sensíveis ao contexto precisam confiáveis para cumprir com a sua função. Tal confiabilidade requer que esses sejam robustos (i.e., capazes de lidar com situações anormais) [3]. O tratamento de exceção sensível ao contexto (TESC) é uma abordagem utilizada para recuperação de erros que vem sendo explorada como uma alternativa para melhorar os níveis de robustez desse tipo de sistema [4][3][5][6][7][8]. No TESC, o contexto é usado para caracterizar situações anormais no sistemas (e.g., uma falha de software/hardware ou a quebra de algum invariante do sistema), denominadas de exceções contextuais, e estruturar as atividades de tratamento, estabelecendo critérios para a seleção e execução de tratadores. De um modo geral, a ocorrência de uma exceção contextual requer que o fluxo normal do sistema seja desviado para que o tratamento apropriado seja conduzido. Entretanto, dependendo da situação e do mecanismo de TESC adotado, o fluxo de controle pode ser retomado, ou não, após o término tratamento. No projeto de sistemas ubíquos sensíveis ao contexto, o uso de formalismos e abstrações apropriados, faz-se necessário, para lidar com questões complexas inerentes a esses sistemas (e.g., abertura, incerteza, adaptabilidade, volatilidade e gerenciamento de contexto) em tempo de projeto [9][10][11][12][13][14]. Em particular, o projetista de TESC é responsável por especificar as condições de contexto utilizadas [5][3]: (i) na definição das exceções contextuais; e (ii) na seleção das medidas de tratamento. No caso (i), as condições de contexto especificadas são usadas pelo mecanismo de TESC para detectar a ocorrência de exceções contextuais em tempo de execução. Por outro lado, no caso (ii), as condições de contexto funcionam como pré-condições que são estabelecidas para cada possível tratador de uma exceção contextual. Assim, quando uma exceção é detectada, o mecanismo seleciona, dentre os possíveis tratadores daquela exceção, aqueles cujas pré-condições são satisfeitas no contexto corrente do sistema. Entretanto, a falibilidade humana e o conhecimento parcial sobre a forma como o contexto do sistema evolui, podem levar os projetistas a cometerem erros de especificação, denominadas de faltas de projeto (design faults). Por exemplo, devido a negligência ou lapsos de atenção, contradições podem ser facilmente inseridas pelos projetistas na especificação das condições de contexto ou, mesmo não existindo contradições, essas condições podem representar situações de contexto que nunca ocorrem em tempo de execução, devido a forma como o contexto evolui. Nessa perspectiva, a inserção de faltas de projeto, e a sua eventual propagação até a fase de codificação, podem fazer com que o mecanismo de TESC seja configurado de maneira imprópria, comprometendo a sua confiabilidade em detectar exceções contextuais ou selecionar os tratadores apropriados. Estudos recentes relatam que a validação do código de tratamento de exceção de sistemas que utilizam recursos externos não confiáveis, a exemplo dos sistemas ubíquos sensíveis ao contexto, é uma atividade difícil e extremamente desafiadora [15]. Isso decorre do fato de que para testar todo o espaço de exceções levantadas no sistema, é necessário gerar, de forma sistemática, todo o contexto que desencadeia essas exceções. Essa atividade, além de complexa, pode se tornar proibitiva em alguns casos devido aos altos custos associados. Desse modo, uma análise rigorosa do projeto do TESC, buscando identificar e remover faltas de projeto, podem contribuir para a melhoria dos níveis de confiabilidade do TESC e para a diminuição dos custos associados à identificação e correção de defeitos decorrentes da inserção de faltas de projeto. Contudo, embora existam abordagens formais baseadas em modelos voltadas para a análise do comportamento de sistemas ubíquos sensíveis ao contexto [16][11][17], essas estão voltadas somente para o comportamento adaptativo. Elas não provêem abstrações e suporte apropriado para modelagem e análise do comportamento do TESC, tornando essa atividade ainda mais complexa e sujeita a introdução de faltas. Nesse cenário, este trabalho propõe um método baseado em verificação de modelos para apoiar a identificação automática de faltas de projeto no TESC (Seção IV). O método proposto provê um conjunto de abstrações que permitem aos projetistas modelarem aspectos comportamentais do TESC e mapeá-los para um modelo formal de comportamento (estrutura de Kripke) compatível com a técnica de verificação de modelos [18]. O formalismo adotado é baseado em estados, transições e lógica temporal devido as necessidades peculiares de projeto e verificação de modelos de TESC (Seção III). Um conjunto de propriedades comportamentais é estabelecido e formalmente definido com lógica temporal no intuito de auxiliar os projetistas na identificação de determinados tipos de faltas de projeto (Seção II). Além disso, com o objetivo de avaliar a viabilidade do método (Seção V): (i) uma ferra- menta de suporte à verificação foi desenvolvida e (ii) cenários recorrentes de faltas em TESC foram injetados em modelos de um sistema de forma a serem analisados com a abordagem de verificação proposta. Ao final, a Seção VI descreve os trabalhos relacionados e a Seção VII conclui o artigo. II. T RATAMENTO DE E XCEÇÃO S ENSÍVEL AO C ONTEXTO No tratamento de exceção sensível ao contexto (TESC), o contexto e a sensibilidade ao contexto são utilizados pelo mecanismo de tratamento de exceção para definir, detectar e tratar condições anormais em sistemas ubíquos, chamadas de exceções contextuais. Na Seção II-A desta seção são descritos os principais tipos de exceções contextuais. Além disso, uma discussão sobre onde e como faltas1 de projeto podem ser cometidas no projeto do TESC é oferecida na Seção II-B. A. Tipos de Exceções Contextuais As exceções contextuais representam situações anormais que requerem que um desvio no fluxo de execução seja feito para que o tratamento da excepcionalidade seja conduzido. A detecção da ocorrência de uma exceção contextual pode indicar uma eventual falha em algum dos elementos (hardware ou software) que compõem o sistema ou que alguma invariante de contexto, necessária a execução de alguma atividade do sistema, tenha sido violada. Neste trabalho, as exceções contextuais foram agrupadas em 3 (três) categorias: infraestrutura, invalidação de contexto e segurança. Elas são descritas nas próximas subseções. 1) Exceções Contextuais de Infraestrutura: Esse tipo de exceção contextual está relacionada com a detecção de situações de contexto que indicam que alguma falha de hardware ou software ocorreu em algum dos elementos que constituem o sistema ubíquo. Um exemplo desse tipo de exceção contextual é descrito em [4] no escopo de um sistema de aquecimento “inteligente”. A função principal daquele sistema é ajustar a temperatura do ambiente às preferências dos usuários. Naquele sistema, uma situação de excepcionalidade é caracterizada quando a temperatura do ambiente atinge um valor acima do limite estabelecido pelas preferencias do usuário. Esse tipo de exceção contextual ajuda a identificar, de forma indireta, a ocorrência de falhas cuja origem pode ser o sistema que controla o equipamento de aquecimento (falha de software) ou o próprio equipamento (falha de hardware). O mau funcionamento do sistema de aquecimento é considerado uma situação anormal, pois pode colocar em risco a saúde dos usuários. Observe que para detectar essa exceção contextual é necessário ter acesso à informações de contexto sobre a temperatura do ambiente e as preferências dos usuários. 2) Exceções Contextuais de Invalidação de Contexto: Esse tipo de exceção contextual está relacionada com a violação de determinadas condições de contexto durante a execução de alguma tarefa do sistema. Essas condições de contexto funcionam como invariantes da tarefa e, quando violadas, 1 Este trabalho adota a nomenclatura de [19], na qual uma falta (fault) é a causa física ou algorítmica de um erro (error), que, se propagado até a interface de serviço do componente ou sistema, caracteriza uma falha (failure). caracterizam uma situação de anormalidade. Por exemplo, os autores de [3] descrevem esse tipo de exceção em uma aplicação de leitor de música sensível ao contexto. O leitor de música executa no dispositivo móvel do usuário, enviando um fluxo continuo de som para a saída de áudio do dispositivo. Entretanto, quando o usuário entra em uma sala vazia, o aplicativo busca por algum dispositivo de áudio disponível no ambiente e transfere o fluxo de som para aquele dispositivo. Nessa aplicação, é estabelecido como contexto invariante a necessidade do usuário estar sozinho dentro da sala. Para os autores de [3], a violação desse invariante é considerado uma situação excepcional, pois o seu não cumprimento pode trazer desconforto ou aborrecimento para as demais pessoas presentes na sala. Note que a detecção dessa exceção depende de informações de contexto sobre a localização do usuário e o número de pessoas que estão na mesma sala que ele. 3) Exceções Contextuais de Segurança: Esse tipo de exceção está relacionada com situações de contexto que ajudam a identificar a violação de políticas de segurança (e.g., autenticação, autorização e privacidade) e demais situações que podem colocar em risco a integridade física ou financeira dos usuários do sistema. Por exemplo, o sistema de registro médico sensível ao contexto apresentado em [3] descreve esse tipo de exceção. Nesse sistema existem três usuários envolvidos: os pacientes, os enfermeiros e os médicos. Os médicos podem fazer registros sobre seus pacientes e os enfermeiros podem ler e atualizar esses registros ao tempo em que assistem aos pacientes. Entretanto, os enfermeiros só podem ter acesso aos registros se estiverem dentro da enfermaria em que o paciente se encontra e se o médico responsável estiver presente. Naquele sistema, quando um enfermeiro tenta acessar os registros do paciente, porém não se encontra na mesma enfermaria que este paciente ou encontra-se na enfermaria, mas o médico responsável não está presente, caracteriza-se uma situação excepcional. Perceba que a detecção desse tipo de exceção depende das informações de contexto sobre a localização e o perfil do paciente, do enfermeiro e do médico. B. Propensão a Faltas de Projeto Com base em trabalhos existentes na literatura [4][3][5][6][7][8], é possível dividir o projeto do TESC em duas grandes atividades: (i) especificação do contexto excepcional; e (ii) especificação do tratamento sensível ao contexto. Na atividade (i), os projetistas especificam as condições de contexto que caracterizam as situações de anormalidade identificadas no sistema. Dessa forma, em tempo de execução, quando uma dessas situações são detectadas pelo mecanismo de TESC, diz-se que uma ocorrência da exceção contextual associada foi identificada. Por outro lado, na atividade (ii), os projetistas especificam as ações de tratamento a serem executadas para mitigar a situação de excepcionalidade detectada. Entretanto, dependendo do contexto corrente do sistema quando a exceção é detectada, um conjunto de ações de tratamento podem ser mais apropriado do que outro para tratar aquela determinada ocorrência excepcional. Desse modo, faz parte do trabalho do projetista na atividade (ii), agrupar as ações de tratamento e estabelecer condições de contexto que ajudem o mecanismo de TESC, em tempo de execução, a selecionar o conjunto de ações de tratamento apropriado para lidar com uma ocorrência excepcional em função do contexto corrente. A falibilidade dos projetistas, o conhecimento parcial sobre a forma como o contexto do sistema evolui em tempo de execução, a inexistência de uma notação apropriada e a falta de suporte ferramental, tornam o projeto do TESC uma atividade extremamente propensa a faltas de projeto. Por exemplo, devido a negligência ou lapsos de atenção, contradições podem ser facilmente inseridas pelos projetistas durante a especificação das condições de contexto construídas nas atividades (i) e (ii) do projeto do TESC. Além disso, mesmo que os projetistas criem especificações livres de contradições, essas podem representar situações de contexto que nunca ocorrerão em tempo de execução devido a forma como o sistema e o seu contexto evoluem. Faltas de projeto como estas podem fazer com que o mecanismo de TESC seja mal configurado, comprometendo a sua confiabilidade em detectar as situações de anormalidade desejadas e selecionar as ações de tratamento adequadas para lidar com ocorrências excepcionais específicas. Adicionalmente, existe outro tipo de falta de projeto pode ser facilmente cometida por projetistas. Por exemplo, considere o projeto do TESC para uma exceção contextual em que as especificações das condições de contexto construídas nas atividades (i) e (ii) estejam livres de faltas de projeto como as descritas anteriormente. Perceba que, mesmo nesse caso, pode ocorrer do projetista especificar a condição de contexto que caracteriza a situação de anormalidade e as condições de seleção das ações de tratamento de tal forma que estas nunca sejam satisfeitas, simultaneamente, em tempo de execução. Isso pode acontecer nos casos em que essas condições de contexto sejam contraditórias entre si ou que não seja possível o sistema atingir um estado em que seu contexto satisfaça a ambas ao mesmo tempo. Desse modo, face a propensão à falta de projetos, uma abordagem rigorosa deve ser empregada pelos projetistas para que faltas de projeto sejam identificadas e removidas, antes que sejam propagadas até a fase de codificação. III. V ERIFICAÇÃO DE M ODELOS A verificação de modelos é um método formal empregado na verificação automática de sistemas reativos concorrentes com número finito de estados [18]. Nessa abordagem, o comportamento do sistema é modelado através de algum formalismo baseado em estados e transições e as propriedades a serem verificadas são especificadas usando lógicas temporais. A verificação das propriedades comportamentais é dada através de uma exaustiva enumeração (implícita ou explícita) de todos os estados alcançáveis do modelo do sistema. A estrutura de Kripke (Definição 1) é um formalismo para modelagem de comportamento, onde os estados são rotulados ao invés das transições. Na estrutura de Kripke, cada rótulo representa um instantâneo (estado) da execução do sistema. Essa característica foi preponderante para sua escolha neste trabalho, uma vez que os aspectos comportamentais do projeto TESC que se deseja analisar são influenciados pela observação do estado do contexto do sistema, e não pelas ações que o levaram a alcançar um estado de contexto em particular. Definição 1 (Estruturas de Kripke). Uma estrutura de Kripke K = hS, I, L, →i sobre um conjunto finito de proposições atômicas AP é dado por um conjunto finito de estados S, um conjunto de estados iniciais I ⊆ S, uma função de rótulos L : S → 2AP , a qual mapeia cada estado em um conjunto de proposições atômicas que são verdadeiras naquele estado, e uma relação de transição total →⊆ S × S, isto é, que satisfaz a restrição ∀s ∈ S ∃s0 ∈ S tal que (s, s0 ) ∈→. Usualmente, as propriedades se deseja verificar são divididas em dois tipos: (i) de segurança (safety), que buscam expressar que “nada ruim acontecerá” durante a execução do sistema; e (ii) de progresso (liveness), que buscam expressar que, eventualmente, “algo bom acontecerá” durante a execução do sistema. Essas propriedades são expressas usando lógicas temporais que são interpretadas sobre uma estrutura de Kripke. Dessa forma, dada uma estrutura de Kripke K e uma fórmula temporal ϕ, uma formulação geral para o problema de verificação de modelos consiste em verificar se ϕ é satisfeita na estrutura K, formalmente K |= ϕ. Nesse caso, K representa o modelo do sistema e ϕ a propriedade que se deseja verificar. Neste trabalho, devido a sua expressividade e ao verificador de modelos utilizado, a lógica temporal CTL (Computation Tree Logic) foi escolhida para a especificação de propriedades sobre o comportamento do TESC. CTL é uma lógica temporal de tempo ramificado que permite expressar propriedades sobre estados. As fórmulas de CTL são construídas sobre proposições atômicas utilizando operadores proposicionais (¬, ∧, ∨, → e ↔) e operadores temporais (EX, EF, EG, EU, AX, AF, AG e AU). Sejam φ e ϕ fórmulas CTL, a intuição para os operadores temporais é dada na Tabela I. Para obter mais detalhes sobre CTL e verificação de modelos, consulte [18]. Tabela I I NTUIÇÃO PARA OS O PERADORES T EMPORAIS DE CTL. EXφ “existe um caminho tal que no próximo estado φ é verdadeira.” EFφ “existe um caminho tal que no futuro φ será verdadeira.” EGφ “existe um caminho tal que φ é sempre verdadeira.” EU(φϕ)“existe um caminho tal que φ é verdadeira até que ϕ passe a ser.” AXφ “para todo caminho, no próximo estado φ é verdadeira.” AFφ “para todo caminho, φ é verdadeira no futuro.” AGφ “para todo caminho, φ é sempre verdadeira.” AU(φϕ)“para todo caminho, φ é verdadeira até que ϕ passe a ser.” IV. O M ÉTODO P ROPOSTO Nesta seção é apresentado o método proposto para a verificação de modelos do TESC. Uma visão geral do método é oferecida na Seção IV-A. A Seção IV-B aborda a forma como o espaço de estados a ser explorado é derivado. Além disso, a atividade de modelagem (Seção IV-C), a derivação da estrutura de Kripke (Seção IV-D) e a atividade de especificação (Seção IV-E) do método são detalhadas. A. Visão Geral O método proposto provê um conjunto de abstrações e convenções que permitem aos projetistas expressarem de forma rigorosa o comportamento excepcional sensível ao contexto e mapeá-lo para uma estrutura de Kripke particular, formalismo apresentado na Seção III que serve de base para a técnica de verificação de modelos. Além disso, o método oferece uma lista de propriedades comportamentais, a serem verificadas sobre o comportamento excepcional sensível ao contexto, com o intuito de auxiliar os projetistas na descoberta de determinados tipos de faltas de projeto no TESC. O método é decomposto em duas atividades: modelagem e especificação. Na atividade de modelagem (Seção IV-C), o comportamento do TESC é modelado utilizando um conjunto de construtores próprios que ajudam a definir as exceções contextuais e estruturar as ações de tratamento. Na atividade de especificação (Seção IV-E), um conjunto de propriedades que permitem identificar um conjunto bem definido de faltas de projeto no TESC, são apresentadas e formalizadas utilizando a lógica CTL. Entretanto, o fato do método conseguir representar o modelo de comportamento do TESC como uma estrutura de Kripke, permite que outros tipos de propriedades comportamentais possam ser definidas pelos projetistas utilizando CTL. B. Determinando o Espaço de Estados Nos estágios iniciais do projeto do TESC, um dos principais esforços está na identificação das informações de contexto que podem ser úteis para projetar o TESC. Nesses estágios, por não haver um conhecimento detalhado sobre o tipo, a origem e a estrutura dessas informações, é pertinente abstrair esses detalhes e buscar lidar com informações de contexto mais alto nível. Essas informações de contexto de alto nível, podem ser vistas como proposições sobre o contexto do sistema, que recebem, em tempo de execução, uma interpretação, verdadeira ou falsa, de acordo com a valoração assumida pelas variáveis de contexto de mais baixo nível observadas pelo sistema. No método proposto, essas proposições são denominadas de proposições contextuais e compõem o conjunto CP, que representa a base de conhecimento utilizada para caracterizar situações contextuais relevantes para o projeto do TESC. Nesse cenário, para que o espaço de estados a ser explorado seja obtido é preciso criar uma função de valoração que atribua valores para as proposições contextuais em CP. Entretanto, construir essa função não é uma atividade trivial, pois essas proposições contextuais representam informações de contexto de baixo nível que assumem uma valoração de forma não determinística, seguindo leis que extrapolam a fronteira e o controle do sistema (e.g., tempo, condições climáticas e mobilidade) e que podem estar relacionadas entre si de forma dependente ou conflitante. Desse modo, embora as proposições contextuais permitam abstrair os detalhes das variáveis de contexto de baixo nível, elas trazem consigo problemas de dependência semântica que dificultam a construção de uma função de valoração. Para lidar com essa questão, o método proposto adota a técnica de programação por restrições [20] como função de valoração para as proposições contextuais. Essa técnica permite ao projetista estabelecer restrições semânticas (Definição 2) sobre CP garantindo que todas as soluções geradas (o espaço de estados a ser explorado) satisfazem as restrições estabelecidas. Por convenção, a função csp(CP, C) será utilizada para designar o espaço de estados derivado a partir do conjunto C de restrições definido sobre CP. Definição 2 (Restrição). Uma restrição é definida como uma fórmula lógica sobre o conjunto CP de proposições contextuais tal como descrito na gramática em (1), onde p ∈ CP é uma proposição contextual, φ e ϕ são fórmulas lógicas e ¬ (negação), ∧ (conjunção), ∨ (disjunção), ⊕ (disjunção exclusiva), → (implicação) e ↔ (dupla implicação) operadores lógicos. φ, ϕ : := p | ¬φ | φ ∧ ϕ | φ ∨ ϕ | φ ⊕ ϕ | φ → ϕ | φ ↔ ϕ (1) C. Atividade de Modelagem Durante a modelagem do comportamento do TESC, algumas questões de projeto relacionadas com a definição e a detecção de exceções contextuais, com o agrupamento, seleção e execução das medidas de tratamento precisam ser pensadas. No método proposto, a atividade de modelagem tem como objetivo tratar essas questões e possibilitar o mapeamento do modelo de comportamento do TESC para uma estrutura de Kripke. Para isso, são propostas as abstrações de exceções contextuais, casos de tratamento e escopos de tratamento. 1) Exceções Contextuais: No método proposto, uma exceção contextual é definida por um nome e uma fórmula lógica utilizada para caracterizar o seu contexto excepcional (Definição 3). Uma exceção contextual é detectada quando a fórmula ecs é satisfeita em algum dado estado de contexto do sistema. Nesse momento, diz-se que a exceção contextual foi levantada. Por convenção, dada uma exceção contextual e = hname, ecsi a função ecs(e) é definida para recuperar a especificação de contexto excepcional (ecs) da exceção e. Definição 3 (Exceção Contextual). Dado um conjunto proposições contextuais CP, uma exceção contextual é definida pela tupla hname, ecsi, onde name é o nome da exceção contextual, ecs é uma fórmula lógica definida sobre CP que especifica o contexto excepcional de detecção. 2) Casos de Tratamento: Como discutido anteriormente, uma exceção contextual pode ser tratada de formas diferentes dependendo do contexto em que o sistema se encontra. Os casos de tratamento (Definição 4) definem as diferentes estratégias que podem ser empregadas para tratar uma exceção contextual em função do contexto corrente do sistema. Um caso de tratamento é composto por uma condição de seleção e um conjunto de fórmulas lógicas que são utilizadas para descrever a situação de contexto esperada após a execução de cada ação (ou bloco de ações) de tratamento de forma sequencial. Por convenção, os constituintes de um caso de tratamento serão referenciados de agora em diante como condição de seleção e conjunto de medidas de tratamento, respectivamente. Definição 4 (Caso de Tratamento). Dado um conjunto de proposições contextuais CP, um caso de tratamento é definido como uma tupla hcase = hα, Hi, onde α é uma fórmula lógica definida sobre CP e H é um conjunto ordenado de fórmulas lógicas definidas sobre CP. 3) Escopos de Tratamento: Tipicamente, os tratadores de exceção encontram-se vinculados a áreas específicas do código do sistema onde exceções podem ocorrer. Essa estratégia ajuda a delimitar o escopo de atuação de um tratador durante a atividade de tratamento. No método proposto, o conceito de escopos de tratamento (Definição 5) é criado para delimitar a atuação dos casos de tratamento e estabelecer uma relação de precedência entre estes. Essa relação de precedência é essencial para resolver situações de sobreposição entre condições de seleção de casos de tratamento (i.e., situações em que mais de um caso de tratamento pode ser selecionado num mesmo estado do contexto). Dessa forma, no método proposto, o caso de tratamento de maior precedência é avaliado primeiro, se este não tiver a sua condição de seleção satisfeita, o próximo caso de tratamento com maior precedência é avaliado, e assim por diante. Definição 5 (Escopo de Tratamento). Dado um conjunto de proposições contextuais CP, um escopo de tratamento é definido pela tupla he, HCASEi, onde e é uma exceção contextual e HCASE é um conjunto ordenado de casos de tratamento. A noção de conjunto ordenado, mencionado na Definição 5, está relacionada com a existência de uma relação de ordem entre os casos de tratamento. Essa relação permite estabelecer a ordem de precedência em que cada caso de tratamento será avaliado quando uma exceção contextual for levantada. No método proposto, a ordem de avaliação utilizada leva em consideração a posição ocupada por cada caso de tratamento dentro do conjunto HCASE. Portanto, para os casos de tratamento hcasei e hcasej , se i < j, então hcasei tem precedência sobre hcasej (i.e., hcasei ≺ hcasej ). No entanto, essa relação de ordem não é fixa, porém obrigatória, podendo ser alterada pelo projetista com o propósito de obter algum tipo benefício. D. Derivando a Estrutura de Kripke Como apresentado na Seção III, uma estrutura de Kripke é uma tupla K = hS, I, L, →i definida sobre um conjunto finito de proposições atômicas AP. Desse modo, o processo de derivação de uma estrutura de Kripke consiste em estabelecer os elementos que a constituem, observando todas as restrições impostas pela sua definição, quais sejam: (i) o conjunto S de estados deve ser finito; e (ii) a relação de transição → deve ser total. Ao longo desta seção são descritos os procedimentos adotados pelo método para obter cada um dos constituintes da estrutura de Kripke que representa o TESC, chamada de EK. Inicialmente, de forma direta, o conjunto AP de proposições atômicas sobre o qual EK é definida, é formado pelo conjunto CP de proposições contextuais (i.e., AP = CP). Além disso, o conjunto S de estados de EK é obtido a partir dos conjuntos CP de proposições contextuais e G de restrições estabelecidas sobre CP por meio da função S = csp(CP, G). Já o conjunto I de estados iniciais é estabelecido como segue: I = {s|s ∈ S, e ∈ E, val(s) |= ecs(e)}, onde S é o conjunto de estados, E é o conjunto de todas as exceções contextuais modeladas no sistema e val(s) significa a valoração das proposições contextuais no estado s. Desse modo, os estados iniciais são os estados em que a valoração das proposições contextuais (val(s)) satisfazem (|=) as especificações de contexto excepcional das exceções modeladas (ecs(e)), i.e., o conjunto I é composto pelos estados excepcionais do modelo. Já a função de rótulos L, é composta pela valoração de todos os estados do sistema: L = {val(s) | s ∈ S}. Antes de apresentar a forma como a relação de transição → de EK é derivada, duas funções auxiliares são introduzidas. O objetivo dessas funções é construir um conjunto de transições entre pares de estados. Em (2a), as transições são construídas a partir de um dado estado s e uma fórmula lógica φ, onde o estado de partida é o estado s e os estados de destino são todos aqueles cujo rótulo satisfazem (|=) a fórmula φ. Já em (2b), as transições são construídas a partir de um par de fórmulas, φ e ϕ, onde os estados de partida são todos aqueles que satisfazem φ e os de destino são os que satisfazem ϕ. ST(s, φ, S) = {(s, r) | r ∈ S, L(r) |= φ} (2a) FT(φ, ϕ, S) = {(s, r) | s, r ∈ S, L(s) |= φ, L(r) |= ϕ} (2b) A relação de transição → de EK representa a sequência de ações realizadas durante a atividade de tratamento para cada exceção contextual detectada e tratada pelo mecanismo de TESC. Essas transições entre estados iniciam em um estado excepcional e terminam em um estado caracterizado pela última medida de tratamento do caso de tratamento selecionado para tratar aquela exceção. O Algoritmo 1 descreve como as transições em EK são geradas, recebendo como entrada os conjuntos: Γ, de escopos de tratamento; I, de estados iniciais (excepcionais); e S de todos os estados. Desse modo, para cada escopo de tratamento he, HCASEi (linha 4) e para cada estado em I (linha 5), verifica-se se a exceção e do escopo de tratamento corrente pode ser levantada no estado s (linha 6). Em caso afirmativo, para cada caso de tratamento hα, Hi (linha 7), verifica-se se este pode ser selecionado (linha 8). Transições entre o estado excepcional e os estados que satisfazem a primeira medida de tratamento do caso de tratamento é feita por meio de uma chamada a função ST(s, H(0), (S)) (linha 9), sendo armazenada em conjunto auxiliar (AUX ). Caso este conjunto auxiliar seja não vazio (linha 10), essas transições são guardadas no conjunto de retorno T R (linha 12). Adicionalmente, o mesmo é feito para cada par de medidas de tratamento por meio de chamadas a função FT(H(i − 1), H(i), (S)) (linhas 13 e 15). Perceba que os laços mais interno e intermediário são interrompidos (linhas 17 e 20) quando não é possivel realizar transições entre estados. Além disso, o comando break (linha 22) garante que apenas um caso de tratamento é selecionado para tratar a exceção e levando em consideração a relação de ordem baseada nos índices. Por fim, antes de retornar o conjunto final de relações de transição (linha 36), o fragmento de código compreendido entre as linhas 19 e 35 adiciona uma auto-transição (transição de loop) nos estados terminais (i.e., nos estados que não possuem sucessores) para garantir a restrição de totalidade imposta pela definição de estrutura de Kripke. Algoritmo 1 Geração da Relação → de Transição de EK. 1: function T RANSICAO EK(Γ, I, S) 2: TR = ∅ 3: AU X = ∅ 4: for all he, HCASEi ∈ Γ do 5: for all s ∈ I do 6: if L(s) |= ecs(e) then 7: for all hα, Hi ∈ HCASE do 8: if L(s) |= α then 9: AU X = ST(s, H(0), S) 10: if AU X 6= ∅ then 11: T R = T R ∪ AU X 12: for i = 1, |H| do 13: AU X = FT(H(i − 1), H(i), S) 14: if AU X 6= ∅ then 15: T R = T R ∪ AU X 16: else 17: break 18: end if 19: end for 20: break 21: else 22: break 23: end if 24: end if 25: end for 26: end if 27: end for 28: end for 29: if T R 6= ∅ then 30: for all s ∈ S do 31: if 6 ∃t ∈ S, (s, t) ∈ T R then 32: T R = T R ∪ (s, s) 33: end if 34: end for 35: end if 36: return T R 37: end function E. Atividade de Especificação A atividade de especificação consiste na determinação de propriedades sobre o comportamento do TESC com o intuito de encontrar faltas de projeto. Neste trabalho foram catalogadas 3 propriedades comportamentais que, se violadas, indicam a existência de faltas de projeto no TESC, são elas: progresso de detecção, progresso de captura e progresso de tratador. Cada uma dessas propriedades é apresentada a seguir. 1) Progresso de Detecção: Essa propriedade determina que para cada estado da estrutura de Kripke do contexto, deve existir pelo menos um estado onde cada exceção contextual é detectada. A violação dessa propriedade indica a existência de exceções contextuais que não são detectadas. Esse tipo de falta de projeto é denominada de exceção morta. Essa propriedade deve ser verificada para cada uma das exceções contextuais modeladas no sistema. Desse modo, seja e uma exceção contextual, a fórmula (3), escrita em CTL, especifica formalmente essa propriedade. EF(ecs(e)) (3) 2) Progresso de Captura: Essa propriedade estabelece que para cada exceção de contexto levantada, deve existir, pelo menos, um caso de tratamento habilitado a capturar aquela exceção. A violação dessa propriedade indica que existem estados do contexto onde exceções contextuais são levantadas, mas não podem ser capturadas e, consequentemente, tratadas. Esse tipo de falta de projeto é denominada de tratamento nulo. É importante observar que, mesmo existindo situações de contexto onde o sistema não pode tratar aquela exceção, é importante que o projetista esteja ciente de que esse fenômeno ocorre no seu modelo. Sendo assim, seja he, HCASEi um escopo de tratamento com hα0 , H0 i, hα1 , H1 i, . . . , hαn , Hn i ∈ HCASE casos de tratamento, a fórmula (4), escrita em CTL, especifica formalmente essa propriedade. i<|HCASE| _ EF ecs(e) ∧ αi (4) 0 3) Progresso de Tratador: Essa propriedade determina que para cada estado do contexto onde uma exceção contextual é levantada, deve existir pelo menos um destes estados onde cada caso de tratamento é selecionado para tratar aquela exceção. A violação dessa propriedade indica que existem casos de tratamento, definidos em um escopo de tratamento de uma exceção contextual particular, que nunca serão selecionados. Esse tipo de falta de projeto é denominada de tratador morto. Desse modo, seja he, HCASEi um escopo de tratamento com hα0 , H0 i, hα1 , H1 i, . . . , hαn , Hn i ∈ HCASE casos de tratamento, a fórmula (5), escrita em CTL, especifica formalmente essa propriedade. i<|HCASE| ^ EF(ecs(e) ∧ (αi )) (5) 0 V. AVALIAÇÃO Nesta seção é feita uma avaliação do método proposto. Na Seção V-A é apresentada a ferramenta desenvolvida de suporte ao método. A Seção V-B descreve o sistema exemplo utilizado na avaliação. Na Seção V-C o projeto do TESC de duas exceções do sistema exemplo é detalhado. Por fim, na Seção V-D os cenários de injeção de faltas são descritos e um sumário dos resultados é oferecido na Seção V-E. A. A Ferramenta A ferramenta2 foi implementada na plataforma Java e provê uma API para que o projetista especifique as proposições de contexto (Seção IV-B), as restrições semânticas (Seção IV-B), as exceções contextuais (Seção IV-C1), os casos de tratamento (Seção IV-C2) e os escopos de tratamento (Seção IV-C3). Essas especificações são enviadas ao módulo conversor que gera os estados do contexto e constrói o modelo de comportamento do TESC e o conjunto de propriedades descritas pelo método. É importante mencionar que o projetista pode informar propriedades adicionais a serem verificadas sobre o modelo, além daquelas já predefinidas por nosso método (Seção IV-E). De posse do modelo de comportamento e das propriedades, a ferramenta submete estas entradas ao módulo de verificação de modelos, o qual executa o processo de verificação e gera um relatório de saída contendo os resultados da verificação. Para fazer a geração dos estados do contexto, a ferramenta faz uso do Choco Solver3 , uma das implementações de referência da JSR 331: Constraint Programming API 4 . Já no processo de verificação, a ferramenta utiliza o verificador de modelos MCiE desenvolvido no projeto The Model Checking in Education5 . Esse verificador foi escolhido, principalmente, pelo fato de ser implementado na plataforma Java, o que facilitou a sua integração com a ferramenta desenvolvida. B. A Aplicação UbiParking O UbiParking é uma aplicação baseada no conceito de “Cidades Ubíquas”. A ideia por trás desse conceito é o provimento de serviços ubíquos relacionados com o cotidiano das cidades e das pessoas que nelas habitam, com o propósito de melhorar a convivência urbana sob diversos aspectos, tais como trânsito, segurança e atendimento aos cidadãos. O objetivo UbiParking é auxiliar motoristas na atividade estacionar seus veículos. Nesse sentido, o UbiParking disponibiliza um mapa plotado com todas as vagas de estacionamento disponíveis por região. Este mapa de vagas livres é atualizado com base em informações coletadas por meio de sensores implantados nos acostamentos das vias e nos estacionamentos públicos. Os sensores detectam quando uma vaga de estacionamento está ocupada ou livre, enviando esta informação para o sistema. Desse modo, utilizando o UbiParking em seus dispositivos móveis ou no computador de bordo dos seus veículos, os cidadãos podem obter informações sobre a distribuição das vagas livres por região, podendo reservar uma vaga e solicitar ao sistema uma rota mais apropriada com base em algum critério de sua preferência (e.g., menor distância, maior número de vagas livres ou menor preço). Chegando ao estacionamento escolhido, o UbiParking conduz o motorista até a vaga reservada ou à vaga livre mais próxima, considerando os casos onde a vaga reservada é ocupada de forma imprevisível por outro veículo. Do mesmo modo, quando o motorista retorna ao seu veículo, o UbiParking o conduz até a saída mais próxima, poupandolhe tempo. Os estacionamentos do UbiParking possuem uma disposição espacial composta por entradas, pátio de vagas e saídas. Além disso, o estacionamento ubíquo é equipado com sensores de temperatura, detectores de fumaça e aspersores controlados automaticamente, para o caso de incêndio. C. Projeto do TESC para o UbiParking Nesta seção é descrita a utilização do método no projeto do TESC de duas exceções contextuais da aplicação UbiParking, a exceção de incêndio e a exceção de vaga indisponível. A exceção de incêndio modela uma condição de incêndio dentro 3 http://www.emn.fr/z-info/choco-solver 4 http://jcp.org/en/jsr/detail?id=331 2 http://www.great.ufc.br/~lincoln/JCAEHV/JCAEHV.zip 5 http://www2.cs.uni-paderborn.de/cs/kindler/Lehre/MCiE/ do estacionamento. Por meio das informações de contexto coletadas pelos sensores de fumaça e temperatura, o UbiParking consegue detectar a ocorrência desse tipo de exceção contextual dentro do estacionamento. Para lidar com essa exceção, o sistema aciona os aspersores e conduz os motoristas até o lado de fora do estacionamento. Já a exceção de vaga indisponível, modela uma situação em que o veículo está em movimento dentro do pátio de vagas indo em direção à sua vaga reservada. Porém, outro veículo ocupa aquela vaga. Nesse caso, se a vaga for a última vaga livre disponível no estacionamento, fica caracterizada a situação de anormalidade. Essa exceção contextual é detectada pelo sistema através de informações de contexto sobre as reservas de vagas, a localização do veículo e os dados que vem dos sensores de detecção de vaga ocupada. Como forma de tratar essa exceção contextual, o UbiParking conduz o veículo até o lado de fora do estacionamento, onde outra vaga livre em outro estacionamento pode ser reservada. Com base nesses dois cenários de exceção, as proposições descritas na Tabela II foram estabelecidas. Tabela II P ROPOSIÇÕES C ONTEXTUAIS DO UbiParking. Proposições Significado inMovement atParkEntrance atParkPlace atParkExit hasSpace isHot hasSmoke isSprinklerOn “O veículo está em movimento?” “O veículo está na entrada do estacionamento?” “O veículo está no pátio de vagas do estacionamento?” “O veículo está na saída do estacionamento?” “Há vaga livre no estacionamento?” “Esta quente no estacionamento?” “Há fumaça no estacionamento?” “Os aspersores estão ligados?” Perceba que as proposições contextuais atParkPlace, atParkExit e atParkEntrance (Tabela II) possuem uma relação semântica particular. No UbiParking, o veículo, do ponto de vista espaço-temporal, só pode estar fora ou dentro do estacionamento em um dado instante. Caso ele esteja fora do estacionamento, as três proposições devem assumir valor verdade falso. Por outro lado, se o veículo estiver no estacionamento, ele só poderá estar em um dos seguintes lugares: na entrada, no pátio de vagas ou na saída do estacionamento, mas não em mais de um local simultaneamente. Esse tipo de relação semântica entre proposições contextuais deve ser levado em consideração no momento da modelagem. Desse modo, a seguinte restrição deve ser derivada durante a modelagem do UbiParking para garantir a consistência semântica: (atParkEntrance ⊕ atParkPlace ⊕ atParkExit) ∨ (¬atParkEntrance ∧ ¬atParkPlace ∧ ¬atParkExit). A exceção contextual de incêndio é descrita pela tupla: h“F ire”, hasSmoke ∧ isHoti. Dois casos de tratamento foram identificados para tratar essa exceção contextual em função de contexto do veículo. Para a situação de contexto em que o veículo encontra-se na entrada do estacionamento, o seguinte caso de tratamento pode ser formulado: hcaser0 = hα0r , H0r i, onde α0r = inMovement∧atParkEntrance e H0r = {isSprinklerOn ∧ (¬atParkEntrance ∧ ¬atParkPlace ∧ ¬atParkExit)}. O caso de tratamento hcasef0 é selecionado quando o veículo encontra-se entrando no estacionamento. Dessa forma, se ele é selecionado, o efeito esperado após a execução do tratamento (H0r ) é que o sistema atinja um estado em que os aspersores estejam ligados e o veículo esteja fora do estacionamento. Por outro lado, na situação em que o veículo encontra-se dentro do pátio de vagas do estacionamento, outro caso de tratamento pode ser derivado: hcaser1 = hα1r , H1r i, onde α1f = inMovement ∧ atParkPlace e H1r = {isSprinklerOn ∧ atParkExit, isSprinklerOn ∧ (¬atParkEntrance ∧ ¬atParkPlace ∧ ¬atParkExit)}. No hcaser1 , o veículo encontra-se em movimento dentro do pátio de vagas do estacionamento. Nesse caso de tratamento, duas medidas de tratamento são esperadas que ocorram sequencialmente (H1r ). A primeira consiste em levar o sistema a um estado em que o veículo esteja na saída do estacionamento e os aspersores encontrem-se ligados. Já a segunda, consiste em levar o sistema a um estado no qual os aspersores continuem ligados e o veículo encontrar-se fora do estacionamento. O escopo de tratamento dessa exceção é dado por: h“F ire”, {hcaser0 , hcaser1 }i. A exceção contextual de vaga indisponível é descrita pela tupla: h“N oF reeSpace”, inMovement ∧ atParkPlace ∧ (¬hasSpace)i. Para essa exceção, apenas um caso de tratamento foi definido: hcasen0 = hα0n , H0n i, onde α0n = inMovement ∧ atParkPlace e H0n = {inMovement ∧ atParkExit}. A condição desse caso de tratamento estabelece que ele só é selecionado se o veículo estiver em movimento dentro do pátio de vagas. A medida de tratamento associada a esse caso de tratamento define que após o tratamento, o veículo deve encontrar-se em movimento na saída do estacionamento. O escopo de tratamento dessa exceção é dado por: h“N oF reeSpace”, {hcasen0 }i. D. Cenários de Injeção de Faltas A injeção de faltas (fault injection) é uma técnica empregada na avaliação da confiabilidade de sistemas computacionais [21]. Ela consiste na inserção controlada de faltas em um modelo ou sistema computacional com o propósito de avaliar aspectos de robustez e dependabilidade. Essa técnica foi utilizada neste trabalho como forma de avaliar a eficácia do método proposto. Para isso, o projeto do TESC das exceções de incêndio e vaga indisponível (Seção V-C) do UbiParking foi modelado utilizando a ferramenta de suporte ao método. Esse projeto foi submetido ao verificador de modelos da ferramenta e nenhuma das faltas de projeto estabelecidas pelo método foi encontrada (i.e., exceção morta, tratamento nulo e tratador morto), portanto, trata-se de um modelo correto. A partir desse modelo correto, para cada propriedade que se deseja verificar (i.e., progresso de detecção, progresso de captura e progresso de tratador), foi feita uma alteração deliberada no modelo com o propósito de violá-la. Essas alterações representam faltas de projeto similares aquelas descritas na Seção II-B, as quais os projetistas estão sujeitos a cometer. 1) Cenário 1: Violando o Progresso de Detecção: Essa propriedade é violada quando não existe, pelo menos, um estado de contexto do sistema onde a exceção em questão pode ser detectada. Isso pode ocorrer quando o projetista: (i) insere uma contradição na especificação do contexto excepcional; ou (ii) a especificação representa uma situação de contexto que nunca ocorrerá em tempo de execução. Embora essas duas faltas de projeto sejam diferentes do ponto de vista de significado, elas representam a mesma situação para o modelo de comportamento do TESC: uma expressão que não pode ser satisfeita dentro do modelo do sistema. Dessa forma, para injetar uma falta de projeto que viole essa propriedade é suficiente garantir que a especificação do contexto excepcional seja insatisfazível no modelo do TESC, independente de ser provocada por uma falta do tipo (i) ou (ii). Neste trabalho, optou-se por utilizar faltas do tipo (i). Desse modo, as contradições foram construídas a partir da conjunção da especificação do contexto excepcional e a sua negação. 2) Cenário 2: Violando o Progresso de Captura: Essa propriedade é violada quando não é possível selecionar, pelo menos um, caso de tratamento quando uma exceção é detectada. Isso pode ocorrer quando a condição de seleção dos casos de tratamento representam: (i) uma contradição; (ii) uma situação de contexto que nunca ocorre; ou (iii) uma contradição entre a condição de seleção e a especificação do contexto excepcional da exceção contextual associada. Embora essas faltas de projeto sejam diferentes do ponto de vista de significado, elas representam a mesma situação para o modelo de comportamento do TESC: uma expressão que não pode ser satisfeita dentro do modelo do sistema ou quando uma exceção contextual é detectada. Assim, para injetar uma falta de projeto que viole essa propriedade é suficiente garantir que a condição de seleção dos casos de tratamento seja insatisfazível no modelo do TESC, independente de ser provocada por uma falta do tipo (i), (ii) ou (iii). Neste trabalho, foi utilizado faltas do tipo (i), construídas a partir da conjunção de cada condição de seleção dos casos de tratamento e a sua negação. 3) Cenário 3: Violando o Progresso de Tratador: Essa propriedade é violada quando existe, pelo menos um, caso de tratamento que nunca é selecionado quando uma exceção é detectada. As situações onde isso pode ocorrer são exatamente os mesmas descritas para a propriedade do Cenário V-D2. A diferença é que para violar a propriedade de progresso de tratador basta que apenas um caso de tratamento seja mal projetado (i.e., contenha uma falta de projeto), enquanto que para violar a propriedade de progresso de captura, descrita no Cenário V-D2, existe a necessidade de que todos os casos de tratamento sejam mal projetados. Desse modo, optou-se por utilizar o mesmo tipo de falta de projeto do Cenário V-D2. E. Sumário dos Resultados Cada cenário foi executado individualmente e foram consideradas 3 (três) tipos de permutações, denominadas de rodadas: (i) a injeção de falta apenas na exceção “Fire”; (ii) a injeção de falta apenas na exceção “NoFreeSpace”; e (iii) a injeção de falta em ambas exceções de forma simultânea. Na primeira rodada do Cenário V-D1, como esperado, a falta injetada foi detectada através da identificação de uma falta de projeto de exceção morta no projeto da exceção de incêndio. Além dessa, outras faltas de projeto foram detectadas: tratamento nulo e tratador morto. O fato dessas outras duas faltas serem detectadas no projeto da exceção de incêndio é compreensível, uma vez que só se pode selecionar um caso de tratamento para tratar uma exceção quando está é detectada. O mesmo resultado ocorreu na segunda rodada do Cenário V-D1, porém, com respeito a exceção de vaga indisponível. Por outro lado, na terceira rodada, nenhuma falta de projeto foi identificada. Nessa rodada, como as faltas foram inseridas em ambas exceções, nenhum estado excepcional foi gerado, consequente, o modelo de comportamento não pode ser derivado e a sua verificação não pode ser conduzida. Na primeira rodada do Cenário V-D2, a falta injetada foi detectada através da identificação das faltas de projeto de tratamento nulo e tratador morto. Nenhuma falta de exceção morta foi identificada, uma vez que exceções foram detectadas no modelo. Na segunda rodada do Cenário V-D2, o mesmo resultado foi encontrado para a exceção de vaga indisponível. Por fim, na terceira rodada, como esperado, um par de faltas de projeto de tratamento nulo e tratador morto foi identificado para cada exceção contextual. com respeito a o Cenário V-D3, na primeira, como esperado, a falta injetada foi detectada através da identificação da falta de projeto de tratador morto. Na segunda rodada do Cenário V-D3, o mesmo resultado foi encontrado para a exceção de vaga indisponível. Por fim, na terceira rodada, como esperado, uma falta de projeto de tratador morto foi identificada para cada exceção contextual. VI. T RABALHOS R ELACIONADOS No escopo da revisão bibliográfica realizada, não foram encontrados trabalhos que abordam a mesma problemática endereçada neste artigo. Porém, os trabalhos [22][16][11][17] possuem uma relação próxima à solução proposta neste artigo. Particularmente, [16][11][17] estão relacionados a descoberta de faltas de projeto no mecanismo de adaptação de sistemas ubíquos sensíveis ao contexto. Essa problemática consiste na má especificação das regras de adaptação em tempo de projeto. Essas regras são compostas por uma condição de guarda (antecedente) e um conjunto de ações associadas (consequente). A condição de guarda descreve situações de contexto às quais o sistema deve reagir. Já as ações, caracterizam a reação do sistema ao contexto detectado. Dessa forma, a especificação erronea das condições de guarda podem levar o sistema a uma configuração imprópria e, posteriormente, a uma falha. Em [22] é proposto uma forma de especificar a semântica do comportamento adaptativo por meio de fórmulas lógicas temporais, entretanto, não provê suporte ferramental para a verificação de propriedades. Já os trabalhos [16][11][17], buscam representar o comportamento adaptativo por meio de algum formalismo baseado em estados e transições. De posse desse modelo, técnicas formais de análise (e.g., algoritmos simbólicos e verificadores de modelos) são empregadas com o intuito de identificar faltas de projeto. Em [16] o focado é dado ao domínio de aplicações sensíveis ao contexto formadas por composições de Web Services. Nesse trabalho o objetivo é encontrar inconsistências na composição dos serviços e nas suas interações. Para isso, é proposto um mapeamento da especificação da aplicação baseada em BPEL para um modelo formal utilizado para fazer as análises e verificações, chamado de CA-STS (Context-Aware Symbolic Transition System). Por outro lado, [11] busca identificar problemas específicos de má especificação das regras de adaptação. Para isso, eles propõem um formalismo baseado em máquina de estados, chamado AFSM (Adaptation Finite-State Machine). Esse formalismo é usado para modelar o comportamento adaptativo e servir como base para a verificação de propriedades e detecção de faltas de projeto. Em [17] é feita uma extensão de [11], onde é proposto um método para melhorar a efetividade da A-FSM por meio de técnicas de programação por restrições, mineração de dados e casamento de padrões. Entretanto, é importante mencionar que todos os trabalhos, exceto [22], são limitados com relação ao tipo de propriedades a serem verificadas. Por proporem seus próprios formalismos e implementarem ferramentas que analisam apenas um conjunto particular de propriedades, a sua extensão acaba sendo limitada. Diferente do método proposto, que permite que novas propriedades sejam incorporadas. VII. C ONCLUSÕES E T RABALHOS F UTUROS Este trabalho apresentou um método para a verificação do projeto do TESC. As abstrações do método permitem que aspectos importantes do comportamento do TESC sejam modelados e mapeados para uma de Kripke, permitindo que seja analisado por um verificador de modelos. Adicionalmente, um conjunto de propriedades que capturam a semântica de determinados tipos de faltas de projeto foram formalmente especificadas e apresentas como forma de auxiliar os projetistas na identificação de faltas de projeto. Além disso, a ferramenta de suporte e os cenários de injeção de faltas, utilizados para avaliar o método, apresentam resultados que demonstram a viabilidade da proposta. Como trabalhos futuros, pretendese tratar questões relacionadas ao tratamento de exceções concorrentes no modelo com a definição de uma função de resolução que permita selecionar as medidas de tratamento mais adequadas face ao conjunto de exceções levantadas. Além disso, outro direcionamento para trabalhos futuros consiste na extensão do modelo para que seja possível representar a composição dos comportamentos adaptativo e excepcional, com o objetivo de analisar a influência de um sobre o outro. Por fim, outra linha a ser investigada é a criação de uma DSL para o projeto do TESC para que um experimento envolvendo usuários possa ser conduzido. R EFERÊNCIAS [1] S. W. Loke, “Building taskable spaces over ubiquitous services,” IEEE Pervasive Computing, vol. 8, no. 4, pp. 72–78, oct.-dec. 2009. [2] A. K. Dey, “Understanding and using context,” Personal Ubiquitous Computing, vol. 5, no. 1, pp. 4–7, 2001. [3] D. Kulkarni and A. Tripathi, “A framework for programming robust context-aware applications,” IEEE Trans. Softw. Eng., vol. 36, no. 2, pp. 184–197, 2010. [4] K. Damasceno, N. Cacho, A. Garcia, A. Romanovsky, and C. Lucena, “Context-aware exception handling in mobile agent systems: The moca case,” in Proceedings of the 2006 international workshop on Software Engineering for Large-Scale Multi-Agent Systems, ser. SELMAS’06. New York, NY, USA: ACM, 2006, pp. 37–44. [5] J. Mercadal, Q. Enard, C. Consel, and N. Loriant, “A domain-specific approach to architecturing error handling in pervasive computing,” in Proceedings of the ACM international conference on Object oriented programming systems languages and applications, ser. OOPSLA ’10. New York, NY, USA: ACM, 2010, pp. 47–61. [6] D. M. Beder and R. B. de Araujo, “Towards the definition of a context-aware exception handling mechanism,” in Fifth Latin-American Symposium on Dependable Computing Workshops, 2011, pp. 25–28. [7] L. Rocha and R. Andrade, “Towards a formal model to reason about context-aware exception handling,” in 5th International Workshop on Exception Handling (WEH) at ICSE’2012, 2012, pp. 27–33. [8] E.-S. Cho and S. Helal, “Toward efficient detection of semantic exceptions in context-aware systems,” in 9th International Conference on Ubiquitous Intelligence Computing and 9th International Conference on Autonomic Trusted Computing (UIC/ATC), sept. 2012, pp. 826 –831. [9] J. Whittle, P. Sawyer, N. Bencomo, B. H. C. Cheng, and J.-M. Bruel, “Relax: Incorporating uncertainty into the specification of self-adaptive systems,” in Proceedings of the 2009 17th IEEE International Requirements Engineering Conference, RE, ser. RE ’09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 79–88. [10] D. Cassou, B. Bertran, N. Loriant, and C. Consel, “A generative programming approach to developing pervasive computing systems,” in Proceedings of the 8th International Conference on Generative Programming and Component Engineering, ser. GPCE’09. ACM, 2009, pp. 137–146. [11] M. Sama, S. Elbaum, F. Raimondi, D. Rosenblum, and Z. Wang, “Context-aware adaptive applications: Fault patterns and their automated identification,” IEEE Trans. Softw. Eng., vol. 36, no. 5, pp. 644–661, 2010. [12] C. Bettini, O. Brdiczka, K. Henricksen, J. Indulska, D. Nicklas, A. Ranganathan, and D. Riboni, “A survey of context modelling and reasoning techniques,” Pervasive Mob. Comput., vol. 6, pp. 161–180, April 2010. [13] A. Coronato and G. De Pietro, “Formal specification and verification of ubiquitous and pervasive systems,” ACM Transactions on Autonomous and Adaptive Systems, vol. 6, no. 1, pp. 9:1–9:6, Feb. 2011. [14] F. Siewe, H. Zedan, and A. Cau, “The calculus of context-aware ambients,” J. Comput. Syst. Sci., vol. 77, pp. 597–620, Jul. 2011. [15] P. Zhang and S. Elbaum, “Amplifying tests to validate exception handling code,” in Proceedings of the 2012 International Conference on Software Engineering, ser. ICSE 2012. Piscataway, NJ, USA: IEEE Press, 2012, pp. 595–605. [16] J. Cubo, M. Sama, F. Raimondi, and D. Rosenblum, “A model to design and verify context-aware adaptive service composition,” in Proceedings of the 2009 IEEE International Conference on Services Computing, ser. SCC ’09. Washington, DC, USA: IEEE, 2009, pp. 184–191. [17] Y. Liu, C. Xu, and S. C. Cheung, “Afchecker: Effective model checking for context-aware adaptive applications,” Journal of Systems and Software, vol. 86, no. 3, pp. 854–867, 2013. [18] E. M. Clarke, Jr., O. Grumberg, and D. A. Peled, Model Checking. Cambridge, MA, USA: MIT Press, 1999. [19] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,” IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33, 2004. [20] P. Van Hentenryck and V. Saraswat, “Strategic directions in constraint programming,” ACM Comput. Surv., vol. 28, no. 4, pp. 701–726, 1996. [21] J. Ezekiel and A. Lomuscio, “Combining fault injection and model checking to verify fault tolerance in multi-agent systems,” in Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 1, Richland, SC, 2009, pp. 113–120. [22] J. Zhang and B. Cheng, “Using temporal logic to specify adaptive program semantics,” J. Syst. Software, vol. 79, no. 10, pp. 1361–1369, 2006. Prioritization of Code Anomalies based on Architecture Sensitiveness Roberta Arcoverde, Everton Guimarães, Isela Macía, Alessandro Garcia Informatics Department, PUC-Rio Rio de Janeiro, Brazil {rarcoverde, eguimaraes, ibertran, afgarcia}@inf.puc-rio.br Abstract— Code anomalies are symptoms of software maintainability problems, particularly harmful when contributing to architectural degradation. Despite the existence of many automated techniques for code anomaly detection, identifying the code anomalies that are more likely to cause architecture problems remains a challenging task. Even when there is tool support for detecting code anomalies, developers often invest a considerable amount of time refactoring those that are not related to architectural problems. In this paper we present and evaluate four different heuristics for helping developers to prioritize code anomalies, based on their potential contribution to the software architecture degradation. Those heuristics exploit different characteristics of a software project, such as change-density and error-density, for automatically ranking code elements that should be refactored more promptly according to their potential architectural relevance. Our evaluation revealed that software maintainers could benefit from the recommended rankings for identifying which code anomalies are harming architecture the most, helping them to invest their refactoring efforts into solving architecturally relevant problems. Keywords— Code anomalies, Architecture degradation and Refactoring; I. INTRODUCTION Code anomalies, commonly referred to as code smells [9], are symptoms in the source code that may indicate deeper maintainability problems. The presence of code anomalies often represents structural problems, which make code harder to read and maintain. Those anomalies can be even more harmful when they impact negatively on the software architecture design [4]. When that happens, we call those anomalies architecturally relevant, as they represent symptoms of architecture problems. Moreover, previous studies [14][35] have confirmed that the progressive manifestation of code anomalies is a key symptom of architecture degradation [14]. The term architecture degradation is used to refer to the continuous quality decay of architecture design when evolving software systems. Thus, as the software architecture degrades, the maintainability of software systems can be compromised irreversibly. As examples of architectural problems, we can mention Ambiguous Interface and Component Envy [11], as well as cyclic dependencies between software modules [27]. Yuanfang Cai Department of Computer Science Drexel University Philadelphia, USA [email protected] In order to prevent architecture degradation, software development teams should progressively improve the system maintainability by detecting and removing architecturally relevant code anomalies [13][36]. Such improvement is commonly achieved through refactoring [6][13] - a widely adopted practice [36] with well known benefits [29]. However, developers often focus on removing – or prioritizing - a limited subset of anomalies that affect their projects [1][16]. Furthermore, most of the remaining anomalies are architecturally relevant [20]. Thus, when it is not possible to distinguish which code anomalies are architecturally relevant, developers can spend more time fixing problems that are not harmful to the architecture design. This problem occurs even in situations where refactoring need to be applied in order to improve the adherence of the source code to the intended architecture [1][19][20]. Several code analysis techniques have been proposed for automatically detecting code anomalies [18][25][28][32]. However, none of them help developers to prioritize anomalies with respect to their architectural relevance, as they present the following limitations: first, most of these techniques focus on the extraction and combination of static code measures. The analysis of the source code structure alone is often not enough to reveal whether an anomalous code element is related to the architecture decomposition [19][20]. Second, they do not provide means to support the prioritization or ranking of code anomalies. Finally, most of them disregard: (i) the exploitation of software project factors (i.e. frequency of changes and number of errors) that may be an indicator of the architectural relevance of a module, and (ii) the role that code elements play in the architectural design. In this context, this paper proposes four prioritization heuristics to help identifying and ranking architecturally relevant code anomalies. Moreover, we assessed the accuracy of the proposed heuristics when ranking code anomalies based on their architecture relevance. The assessment was carried out in the context of four target systems from heterogeneous domains, developed by different teams using different programming languages. Our results show that the proposed heuristics were able to accurately prioritize the most code relevant anomalies of the target systems mainly for scenarios where: (i) there were architecture problems involving groups of classes that changed together; (ii) changes were not predominantly perfective; (iii) there were code elements infected by multiple anomalies; and (iv) the architecture roles are well-defined and have distinct architectural relevance. The remainder of this paper is organized as follows. Section II introduces the concepts involved in this work, as well as the related work. Section III introduces the study settings. Section IV describes the prioritization heuristics proposed in this paper. Section V presents the evaluation of the proposed heuristics, and Section VI the evaluation results. Finally, Section VII presents the threats to validity, while Section VIII discusses the final remarks and future work. II. BACKGROUND AND RELATED WORK This section introduces basic concepts related to software architecture degradation and code anomalies (Section II.A). It also discusses researches that investigate the interplay between code anomaly and architectural problems (Section II.B). Finally, the section introduces existing ranking systems for code anomalies (Section II.C). A. Basic Concepts One of the causes for architecture degradation [14] on software projects is the continuous occurrence of code anomalies. The term code anomaly or code smell [9] is used to define structures on the source code that usually indicate maintenance problems. As examples of code anomalies we can mention Long Methods and God Classes [9]. In this work, we consider a code anomaly as being architecturally relevant when it has a negative impact in the system architectural design. That is, the anomaly is considered relevant when it is harmful or related to problems in the architecture design. Therefore, the occurrence of an architecturally relevant code anomaly can be observed if the anomalous code structure is directly realizing an architectural element exhibiting an architecture problem [19-21]. Once a code anomaly is identified, the corresponding code may suffer some refactoring operations, so the code anomaly is correctly removed. When those code anomalies are not correctly detected, prioritized and removed in the early stage of software development, the ability of these systems to evolve can be compromised. In some cases, the architecture has to be completely restructured. For this reason, the effectiveness of automatically detected code anomalies using strategies has been studied under different perspectives [16][18][19][26][31]. However, most techniques and tools disregard software project factors that might indicate the relevance of an anomaly in terms of its architecture design, number of errors and frequency of changes. Moreover, those techniques do not help developers to distinguish which anomalous element are architecturally harmful without considering the architectural role that a given code element plays on the architectural design. B. Harmful Impact and Detection of Code Anomalies The negative impact of code anomalies on the system architecture has been investigated by many studies in the stateof-art. For instance, the study developed in [23] reported that the Mozilla’s browser code was overly complex and tightly coupled therefore hindering its maintainability and ability to evolve. This problem was the main cause of its complete reengineering, and developers’ took about five years to rewrite over 7 thousand source files and 2 million lines of code [12]. Another study [7] showed how the architecture design of a large telecommunication system degraded in 7 years. Particularly, the relationship between the system modules increased over the time. This was the main cause why the system modules were not independent anymore, and as consequence, further changes were not possible. Finally, a study performed in [14] investigated the main causes for architecture degradation. As a result, the study indicated that refactoring specific code anomalies could help to avoid it. Another study [35] has identified that duplicated code was related to design violations. In this sense, several detection strategies have been proposed in order provide means for the automatic detection of code anomalies [18][25][28]. However, most of them is based on source code information and relies on a combination of static code metrics and thresholds into logical expressions. This is the main limitation of those detection strategies, since they disregard architecture information that could be exploited to reveal architecturally relevant code anomalies. In addition, current detection strategies only consider individual occurrences of code anomalies, instead of analyzing the relationships between them. Such limitations are the main reasons why the current detection strategies are not able to support the detection of code anomalies responsible for inserting architectural problems [19]. Finally, a recent study [19] investigated to what extent the architecture sensitive detection strategies can better identify code anomalies related to architectural problems [22]. C. Ranking Systems for Code Anomalies As previously mentioned, many tools and techniques provide support for automatically detecting code anomalies. However, the number of anomalies tends to increase as the system grows and, in some cases, the high number of anomalies can be unmanageable. Moreover, maintainers are expected to choose which code anomalies should be prioritized. Some of the reasons why this is necessary are (i) time constraints and (ii) attempts to find the correct solution when restricting a large system in order to perform refactoring operations to solve those code anomalies. The problem is that the existing detection strategies do not focus on ranking or prioritizing code anomalies. Nevertheless, there are two tools that provide ranking capabilities for different development platforms: Code Metrics and InFusion. The first tool is a .NET based add-in for the Visual Studio development environment and it is able to calculate a limited set of metrics. Once the metrics are calculated, the tool assigns a “maintainability index” score to each of the analyzed code elements. This score is based on the combination of the metrics for each code element. The second tool is the InFusion, which can be used for analyzing Java, C and C++ systems. Moreover, it allows calculating more than 60 metrics. Besides the statistical analysis for calculating code metrics, it also provides numerical scores in order to detect the code anomalies. Those scores provide means to measure the negative impact of code anomalies in the software system. When combining the scores, a deficit index is calculated for the entire system. The index takes into consideration size, encapsulation, complexity, coupling and cohesion metrics. However, the main concern of using these tools is that the techniques they implement have some limitations: (i) usually it only considers the source code structure as input for detecting code anomalies; (ii) the ranking system disregards the architecture role of the code elements; and (iii) the user cannot define or customize their own criteria for prioritizing code anomalies. In this sense, our study proposes prioritization heuristics for ranking and prioritizing code anomalies. Moreover, our heuristics are not only based on source code information for detecting code anomalies. It also considers information about architecture relevance of the detected code anomalies. For this, we analyze different properties of the source code they affect, such as information about changes on software modules, bugs observed during the system evolution and the responsibility of module in the system architecture. III. STUDY DEFINITION USING GQM FORMAT GQM (Goal, Question, Metric) Analyze: For the purpose of: With respect to: From the viewpoint of: In the context of: In this section, we describe the study hypothesis in order to test the accuracy of the proposed heuristics for ranking code anomalies based on their relevance. First, we have defined some thresholds of what we consider as an acceptable accuracy: (i) low accuracy, 0-40%; (ii) acceptable accuracy, 40-80%; and (iii) high accuracy, 80-100%. These thresholds are based on the ranges defined in [37], where the values are applied in statistical tests (e.g. Pearson’s correlation). We adapted these values in order to better evaluate our heuristics, since we are only interested in values that indicate a high correlation. Moreover, we analyzed the three levels of accuracy in order to investigate to what extent the prioritization heuristics would be helpful. For instance, a heuristic with an accuracy level of 50% means the ranking produced by the heuristic should be able to identify at least half of the architecturally relevant code anomalies. In order to test the accuracy of the prioritization heuristics, we have defined 4 hypotheses (see Table II). STUDY SETTINGS This section describes our study hypotheses and variables selection, as well as the target systems used to evaluate the accuracy of the proposed heuristics. The main goal of this study is to evaluate whether the proposed heuristics for prioritization of architecturally relevant code anomalies can help developers on the ranking and prioritization process. It is important to emphasize that the analysis of the proposed heuristics is carried out in terms of accuracy. Also, Table I defines our study using the GQM format [34]. TABLE I. A. Hypotheses The proposed set of prioritization heuristics. Understanding their accuracy for ranking code anomalies based on their architecture relevance. Rankings previously defined by developers or maintainers of each analyzed system. Researchers, developers and architects Four software systems from different domains with different architectural designs. Our study was basically performed in three phases: (i) as we have proposed prioritization heuristics for ranking code anomalies, in the first phase we performed the detection and classification of code anomalies according to their architecture relevance for each of the target systems. For such detection, we used a semi-automatic process based on strategies and thresholds, which has been broadly used on previous studies [2][11][16][20]; (ii) in the second phase, we applied the proposed heuristics and computed their scores for each detected code anomaly. The output of this phase is an ordered list with the high-priority anomalies; finally, (iii) in the third phase, we compared the heuristics results with rankings previously defined by developers or maintainers of each target system. The ranking list provided by developers represents the “ground truth” data in our analysis, and were produced manually. TABLE II. Hypothesis H1.0 H1 H1.1 H2.0 H2 H2.1 H3.0 H3 H3.1 H4 H4.0 H4.1 STUDY HYPOTHESES Description The change-density heuristic cannot accurately identify architecturally relevant code anomalies ranked as top ten. The change-density heuristic can accurately identify architecturally relevant code anomalies ranked as top ten. The error-density heuristic cannot accurately identify architecturally relevant code anomalies ranked as top ten. The error-density heuristic can accurately identify architecturally relevant code anomalies ranked as top ten. The anomaly density heuristic cannot accurately identify architecturally relevant code anomalies ranked as top ten The anomaly density heuristic can accurately identify architecturally relevant code anomalies ranked as top ten The architecture role heuristic cannot accurately identify architecturally relevant code anomalies ranked as top ten. The architecture role heuristic can accurately identify architecturally relevant code anomalies ranked as top ten B. Target Systems In order to test the study hypotheses, we selected 4 target systems from different domains: (i) MIDAS [24], a lightweight middleware for distributed sensors application; (ii) Health Watcher [13], a web system used for registering complaints about health issues in public institutions; (iii) PDP, a web application for managing scenographic sets in television productions; and (iv) Mobile Media [6], a software product line that manages different types of media in mobile devices. All the selected target systems have been previously analyzed in other studies that address problems such as architectural degradation and refactoring [11][20]. The target systems were selected based on 4 criteria: (i) the availability of either architecture specification or original developers. The architectural information is essential to the application of the architecture role heuristic, which directly depends on architecture information to compute the ranks of code anomalies; (ii) availability of the source version control systems of the selected applications; the information for the version control system provides input for the change-density heuristic; (iii) availability of an issue tracking system. Although this is not a mandatory criterion, it is highly recommended for providing input for the error-density heuristic; and (iv) the applications should present different design and architectural structures. This restriction allows us to better understand the impact of the proposed heuristics for a diverse set of code anomalies, emerging from different architectural designs. IV. PRIORITIZATION OF CODE ANOMALIES In this section, we describe 4 prioritization heuristics proposed in this work. These heuristics are intentionally simple, in order to be feasible on most software projects. Their main goal is to help developers on identifying and ranking architecturally relevant code anomalies. A. Change Density Heuristic This heuristic is based on the idea that anomalies infecting unstable code elements are more likely to be architecturally relevant. An unstable element can be defined as a code element that suffers from multiple changes during the system evolution [15]. In some cases, for instance, changes occur in cascade and affect the same code elements. Those cases are a sign that such changes are violating the "open-closed principle", which according to [27] is the main principle for the preservation of architecture throughout the system evolution. In this sense, the change-density heuristic calculates the ranking results based on the number of changes performed on the anomalous code element. The change-density heuristics is defined as follows: given a code element c, the heuristic will look for every revision in the software evolution path where the element c has been modified. That is, the number of different revisions represents the number of changes performed in the element. Thus, the higher the number of changes, the higher is the element priority. The only input required for this heuristic is the change sets that occurred during the system evolution. The change sets is composed by the list of existing revisions and the code elements that were modified on each revision. For this heuristic, we are only able to calculate the changes performed to an entire file. For this scoring mechanism, all code anomalies presented in the same file will receive the same score. We adopted this strategy in our heuristics because none of the studied code anomalies emerged as the best indicator of architectural problems across the systems [20]. However, it is possible to differentiate between two classes by ranking those that have changed most as high-priority. In order to calculate the score for each code anomaly, the heuristic assign the number of changes that were performed in the infected class. Once the number of changes was computed, we ordered the list of resources and their respective number of changes, thus producing our final ranking. This information was to extract the change log from the version control systems for each of the target applications. B. Error-Density Heuristic This heuristic is based on the idea that code elements that have a high number of errors observed during the system evolution might be considered high-priority. The error-density heuristic is defined as follows: given a resolved bug b, the heuristic will look for code elements c that was modified in order to solve b. Thus, the higher the number of errors solved as a consequence of changes applied to c, the higher is the position in the prioritization ranking. This heuristic requires two different inputs: (i) change log inspection – our first analysis was based on change log inspection, looking for common terms like bug or fix. Once those terms are found on commit messages, we incremented the scores for the classes involved in a given change. This technique has been successfully applied in other relevant studies [17]; and (ii) bug detection tool – as we could not rely on the change log inspection for all system, we have decided to use a bug detection tool, namely findBugs, for automatically detecting blocks of code that could be related to bugs. Once possible bugs are identified, we collect the code elements causing them and increment their scores. Basically, the heuristic works as follows: (i) firstly, the information of bugs that were fixed is retrieved from the revisions; (i) after that, the heuristic algorithm iterates over all classes changes on those revisions and the score is incremented for each anomaly that infect the classes. In summary, when a given class is related to several bug fixes, the code anomaly will have a high score. C. Anomaly Density Heuristic This heuristic is based on the idea that each code element can be affected by many anomalies. Moreover, a high number of anomalous elements concentrated in a single component indicate a deeper maintainability problem. In this sense, the classes internal to a component with a high number of anomalies should be prioritized. Furthermore, it is known that developers seem to care less about classes that present too many code anomalies [27], when they need to modify them. Thus, anomalous classes tend to remain anomalous or get worse as the systems evolve. Thus, prioritizing classes with many anomalies should avoid the propagation of problems. This heuristic might also be worthy when classes have become brittle and hard to maintain due to the number of anomalies infecting them. Computing the scores for this heuristic was rather straightforward. Basically, it calculates the number of anomalies found per code element. Thus, we consider that elements with a high number of anomalies are high-priority targets for refactoring. The anomaly density heuristic is defined as follows: given a code element c, the heuristic will look to the number of code anomalies that c contains. Thus, the higher the number of anomalies found in c, the higher would be the ranking in the prioritization heuristic result. This heuristic requires only one input: the set of detected code anomalies for each code element in the system. Moreover, the heuristic can be customized to compute only architecture relevant anomalies, instead of computing the set of all the anomalies infecting the system. In order to define whether an anomaly is relevant or not, our work relies on the detection mechanisms provided by SCOOP tool [21]. D. Architecture Role Heuristic Finally, this heuristic proposes a ranking mechanism based on the architectural role a given class plays in the system. The fact is that, when the architecture information is available, the architectural role influences the priority level. The architecture role heuristic is defined as follows: given a code element c, this heuristic will examine the architectural role r performed by c. The relevance of the architectural role in the system represents the rank of c. In other words, if r is defined as a relevant architecture role and it is performed by c, the code element c will be ranked as high priority. The architecture role heuristic depends on two kinds of information, regarding the system’s design: (i) which roles each class plays in the architecture; and (ii) how relevant those roles are towards architecture maintainability. For this study setting, we first had to leverage architecture design information in order to map code elements to their architecture roles. Part of this information extraction had already been performed on our previous studies [19][20]. Then, we asked the original architects to assign different levels of importance to those roles, according to the architecture patterns implemented. Moreover, we defined score levels to each architecture role. For doing so, we considered the number of roles identified by the architects, and distributed them according to a fixed interval from 0 to 10. Code anomalies that infected elements playing critical architecture roles were assigned to the highest score. On the other hand, when the code anomaly affected elements related to less critical architecture roles, they would be assigned to lower scores, according to the number architecture roles provided by the original architects. V. EVALUATION This section describes the main steps for evaluating the proposed heuristics, as well as testing the study hypotheses. The evaluation is organized into three main activities: (i) detect of code anomalies; (ii) identify of the rankings representing the ground truth; and (iii) collect scores for each anomaly under the perspective of the prioritization heuristics. A. Detecting Code Anomalies The first step was the automatic identification of code anomalies for each of the 4 target systems by using well-known detection strategies and thresholds [16][31]. These detection strategies and thresholds used in our study have been used previously in other studies [6][19][20]. The metrics required by the detection strategies are mostly collected with current tools [30][33]. After that, the list of code anomalies is checked and refined by original developers and architects of each target system. Through this validation we can make sure that results produced by the detection tools do not include false positives [19]. We have also a defined ground truth ranking in order to compare the results of the analysis provided by the software architects and the resulting ranking provided by each of the proposed heuristics. The ground truth ranking is a list of anomalous elements in the source code ordered by their architecture relevance, defined by the original architects of each target application. Basically, the architects were asked to provide an ordered list of the top ten classes that, in their beliefs, represented the main sources of maintainability problems of those systems. Besides providing a list of the high priority code elements, the architects were also asked to provide information regarding the architectural design of each target system. That is, they should provide a list of architectural roles presented in each target system ordered by their relevance from the architecture perspective. B. Analysis Method After applying the heuristics, we compared the rankings produced by each of them with the ground truth ranking. We decided to analyze only the top ten code elements ranked, for three main reasons: (i) it would be unviable if we have asked developers to rank an extensive list of elements; (ii) we wanted to evaluate our prioritization heuristics mainly for their abilities to improve refactoring effectiveness. Thus, the top ten anomalous code elements represent a significant sample of elements that could possibly cause architecture problems; and (iii) we focused on analyzing the top 10 code elements for assessing whether they represent a useful subset of sources of architecturally relevant anomalies. In order to analyze the rankings provided by the heuristics, we have considered three measures: (i) Size of overlap – measures the number of elements that appear both in the ground truth ranking and in the heuristic ranking. It is fairly simple to calculate and tells us whether the prioritization heuristics are accurately distinguishing the top k items from the others; (ii) Spearman’s footrule [5] – it is a well-known metric for permutations. It measures the distance between two ranked lists by computing the differences in the rankings of each item; and (iii) Fagin’s extension to the Spearman’s footrule [8] – it is an extension to Spearman’s footrule for top k lists. Fagin extended Spearman’s footrule by assigning an arbitrary placement to elements that belong to one of the lists but not to the other. Such placement represents the position in the resulting ranking for all of the items that do not overlap when comparing both lists. It is important to notice the main differences between the three measures: the number of overlaps indicates how effectively our prioritization heuristics are capable of identifying a set of k relevant code elements, disregarding the differences between them. This measure becomes more important as the number of elements under analysis grows. Thus, the number of overlaps might give us a good hint on the heuristics capability for identifying good refactoring candidates, disregarding the differences between them. The purpose of the other two measures is to analyze the similarity between two rankings. Unlike the number of overlaps, they take into consideration the positions each item has in the compared rankings. It is important to mention the main differences between those two measures: when calculating Spearman’s footrule, we are only considering the overlapping items. When the lists are disjoint, the original ranks are lost, and a new ranking is produced. On the other hand, Fagin’s measure takes into consideration the positions of the overlapping elements in the original lists. Finally, we used the measures results to calculate the similarity accuracy – as defined in our hypotheses. VI. EVALUATING THE PROPOSED PRIORITIZATION HEURISTICS The evaluation of the proposed heuristics involved two separated activities: (i) quantitative analysis on the similarity results; and (ii) quantitative evaluation of the results regarding their relations to actual architecture problems. A. Change-Density Heuristic Evaluation. This heuristic was applied in 3 out of the 4 target applications selected in our study. Our analysis was based on different versions of Health Watcher (10 versions), Mobile Media (8 versions) and PDP (409 versions). Our goal was to check whether the prioritization heuristics performed well or not on systems with shorter and longer longevity. Additionally, it was not a requirement to only embrace projects with long histories, once we wanted also to evaluate whether the heuristics would be more effective in preliminary versions of a software system. Table III shows the evolution characteristics analyzed for each system. TABLE III. CHANGE CHARACTERISTICS FOR EACH TARGET APPLICATION Name Health Watcher Mobile Media PDP CE 137 82 97 N-Revisions 10 9 409 M-Revisions 9 8 74 AVG 1.5 2.6 8.8 As we can observe, Mobile Media and Health Watcher presented similar evolution behaviors. As the maximum number of revisions (M-Revisions) was limited to the total number of revisions for a system (AVG), neither Health Watcher nor Mobile Media could have 10 or more versions of a code element (CE). We can observe that Health Watcher had more revisions than Mobile Media. However, those changes were scattered between more files. Due to the reduced number of revisions available for both systems, we have established a criterion for selecting items when there were ties in the top 10 rankings. For instance, we can use alphabetical order when the elements in the ground truth are ranked equally harmful. TABLE IV. Name HW MM PDP RESULTS FOR THE CHANGE-DENSITY HEURISTIC Overlap Value Accuracy 8 57% 5 50% 6 60% Value 0.62 1 0.44 NSF Accuracy 38% 0% 56% Value 0.87 0.89 0.54 NF Accuracy 13% 11% 46% Table IV show the results observed when analyzing the change-density heuristic. As we can observe, the highest absolute overlap value was obtained for Health Watcher. It can be explained by the fact that the Health Watcher system has many files with the same number of changes. In this sense, when computing the scores we did not consider only the 10 most changed files, as that approach would discard files with as many changes as the ones selected. So, we decided to select 14 files, where the last 5 presented exactly the same number of changes. Moreover, the Health Watcher presented the highest number of code elements, having a total of 137 items (see Table III) that could appear on the ranking produced by applying the heuristic. Another interesting finding was observed in the Mobile Media system. Although the changedensity heuristic detected 5 overlaps, all of them were shifted exactly two positions, thus resulting in the 1 value for the NSF measure. On the other hand, when we considered the nonoverlaps, the position for one item matched. Moreover, the results show us that the NSF measure is not adequate when the number of overlaps is small. When we compare the results of Mobile Media and Health Watcher to those obtained by PDP, we observed a significant difference. All PDP measures performed above our acceptable similarity threshold, which means a similarity value higher than 45%. For this case, we observed that the similarity was related to a set of classes that were deeply coupled: an interface acting as Facade and three realizations of this interface, implementing a server module, a client module and a proxy. When performing changes on the interface, many other changes were triggered in those three classes. For this reason, they have suffered many other modifications during the system evolution. Moreover, the nature of changes that the target applications underwent is different. For instance, on Health Watcher most part of changes was perfective (changes made aiming to improve the overall structure of the application). On the other hand, on Mobile Media, most part of the changes was related to the addition of new functionalities, which was also the case for PDP. However, we observed that Mobile Media had also low accuracy rates. In summary, the results on applying the change-density heuristic showed us that it could be useful for detecting and prioritizing architecturally relevant anomalies in the following scenarios: (i) there are architecture problems involving groups of classes changing together; (ii) there are problems in the architecture related to Facade or communication classes; and (iii) changes were predominantly perfective. In this sense, from the results observed in the analysis, we can reject the null hypothesis H1. The fact was that the change-density heuristic was able to produce rankings for PDP with at least acceptable accuracy in all the analyzed measures. Correlation with Architectural Problems. Based on the results produced by the change-density heuristic, we also needed to evaluate whether there is a correlation between the rankings with architectural problems. In this sense, we performed the analysis by observing which ranked elements are related to actual architectural problems (see Table V). We can observe that elements containing architecturally relevant anomalies (Arch-Relevant) were likely to be change-prone. For PDP system, all of the top 10 most changed elements were related to architectural problems. Also, if we consider that PDP has 97 code elements, and 37 of them are related to architectural problems, the results give us a hint that changedensity is a good heuristic for detecting them. TABLE V. Name HW MM PDP RESULTS FOR THE CHANGE-DENSITY HEURISTIC VS. ARCHITECTURAL PROBLEMS N-ranked CE 14 10 10 Arch-Relevant 10 7 10 % of Arch-Relevant 71% 70% 100% B. Error-Density Heuristic Evaluation. This heuristic is based on the assessment of bugs that are introduced by a code element. So, the higher the number of bugs observed in a code element, the higher is its priority. Thus, in order to correctly evaluate the results produced by the error-density heuristics, a reliable set of detected bugs should be available for each target system. This was the case for the PDP system, where the set of bugs was well documented. On the other hand, for Mobile Media and Health Watcher, where the documentation of bugs was not available, we relied on the analysis of bug detection tools. The results of applying the error-density heuristics are showed in Table VI. It is important to highlight that for Health Watcher there were 14 ranked items, due to ties between some of them. Nevertheless, Health Watcher presented the highest overlap measures. That happens because the detected bugs were related to the behavior observed in every class implementing the Command pattern. Furthermore, each of the classes implementing this pattern was listed a high-priority in the ground-truth ranking. TABLE VI. Name HW MM PDP RESULTS FOR THE ERROR-DENSITY HEURISTIC Overlap Value Accuracy 10 71% 3 30% 5 30% Value 0 0 0.83 NSF Accuracy 100% 100% 17% Value 0.74 0.76 0.74 NF Accuracy 26% 24% 26% problems. When we take into consideration that the ranking for Health Watcher was composed of 14 code elements (instead of 10), this result is even more significant. As mentioned before, the rankings for Health Watcher and Mobile Media were built over automatically detected bugs. It means that even when formal bug reports are not available, the use of static analysis tool [3] for predicting possible bugs might be useful. TABLE VII. Name HW MM PDP RESULTS FOR THE ERROR-DENSITY HEURISTIC VS. ACTUAL ARCHITETURAL PROBLEMS N-ranked CE 14 10 10 Arch-Relevant 10 8 8 % of Arch-Relevant 85% 80% 80% On the other hand, for the PDP system where we considered actual bug reports, the results were also promising. From the top 10 ranked elements, 8 were related to architecture problems. When we consider that PDP system has 97 code elements, with 37 of them related to architecture problems, it means that the remaining 29 were distributed among the 87 bottom ranked elements. Moreover, if we extend the analysis over the top 20 elements, we observe a better correlation factor. That is, in this case the correlation showed us that around 85% of the top 20 most error-prone elements were related to architecture problems. C. Anomaly Density Heuristic Another interesting finding we observed was that the priority order for overlapping elements was exactly the same as the one pointed out in the ground truth. However, the 4 remaining non-overlapping elements were the same 4 elements in the ground truth ranking. The fact that top 4 elements are not listed in the ranking list produced by the heuristic resulted in a low accuracy for NF measure. For the Mobile Media, we have applied the same strategy, but all the measures also presented low accuracies. Due to the small number of overlaps, the results for NSF may not confidently represent the heuristics’ accuracy. Finally, for the PDP the results were evaluated in a different perspective once we considered the manually detected bugs. That is, the bugs were collected through its issue tracking system, instead of using automatic bug detection tools. However, even when we performed the analysis using a reliable set of bugs, the overall results presented low accuracy. That is, from the 5 nonoverlapping items, 4 of them were related to bugs on utility classes. Once those classes were neither related to any particular architectural role, nor implementing an architecture component, they were not considered architecturally relevant. Evaluation. The anomaly density heuristic was applied to the 4 target systems selected in our study. We have observed good results in terms of accuracy on ranking the architecturally relevant anomalies. As we can see in Table VIII, good results were obtained not only on ranking the top 10 anomalies, but also on defining its positions. We observed that only 2 of 8 measures had low accuracy when compared to the thresholds defined in our work. Furthermore, the number of overlaps achieved by this heuristic can be considered highly accurate in 3 of the 4 target systems. This indicates that code elements affected by multiple code anomalies are often perceived as high priority. It did not occur only in the case of Health Watcher, where we observed only 5 overlaps. When analyzing the number of anomalies for each element on the ranking representing ground truth, we could observe that many of them had exactly the same number of code anomalies, namely 8. Also, it is important to mention that for this heuristic, in contrast to the change-density and error-density heuristics, we only considered the top 10 elements for the Health Watcher system - once there were not ties to be taken into consideration. Correlation with Architectural Problems. Based on the results produced by the error-density heuristic, we could investigate the correlation between the rankings with actual architectural problems. That is, we could analyze whether the error-density heuristics presented better results towards detecting architecturally relevant anomalies. Table VII presents the results from applying this heuristic. As we can see, at least 80% of the ranked elements were related to architecture problems for all the analyzed systems. Moreover, Health Watcher system reached the most significant results with 85% of the ranked elements related to architectural When analyzing the MIDAS system, we could not observe a significant number of overlaps, once 9 out of 10 elements appeared in both rankings. However, this fact was expected as the system is composed by only 21 code elements. Nevertheless, we observed that both NSF and NF presented a high accuracy, which means that the rankings were similarly ordered. Moreover, the NF measure presented a better result, which was influenced by the fact that the only mismatched element was ranked in the 10th position. On the other hand, when analyzing the Mobile Media we observed discrepant results regarding two ranking measures. We found 59% of accuracy for the NSF measure, and 30% for the NF measure. This difference is also related to the position the nonoverlapping elements in the ranking generated by the heuristic. Therefore, the ranks for those elements were assigned to k+1 in the developers’ list, which resulted in a huge distance from their original positions. It is also important to mention that those elements comprehended a data model class, a utility class and a base class for controllers. TABLE VIII. Name HW MM PDP MIDAS Val ue 5 7 8 9 RESULTS FOR THE ANOMALY DENSITY HEURISTIC Overlap Accuracy 50% 70% 80% 90% Value 0.66 0.41 0.37 0.4 NSF Accuracy 34% 59% 63% 60% Value 0.54 0.7 0.36 0.20 NF Accuracy 46% 30% 64% 80% By analyzing the results for this heuristic, we observed that code elements infected by multiple code anomalies are often perceived as high priority. We also identified that many false positives could arise from utility classes, as those classes are often large and not cohesive. Finally, the results obtained in this analysis also helped us rejecting the null hypothesis H3 – as the anomaly density heuristic was able to produce rankings with at least acceptable accuracy in all of the systems we analyzed for at least one measure. Furthermore, we obtained a high accuracy rate for the MIDAS system in 2 out of 3 measures, which means 90% for the overlaps and 80% for NF. Correlation with Architectural Problems. We also performed an analysis in order to evaluate whether the rankings produced by the anomaly density heuristic. However, when evaluating the results produced by this heuristic, we observed that they were not consistent if compared them with architecturally relevant anomalies. This is valid conclusion for all target systems. TABLE IX. Name HW MM PDP MIDAS RESULTS FOR THE ANOMALY DENSITY HEURISTIC VS. ACTUAL ARCHITETURAL PROBLEMS N-ranked CE 10 10 10 10 Arch-Relevant 5 9 8 6* % of Arch-Relevant 50% 90% 80% 60%* For instance, Table IX shows that for the Health Watcher system only 5 out of the top 10 ranked elements were related to architectural problems. The 5 code elements related to architectural problems are exactly the same overlapping items between the compared ranks. It happens due the high number of anomalies, which are concentrated in a small numbers of elements that are not architecturally relevant. Moreover, all the 5 non-architecturally relevant elements were data access classes responsible for communicating to the database. For the MIDAS system, we observed that from the top 10 code elements with the higher number of anomalies, 6 were architecturally relevant. In addition, the MIDAS system has exactly 6 elements that contribute to the occurrence of architecture problems. So, we can say that the anomaly density heuristic correctly outlined all of them in the top 10 ranking. D. Architecture Role Heuristic Evaluation We analyzed 3 of the 4 systems in order to evaluate the architecture role heuristic. As we can observe (see Table X), PDP achieved the most consistent results regarding the three similarity measures. The heuristic achieved around 60% of accuracy when comparing the similarity between the rankings. Also, the PDP is the only system where it was possible to divide classes and interfaces in more than three levels when analyzing the architectural roles. For instance, Table XI illustrates the four different architectural roles defined on the PDP system. TABLE X. Name HW MM PDP RESULTS FOR THE ARCHITECTURE ROLE HEURISTIC Overlap Value Accuracy 4 40% 6 60% 6 60% TABLE XI. Value 0.5 0.22 0.33 NSF Accuracy 50% 78% 67% Value 0.72 0.41 0.41 NF Accuracy 28% 59% 59% ARCHITECTURE ROLES IN PDP Architecture Roles Utility and Internal Classes Presentation and Data access classes Domain Model, Business classes Public Interfaces, Communication classes, Facades Score 1 2 4 8 # of CE 23 28 24 6 Based on the classification provided in Table XI, we can draw the architecture role heuristic ranking for PDP. As we can see, the ranking contains all of the 6 code elements (# of CE) from the highest category and 4 elements from the domain model and business classes. We ordered the elements alphabetically for breaking ties. Therefore, although 23 classes obtained the same score, we are only comparing 4 of them. However, it is important to mention that some of the elements ranked by the original architects belonged to the group of discarded elements. Once we have chosen a different approach, such as considering all the ties as one item, we would turn our top ten ranking into a list of 30 items and have a 100% overlap rate. On the other hand, we decided to follow a different score approach for Mobile Media and Health Watcher, by consulting original architects for each of the target applications. The architects provided us the architecture roles and their relevance on the system architecture. Once we identified which classes were directly implementing which roles, we were able to produce the rankings for this heuristic. The worst results were observed in the Health Watcher system, where almost 20 elements were ties with the same scores. So, we first selected the top 10 elements, and broke the ties according to the alphabetic order. This led us to an unreal low number of overlaps, as some of the discarded items were present in the ground truth ranking. In fact, due to low number of overlaps, it would not be fair to evaluate the NSF measure as well. Thus, we performed a second analysis, considering the top 20 items instead of the top 10, for analyzing the whole set of elements that had the same score. In this second analysis, we observed the number of overlaps went up to 6, but the accuracy for the NSF measure decreased to 17% which indicates a larger distance between the compared rankings. In addition, this also shows us that the 50% accuracy for NSF obtained in the first comparison round was misleading, as expected, due the low number of overlaps. For the Mobile Media system, we observed high accuracy rates for both NSF and NF measures. Furthermore, we observed that several elements of the Mobile Media were documented as being of high priority on the implementation of architectural components. More specifically, there were 8 architecture components described in that document directly related to 9 out of the top 10 high priority classes. It is important to notice that the results for this heuristic are dependent on the quality of the architecture roles defined by the software architect. Moreover, we observed that PDP system achieved the best results, even with multiple architecture roles defined, as well as different levels of relevance. Finally, we conclude that the results of applying the architecture role heuristic helped to reject the null hypothesis H4. In other words, the heuristic was able to produce rankings with at least acceptable accuracy in all of the target applications. Correlation with Architectural Problems. Similarly to the other heuristics, we have also evaluated whether the rankings produced by the architecture role heuristic are related to actual architectural problems for each of the target applications (see Table XII). As we can observe, the results are discrepant between the Health Watcher and the other three systems. However the problem in this analysis is related to the analyzed data. We identified two different groups of architecture roles among the top 10 elements for Health Watcher, ranked as equally relevant. That is, 6 of the related elements were playing the role of repository interfaces. The 4 remaining elements were Facades [10], or elements responsible for communicating different architecture components. We then asked the original architects to elaborate on the relevance of those roles, as we suspected they were unequal. They decided to differentiate the relevance between them, and considered the repository role as less relevant. This refinement led to a completely different ranking, which went up from 4 to 7 elements related to architecture problems. TABLE XII. Name HW MM PDP ARCHITECTURE ROLE HEURISTIC AND ACTUAL ARCHITECTURAL PROBLEMS # of ranked CE 10 10 10 Arch-Relevant 4 9 10 % of Arch-Relevant 40% 90% 100% The results obtained for Health Watcher show us the importance of correctly identifying the architecture roles and their relevancies for improving the accuracy of this heuristic. When that information is accurate, the results for this heuristic are highly positive. Furthermore, the other proposed prioritization heuristics could benefit from information regarding architecture roles in order to minimize the number of false positives, like utility classes. This indicates the need to further analyze different combinations of prioritization heuristics. VII. THREATS TO VALIDITY This section describes some threats to validity observed in our study. The first threat is related to possible errors on the anomalies detection in each of the selected target systems. As the proposed heuristics consist of ranking previously ranked code anomalies, the method for detecting these anomalies must be trustworthy. Although there are several kinds of detection strategies in the state-of-art, many studies have proven that they are inefficient for detecting architecturally relevant code anomalies [19]. In order to reduce the risk of imprecision when detecting code anomalies: (i) the original developers and architects were involved in this process; and (ii) we used wellknown metrics and thresholds for constructing our detection strategies [16][31]. The second threat is related to how we identified errors in software systems in order to apply the errordensity heuristic. Firstly, we relied on commit messages for identifying classes related to bug fixes. So, it implies that some errors might be missed. In order to mitigate this threat, we also investigated issue-tracking systems. Basically, we looked for error reports and traces between these errors and the code changed to fix them. Furthermore, we investigated test reports in order to identify the causes for eventual broken tests. Finally, for some cases where the information is not available, we relied on the use of static analysis methods for identifying bugs [3]. The third threat is related to the identification of the architectural roles for each of the target systems. The architecture role heuristic is based on identifying the relevance of code elements regarding the system architectural design. Thus, in order to compute the scores for this heuristic, we needed to assess the roles that each code element plays in the system architecture. In this sense, we considered the identification of architectural roles as being a threat to construct validity because the information regarding the architectural roles was extracted differently depending on the target system. Furthermore, we understand that the absence of architecture documentation reflect a common situation that might be inevitable when analyzing real world systems. Finally, the fourth threat to validity is an external threat and it is related to the choice of the target systems. The problem here is that our results are limited to the scope of the 4 target systems. But in order to minimize this threat, we selected systems developed by different programmers, with different domains, programming languages, environment and architectural styles. In order to generalize our results, further empirical investigation is still required. In this sense, our study should be replicated with other applications, from different domains. VIII. FINAL REMARKS AND FUTURE WORK The presence of architecturally relevant code anomalies often leads to the decline of the software architecture quality. Furthermore, the removal of those critical anomalies is not properly prioritized, mainly due to the inability of current tools to identify and rank architecturally relevant code anomalies. Moreover, there is no sufficient empirical knowledge towards factors that could make it easier the prioritization process. In this sense, our work has shown that developers can be guided through the prioritization of code anomalies according to architectural relevance. The main contributions of this work are: (i) four prioritization heuristics based on the architecture relevance and (ii) the evaluation of the proposed heuristics on four different software systems. In addition, during the evaluation of the proposed heuristics, we found that they were mostly useful in scenarios where: (i) there are architectural problems involving groups of classes that change together; (ii) there are architecture problems related to Facades or classes responsible for communicating different modules; (iii) changes are not predominantly perfective; (iv) there are architecture roles infected by multiple anomalies; and (v) the architecture roles are well defined in the software system and have distinct architecture relevance. Finally, in this work we evaluated the proposed heuristics individually. Thus, we have not evaluated how their combinations could benefit the prioritization results. In that sense, as a future work, we aim to investigate whether the combination of two or more heuristics would improve the efficiency of the ranking results. We also intend to apply different weights when combining the heuristics, enriching the possible results and looking for an optimal combination. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] R. Arcoverde, A. Garcia and E. Figueiredo, “Understanding the Longevity of Code Smells – Preliminary Results of an Survey,” in Proc. of 4th Int’l Workshop on Refactoring Tools, May 2011 R. Arcoverde et al., “Automatically Detecting Architecturally-Relevant Code Anomalies,” 3rd Int’l Workshop on Recommendation Systems for Soft. Eng., June 2012. N. Ayewah et al., “Using Static Analysis to Find Bugs,” IEEE Software, Vol. 25, Issue 5, pp. 22-29, September 2008. L. Bass, P. Clements and R. Kazman, “ Software Architecture in Practice”, Second Edition, Addison-Wesley Professional, 2003. P. Diaconis and R. Graham, “Spearman’s Footrule as a Measure of Disarray”, in Journal of the Royal Statistic Society, Series B, Vol. 39, pp. 262-268, 1977. E. Figueiredo et al., “Evolving Software Product Lines with Aspects: An Empirical Study on Design Stability,” in Proc. of 30th Int’l Conf. on Software Engineering, New York, USA 2008. S. Eick, T. Graves and A. Karr, “Does Code Decay? Assessing the Evidence from Change Management Data”, IEEE Transactions on Soft. Eng., Vol. 27, Issue 1, pp. 1-12, 2001 R. Faing, R. Kumar and D. Sivakumar, “Comparing Top K Lists”, in Proc. of 14th Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, pp. 28-36,, USA 2003. M. Fowler, “Refactoring: Improving the Design of Existing Code,” Addison-Wesley, 99. E. Gamma et al., “Design Patterns: Elements of Reusable ObjectOriented Software”, Addison-Wesley, Boston, USA, 1995. J. Garcia, D. Popescu, G. Edwards and N. Medvidovic, “Idenfitying Architectural Bad Smells,” in Proc. of CSMR, Washington, USA 2009. M. Godfrey and E. Lee, “Secrets from the Monster: Extracting Mozilla’s Software Architecture”, in Proc. of 2nd Int’l Symp. On Constructing Software Engineering Tools, 2000. P. Greenwood et al., “On the Impact of Aspectual Decompositions on Design Stability: An Empirical Study,” in Proc. of 21st Conf. of ObjectOriented Programming, Springer, pp. 176-200, 2007. L. Hochestein and M. Lindvall, “Combating Architectural Degeneration: A Survey,” Information of Software Technology, Vol. 47, July 2005. D. Kelly, “A Study of Design Characteristics in Evolving Software Using Stability as a Criterion,” IEEE Transactions on Software Engineering, Vol. 32, Issue 5, pp. 315-329, 2006. F. Khom, M. Penta and Y. Guéhenéuc, “An Exploratory Study of the Impact of Code Smells on Software Change-Proneness,” in Proc. of 16th Working Conf. on Reverse Eng., pp. 75-84, 2009 M. Kim, D. Cai and S. Kim, “An Empirical Investigation into the Role of API-Level Refactorings during Software Evolution,” in Proc. of 33rd Int’l Conf. on Software Engineering, USA 2011. [18] M. Lanza and R. Marinescu, “Object-Oriented Metrics in Practice,” Springer-Verlag, New York, USA 2006 [19] I. Macia et al., “Are Automatically-Detected Code Anomalies Relevant to Architectural Modularity? An Exploratory Analysis of Evolving Systems,” in Proc. of 11th AOSD, pp. 167-178, Germany, 2012. [20] I. Macia et al., “On the Relevance of Code Anomalies for Identifying Architecture Degradation Symptoms”, in Proc. of 16th CSMR, Hungary, March 2012. [21] I. Macia et al., “Supporting the Identification of ArchitecturallyRelevant Code Anomalies”, in Proc. of 28th IEEE Int’l Conf on Soft. Maint., Italy, 2012. [22] I. Macia et al., “Enhancing the Detection of Code Anomalies with Architecture-Sensitive Strategies”, in Proc. of the 17th CSMR, Italy, March 2013. [23] A. MacCormack, J. Rusnak and C. Baldwin, “Exploring the Structure of Complex Software Design: An Empirical Study of Open Source and Proprietary Code”, in Management Science, Vol. 52, Issue 7, pp. 10151030, 2006. [24] S. Malek et al., “Reconceptualizing a Family of Heterogeneous Embedded Systems via Explicit Architectural Support”, in Proc. of the 29th Int’l Conf on Soft. Eng., IEEE Computing Society, USA 2007. [25] M. Mantyla and C. Lassensius, “Subjective Evaluation of Software Evolvability using Code Smells: An Empirical Study, Vol. 11, pp. 395431, 2006 [26] R. Marinescu, “Detection Strategies: Metrics-Based Rules for Detecting Design Flaws,” in Proc. Int’l Conf. on Soft. Maint., pp. 350-359, 2004. [27] R. Martin, “Agile,Software Software Development, Principles, Patterns ans Practices. Prentice Hall, 2002. [28] M. J. Munro, “Product Metrics for Automatic Identification of Bad Smells Design problems in Java Source-Code”, In Proc. of 11th Int’l Sympositum on Soft. Metrics, pp. 15, September 2005. [29] E. Murphy-hill, C. Parnin and A. Black, “ How We Refactor and How We Know it,” in Proc. of 31st Int’l Conf. on Software Engineering, 2009. [30] NDepend. Available at http://www.ndepend.com. 2013. [31] S. Olbrich, D. Cruzes and D. Sjoberg, “Are Code Smells Harmful? A Study of God Class and Brain Class in the Evolution of Three Open Source Systems,” in Proc. of 26th Int’l Conf. on Soft. Maint., 2010. [32] J. Ratzinger, M. Fischer and H. Gall, “Improving Evolvability through Refactoring,” in Proc. of 2nd Int’l Workshop on Mining Soft. Repositories, ACM Press, pp. 69-73, New York, 2005. [33] Understand, 2013. Available at: http://www.scitools.com/ [34] C. Wohlin, et al., “Experimentation in Software Engineering – An Introduction”, Kluwer Academic Publisher, 2000. [35] S. Wong, Y. Cai and M. Dalton, “Detecting Design Defects Caused by Design Rule Violations,” in Proc. of 18th ESEC/ Foundations on Software Engineering, 2010. [36] Z. Xing and E. Stroulia, “Refactoring Practice: How it is and How it should be Supported: An Eclipse Study,” in Proc. of 22nd IEEE Int’l Conf. on Software Maintenance, pp. 458-468, 2000. [37] D. Sheskin, “Handbook of Parametric and Nonparametric Statistical Procedures”, Chapman & All, 4th Edition, 2007. Are domain-specific detection strategies for code anomalies reusable? An industry multi-project study Reuso de Estratégias Sensíveis a Domínio para Detecção de Anomalias de Código: Um Estudo de Múltiplos Casos Alexandre Leite Silva, Alessandro Garcia, Elder José Reioli, Carlos José Pereira de Lucena Opus Group, Laboratório de Engenharia de Software Pontifícia Universidade Católica do Rio de Janeiro (PUC - Rio) Rio de Janeiro/RJ, Brasil {aleite, afgarcia, ecirilo, lucena}@inf.puc-rio.br Resumo— Para promover a longevidade de sistemas de software, estratégias de detecção são reutilizadas para identificar anomalias relacionadas a problemas de manutenção, tais como classes grandes, métodos longos ou mudanças espalhadas. Uma estratégia de detecção é uma heurística composta por métricas de software e limiares, combinados por operadores lógicos, cujo objetivo é detectar um tipo de anomalia. Estratégias prédefinidas são usualmente aplicadas globalmente no programa na tentativa de revelar onde se encontram os problemas críticos de manutenção. A eficiência de uma estratégia de detecção está relacionada ao seu reuso, dado o conjunto de projetos de uma organização. Caso haja necessidade de definir limiares e métricas para cada projeto, o uso das estratégias consumirá muito tempo e será negligenciado. Estudos recentes sugerem que o reuso das estratégias convencionais de detecção não é usualmente possível se aplicadas de forma universal a programas de diferentes domínios. Dessa forma, conduzimos um estudo exploratório em vários projetos de um domínio comum para avaliar o reuso de estratégias de detecção. Também avaliamos o reuso de estratégias conhecidas, com calibragem inicial de limiares a partir do conhecimento e análise de especialistas do domínio. O estudo revelou que, mesmo que o reuso de estratégias aumente quando definidas e aplicadas para um domínio específico, em alguns casos o reuso é limitado pela variação das características dos elementos identificados por uma estratégia de detecção. No entanto, o estudo também revelou que o reuso pode ser significativamente melhorado quando as estratégias consideram peculiaridades dos interesses recorrentes no domínio ao invés de serem aplicadas no programa como um todo. reuse of previously-proposed detection strategies based on the judgment of domain specialists. The study revealed that even though the reuse of strategies in a specific domain should be encouraged, their accuracy is still limited when holistically applied to all the modules of a program. However, the accuracy and reuse were both significantly improved when the metrics, thresholds and logical operators were tailored to each recurring concern of the domain. Abstract— To prevent the quality decay, detection strategies are reused to identify symptoms of maintainability problems in the entire program. A detection strategy is a heuristic composed by the following elements: software metrics, thresholds, and logical operators combining them. The adoption of detection strategies is largely dependent on their reuse across the portfolio of the organizations software projects. If developers need to define or tailor those strategy elements to each project, their use will become time-consuming and neglected. Nevertheless, there is no evidence about efficient reuse of detection strategies across multiple software projects. Therefore, we conduct an industry multi-project study to evaluate the reusability of detection strategies in a critical domain. We assessed the degree of accurate A automação do processo de detecção de anomalias em programas é usualmente suportada através de métricas [2][11]. Cada métrica quantifica um atributo de elementos do código fonte, tais como acoplamento [23], coesão [24] e complexidade ciclomática [22]. A partir das métricas é possível identificar uma relação entre os valores de atributos e um sintoma de problema no código. Através dessa relação é possível definir uma estratégia de detecção para apoiar a descoberta de anomalias automaticamente [1][2]. Uma estratégia de detecção é uma condição composta por métricas e limiares, combinados através de operadores lógicos. Através desta condição é possível filtrar um conjunto específico de elementos do Palavras-chave— anomalias;detecção;reuso;acurácia I. INTRODUÇÃO Na medida em que sistemas de software são alterados, mudanças não planejadas podem introduzir problemas estruturais no código fonte. Estes problemas representam sintomas de manutenibilidade pobre do programa e, portanto, podem dificultar as atividades subsequentes de manutenção e evolução do programa [1]. Tais problemas são chamados de anomalias de código ou popularmente de bad smells [1]. Segundo estudos empíricos, módulos de programas com anomalias recorrentes, tais como métodos longos [1] e mudanças espalhadas [1], estão usualmente relacionados com introdução de falhas [17][25][26] e sintomas de degeneração de projeto [17][20][27]. Quando tais anomalias não são identificadas e removidas, é frequente a ocorrência de degradação parcial ou total do sistema [21]. À medida que um sistema cresce, identificar anomalias de código manualmente fica ainda mais difícil ou impeditivo. programa. Este conjunto de elementos representa candidatos a anomalias de código nocivas a manutenibilidade do sistema [2]. Mesmo assim, nem todo sintoma representa necessariamente um problema relevante para o desenvolvedor do sistema [8]. Para facilitar a identificação das anomalias, algumas ferramentas foram propostas, a partir das estratégias de detecção conhecidas: [3], [4], [5], [6] e [7]. Mesmo com o apoio de ferramentas, detectar anomalias é difícil e custoso [8]. Além disso, a eficiência de uma estratégia de detecção está relacionada à facilidade do seu reuso dado o conjunto de projetos de uma organização. Em um extremo negativo, os desenvolvedores precisariam definir uma estratégia de detecção para cada tipo possível de anomalia, para cada projeto. Para isso, seria preciso rever as métricas e limiares apropriados, além das ocorrências identificadas pelas ferramentas que não representam necessariamente problemas no código. Essa tarefa, ao ser executada especificamente para cada projeto, vai custar muito tempo e fatalmente será negligenciada. Além disso, existem evidências empíricas de que o reuso das estratégias de detecção não é possível se aplicada em vários projetos de software de domínios totalmente diferentes. Para que fosse possível investigar o reuso de estratégias de detecção em vários projetos de software do mesmo domínio, este artigo apresenta um estudo de múltiplos casos da indústria. O estudo investigou o reuso de sete estratégias de detecção, relacionadas a três anomalias, em seis projetos de um domínio específico. O reuso das estratégias foi avaliado a partir do percentual de falsos positivos, classificados segundo a análise de três especialistas do domínio, a partir das ocorrências encontradas pelas estratégias de detecção de anomalias. A partir do grau de reuso das estratégias, foram investigadas as situações em que fosse possível aumentar o grau de reuso, tendo em vista os sistemas escolhidos para o estudo. Dessa forma, o estudo revelou que, mesmo que o reuso de estratégias de detecção em um domínio específico seja incentivado, em alguns casos o reuso é limitado devido à variação das características dos elementos identificados por uma estratégia de detecção. No entanto, a acurácia e o reuso foram ambos significativamente melhorados quando os limiares foram adaptados para certos interesses recorrentes no domínio. Nesse sentido, foi observado que os melhores resultados se deram nos interesses em que as características dos elementos variaram menos. Assim, o presente estudo inicia o estudo de estratégias de detecção tendo em vista conjuntos de elementos com responsabilidades bem definidas. O artigo está estruturado da seguinte maneira. Na seção II é apresentada a terminologia relacionada ao trabalho. Na seção III é apresentada a definição do estudo de caso. Na sessão IV são apresentados os resultados e as discussões e, na seção V, as conclusões. II. TERMINOLOGIA Esta seção apresenta conceitos associados a anomalias de código (seção II.A) e estratégias de detecção de anomalias (II.B). A. Anomalias de código Segundo Fowler, uma anomalia de código é um sintoma de manutenibilidade pobre do programa que pode dificultar futuras atividades de correção e evolução do código fonte [1]. Por exemplo, um sintoma que precisa ser evitado é a existência de classes que centralizam muito conhecimento sobre as funcionalidades do sistema. Este sintoma é muito conhecido como God Class e tem um grande potencial de impacto negativo no perfeito entendimento do sistema [2]. Outro sintoma que deve ser evitado é Long Method [1]: quanto maior um método é, mais difícil será entender o que ele se propõe a fazer. Espera-se então uma maior longevidade de programas com métodos curtos de código [1]. Estas anomalias estão relacionadas de uma forma ou de outra a fatos sobre um único elemento do código. Por outro lado, certas anomalias procuram correlacionar fatos sobre diversos elementos do código com possíveis problemas de manutenção, como é o caso da Shotgun Surgery. Esta anomalia identifica métodos que podem provocar muitas alterações em cascata, isto é, manutenções no código que levam a diversas mudanças pequenas em outras classes. Quando essas alterações estão espalhadas pelo código, elas são difíceis de encontrar assim como também é fácil neste caso para o desenvolvedor esquecer alguma mudança importante [1]. Para apoiar a descoberta de anomalias, Fowler propôs 22 metáforas de sintomas que indicam problemas no código, sendo que, cada metáfora está relacionada a uma anomalia de código [1][10]. B. Estratégias de detecção A detecção de anomalias oferece aos desenvolvedores a oportunidade de reestruturação do código para uma nova estrutura que facilite manutenções futuras. Um mecanismo bastante utilizado para detecção de anomalias é a descrição das mesmas através da composição de métricas associadas aos atributos dos elementos de código [2]. Uma composição de métricas descreve uma estratégia de detecção. A partir de uma estratégia de detecção é possível filtrar então um conjunto específico de elementos do programa. Este conjunto de elementos representa potenciais candidatos a anomalias de código [2][12][13]. A Fig. 1 descreve, de maneira sucinta, o processo de formação de uma estratégia de detecção, segundo [2], para reuso em diferentes sistemas. Primeiro, um conjunto de métricas relacionadas a sintomas que indicam um determinado problema é identificado (Fig. 1–a). Em um segundo passo, as métricas identificadas são associadas a limiares, para que seja possível filtrar os elementos de código. Uma métrica associada a um dado limiar elimina elementos para os quais os valores das métricas excedam os limiares (Fig. 1–b). Para a final formação de uma estratégia de detecção, as métricas e limiares são combinados entre si através de operadores lógicos (e.g., AND, OR) (Fig. 1–c e d). M1 M1, L1 M2 M2, L2 Mn Mn, Ln (a) (b) Sistema A Composição ED Sistema B Sistema C (c) Legenda: Mi: Resultado da métrica i Li: Limiar associado à métrica i (d) (e) ED: Estratégia de detecção Fig. 1. Processo de formação de uma estratégia de detecção[2] e seu uso em diversos sistemas – Adaptado de [25] . Como se pode observar, uma estratégia de detecção codifica o conhecimento a respeito das características de uma determinada anomalia. Logo, escolher métricas e limiares apropriados é determinante para o sucesso da estratégia no apoio a descoberta de sintomas de problemas no código [2] [8][14]. Com isso, a principal intenção com esta abordagem é permitir que uma estratégia de detecção possa ser posteriormente aplicada em diversos sistemas (Fig. 1–e), isto é, espera-se que as características de uma anomalia se mantenham dentre diferentes sistemas. No entanto, observa-se que em determinados contextos em que as estratégias são aplicadas, algumas ocorrências não são necessariamente sintomas de problemas, isto é, elas indicam, na realidade, falsos positivos [8] [9]. III. DEFINIÇÃO DO ESTUDO O estudo objetiva investigar a viabilidade de reuso de estratégias de detecção de anomalias em vários sistemas do mesmo domínio. Portanto, a seção III.A descreve o objetivo do estudo em mais detalhes. A seção III.B descreve o contexto em que o estudo foi conduzido. A seção III.C descreve o projeto do estudo. A. Objetivo do estudo De acordo com o formato proposto por Wohlin (1999), o objetivo deste trabalho pode ser caracterizado da seguinte forma: O objetivo é analisar a generalidade das estratégias de detecção de anomalias de código para o propósito de reuso das mesmas com respeito à diminuição da ocorrência de falsos positivos do ponto de vista de mantenedores de software no contexto de sistemas web de apoio à tomada de decisão. O contexto desse estudo é formado por seis sistemas web de apoio à tomada de decisão. Esse conjunto de sistemas opera em um domínio crítico, pois realiza a análise de indicadores para o mercado financeiro (seção III.B). Em uma primeira etapa, busca-se calibrar ou definir estratégias de detecção para sistemas desse domínio, a partir de características conhecidas e observadas pelos desenvolvedores de um subconjunto de sistemas neste domínio. Esta fase tem o objetivo de calibrar estratégias existentes ou definir novas estratégias para serem utilizadas em sistemas do domínio alvo. Portanto, o conhecimento dos especialistas do domínio sobre o código fonte foi utilizado primeiro para calibrar os limiares de métricas usadas em estratégias convencionais existentes (ex. [2][13][16]). Tal conhecimento do especialista sobre o código foi também usado para definir novas estratégias com métricas não exploradas em tais estratégias convencionais. Em uma segunda etapa, avalia-se o reuso e a acurácia das estratégias em uma família de outros sistemas do mesmo domínio. Além do grau de reuso, a acurácia das estratégias é avaliada através da quantidade de falsos positivos encontrados. Falsos positivos são indicações errôneas de anomalias detectadas pela estratégia. Nossa pressuposição é que o reuso das estratégias aumenta na medida em que as mesmas são definidas em função de características de sistemas do mesmo domínio. Além disso, certos conjuntos recorrentes de classes, que implementam um mesmo interesse (responsabilidade) bem definido, em sistemas de um mesmo domínio, tendem a possuir características estruturais semelhantes. De fato, em sistemas web de apoio à tomada de decisão, foco do nosso estudo (seção III.B), alguns conjuntos de classes possuem responsabilidades semelhantes e bem definidas. Portanto, também estudamos se o reuso e a eficácia poderiam ser melhorados se estratégias fossem aplicadas a classes com uma responsabilidade recorrente do domínio. Por exemplo, um conjunto de classes desse domínio é formado por classes que recebem as requisições do usuário e iniciam a geração de indicadores financeiros. Essas classes recebem os parâmetros necessários, calculam uma grande quantidade de informações e geram os resultados para serem exibidos na interface. Além disso, essas classes desempenham o papel de intermediário entre a interface do usuário e as classes de negócio. Mesmo assim, é preciso evitar que fiquem muito grandes. Além disso, é preciso evitar que o acoplamento dessas classes fique muito disperso com relação a outras classes da aplicação. Outro conjunto desse domínio é formado por classes responsáveis pela persistência dos dados. Assim, essas classes são formadas por muitos métodos de atribuição e leitura de valores de atributos (getters e setters). As classes de persistência devem evitar métodos muito longos que possam incorporar também a lógica da aplicação de forma indesejável. Uma classe da camada de persistência com essas características pode indicar um sintoma de problemas para a compreensão dos métodos, bem como acoplamentos nocivos à manutenção do programa. Nesse sentido, esse trabalho visa responder às seguintes questões de pesquisa: (1) É possível reusar estratégias de detecção de anomalias de forma eficaz em um conjunto de sistemas de um mesmo domínio? A partir de estratégias calibradas ou definidas com o apoio do especialista do domínio, faz-se necessário avaliar o reuso das estratégias em outros sistemas do mesmo domínio. Entretanto, o reuso de cada estratégia só é eficaz se a mesma é aplicada em um novo programa do mesmo domínio com baixa incidência de falsos positivos. Em nosso estudo, consideramos que a estratégia foi eficaz se o uso desta não resulta em mais que 33% de falsos positivos. Mais a frente justificamos o uso deste procedimento. (2) É possível diminuir a ocorrência de falsos positivos ao considerar as características de classes com responsabilidade bem definida do domínio? Como justificamos com os exemplos acima, observa-se que certos elementos do programa implementam um interesse recorrente de um domínio de aplicações; estes elementos podem apresentar características estruturais parecidas, que não são aplicáveis aos outros elementos do programa como um todo. Portanto, também verificamos se a associação de estratégias específicas para classes de um mesmo interesse seriam mais reutilizáveis do que as mesmas que são definidas para um programa como um todo. Para responder essas questões, foi conduzido um estudo com múltiplos casos de programas do mesmo domínio. Esse estudo foi realizado para avaliar o reuso de sete estratégias de detecção, relacionados a três tipos de anomalias recorrentes em um domínio específico. B. Contexto de aplicação do estudo O presente estudo foi conduzido em uma empresa de consultoria e desenvolvimento em sistemas de missão-crítica. A empresa é dirigida por doutores e mestres em informática, e foi fundada em 2000. Em 2010, a empresa absorveu um conjunto de sistemas web de apoio à tomada de decisão, originalmente desenvolvidos por outra empresa. Esse conjunto de sistemas opera em um domínio crítico, pois realiza a análise de indicadores para o mercado financeiro. O tempo de resposta e a precisão dos dados são importantes, pois a apresentação de uma análise errada pode gerar uma decisão errada e a consequente perda de valores financeiros. De forma a propiciar a confiabilidade deste sistema em longo prazo, o mesmo também deve permanecer manutenível. Caso contrário, as dificuldades de manutenção facilitarão a introdução de faltas nos programas ao longo do histórico do projeto. Além disso, a baixa manutenibilidade dificulta que a empresa se adapte a mudanças nas regras de negócio ou incorpore inovações, perdendo, assim, competitividade no mercado. A seguir, apresentamos várias características destes programas, algumas delas sinalizando a importância de manter a manutenibilidade dos mesmos através, por exemplo, de detecção de anomalias de código. Os seis sistemas escolhidos para o estudo estão divididos entre três equipes distintas. Segundo a Tabela I, cada equipe é responsável por dois sistemas e é representada por um líder. Cada líder participa do estudo como especialista do domínio (E1, E2 e E3). Além disso, oito programadores distintos compõem as três equipes que mantêm os seis sistemas. Os sistemas que fazem parte desse estudo possuem uma estrutura direcionada à operação de grande quantidade de dados. A partir desses dados é possível gerar indicadores para a tomada de decisão no mercado financeiro. Os dados estão relacionados, por exemplo, com informações históricas de ativos financeiros e informações relacionadas à configuração e armazenamento de estruturas utilizadas pelos usuários. A partir das estruturas utilizadas pelos usuários é possível controlar: carteiras de ativos financeiros, tipos de relatório, variáveis utilizadas nos cálculos de indicadores, modos de interpolação de dados, entre outras informações. TABELA I. COMPOSIÇÃO DAS EQUIPES QUE MANTÉM OS SISTEMAS USADOS NO ESTUDO Sistemas AeB CeD EeF Especialistas E1 E2 E3 Programadores P1, P2 e P3 P4 e P5 P6, P7 e P8 Nesses sistemas, como a interface do usuário é bastante rica, também existem muitas classes que compõem os elementos visuais. Esses elementos recebem as requisições do usuário e dão início à geração de informações e processamento de dados. Ao final das operações necessárias, os dados são mostrados na interface e o usuário pode analisá-los através de gráficos e relatórios em diferentes formatos. O tempo de resposta das solicitações é fundamental para a tomada de decisões. Desse modo, algumas operações realizadas por esses sistemas utilizam tecnologias assíncronas e client-side – operações executadas diretamente no navegador do cliente como, por exemplo, javascript e JQuery. A manutenibilidade das classes destes programas também é importante para não acarretar potenciais efeitos colaterais ao desempenho. Ainda, existe um conjunto de classes que garante o controle de acesso às informações por meio de autenticação. A autenticação é necessária, pois existem restrições para os diferentes perfis de usuários. Além disso, um grande conjunto de classes é usado para refletir o modelo do banco de dados. Da mesma forma que em vários outros sistemas existentes, essas classes são necessárias para garantir a integridade das informações. Ainda, nesses sistemas, é importante garantir a frequente comunicação com serviços de terceiros. Esses serviços fornecem dados provenientes de algumas fontes de dados financeiros como, por exemplo, Bloomberg (www.bloomberg.com). Outro ponto importante para a escolha destes sistemas é a recorrência de conjuntos de elementos com responsabilidades bem definidas. Dessa forma, é possível garantir a proximidade estrutural dos conjuntos de classes dos sistemas em estudo, o que é fundamental para avaliar o reuso das estratégias e responder nossas duas questões de pesquisa (seção III.A). Além disso, através da recorrência desses conjuntos de elementos é possível avaliar o percentual de falsos positivos das estratégias, considerando as características específicas dos conjuntos de elementos. C. Projeto do estudo Segundo [13], um bom índice de acurácia de uma estratégia de detecção deveria estar acima dos 60%. De qualquer forma, o índice usado nesse estudo foi um pouco mais rigoroso e está um pouco acima deste índice sugerido na literatura: 66%, isto é, dois terços de acertos nas detecções feitas por cada estratégia. A escolha do índice de acurácia de 66% também se deu pelo fato de que é possível garantir que, a cada três ocorrências identificadas pelas estratégias de detecção, apenas uma é classificada como um falso positivo. Se o desenvolvedor encontra um número de erros (falso positivos) maior que dois terços, este será desencorajado a reusar a estratégia em outro programa. Sendo assim, para avaliar se as estratégias de detecção de anomalias escolhidas podem ser reusadas com, no máximo, 33% de ocorrências de falsos positivos, foram definidas três etapas. O objetivo da primeira etapa, chamada de etapa de ajustes, é definir estratégias de detecção de anomalias que tenham percentual de falsos positivos abaixo de 33% para duas aplicações do domínio em estudo. A segunda etapa, chamada de etapa de reuso, tem por objetivo avaliar se as estratégias definidas na etapa de ajustes podem ser reusadas em outros quatro sistemas do mesmo domínio, com o resultado de falsos positivos ainda abaixo de 33%. Finalmente, a última etapa é chamada de etapa de análise por interesses do domínio. Esta tem como objetivo verificar se o percentual de falsos positivos das estratégias pode ser melhorado tendo em vista a aplicação das estratégias apenas em classes de um mesmo interesse recorrente nos programas do mesmo domínio. Nesse estudo, o percentual de falsos positivos é definido através da avaliação do especialista do domínio. Essa avaliação é realizada durante uma “sessão de investigação”. Em cada sessão de investigação realizada, as estratégias de detecção de anomalias são aplicadas a um dos sistemas do domínio. Assim, a partir de cada ocorrência indicada pela ferramenta de detecção, o especialista faz uma avaliação qualitativa, para indicar se a ocorrência é um falso positivo ou se realmente é um sintoma de problema para o domínio das aplicações em estudo. Dessa forma, o percentual de falsos positivos de uma estratégia de detecção é definido pelo nº de ocorrências classificadas como falso positivo pelo especialista do domínio, em relação ao nº de ocorrências identificadas pela ferramenta de detecção. Etapa de Ajustes. Na etapa de ajustes, os especialistas do domínio apoiaram as atividades de: (i) definição do domínio em estudo, para que fosse possível caracterizar os sistemas para os quais faria sentido avaliar o reuso de estratégias; (ii) escolha dos sistemas que caracterizam o domínio, para que fosse possível considerar sistemas que representam o domínio em estudo; e (iii) identificação dos interesses (responsabilidades) recorrentes do domínio, bem como do conjunto de classes que contribuem para a implementação de cada interesse. Em seguida, nessa mesma etapa, as definições de anomalias que são recorrentes na literatura [19] foram apresentadas aos especialistas do domínio. Isso foi feito para que fosse possível avaliar as anomalias que seriam interessantes investigar no domínio alvo, do ponto de vista dos especialistas. A partir da escolha das anomalias, foram definidas as estratégias de detecção de anomalias que seriam utilizadas. Conforme mencionado anteriormente, foram utilizadas estratégias definidas a partir da sugestão dos especialistas do domínio, além de estratégias conhecidas da literatura [2][13][16]. Neste último caso, os especialistas sugeriram refinamentos de limiares de acordo com experiências e observações feitas na etapa de ajustes. Ainda nesta etapa de ajustes, foi escolhida uma ferramenta de detecção de anomalias de código em que fosse possível avaliar as ocorrências identificadas pelas estratégias, tendo em vista o mapeamento das classes que implementavam cada interesse do domínio (conforme discutido acima). Também na etapa de ajustes, foram realizadas duas sessões de investigação, com a participação do especialista do domínio, para os dois sistemas escolhidos para essa etapa. A partir da classificação do especialista, foi possível definir o percentual de falsos positivos para cada uma das estratégias escolhidas, para cada um dos dois sistemas. Finalizando a etapa de ajustes, verificamos as estratégias que resultaram em no máximo 33% de falsos positivos (na média). Dessa forma, as estratégias que não excederam esse limiar para os dois sistemas foram aplicadas na etapa seguinte, chamada etapa de reuso. Etapa de Reuso. Como mencionamos, o objetivo da etapa de reuso é avaliar se as estratégias definidas na etapa de ajustes podem ser reusadas em outros quatro sistemas do mesmo domínio, com o resultado de falsos positivos ainda abaixo de 33%. Nessa etapa, o reuso das estratégias é definido através dos seguintes critérios: • Reuso total: a estratégia foi aplicada nos sistemas do domínio e resultou diretamente em no máximo 33% de falsos positivos, em todos os sistemas; • Reuso parcial: a estratégia foi aplicada nos sistemas do domínio, porém o percentual de falsos positivos excedeu 33% em um ou dois sistemas; isto é, as estratégias foram reusadas de forma eficaz em, pelo menos, a metade dos programas; • Nenhum reuso: nesse caso, a estratégia foi aplicada nos sistemas do domínio e o percentual de falsos positivos excedeu 33% em mais de dois sistemas; isto é, as estratégias foram reusadas de forma eficaz em menos da metade dos programas. Da mesma forma, na etapa de reuso, o percentual de falsos positivos para todas as estratégias de detecção é determinado pela avaliação qualitativa do especialista do domínio. Dessa forma, o percentual de falsos positivos para cada estratégia de detecção é definido pelo nº de ocorrências classificadas como falso positivo pelo especialista do domínio, em relação ao nº total de ocorrências identificadas pela ferramenta de detecção. Assim, foram realizadas quatro sessões de investigação, sendo uma para cada um dos quatro sistemas escolhidos para essa etapa. A partir dos resultados das quatro sessões de investigação, procurou-se identificar quais estratégias tiveram um reuso total. Dessa forma, essa etapa procura indícios das estratégias que tiveram bons resultados, considerando o domínio das aplicações em estudo. Depois, em um segundo momento, foram investigados os casos em que o percentual de falsos positivos esteve acima de 33%. Nesses casos, procuramos entender quais fatores influenciaram a alta ocorrência de falsos positivos. Nesse sentido, foi realizada uma investigação nos valores das métricas dos elementos identificados, para observar quais fatores desmotivaram o reuso da estratégia para os seis sistemas do domínio em estudo. Etapa de Interesses do Domínio. Para finalizar o estudo, a etapa chamada de etapa de interesses tem como objetivo verificar se o percentual de falsos positivos das estratégias diminui ao considerar a aplicação das estratégias apenas em elementos de cada interesse recorrente nos sistemas do mesmo domínio. Através da última etapa, investigamos se seria possível diminuir o percentual de falsos positivos ao aplicar as estratégias de detecção a um conjunto de elementos com responsabilidades bem definidas. Anomalias Investigadas. Nesse estudo as anomalias investigadas forma definidas juntamente com o especialista do domínio. Dessa forma, é possível investigar anomalias que são interessantes do ponto de vista de quem acompanha o dia a dia do desenvolvimento dos sistemas do domínio. Nesse estudo foram investigadas: uma anomalia em nível de classe, uma anomalia em nível de método e uma anomalia relacionada a mudanças. São elas, nessa ordem: 1) God Class (seção II.A): com o passar do tempo, é mais cômodo colocar apenas um método em uma classe que já existe, a criar uma classe nova. Dessa forma, é preciso evitar classes que concentram muito conhecimento, isto é, classes com várias responsabilidades distintas, chamadas de God Classes; escolhidas para o estudo. Além disso, os limiares usados para todas as estratégias escolhidas para o estudo foram definidos segundo as opiniões dos três especialistas. Assim, as estratégias escolhidas na fase de ajustes, para detecção de anomalias definidas pelos especialistas, para o domínio em estudo, são apresentadas nas Tabelas II e III. Ferramenta de Detecção Escolhida. Entre as ferramentas disponíveis para a detecção de anomalias de código, diversas são baseadas nas estratégias de detecção propostas em [2]. Mesmo assim, para que fosse possível realizar um estudo de estratégias através do mapeamento de interesses, foi necessário escolher uma ferramenta que possibilitasse essa análise. Assim, a ferramenta escolhida foi SCOOP (Smells Co-Occurrences Pattern Analyzer) [15]. Além disso, SCOOP já foi usada com sucesso em estudos empíricos anteriores, tais como aqueles reportados em [16][17]. TABELA II. 2) Long Method (seção II.A): da mesma forma que a anomalia anterior, existem métodos que acabam concentrando muita lógica do domínio. Assim, é importante identificar métodos que concentram muito conhecimento e dificultam a compreensão e manutenção do programa. Ocorrências de anomalias como estas são chamadas de Long Methods; PELOS ESPECIALISTAS Anomalia God Class EspLoc God Class EspNom Long Method Esp Shotgun Surgery Esp a. 3) Shotgun Surgery (seção II.A): para prevenir que uma mudança em um método possa gerar várias pequenas mudanças em outros elementos do código, é preciso evitar que um método possua relação com vários outros métodos dispersos na aplicação. Caso esse relacionamento disperso ocorra, podemos ter ocorrências de anomalias chamadas de Shotgun Surgeries; Estratégias de Detecção Escolhidas. A partir das anomalias escolhidas, as estratégias de detecção definidas para o estudo foram concebidas em conjunto com o especialista. A partir da discussão com o especialista, foram escolhidas e calibradas estratégias de detecção conhecidas da literatura (Seção III.A). Foram formadas também novas estratégias de detecção, tendo como orientação o processo de formação de estratégias de detecção, proposto por [2] (seção II.B). Para definir as estratégias em conjunto com os especialistas, foi necessário decidir quais métricas identificam os sintomas que devem ser evitados, tendo em vista as características do domínio em estudo. Dessa forma, para definir as estratégias que avaliam God Class, os especialistas sugeriram uma métrica relacionada ao tamanho e uma métrica relacionada ao acoplamento. Além disso, os especialistas sugeriram que fosse possível variar a métrica de tamanho para avaliar qual estratégia poderia apresentar melhores resultados, tendo em vista os sistemas do domínio em estudo. Depois, para identificar Long Method, os especialistas do domínio sugeriram que fossem usadas uma métrica de tamanho e uma métrica de complexidade. Por último, para identificar Shotgun Surgery, os especialistas sugeriram uma métrica de complexidade e uma métrica de acoplamento. Depois de definir estratégias de detecção em conjunto com os especialistas do domínio, foram escolhidas três estratégias de detecção, a partir da literatura. Dessa forma, cada uma das estratégias da literatura está relacionada a uma das anomalias ESTRATÉGIAS DE DETECÇÃO SUGERIDAS INTEIRAMENTE Estratégia (LOC > 150) and (CBO > 6) (NOM > 15) and (CBO > 6) (LOC > 50) and (CC > 5) (CC > 7) and (AMa > 7) Accessed Methods (AM) representa a quantidade de métodos externos utilizados por um método[2]. TABELA III. ESTRATÉGIAS DE DETECÇÃO SUGERIDAS NA LITERATURA COM LIMIARES AJUSTADOS PELOS ESPECIALISTAS Anomalia God Class Lit Estratégia (ATFDb > 5) and (WMCc > 46) and (TCCd < 33) (LOC > 50) and (CC > 6) and (MaxNesting > 5) and (NOAVe > 3) (FanOut > 16) Long Method Lit Shotgun Surgery Lit b. c. Access to Foreign Data (ATFD) representa o nº de atributos de classes externas, que são acessados diretamente ou através de métodos de acesso [12]. Weighted Method Count (WMC) representa a soma da complexidade ciclomática de todos os métodos de uma classe [22][23]. d. e. Tight Class Cohesion (TCC) representa o nº relativo de pares de métodos de uma classe que acessam em comum pelo menos um atributo da classe avaliada[24]. Number of Accessed Variables (NOAV) representa o nº total de variáveis acessadas diretamente pelo método avaliado [2]. TABELA IV. RELAÇÃO DOS INTERESSES MAPEADOS DOS SISTEMAS USADOS NO ESTUDO Sistema Interesses mapeados a A B C D E F TABELA V. x x x x x x b x x x x x x c x x x x x x d x x x x x x e f x x x x x x x x x x x x g x h x i j x x x x x x x DESCRIÇÃO DO TAMANHO DOS SISTEMAS USADOS NO ESTUDO Sistema A B C D E F NLOC 21599 10011 12504 5935 31766 21602 Nº de classes 161 81 130 41 150 149 Interesses Mapeados nos Sistemas em Estudo. Para avaliar se é possível diminuir a ocorrência de falsos positivos, ao considerar as características de conjuntos de elementos com responsabilidades bem definidas, foi necessário realizar o mapeamento dos interesses em classes dos seis sistemas, através do acompanhamento dos especialistas do domínio. Durante o mapeamento dos interesses, primeiro foram observados os interesses mais gerais, como, por exemplo, interface, persistência e recursos auxiliares. Em seguida, foi realizado o mapeamento dos interesses relacionados especificamente ao domínio das aplicações. Através do acompanhamento dos especialistas do domínio pode-se garantir a identificação dos elementos de código para cada um dos interesses mapeados para o domínio. Segundo os especialistas do domínio, existia, de fato, um conjunto razoável de interesses recorrentes do domínio. A partir da Tabela IV é possível observar o grau de recorrência dos interesses mapeados nos sistemas escolhidos para o estudo. Os interesses escolhidos são representados por letras minúsculas na tabela, mas nomeados em próximas subseções do artigo. A Tabela V descreve o tamanho dos sistemas escolhidos para o estudo, em número de linhas de código (NLOC) e número de classes. Mesmo que os sistemas variem em tamanho, segundo os especialistas do domínio, a proximidade estrutural dos sistemas é observada nos conjuntos de classes mapeadas para interesses recorrentes nos sistemas. IV. RESULTADOS E DISCUSSÕES Essa seção apresenta os resultados do estudo. A seção IV.A apresenta os resultados da etapa de ajustes. A seção IV.B apresenta os resultados sobre o reuso das estratégias e a seção IV.C apresenta os resultados sobre o percentual de falsos positivos ao considerar o mapeamento de interesses. Nas tabelas a seguir, a coluna “NO/FP” indica os valores de: nº de ocorrências de anomalias encontradas pelas estratégias / nº de falsos positivos identificados pelo especialista do domínio. A coluna “%FP” indica o percentual de falsos positivos identificados pelo especialista do domínio, em relação ao total de ocorrências de anomalias encontradas pelas estratégias. Destacamos em negrito os percentuais de falsos positivos acima de 33% para facilitar a identificação dos casos em que o resultado da estratégia não foi eficaz. A. Resultado da fase de ajustes Como mencionado, nesta fase avaliamos o percentual de falsos positivos de cada estratégia, de acordo com o julgamento do especialista. A Tabela VI apresenta o número e o percentual de falsos positivos indicados pelo especialista para cada estratégia, quando aplicadas aos sistemas A e B. A escolha desses sistemas para a fase de ajustes se deve especificamente à disponibilidade imediata do especialista E1 (Tabela I). Através da Tabela VI é possível perceber que apenas uma das estratégias (God Class Lit) excedeu o percentual de falsos positivos (33%), na média dos dois sistemas, proposto para o estudo. Isso significa que, embora as métricas utilizadas sejam recorrentes da literatura, os limiares propostos pelos especialistas não foram muito bons. TABELA VI. Estratégia God Class EspLoc God Class EspNom God Class Lit Long Method Esp Long Method Lit Shotgun Surgery Lit Shotgun Surgery Esp RESULTADO DA FASE DE AJUSTES. Sistema A NO/FP 27/6 15/3 4/3 30/0 1/0 61/12 21/0 %FP 22% 20% 75% 0% 0% 20% 0% Sistema B NO/FP %FP 17/7 5/1 2/0 19/3 0/0 25/6 7/1 41% 20% 0% 16% 0% 24% 14% TABELA VII. ESTRATÉGIAS DE DETECÇÃO DA FASE DE REUSO Anomalia God Class EspLoc God Class EspNom God Class Lit Long Method Esp Long Method Lit Shotgun Surgery Lit Shotgun Surgery Esp Estratégia (LOC > 150) and (CBO > 6) (NOM > 15) and (CBO > 6) (ATFD > 6) and (WMC > 46) and (TCC < 11) (LOC > 50) and (CC > 5) (LOC > 50) and (CC > 6) and (MaxNesting > 5) and (NOAV > 3) (FanOut > 16) (CC > 7) and (AM > 7) Como um exercício, porém, investigamos se seria possível reduzir o percentual de falsos positivos para a estratégia God Class Lit apenas alterando os limiares dos seus componentes, preferencialmente sem criar novos falsos negativos. Neste caso, observamos que sim, realmente foi possível reduzir para 0% o percentual de falsos positivos da anomalia God Class Lit. Na fase de reuso, como veremos a seguir, as estratégias foram então reaplicadas a outros quatro sistemas do mesmo domínio. Objetivamos na fase de reuso observar se é possível reusar as estratégias diretamente (sem modificação nos limiares) em outros sistemas do mesmo domínio. B. Resultado da etapa de reuso Na segunda fase, as estratégias apresentadas anteriormente (Tabela VII) foram aplicadas aos sistemas C, D, E e F. As estratégias God Class EspLoc e God Class EspNom, quando aplicadas ao sistema D, resultaram em um percentual de falsos positivos de 80%. A estratégia Shotgun Surgery Lit, quando aplicada ao sistema C, resultou em 76% de falsos positivos. Mesmo assim, nenhuma das estratégias definidas para a segunda fase resultou em mais do que 30% de falsos positivos, quando aplicadas aos sistemas A e E. A partir da Tabela VIII, é importante observar então que God Class Lit e Long Method Lit mantiveram os resultados abaixo de 33% para todos os sistemas avaliados. As estratégias que não sofreram qualquer adaptação, por outro lado, variaram um pouco em termos do percentual de falsos positivos. De forma geral, é possível perceber que houve um reuso satisfatório (83%) tanto das estratégias definidas em conjunto com os especialistas (God Class EspLoc e EspNom, Long Method Esp e Shotgun Surgery Esp) quanto das estratégias com limiares definidos na literatura (God Class Lit, Long Method Lit e Shotgun Surgery Lit). Pode-se concluir pelos resultados desta fase de análise do reuso que existe certa tendência de comportamento padrão entre sistemas de um mesmo domínio, apesar de uns poucos casos peculiares que encorajaram e desencorajaram futuras adaptações nos limiares. TABELA VIII. OCORRÊNCIAS DE FALSOS POSITIVOS E ANOMALIAS NA SEGUNDA FASE. Sistemas Estratégia A NO/FP God Class EspLoc God Class EspNom God Class Lit Long Method Esp Long Method Lit Shotgun Surgery Lit Shotgun Surgery Esp 27/6 15/3 4/3 30/0 1/0 61/12 21/0 B %FP 22% 20% 0% 0% 0% 20% 0% NO/FP 17/7 5/1 2/0 19/3 0/0 25/6 7/2 C %FP NO/FP 41% 20% 0% 16% 0% 24% 28% Aplicando novas adaptações nos limiares, observamos que certas características comuns entre os sistemas certamente podem influenciar positivamente no grau de reuso das estratégias. Por exemplo, considerando a estratégia God Class EspNom, quando aplicada o sistemas F, gerou um número de falsos positivos onde 75% dos casos o valor do componente CBO é igual a 10. Neste caso, alterando o valor do componente CBO para 10, o número de falsos positivos cai para 20%, no caso do sistema F , e aumenta para 27% no caso do sistema A. No entanto para o sistema B, o número de falsos positivos cai para 0%, para o sistema E 12% e sistema D 50%. Mesmo com uma piora no caso do sistema C para 40%, um pequeno ajuste nos limiares mostrou um melhor equilíbrio dentro dos sistemas para um mesmo domínio. Outro exemplo pode ser mais criterioso. Assim, considerando a estratégia God Class EspNom, quando aplicada ao sistema D, gerou um número de falsos positivos onde 100% dos casos o valor do componente CBO é menor que 18. Neste caso, alterando o valor do componente CBO para 18, o número de falsos positivos cai para 0%, nos sistemas B, D, E e F. Mesmo assim, o percentual de falsos positivos se mantém no sistema C e aumenta para 25% no sistema A. Dessa forma, com um ajuste mais criterioso é possível diminuir para 0% o percentual de falsos positivos em quatro dos seis sistemas. Em um segundo caso, analisando o resultado da aplicação da estratégia God Class EspLoc nos sistemas C e D, constatamos um número de falsos positivos relativamente maior do que para os demais sistemas (E e F), onde ela apresentou valores de falsos positivos menor que 33%. C. Resultados da etapa de interesses Ao avaliar a segunda questão de pesquisa, investigou-se a possibilidade de diminuir a ocorrência de falsos positivos das estratégias de detecção. Observamos se tal diminuição pode ocorrer caso fossem definidas estratégias para as classes de cada interesse do domínio. Com este propósito, nós aplicamos cada uma das estratégias de detecção, apresentadas anteriormente, em classes de cada interesse. As mesmas métricas e limiares foram mantidos. Desta forma, conseguimos observar se: (i) haveria potencial benefício em utilizar estratégias de detecção para cada interesse do domínio – este caso foi observado quando as estratégias tiveram um número maior do que 33% falsos positivos, e (ii) foi suficiente o uso de estratégias no programa como um todo – este caso foi observado quando as estratégias tiveram um número menor do que 33% falsos positivos. 17/7 6/2 0/0 6/1 0/0 17/13 0/0 D %FP NO/FP 41% 33% 0% 17% 0% 76% 0% E %FP 10/8 5/4 0/0 5/2 0/0 13/1 1/0 80% 80% 0% 40% 0% 8% 0% NO/FP 30/5 10/3 3/0 40/2 4/0 48/1 12/0 F %FP NO/FP 17% 30% 0% 5% 0% 2% 0% %FP 24/7 8/4 2/0 26/3 0/0 44/2 9/0 29% 50% 0% 12% 0% 5% 0% As tabelas a seguir apresentam o número de ocorrências (NO) e percentagem de falsos positivos (FP) para cinco estratégias da fase anterior. Duas delas tiveram 0% de falsos positivos na ampla maioria dos casos. A partir dos resultados apresentados nas Tabelas IX a XIII, percebe-se que não haveria necessidade de especialização das estratégias para cada interesse: (i) tanto para os casos de interesses Autenticação/Segurança e Auxiliar, que são mais gerais (isto é, podem ocorrer frequentemente em aplicações de outros domínios), (ii) como para os interesses Ações, Engine e Serviços, que são características mais específicas deste domínio. Nesse sentido, ajustar os limiares para as estratégias considerando o mapeamento de interesses não seria benéfico para reduzir significativamente o percentual de falsos positivos nos casos acima. Por outro lado, note que o contrário pode ser dito para o caso dos interesses Persistência, Interface, Indicadores e Tarefas. Em todos estes casos de interesses, notase nas tabelas que os números de falsos positivos, independentemente da anomalia analisada, estão bem acima do limiar de 33% em vários casos. TABELA IX. OCORRÊNCIAS DA ESTRATÉGIA SHOTGUN SURGERY LIT VISANDO O MAPEAMENTO DE INTERESSES Interesse Ações Autenticação/segurança Auxiliar Engine Exceção Indicadores Interface Persistência Serviços Tarefas TABELA X. NO/FP 83/15 1/0 81/12 7/1 1/0 1/1 5/2 16/5 11/2 2/1 % FP 18% 0% 15% 14% 0% 100% 40% 31% 18% 50% OCORRÊNCIAS DA ESTRATÉGIA GOD CLASS ESPLOC VISANDO O MAPEAMENTO DE INTERESSES Interesse Ações Autenticação/segurança Auxiliar Engine Interface Persistência Serviços Tarefas NO/FP 30/7 2/0 52/12 1/0 10/6 18/11 10/3 2/1 % FP 23% 0% 23% 0% 60% 61% 30% 50% TABELA XI. OCORRÊNCIAS DA ESTRATÉGIA LONG METHOD ESP VISANDO O MAPEAMENTO DE INTERESSES Interesse NO/FP Ações Autenticação/segurança Auxiliar Engine Persistência Serviços Tarefas % FP 46/3 2/0 52/5 4/0 13/5 7/0 2/0 7% 0% 10% 0% 38% 0% 0% TABELA XII. OCORRÊNCIAS DA ESTRATÉGIA GOD CLASS ESPNOM VISANDO O MAPEAMENTO DE INTERESSES Interesse NO/FP Ações Auxiliar Indicadores Interface Persistência Servicos Tarefas 6/2 19/3 2/1 10/6 7/6 4/0 1/0 % FP 33% 16% 50% 60% 86% 0% 0% TABELA XIII. OCORRÊNCIAS DA ESTRATÉGIA SHOTGUN SURGERY ESP VISANDO O MAPEAMENTO DE INTERESSES Interesse Ações Auxiliar Engine Persistência Serviços NO/FP % FP 24/0 16/2 3/0 2/0 5/0 0% 13% 0% 0% 0% D. Trabalhos relacionados Em 2011 [19], Zhang, Hall e Baddoo, realizaram uma revisão sistemática da literatura, para descrever o estado da arte sobre anomalias de código e refatoração. Esse trabalho foi baseado em artigos de conferências e revistas, entre 2000 e Junho de 2009. Segundo os autores, poucos trabalhos que relatam estudos empíricos sobre a detecção de anomalias. A grande maioria dos trabalhos tem o objetivo de mostrar novas ferramentas e métodos para apoiar a detecção de anomalias. Em 2010 [18], Guo, Seaman, Zazworka e Shull, propuseram a análise de características do domínio, para a adaptação das estratégias de detecção de anomalias. Esse trabalho foi realizado em um ambiente real de manutenção de sistemas. Além disso, a adaptação dos limiares das estratégias foi apoiada pela análise de especialistas do domínio. Mesmo assim, esse trabalho não avalia o reuso das estratégias de detecção para outras aplicações do mesmo domínio. Em 2012 [28], Ferreira, Bigonha, Bigonha, Mendes e Almeida, identificaram limiares para métricas de software orientado a objetos. Esse trabalho foi realizado em 40 sistemas Java, baixados a partir do SourceForge (www.sourcegourge.net). Nesse trabalho foram identificados limiares para seis métricas, para onze domínios de aplicações. A partir desse trabalho é necessário investigar o reuso desses limiares em projetos da indústria. Em 2012 [29], Fontana, Braione e Zanoni, revisaram o cenário atual das ferramentas de detecção automática de anomalias. Para isso, realizaram a comparação de quatro ferramentas de detecção, em seis versões de projetos de software de tamanho médio. Segundo os autores, é interessante refinar o uso das estratégias, considerando informações do domínio dos sistemas analisados. Ainda, existe um esforço manual para avaliar as anomalias que são caracterizadas como falsos positivos. Nesse sentido, percebe-se o esforço investido na adaptação das estratégias de detecção. Dessa forma, torna-se motivador investigar estratégias de detecção que possam ser reusadas com sucesso. E. Ameaças à validade Ameaças à Validade de Construto. Durante o experimento, os três especialistas do domínio participaram da definição das características do domínio em estudo, da escolha dos seis sistemas, do mapeamento de interesses de cada sistema, da escolha das anomalias, da definição das estratégias e os limiares e a classificação das ocorrências de anomalias. Ao avaliar um domínio específico, é necessária a participação de alguém que vive o desenvolvimento neste domínio no seu dia a dia. Além disso, os especialistas possuem conhecimento sobre boas práticas e experiências profissionais prévias no domínio escolhido de mais de dois anos. Validade de Conclusão e Validade Externa. Para a conclusão do estudo, o percentual de falsos positivos das estratégias é avaliado a partir da relação entre a quantidade de falsos positivos classificados pelos especialistas e a quantidade de ocorrências identificadas pela ferramenta. O limiar que define o reuso das estratégias é de 33% de falsos positivos. Dessa forma é possível garantir que a estratégia é capaz de identificar apenas um falso positivo, a cada três ocorrências. Para amenizar as ameaças à validade externa, é importante ratificar que os seis sistemas em estudo foram escolhidos a partir da especificação do domínio em estudo. Ainda, escolha dos sistemas teve o apoio de especialistas que possuem mais de dois anos de experiência no domínio. V. CONCLUSÕES Para que fosse possível investigar o reuso de estratégias de detecção em vários projetos de software do mesmo domínio, foi conduzido um estudo de múltiplos casos da indústria. O estudo investigou o reuso de sete estratégias de detecção, relacionadas a três anomalias, em seis projetos de um domínio específico. Segundo o nosso estudo, em alguns casos, o reuso das estratégias de detecção pode ser melhorado, se aplicadas a programas do mesmo domínio, sem gerar um efeito colateral. Mesmo assim, em outros casos, para realizar uma melhoria no reuso das estratégias, é possível que sejam criados falsos negativos. No total, dos sete casos que excederam o limiar de 33%, em quatro casos existe pelo menos um cenário onde duas classes com estruturas similares foram classificadas uma como anomalia e outra como falso positivo. Isso mostrou que em certos casos é impossível definir um limiar que elimine boa parte dos falsos positivos sem gerar falsos negativos. Como uma consequência direta, pode-se afirmar que existe um limite no grau de reuso das estratégias, isto é, uma nova adaptação na tentativa de diminuir o percentual de falsos positivos pode aumentar o número de falsos negativos. Além disso, percebe-se que tanto em interesses como Autenticação/Segurança e Auxiliar, que são mais gerais, quanto os interesses Ações, Engine e Serviços, que são características mais específicas deste domínio, não existe a necessidade de especialização das estratégias. Nesse sentido, ajustar os limiares para as estratégias considerando o mapeamento de interesses não seria benéfico para reduzir significativamente o percentual de falsos positivos nos casos acima. Por outro lado, o contrário pode ser dito para o caso dos interesses Persistência, Interface, Indicadores e Tarefas. Ainda, a partir dos resultados, percebeu-se que duas estratégias de detecção de anomalias escolhidas a partir da literatura, resultaram em 0% de falsos positivos em todos os casos em que encontraram ocorrências. Mesmo assim, essas estratégias não detectaram algumas ocorrências identificadas pelas estratégias mais simples, para a mesma anomalia. Essas ocorrências das anomalias mais simples já haviam sido classificadas pelo especialista do domínio e não eram falsos positivos. Essa evidência motiva trabalhos futuros sobre a variedade da complexidade das estratégias de detecção de anomalias. Ainda, como trabalho futuro, o presente trabalho pode ser estendido a outros cenários (porém não limitados), como: (i) a investigação de estratégias de detecção com reuso em outros domínios e (ii) a investigação de outras estratégias de detecção neste e em outros domínios. REFERÊNCIAS [1] M. Fowler: “Refactoring: Improving the Design of Existing Code”. New Jersey: Addison Wesley, 1999. 464 p. [2] R. Marinescu, M. Lanza: “Object-Oriented Metrics in Practice”. Springer, 2006. 206 p. [3] N. Tsantalis, T. Chaikalis, A. Chatzigeorgiou: “JDeodorant: Identification and removal of typechecking bad smells”. In Proceedings of CSMR 2008, pp 329–331. [4] PMD. Disponível em http://pmd.sourceforge.net/. [5] iPlasma. Disponível em http://loose.upt.ro/reengineering/research/iplasma [6] InFusion. Disponível em http://www.intooitus.com/inFusion.html. [7] E. Murphy-Hill, A. Black: “An interactive ambient visualization for code smells”, Proceedings of SOFTVIS '10, USA, October 2010. [8] F. Fontana, E. Mariani, A. Morniroli, R. Sormani, A. Tonello: “An Experience Report on Using Code Smells Detection Tools”. IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops (ICSTW), 2011. [9] F. Fontana, V. Ferme, S. Spinelli: “Investigating the impact of code smells debt on quality code evaluation”. Third International Workshop on Managing Technical Debt, 2012. [10] E. Emden, L. Moonen: “Java quality assurance by detecting code smells”. In in Proceedings of the 9th Working Conference on Reverse Engineering, 2002. [11] N. Fenton, S. Pfleeger: “Software metrics: a rigorous and practical approach”. PWS Publishing Co., 1998. [12] R. Marinescu: “Measurement and Quality in Object-Oriented Design”. Proceedings of the 21st IEEE International Conference on Software Maintenance, 2005. [13] R. Marinescu: “Detection strategies: Metrics-based rules for detecting design flaws. Proceedings of the 20th IEEE International Conference on Software Maintenance, 2004. [14] N. Moha, Y. Gu´eh´eneuc, A. Meur, L. Duchien, A. Tiberghien: “From a domain analysis to the specification and detection of code and design smells“, Formal Aspects of Computing, 2009. [15] Scoop. Disponível em: http://www.inf.puc-rio.br/~ibertran/SCOOP/ [16] I. Macia, J. Garcia, D. Popescu, A. Garcia, N. Medvicovic, A. Staa: "Are Automatically-Detected Code Anomalies Relevant to Architectural Modularity? An Exploratory Analysis of Evolving Systems". In Proceedings of the 11th International Conference on Aspect-Oriented Software Development (AOSD'12), Postdam, Germany, March 2012. [17] I. Macia, A. Garcia, A. Staa: “An Exploratory Study of Code Smells in Evolving Aspect-Oriented Systems”. Proceedings of the 10th International Conference on Aspect-Oriented Software Development, 2011. [18] Y. Guo, C. Seaman, N. Zazworka, F. Shull: “Domain-specific tailoring of code smells: an empirical study”. Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2, 2010. [19] M. Zhang, T. Hall, e N. Baddoo: “Code bad smells: a review of current knowledge”. Journal of Software Maintenance and Evolution: research and practice, 23(3), 179-202, 2011. [20] I. Macia, R. Arcoverde A. Garcia, C. Chavez e A. von Staa: “On the Relevance of Code Anomalies for Identifying Architecture Degradation Symptoms”. In Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on (pp. 277-286). IEEE. [21] L. Hochstein e M. Lindvall: "Combating architectural degeneration: a survey." Information and Software Technology 47.10 (2005): 643-656. [22] T. J. McCabe: “A Complexity Measure”. IEEE Transactions on Software Engineering, 2(4):308–320, 1976. [23] S. R. Chidamber e C. F. Kemerer: “A metrics suite for object oriented design”. Software Engineering, IEEE Transactions on, v. 20, n. 6, p. 476-493, 1994. [24] J. Bieman e B. Kang: “Cohesion and reuse in an object-oriented system.” In Proceedings ACM Symposium on Software Reusability, 1995. [25] S. Olbrich, D. S. Cruzes, V. Basili, e N. Zazworka: “The evolution and impact of code smells: A case study of two open source systems”. In Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement (pp. 390-400). IEEE Computer Society, 2009. [26] F. Khomh, M. Di Penta, e Y. G. Guéhéneuc: “An exploratory study of the impact of code smells on software change-proneness.” In Reverse Engineering, WCRE'09. 16th Working Conference on (pp. 75-84). IEEE. 2009. [27] A. Lozano, M. Wermelinger, e B. Nuseibeh: "Assessing the impact of bad smells using historical information." In Ninth international workshop on Principles of software evolution: in conjunction with the 6th ESEC/FSE joint meeting (pp. 31-34). ACM, 2007. [28] K. A. Ferreira, M. A. Bigonha, R. S. Bigonha, L. F. Mendes e H. C. Almeida: “Identifying thresholds for object-oriented software metrics.” Journal of Systems and Software, 85(2), 244-257, 2012. [29] F. Fontana, P. Braione, M. Zanoni: “Automatic detection of bad smells in code: An experimental assessment.”. Publicação Electrônica em JOT: Journal of Object Technology, v. 11, n. 2, ago. 2012 F3T: From Features to Frameworks Tool Matheus Viana, Rosângela Penteado, Antônio do Prado Rafael Durelli Department of Computing Federal University of São Carlos São Carlos, SP, Brazil Email: {matheus viana, rosangela, prado}@dc.ufscar.br Institute of Mathematical and Computer Sciences University of São Paulo São Carlos, SP, Brazil [email protected] Abstract—Frameworks are used to enhance the quality of applications and the productivity of development process since applications can be designed and implemented by reusing framework classes. However, frameworks are hard to develop, learn and reuse, due to their adaptive nature. In this paper we present the From Features to Framework Tool (F3T), which supports the development of frameworks in two steps: Domain Modeling, in which the features of the framework domain are modeled; and Framework Construction, in which the source-code and the Domain-Specific Modeling Language (DSML) of the framework are generated from the features. In addition, the F3T also supports the use of the framework DSML to model applications and generate their source-code. The F3T has been evaluated in a experiment that is also presented in this paper. I. I NTRODUCTION Frameworks are reusable software composed of abstract classes implementing the basic functionality of a domain. When an application is developed through framework reuse, the functionality provided by the framework classes is complemented with the application requirements. As the application is not developed from scratch, the time spent in its development is reduced and its quality is improved [1]–[3]. Frameworks are often used in the implementation of common application requirements, such as persistence [4] and user interfaces [5]. Moreover, a framework is used as a core asset when many closely related applications are developed in a Software Product Line (SPL) [6], [7]. Common features of the SPL domain are implemented in the framework and applications implement these features reusing framework classes. However, frameworks are hard to develop, learn and reuse. Their classes must be abstract enough to be reused by applications that are unknown beforehand. Framework developers must define the domain of applications for which the framework is able to be instantiated, how the framework is reused by these applications and how it accesses application-specific classes, among other things [7], [8]. Frameworks have a steep learning curve, since application developers must understand their complex design. Some framework rules may not be apparent in its interface [9]. A framework may contain so many classes and operations that even developers who are conversant with it may make mistakes while they are reusing this framework to develop an application. In a previous paper we presented an approach for building Domain-Specific Modeling Languages (DSML) to support framework reuse [10]. A DSML can be built by identifying framework features and the information required to instantiate them. Thus, application models created with a DSML can be used to generate application source-code. Experiments have shown that DSMLs protect developers from framework complexities, reduce the occurrence of mistakes made by developers when they are instantiating frameworks to develop applications and reduce the time spent in this instantiation. In another paper we presented the From Features to Framework (F3) approach, which aims to reduce framework development complexities [11]. In this approach the domain of a framework is defined in a F3 model, which is a extended version of the feature model. Then a set of patterns, named F3 patterns, guides the developer to design and implement a white box framework according to its domain. One of the advantages of this approach is that, besides showing how developers can proceed, the F3 patterns systematizes the process of framework development. This systematization allowed that the development of frameworks could be automatized by a tool. Therefore, in this paper we present the From Features to Framework Tool (F3T), which is a plug-in for the Eclipse IDE that supports the use of the F3 approach to develop and reuse frameworks. This tool provides an editor for developers to create a F3 model of a domain. Then, the source-code and the DSML of a framework can be generated from the domain defined in this model. The source-code of the framework is generated as a Java project, while the DSML is generated as a set of Eclipse IDE plug-ins. After being installed, a DSML can be used to model applications. Then, the F3T can be used again to generate the application source-code from models created with the framework DSML. This application reuses the framework previously generated. We also have carried out an experiment in order to evaluate whether the F3T facilitates framework development or not. The experiment analyzed the time spent in framework development and the number of problems found the source-code of the outcome frameworks. The remainder of this paper is organized as follows: background concepts are discussed in Section II; the F3 approach is commented in Section III; the F3T is presented in Section IV; an experiment that has evaluated the F3T is presented in Section V; related works are discussed in Section VI; and conclusions and future works are presented in Section VII. II. BACKGROUND The basic concepts applied in the F3T and its approach are presented in this section. All these concepts have reuse as their basic principle. Reuse is a practice that aims: to reduce time spent in a development process, because the software is not developed from scratch; and to increase the quality of the software, since the reusable practices, models or code were previously tested and granted as successful [12]. Reuse can occur in different levels: executing simple copy/paste commands; referencing operations, classes, modules and other blocks in programming languages; or applying more sophisticated concepts, such as patterns, frameworks, generators and domain engineering [13]. Patterns are successful solutions that can be reapplied to different contexts [3]. They provide reuse of experience helping developers to solve common problems [14]. The documentation of a pattern mainly contains its name, the context it can be applied, the problem it is intended to solve, the solution it proposes, illustrative class models and examples of use. There are patterns for several purposes, such as design, analysis, architectural, implementation, process and organizational patterns [15]. Frameworks act like skeletons that can be instantiated to implement applications [3]. Their classes embody an abstract design to provide solutions for domains of applications [9]. Applications are connected to a framework by reusing its classes. Unlike library classes, whose execution flux is controlled by applications, frameworks control the execution flux accessing the application-specific code [15]. The fixed parts of the frameworks, known as frozen spots, implement common functionality of the domain that is reused by all applications. The variable parts, known as hot spots, can change according to the specifications of the desired application [9]. According to the way they are reused, frameworks can be classified as: white box, which are reused by class specialization; black box, which work like a set of components; and gray box, which are reused by the two previous ways [2]. Generators are tools that transform an artifact into another [16], [17]. There are many types of generators. The most common are Model-to-Model (M2M), Model-to-Text (M2T) and programming language translators [18]. Such as frameworks, generators are related to domains. However, some generators are configurable, being able to change their domain [19]. In this case, templates are used to define the artifacts that can be generated. Domains can also be modeled with metamodel languages, which are used to create Domain-Specific Modeling Languages (DSML). Metamodels, such as defined in the MetaObject Facility (MOF) [25], are similar to class models, which makes them more appropriate to developers accustomed to the UML. While in feature models, only features and their constraints are defined, metaclasses in the metamodels can contain attributes and operations. On the other hand, feature models can define dependencies between features, while metamodels depend on declarative languages to do it [18]. A generator can be used along with a DSML to transform models created with this DSML into code. When these models represent applications, the generators are called application generators. III. The F3 is a Domain Engineering approach that aims to develop frameworks for domains of applications. It has two steps: 1) Domain Modeling, in which framework domain is determined; and 2) Framework Construction, in which the framework is designed and implemented according to the features of its domain. In Domain Modeling step the domain is defined in a feature model. However, an extended version of feature model is used in the F3 approach, because feature models are too abstract to contain information enough for framework development and metamodels depend on other languages to define dependencies and constraints. These extended version, called F3 model, incorporates characteristics of both feature models and metamodels. As in conventional feature models, the features in the F3 models can also be arranged in a tree-view, in which the root feature is decomposed in other features. However, the features in the F3 models do not necessarily form a tree, since a feature can have a relationship targeting a sibling or even itself, as in metamodels. The elements and relationships in F3 models are: • Feature: graphically represented by a rounded square, it must have a name and it can contain any number of attributes and operations; • Decomposition: relationship that indicates that a feature is composed of another feature. This relationship specifies a minimum and a maximum multiplicity. The minimum multiplicity indicates whether the target feature is optional (0) or mandatory (1). The maximum multiplicity indicates how many instances of the target feature can be associated to each instance of the source feature. The valid values to the maximum multiplicity are: 1 (simple), for a single feature instance; * (multiple), for a list of a single feature instance; and ** (variant), for any number of feature instances. • Generalization: relationship that indicates that a feature is a variation generalized by another feature. • Dependency: relationship that defines a condition for a feature to be instantiated. There are two types of dependency: requires, when the A feature requires the B feature, an application that contains the A feature also has to include the B feature; and excludes, when the A feature excludes the B feature, no application can include both features. A domain of software consists of a set of applications that share common features. A feature is a distinguishing characteristic that aggregates value to applications [20]–[22]. For example, Rental Transaction, Destination Party and Resource could be features of the domain of rental applications. Different domain engineering approaches can be found in the literature [20], [22]–[24]. Although there are differences between them, their basic idea is to model the features of a domain and develop the components that implement these features and are reused in application engineering. The features of a domain are defined in a feature model, in which they are arranged in a tree-view notation. They can be mandatory or optional, have variations and require or exclude other features. The feature that most represents the purpose of the domain is put in the root and a top-down approach is applied to add the other features. For example, the main purpose of the domain of rental applications is to perform rentals, so Rental is supposed to be the root feature. The other features are arranged following it. T HE F3 A PPROACH Framework Construction step has as output a white box framework for the domain defined in the previous step. The F3 approach defines a set of patterns to assist developers to design and implement frameworks from F3 models. The patterns treat problems that go from the creation of classes for the features to the definition of the framework interface. Some of the F3 patterns are presented in Table I. TABLE I: Some of the F3 patterns. Pattern Domain Feature Mandatory Decomposition Optional Decomposition Simple Decomposition Multiple Decomposition Variant Decomposition Variant Feature Modular Hierarchy Requiring Dependency Excluding Dependency Purpose Indicates structures that should be created for a feature. Indicates code units that should be created when there is a mandatory decomposition linking two features. Indicates code units that should be created when there is an optional decomposition linking two features. Indicates code units that should be created when there is a simple decomposition linking two features. Indicates code units that should be created when there is a multiple decomposition linking two features. Indicates code units that should be created when there is a variant decomposition linking two features. Defines a class hierarchy for features with variants. Defines a class hierarchy for features with common attributes and operations. Indicates code units that should be created when a feature requires another one. Indicates code units that should be created when a feature excludes another one. In addition to indicate the code units that should be created to implement the framework functionality, the F3 patterns also determine how the framework can be reused by the applications. For example, some patterns suggest to include abstract operations in the classes of the framework that allows it to access application-specific information. In addition, the F3 patterns make the development of frameworks systematic, allowing it to be automatized. Thus, the F3T tool was created to automatize the use of the F3 approach, enhancing the processes of framework development. IV. T HE F3T The F3T assists developers to apply the F3 approach in the development of white box frameworks and to reuse these frameworks through their DSMLs. The F3T is a plug-in for the Eclipse IDE. So developers can make use of the F3T resources, Fig. 1: Modules of the F3T. such as domain modeling, framework construction, application modeling through framework DSML and application construction, as well the other resources provided by the IDE. The F3T is composed of three modules, as seen in Figure 1: 1) Domain Module (DM); 2) Framework Module (FM); and 3) Application Module (AM). A. Domain Module The DM provides a F3 model editor for developers to define domain features. This module has been developed with the support of the Eclipse Modeling Framework (EMF) and the Graphical Modeling Framework (GMF) [18]. The EMF was used to create a metamodel, in which the elements, relationships and rules of the F3 models were defined as described in the Section III. The metamodel of F3 models is shown in Figure 2. From this metamodel, the EMF generated the source-code of the Model and the Controller layers of the F3 model editor. The GMF has been used to define the graphical notation of the F3 models. This graphical notation also can be seen as the View layer of the F3 model editor. With the GMF, the graphical figures and the menu bar of the editor were defined and linked to the elements and relationships defined in the metamodel of the F3 models. Then, the GMF generates the source-code of the graphical notation. The F3 model editor is shown in Figure 3 with an example of F3 model for the domain of trade and rental transactions. Fig. 2: Metamodel containing elements, relationships and rules of F3 models. Fig. 3: F3 model for the domain of trade and rental transactions. B. Framework Module The FM is a M2T generator that transforms F3 models into framework source-code and DSML. Despite their graphical notation, F3 models actually are XML files. It makes them more accessible to other tools, such as a generator. The FM was developed with the support of the Java Emitter Templates (JET) in the Eclipse IDE [26]. The JET plug-in contains a framework that is a generic generator and a compiler that translate templates into Java files. These templates are XML files, in which tags are instructions to generate an output based on information in the input and text is a fixed content inserted in the output independently of input. The Java files originated from the JET templates reuse the JET framework to compose a domain-specific generator. Thus, the FM depend on the JET plug-in to work. The templates of the FM are organized in two groups: one related to framework source-code; and another related to framework DSML. Both groups are invoked from the main template of the DM generator. Part of the JET template which generates Java classes in the framework source-code from the features found in the F3 models can be seen as follows: public <c:if test="($feature/@abstract)">abstract </c:if> class <c:get select="$feature/@name"/> extends <c:choose select="$feature/@variation"> <c:when test="’true’">DVariation</c:when> <c:otherwise> <c:choose> <c:when test="$feature/dSuperFeature"> <c:get select="$feature/dSuperFeature/@name"/> </c:when> <c:otherwise>DObject</c:otherwise> </c:choose> </c:otherwise> </c:choose> { ... } The framework source-code that is generated by the FM is organized in a Java project identified by the domain name and the suffix “.framework”. The framework source-code is generated according to the patterns defined by the F3 approach. For example, the FM generates a class for each feature found in a F3 model. These classes contain the attributes and operations defined in its original feature. All generated classes also, directly or indirectly, extend the DObject class, which implements non-functional requirements, such as persistence and logging. Generalization relationships result in inheritances and decomposition relationships result in associations between the involving classes. Additional operations are included in the framework classes to treat feature variations and constraints of the domains defined in the F3 models. For example, according to the Variant Decomposition F3 pattern, the getResourceTypeClasses operation was included in the code of the Resource class so that the framework can recognize which classes implement the ResourceType feature in the applications. Part of the code of the Resource class is presented as folows: /** @generated */ public abstract class Resource extends DObject { /** @generated */ private int id; /** @generated */ private Sting name; /** @generated */ private List<ResourceType> types; /** @generated */ public abstract Class<?>[] getResourceTypeClasses(); The framework DSML is generated as a EMF/GMF project identified only by the domain name. The FM generates the EMF/GMF models of the DSML, as seen in Figure 4.a, which was generated from the F3 model shown in Figure 3. Then, source-code of the DSML must be generated by using the generator provided by the EMF/GMF in three steps: 1) using the EMF generator from the genmodel file (Figure 4.a); 2) using the GMF generator from the gmfmap file (Figure 4.b); and 3) using the GMF generator from the gmfgen file (Figure 4.c). After this, the DSML will be composed of 5 plug-in projects in the Eclipse IDE. The projects that contain the source-code and the DSML plug-ins of the framework for the trade and rental transactions domain are shown in Figure 4.d. Fig. 4: Generation of the DSML plugins. A. Planning Fig. 5: Application model created with the framework DSML. The experiment was planned to answer two research questions: RQ1 : “Does the FT3 reduce the effort to develop a framework?”; and RQ2 : “Does the F3T result in a outcome framework with a fewer number of problems?”. All subjects had to develop two frameworks, both applying the F3 approach, but one manually and the other with the support of the F3T. The context of our study corresponds to multi-test within object study [27], hence the experiment consisted of experimental tests executed by a group of subjects to study a single tool. In order to answer the first question, we measured the time spent to develop each framework. Then, to answer the second question, we analyzed the frameworks developed by the subjects, then we identified and classified the problems found in the source-code. The planning phase was divided into seven parts, which are described in the next subsections: C. Application Module 1. Context Selection The AM has been also developed with the support of JET. It generates application source-code from an application model based on a framework DSML. The templates of the AM generate classes that extend framework classes and override operations that configure framework hot spots. After the DSML plug-ins are installed in the Eclipse IDE, the AM recognizes the model files created from the DSML. An application model created with the DSML of the framework for the domain of trade and rental transactions is shown in Figure 5. 26 MSc and PhD students of Computer Science participated in the experiment, which has been made in an off-line situation. All participants had prior experience in software development, Java programming, patterns and framework reuse. Application source-code is generated in the source folder of the project where the application model is. The AM generates a class for each feature instantiated in the application model. Since the framework is white box, the application classes extend the framework classes indicated by the stereotypes in the model. It is expected that many class attributes requested by the application requirements have been defined in the domain. Thus, these attributes are in the framework source-code and they must not be defined in the application classes again. Part of the code of the Product class is presented as follows: public class Product extends Resource { /** @generated */ private float value; /** @generated */ public Class<?>[] getResourceTypeClasses() { return new Class<?>[] { Category.class, Manufacturer.class }; }; V. E VALUATION In this section we present an experiment, in which we evaluated the use of the F3T to develop frameworks, since the use of DSMLs to support framework reuse has been evaluated in a previous paper [10]. The experiment was conducted following all steps described by Wohlin et al. (2000) and it can be summarized as: (i) analyse the F3T, described in Section IV; (ii) for the purpose of evaluation; (iii) with respect to time spent and number of problems; (iv) from the point of view of the developer; and (v) in the context of MSc and PhD Computer Science students. 2. Formulation of Hypotheses The experiment questions have been formalized as follows: RQ1 , Null hypothesis, H0 : Considering the F3 approach, there is no significant difference, in terms of time, between developing frameworks with the support of F3T and doing it manually. Thus, the F3T does not reduce the time spent to develop frameworks. This hypothesis can be formalized as: H0 : µF3T = µmanual RQ1 , Alternative hypothesis, H1 : Considering the F3 approach, there is a significant difference, in terms of time, between developing frameworks with the support of F3T and doing it manually. Thus, the F3T reduces the time spent to develop frameworks. This hypothesis can be formalized as: H1 : µF3T 6= µmanual RQ2 , Null hypothesis, H0 : Considering the F3 approach, there is no significant difference, in terms of problems found in the outcome frameworks, between developing frameworks using the F3T and doing it manually. Thus, the F3T does not reduce the mistakes made by subjects while they are developing frameworks. This hypothesis can be formalized as: H0 : µF3T = µmanual RQ2 , Alternative hypothesis, H1 : Considering the F3 approach, there is a significant difference, in terms of problems found in the outcome frameworks, between developing frameworks using the F3T and doing it manually. Thus, the F3T reduces the mistakes made by subjects while they are developing frameworks. This hypothesis can be formalized as: H1 : µF3T 6= µmanual 3. Variables Selection The dependent variables of this experiment were “time spent to develop a framework” and “number of problems found in the outcome frameworks”. The independent variables were as follows: • Application: Each subject had to develop two frameworks: one (Fw1) for the domain of trade and rental transactions and the other (Fw2) for the domain of automatic vehicles. Both Fw1 and Fw2 had 10 features. presentation about frameworks, which included the description of some known examples and their hot spots. The subjects were also trained on how to develop frameworks using the F3 approach with or without the support of the F3T. • Development Environment: Eclipse 4.2.1, Astah Community 6.4, F3T. • Technologies: Java version 6. Following the training, the pilot experiment was executed. The subjects were split into two groups considering the results of the characterization forms. Subjects were not told about the nature of the experiment, but were verbally instructed on the F3 approach and its tool. The pilot experiment was intended to simulate the real experiments, except that the applications were different, but equivalent. Beforehand, all subjects were given ample time to read about approach and to ask questions on the experimental process. This could affect the experiment validity, then, the data from this activity was only used to balance the groups. 4. Selection of Subjects The subjects were selected through a non probabilist approach by convenience, i.e., the probability of all population elements belong to the same sample is unknown. 5. Experiment Design The subjects were carved up in two blocks of 13 subjects: • Block 1, development of Fw1 manually and development of Fw2 with the support of the F3T; • Block 2, development of Fw2 manually and development of Fw1 with the support of the F3T. We have chosen use block to reduce the effect of the experience of the students, that was measured through a form in which the students answered about their level of experience in software development. This form was given to the subjects one week before the pilot experiment herein described. The goal of this pilot experiment was to ensure that the experiment environment and materials were adequate and the tasks could be properly executed. 6. Design Types The design type of this experiment was one factor with two treatments paired [27]. The factor in this experiment is the manner how the F3 approach was used to develop a framework and the treatments are the support of the F3T against the manual development. 7. Instrumentation All necessary materials to assist the subjects during the execution of this experiment were previously devised. These materials consisted of forms for collecting experiment data, for instance, time spent to develop the frameworks and a list of the problems were found in the outcome frameworks developed by each subject. In the end of the experiment, all subjects received a questionnaire to report about the F3 approach and the F3T. B. Operation The operation phase was divided into two parts, as described in the next subsections: 1. Preparation Firstly, the subjects received a characterization form, containing questions regarding their knowledge about Java programming, Eclipse IDE, patterns and frameworks. Then, the subjects were introduced to the F3 approach and the F3T. 2. Execution Initially, the subjects signed a consent form and then answered a characterization form. After this, they watched a When the subjects understood what their had to do, they received the description of the domains and started timing the development of the frameworks. Each subject had to develop the frameworks applying the F3 approach, i.e., creating its F3 model from a document which describes its domain features and then applying the F3 patterns to implement it. C. Analysis of Data This section presents the experimental findings. The analysis is divided into two subsections: (1) Descriptive Statistics and (2) Hypotheses Testing. 1. Descriptive Statistics The time spent by each subject to develop a framework and the number of problems found in the outcome frameworks are shown in Table II. From this table, it can be seen that the subjects spent more time to develop the frameworks when they were doing it manually, approximately 72.5% against 27.5%. This result was expected, since the F3T generates framework source-code from F3 models. However, it is worth highlighting that most of the time spent in the manual framework development was due to framework implementation and the effort to fix the problems found in the frameworks, while most of the time spent in the framework development supported by the F3T was due to domain modeling. The dispersion of time spent by the subjects are also represented graphically in a boxplot on left side of Figure 6. In Table II it is also possible to visualize four types of problems that we analyzed in the outcome frameworks: (i) incoherence, (ii) structure, (iii) bad smells, (iv) interface. The problem of incoherence indicates that, during the experiment, the subjects did not model the domain of the framework as expected. Consequently, the subjects did not develop the frameworks with the correct domain features and constraints (mandatory, optional, and alternative features). As the capacity to model the framework domains depend more on the subject skills than on tool support, incoherence problems could be found in equivalent proportions, approximately 50%, when the framework was developed either manually or with the support of the F3T. The problem of structure indicates that the subjects did not implement the frameworks properly during the experiment. For example, they implemented classes with no constructor TABLE II: Development timings and number of problems. . . . . . . . . . . . . . The problem of interface indicates absence of getter/setter operations and the lack of operations that allows the applications to reuse the framework and so on. Usually, this kind of problem is a consequence of problems of structure, hence the number of problems of these two types are quite similar. As it can be observed in Table II that the F3T helped the subjects to design a better framework interface than when they developed the framework manually, i.e., 8.6% against 91.4%. In the last two columns of Table II it can be seen that the F3T reduced the total number of problems found in the frameworks developed by the subjects. It is also graphically represented in the boxplot on right side of Figure 6. 2. Testing the Hypotheses Fig. 6: Dispersion of the total time and number of problems. and incorrect relationships or when they forgot to declare the classes as abstract. This kind of problem occurred when the subjects did not properly follow the instructions provided by the F3 patterns. In Table II it can be seen that the F3T helped the subjects to develop frameworks with less structure problems, i.e., 10% in opposition to 90%. The problem of bad smells indicates design weaknesses that do not affect functionality, but make the frameworks harder to maintain. In the experiment, this kind of problem occurred when the subjects forgot to apply some F3 patterns related to the organization of the framework classes, such as the Modular Hierarchy F3 pattern. By observing Table II we can remark that the F3T made a design with higher quality than the manual approach, i.e, 0% against 100%, because the F3T automatically identified which patterns should be applied from the F3 models. The objective of this section is to verify, based on the data obtained in the experiment, whether it is possible to reject the null hypotheses in favor of the alternative hypotheses. Since some statistical tests are applicable only if the population follows a normal distribution, we applied the Shapiro-Wilk test and created a Q-Q chart to verify whether or not the experiment data departs from linearity before choosing a proper statistical test. The tests has been carried out as follows: 1) Time: We have applied the Shapiro-Wilk test on the experiment data that represents the time spent by each subject to develop a framework manually or using the F3T, as shown in Table II. Considering an α = 0.05, the p-values are 0.878 and 0.6002 and Ws are 0.9802 and 0.9691, respectively, for each approach. The test results confirmed that the experiment data related to the time spent in framework development is normally distributed, as it can be seen in the Q-Q charts (a) and (b) in Figure 7. Thus, we decided to apply the Paired T-Test to these data. Assuming a Paired T-Test, we can reject H0 if | t0 | > tα/2,n−1 . these data are S/R of | t problemsmanual − t problemsF3T | = { +3.5, +7.5, +7.5, +16.5, -3.5, +23, +3.5, +3.5, +10.5, +10.5, +3.5, +18.5, +10.5, +14, +24, +18.5, +3.5, +21, +21, +14, +21, +10.5, +14, +16.5}, S/R stand for “signed rank”. As result we got a p-value = 0.001078 with a significance level of 1%. Based on these data, we conclude there is considerable difference between the means of the two treatments. We were able to reject H0 at 1% significance level. The pvalue is very close to zero, which further emphasizes that the F3T reduces the number of problems found in the outcome frameworks. D. Opinion of the Subjects We analyzed the opinion of the subjects in order to evaluate the impact of using the approaches considered in the experiment. After the experiment operation, all subjects received a questionnaire, in which they could report their perception about applying the F3 approach manually or with the support of the F3T. Fig. 7: Normality tests. In this case, tα, f is the upper α percentage point of the t-distribution with f degrees of freedom. Therefore, based on the samples, n = 26 and d = {46, 42, 52, 49, 41, 49, 55, 50, 53, 42, 42, 52, 48, 43, 45, 42, 47, 48, 44, 49, 51, 48, 52, 51, 48, 45}, Sd = 9.95 and t0 = 1.6993. The average values of each data set are µmanual = 76.42 and µF3T = 28.96. So, d = 76.42 − 28.96 = 47.46, which implies that Sd = 3.982 and t0 = 60.7760. The number of degrees of freedom is f = n − 1 = 26 − 1 = 25. We take α = 0.025. Thus, according to StatSoft1 , it can be seen that t 0.025,9 = 2.05954. Since | t0 | > t0.025,9 it is possible to reject the null hypothesis with a two sided test at the 0.025 level. Therefore, statistically, we can assume that, when the F3 approach is applied, the time needed to develop a framework using F3T is less than doing it manually. 2) Problems: Similarly, we have applied the ShapiroWilk test on the experiment data shown in the last two columns of Table II, which represent the total number of problems found in the outcome frameworks that were developed whether manually or using the F3T. Considering an α = 0.05, the p-values are 0.1522 and 0.007469, and Ws are 0.9423 and 0.8853, respectively, for each approach. As it can be seen in the Q-Q charts (c) and (d) in Figure 7, the test results confirmed that date related to manual development is normally distributed, but the data related to the F3T can not be considered as normally distributed. Therefore we applied a non-parametric test, the Wilcoxon signed-rank test in these data. The signed rank of 1 http://www.statsoft.com/textbook/distribution-tables/#t The answers in the questionnaire has been analyzed in order to identify the difficulties in the use of the F3 approach and its tool. As it can be seen in Figure 8, when asked if they encountered difficulties in the development of the frameworks by applying the F3 approach manually, approximately 52% of the subjects reported having significant difficulty, 29% mentioned partial difficulty and 19% had no difficulty. In contrast, when asked the same question with respect to the use of the F3T, 73% subjects reported having no difficulty, 16% subjects mentioned partial difficulty and only 11% of all subjects had significant difficulty. Fig. 8: Level of difficulty of the subjects. The reduction of the difficulty to develop the frameworks, shown in Figure 8, reveals that the F3T assisted the subjects in this task. The subjects also answered in the questionnaire about the difficulties they found during framework development. The most common difficulties pointed out by the subjects when they developed the frameworks manually were: 1) too much effort spent on coding; 2) mistakes they made due to lack of attention; 3) lack of experience for developing frameworks; and 4) time spent identifying the F3 patterns in the F3 models. In contrast, the most common difficulties faced by the subjects when they used the F3T were: 1) lack of practice with the tool; and 2) some actions in the tool interface, for instance, opening the F3 model editor, take many steps to be executed. The subjects said that the F3 patterns helped them to identify which structures were necessary to implement the frameworks in the manual development. They also said the F3T automatized the tasks of identifying which F3 patterns should be used and of implementing the framework source-code. Then, they could keep their focus on domain modeling. two tests: T-Tests to statistically analyze the time spent to develop the frameworks and Wilcoxon signed-rank test to statistically analyze the number of problems found in the outcome frameworks. VI. E. Threats to Validity Internal Validity: • Experience level of the subjects: the subjects had different levels of knowledge and it could affect the collected data. To mitigate this threat, we divided the subjects in two balanced blocks considering their level knowledge and rebalanced the groups considering the preliminary results. Moreover, all subjects had prior experience in application development reusing frameworks, but not for developing frameworks. Thus, the subjects were trained in common framework implementation techniques and how to use the F3 approach and the F3T. • Productivity under evaluation: there was a possibility that this might influence the experiment results because subjects often tend to think they are being evaluated by experiment results. In order to mitigate this, we explained to the subjects that no one was being evaluated and their participation was considered anonymous. • Facilities used during the study: different computers and installations could affect the recorded timings. Thus, the subjects used the same hardware configuration and operating system. Validity by Construction: • Hypothesis expectations: the subjects already knew the researchers and knew that the F3T was supposed to ease framework development, which reflects one of our hypothesis. These issues could affect the collected data and cause the experiment to be less impartial. In order to keep impartiality, we enforced that the participants had to keep a steady pace during the whole study. External Validity: • Interaction between configuration and treatment: it is possible that the exercises performed in the experiment are not accurate for every framework development for real world applications. Only two frameworks were developed and they had the same complexity. To mitigate this threat, the exercises were designed considering framework domains based on the real world. Conclusion Validity: • Measure reliability: it refers to metrics used to measuring the development effort. To mitigate this threat, we used only the time spent which was captured in forms fulfilled by the subjects; • Low statistic power: the ability of a statistic test in reveal reliable data. To mitigate this threat, we applied R ELATED W ORKS In this section some works related to the F3T and the F3 approach are presented. Amatriain and Arumi [28] proposed a method for the development of a framework and its DSL through iterative and incremental activities. In this method, the framework has its domain defined from a set of applications and it is implemented by applying a series of refactorings in the source-code of these applications. The advantage of this method is a small initial investment and the reuse of the applications. Although it is not mandatory, the F3 approach can also be applied in iterative and incremental activities, starting from a small domain and then adding features. Applications can also be used to facilitate the identification of the features of the framework domain. However, the advantage of the F3 approach is the fact that the design and the implementation of the frameworks are supported by the F3 patterns and it is automatized by the F3T. Oliveira et al. [29] presented the ReuseTool, which assists framework reuse by manipulating UML diagrams. The ReuseTool is based in the Reuse Description Language (RDL), a language created by these authors to facilitate the description of framework instantiation processes. Framework hot spots can be registered in the ReuseTool with the use of the RDL. In order to instantiate the framework, application models can be created based on the framework description. Application source-code is generated from these models. Thus, the RDL works as a meta language that registers framework hot spots and the ReuseTool provides a more friendly interface for developers to develop applications reusing the frameworks. In comparison, the F3T supports framework development through domain modeling and application development through framework DSML. Pure::variants [30] is a tool that supports the development of applications by modeling domain features (Feature Diagram) and the components that implement these features (Family Diagram). Then the applications are developed by selecting a set of features of the domain. Pure::variants generates only application source-code, maintaining all domain artifacts in model-level. Besides, this tool has private license and its free version (Community) has limitations in its functionality. In comparison, the F3T is free, uses only one type of domain model (F3 model) and generates frameworks as domain artifacts. Moreover, the frameworks developed with he support of the F3T can be reused in the development of applications with or without the support of the F3T. VII. C ONCLUSIONS The F3T support framework development and reuse through code generating from models. This tool provides an F3 model editor for developers to define the features of the framework domain. Then, framework source-code and DSML can be generated from the F3 models. Framework DSML can be installed in the F3T to allow developers to model and to generate the source-code of applications that reuses the framework. The F3T is a free software available at: http://www.dc.ufscar.br/∼matheus viana. [8] The F3T was created to semi-automatize the applying of the F3 approach. In this approach, domain features are defined in F3 models in order to separate the elements of the framework from the complexities to develop them. F3 models incorporate elements and relationships from feature models and properties and operations from metamodels. [9] Framework source-code is generated based on patterns that are solutions to design and implement domain features defined in F3 models. A DSML is generated along with the sourcecode and includes all features of the framework domain and in the models created with it developers can insert application specifications to configure framework hot spots. Thus, the F3T supports both Domain Engineering and Application Engineering, improving their productivity and the quality of the outcome frameworks and applications. The F3T can be used to help the construction of software product lines, providing an environment to model domains and create frameworks to be used as core assets for application development. [10] [11] [12] [13] [14] [15] [16] The experiment presented in this paper has shown that, besides the gain of efficiency, the F3T reduces the complexities surrounding framework development, because, by using this tool, developers are more concerned about defining framework features in a graphical model. All code units that compose these features, provide flexibility to the framework and allows it to be instantiated in several applications are properly generated by the F3T. [17] The current version of the F3T generates only the model layer of the frameworks and applications. In future works we intend to include the generation of a complete multi-portable Model-View-Controller architecture. [20] ACKNOWLEDGMENT [18] [19] [21] [22] The authors would like to thank CAPES and FAPESP for sponsoring our research. R EFERENCES [1] [2] [3] [4] [5] [6] [7] V. Stanojevic, S. Vlajic, M. Milic, and M. Ognjanovic. Guidelines for Framework Development Process. In 7th Central and Eastern European Software Engineering Conference, pages 1–9, Nov 2011. M. Abi-Antoun. Making Frameworks Work: a Project Retrospective. In ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications, 2007. R. E. Johnson. Frameworks = (Components + Patterns). Communications of ACM, 40(10):39–42, Oct 1997. JBoss Community. Hibernate. http://www.hibernate.org, Jan 2013. Spring Source Community. Spring Framework. http://www.springsource.org/spring-framework, Jan 2013. S. D. Kim, S. H. Chang, and C. W. Chang. A Systematic Method to Instantiate Core Assets in Product Line Engineering. In 11th AsiaPacific Conference on Software Engineering, pages 92–98, Nov 2004. David M. Weiss and Chi Tau Robert Lai. Software Product Line Engineering: A Family-Based Software Development Process. AddisonWesley, 1999. [23] [24] [25] [26] [27] [28] [29] [30] D. Parsons, A. Rashid, A. Speck, and A. Telea. A Framework for Object Oriented Frameworks Design. In Technology of Object-Oriented Languages and Systems, pages 141–151, Jul 1999. S. Srinivasan. Design Patterns in Object-Oriented Frameworks. ACM Computer, 32(2):24–32, Feb 1999. M. Viana, R. Penteado, and A. do Prado. Generating Applications: Framework Reuse Supported by Domain-Specific Modeling Languages. In 14th International Conference on Enterprise Information Systems, Jun 2012. M. Viana, R. Durelli, R. Penteado, and A. do Prado. F3: From Features to Frameworks. In 15th International Conference on Enterprise Information Systems, Jul 2013. Sajjan G. Shiva and Lubna Abou Shala. Software Reuse: Research and Practice. In Fourth International Conference on Information Technology, pages 603–609, Apr 2007. W. Frakes and K. Kang. Software Reuse Research: Status and Future. IEEE Transactions on Software Engineering, 31(7):529–536, Jul 2005. M. Fowler. Patterns. IEEE Software, 20(2):56–57, 2003. R. S. Pressman. Software Engineering: A Practitioner’s Approach. McGraw-Hill Science, 7th edition, 2009. A. Sarasa-Cabezuelo, B. Temprado-Battad, D. Rodrguez-Cerezo, and J. L. Sierra. Building XML-Driven Application Generators with Compiler Construction. Computer Science and Information Systems, 9(2):485–504, 2012. S. Lolong and A.I. Kistijantoro. Domain Specific Language (DSL) Development for Desktop-Based Database Application Generator. In International Conference on Electrical Engineering and Informatics (ICEEI), pages 1–6, Jul 2011. R. C. Gronback. Eclipse Modeling Project: A Domain-Specific Language (DSL) Toolkit. Addison-Wesley, 2009. I. Liem and Y. Nugroho. An Application Generator Framelet. In 9th International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD’08), pages 794–799, Aug 2008. J. M. Jezequel. Model-Driven Engineering for Software Product Lines. ISRN Software Engineering, 2012, 2012. K. Lee, K. C. Kang, and J. Lee. Concepts and Guidelines of Feature Modeling for Product Line Software Engineering. In 7th International Conference on Software Reuse: Methods, Techniques and Tools, pages 62–77, London, UK, 2002. Springer-Verlag. K. C. Kang, S. G. Cohen, J. A. Hess, W. E. Novak, and A. S. Peterson. Feature-Oriented Domain Analysis (FODA): Feasibility Study. Technical report, Carnegie-Mellon University Software Engineering Institute, Nov 1990. H. Gomaa. Designing Software Product Lines with UML: From Use Cases to Pattern-Based Software Architectures. Addison-Wesley, 2004. J. Bayer, O. Flege, P. Knauber, R. Laqua, D. Muthig, K. Schmid, T. Widen, and J. DeBaud. PuLSE: a Methodology to Develop Software Product Lines. In Symposium on Software Reusability, pages 122–131. ACM, 1999. OMG. OMG’s MetaObject Facility. http://www.omg.org/mof, Jan 2013. The Eclipse Foundation. Eclipse Modeling Project, Jan 2013. C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén. Experimentation in Software Engineering: an Introduction. Kluwer Academic Publishers, Norwell, MA, USA, 2000. X. Amatriain and P. Arumi. Frameworks Generate Domain-Specific Languages: A Case Study in the Multimedia Domain. IEEE Transactions on Software Engineering, 37(4):544–558, Jul-Aug 2011. T. C. Oliveira, P. Alencar, and D. Cowan. Design Patterns in ObjectOriented Frameworks. ReuseTool: An Extensible Tool Support for Object-Oriented Framework Reuse, 84(12):2234–2252, Dec 2011. Pure Systems. Pure::Variants. http://www.puresystems.com/pure variants.49.0.html, Fev 2013. A Metric of Software Size as a Tool for IT Governance Marcus Vinícius Borela de Castro, Carlos Alberto Mamede Hernandes Tribunal de Contas da União (TCU) Brasília, Brazil {borela, carlosmh}@tcu.gov.br INTRODUCTION RGANIZATIONS need to leverage their technology to create new opportunities and produce change in their capabilities [1, p. 473]. According to ITGI [2, p. 7], information technology (IT) has become an integral part of business for many companies with key role in supporting and promoting their growth. In this context, IT governance fulfills an important role of directing and boosting IT in order to achieve its goals aligned with the company’s strategy. In order for IT governance to foster the success of IT and of the organization, ISO 38500 [3, p. 7] proposes three main activities: to assess the current and future use of IT; to direct the preparation and implementation of plans and policies to ensure that IT achieves organizational goals; to monitor performance and compliance with those policies (Fig. 1). A metric of software size can compose several indicators to help reveal the real situation of the systems development area for the senior management of an organization, directly or through IT governance structures (e.g., IT steering committee). Measures such as the production of software in a period (e.g., measure of software size per month) and the productivity of an area (e.g., measure of software size per effort) are examples of indicators that can support the three activities of governance proposed by ISO 38500. For the formation of these indicators, one can use Function Point Analysis (FPA) to get function points (FP) as a metric of software size. Created by Albrecht [4], FPA has become an This work has been supported by the Brazilian Court of Audit (TCU). IT Corporate Governance Assess Direct Monitor Performance Accordance O I. IT Needs of Business Proposals Index Terms—Function Points, IT governance, performance, Software engineering, Software metrics. Pressures of Business Plans and Policies Abstract— This paper proposes a new metric for software functional size, which is derived from Function Point Analysis (FPA), but overcomes some of its known deficiencies. The statistical results show that the new metric, Functional Elements (EF), and its submetric, Functional Elements of Transaction (EFt), have higher correlation with the effort in software development than FPA in the context of the analyzed data. The paper illustrates the application of the new metric as a tool to improve IT governance specifically in assessment, monitoring, and giving directions to the software development area. Business Process negócio IT IT Projects Operati Fig. 1. Cycle Assess-Direct-Monitor of IT Governance Source:: ISO 38500 [3, p. 7] international standard for measuring the functional size of a software with the ISO 20926 [5] designation. Its rules are maintained and enhanced by a nonprofit international group of users called International Function Point Users Group (IFPUG), responsible for publishing the Counting Practices Manual (CPM), now in version 4.3.1 [6]. Because it has a direct correlation with the effort expended in software development [7]-[8], FPA has been used as a tool for information technology management, not only in Brazil but worldwide. As identified in the Quality Research in Brazilian Software Industry report, 2009 [9, p. 93], FPA is the most widely used metric to evaluate the size of software among software companies in Brazil, used by 34.5% of the companies. According to a survey carried out by Dekkers and Bundschuh [10, p. 393], 80% of projects registered on the International Software Benchmarking Standards Group (ISBSG), release 10, which applied metric used the FPA. The FPA metric is considered a highly effective instrument to measure contracts [11, p. 191]. However, it has the limitation of not treating non-functional requirements, such as quality criteria and response-time constraints. Brazilian federal government institutions also use FPA for procurement of development and maintenance of systems. The Brazilian Federal Court of Audit (TCU) points out FPA as an example of metric to be used in contracts. 2 The metrics roadmap of SISP [12], a federal manual for software procurement, recommends its application to federal agencies. Despite the extensive use of the FPA metric, a large number of criticism about its validity and applicability, described in Section II-B, put in doubt the correctness of its use in contracts and the reliability of its application as a tool for IT management and IT governance. So the question arises for the research: is it possible to propose a metric for software development, with the acceptance and practicality of FPA, that is, based on its concepts already widely known, without some of the flaws identified in order to maximize its use as a tool for IT governance, focusing on systems development and maintenance? The specific objectives of this paper are: 1) to present an overview of software metrics and FPA; 2) to present the criticisms to the FPA technique that motivated the proposal of a new metric; 3) to derive a new metric based on FPA; 4)to evaluate the new metric against FPA in the correlation with effort; 5) to illustrate the use of the proposed metric in IT governance in the context of systems development and maintenance. Following, each objective is covered in a specific section. II. DEVELOPMENT A. Software Metrics 1) Conceptualization, categorization, and application Dekkers and Bundschuh [10, p. 180-181] describe various interpretations for metric, measure, and indicator found in the literature. Concerning this study, no distinction is made among these three terms. We used Fenton and Pfleeger’s definition [13, p. 5] for measure: a number or symbol that characterizes an attribute of a real world entity, object or event, from formally defined rules. Kitchenham et al. [14] present a framework for software metrics with concepts related to the formal model in which a metric is based, for example, the type of scale used. According to Fenton and Pfleeger [13, p. 74], software metrics can be applied to three types of entities: processes, products, and resources. The authors also differentiate direct metrics when only one attribute of an entity is used, from indirect metrics, the other way around [13, p. 39]. Indirect metrics are derived by rules based on other metrics. The speed of delivery of a team (entity type: resource) is an example of indirect metric because it is calculated from the ratio of two measures: size of developed software (product) development and elapsed time (process). The elapsed time is an example of direct metric. Moser [15, p. 32] differentiates size metrics from quality metrics: size metrics distinguish between the smallest and the largest whereas quality metrics distinguish between good and bad. Table I consolidates the mentioned categories of software metrics. 2 There are several rulings on the subject: 1.782/2007, 1.910/2007, 2.024/2007, 1.125/2009, 1.784/2009, 2.348/2009, 1.274/2010, 1.647/2010, all of the Plenary of the TCU. Moser [15, p.31] notes that, given the relationship between a product and the process that produced it, a product measure can be assigned to a process, and vice versa. For example, the percentage of effort in testing, which is a development process attribute, can be associated with the generated product as an indicator of its quality. And the number of errors in production in the first three months, a system attribute (product), can be associated to the development process as an indicative of its quality. Fenton and Pfleeger [13, p. 12] set three goals for software metric: to understand, to control, and to improve the targeted entity. They call our attention to the fact that the definition of the metrics to be used depends on the maturity level of the process being measured: the more mature, more visible, and therefore more measurable [13, p. 83]. Chikofsky and Rubin [16, p. 76] highlight that an initial measurement program for a development and maintenance area should cover five key dimensions that address core attributes for planning, controlling, and improvement of products and processes: size, effort, time, quality, and rework. The authors remind us that what matters are not the metric itself, but the decisions that will be taken from them, refuting the possibility of measuring without foreseeing the goal [16, p. 75]. According to Beyers [17, p. 337], the use of estimates in metric (e.g., size, time, cost, effort, quality, and allocation of people) can help in decision making related to software development and to the planning of software projects. 2) FPA overview According to the categorization of in previous section, FPA is an indirect measure of product size. It measures the functional size of an application (system) as a gauge of the functionality requested and delivered to the user of the software. 3 This is a metric understood by users, regardless of the technology used. According to Gencel and Demirors [18, p. 4], all functional metrics ISO standards estimate software size based on the functionality delivered to users, 4 differing in the considered objects and how they are measured. TABLE I EXAMPLES OF CATEGORIES OF SOFTWARE METRICS Criterion Category Source Entity Of process [13, p. 74] Of product Of resource Number of attributes Direct [13, p. 39\ involved Indirect Target of Size [15, p. 32] differentiation Quality 3 The overview presented results from the experience of the author Castro with FPA. In 1993, he coordinated the implementation of FPA in the area of systems development at the Brazilian Superior Labor Court (TST). At TCU, he works with metric, albeit sporadically, without exclusive dedication. 4 Besides FPA, there are four other functional metrics that are ISO standards, as they meet the requirements defined in the six standards of ISO 14143: MKII FPA, COSMIC-FFP, FISMA, and NESMA. Non-functional attributes of a development process (e.g., development team experience, chosen methodology) are not in the scope of functional metrics. Functional requirements are only one dimension of several impacting the effort. All of them have to be taken into account in estimates. Estimates and non-functional requirements evaluations are not the goal of this paper. Functionalities can be of two types: transactions, that implement data exchanges with users and other systems, and data files, which indicate the structure of stored data. There are three types of transactions: external inquiry (EQ), external outputs (EO), and external inputs (EI), as the primary intent of the transaction is, respectively, a simple query, a more elaborate query (e.g., with calculated totals) or data update. There are two types of logical data files: internal logical files (ILF) and external interface files (EIF), as their data are, respectively, updated or just referenced (accessed) in the context of the application. Fig. 2 illustrates graphically these five function types. To facilitate understanding, we can consider an example of EI as an employee inclusion form which includes information in the employees data file (ILF) and validates the tax code (CPF) informed by the user accessing the external file taxpayers (EIF), external to the application. Also in the application we could have, hypothetically, an employee report, a simple query containing the names of the employees of a given organizational unit (EQ) and a more complex report with the number of employees per unit (EO). In the FPA calculating rule, each function is evaluated for its complexity and takes one of three classifications: low, medium or high complexity. Each level of complexity is associated with a size in function points. Table II illustrates the derivation rule for external inquiries, according to the number of files accessed (File Type Referenced - FTR) and the number of fields that cross the boundary of the application (Data Element Type - DET). As for EQ, each type of functionality (EO, EI, ILF, and EIF) has its specific rules for derivation of complexity and size, similar to Table II. Table III summarizes the categories of attributes used for calculating function points according to each type of functionality. The software size is the sum of the sizes of its functionalities. This paper is not an in-depth presentation of concepts associated with FPA. Details can be obtained in the Counting Practices Manual, version 4.3.1 [6]. Employee Inclusion EI Taxpayer EIIF Employee Report EQ Employee ILF Totals per Unit EO Application Boundary Fig. 2. Visualization of the five types of functions in FPA User or External System TABLE II DERIVATION RULE FOR COMPLEXITY AND SIZE IN FUNCTION POINTS OF AN EXTERNAL INQUIRY DET (field) 1-5 6 - 19 20 or more FTR (file) 1 low (3) low (3) medium (4) 2-3 low (3) medium (4) high (6) 4 or more medium (4) high (6) high (6) B. Criticisms to the FPA technique that motivated the proposal of a new metric Despite the extensive use of the metric FPA, mentioned in Section I, there are a lot of criticism about its validity and applicability that call into question the correctness of its use in contracts and the reliability of its application as a tool for IT management and governance ( [19], [13], [20], [21], [14], [22]; [23], [24], [25]). Several metrics have been proposed taking FPA as a basis for their derivation, either to adapt it to particular models, or to improve it, fixing some known bugs. To illustrate, there is Antoniol et al. [26] work proposing a metric for objectoriented model and Kralj et al. [22] work proposing a change in FPA to measure more accurately high complexity functions (item 4 below). The objective of the metric proposed in this paper is not to solve all faults of FPA, but to help to reduce the following problems related to its definition: 1) low representation: the metric restricts the size of a function to only three possible values, according to its complexity (low, medium, or high). But there is no limit on the number of possible combinations of functional elements considered in calculating the complexity of a function in FPA; 2) functions with different functional complexities have the same size: as a consequence of the low representation. Pfleeger et al. [23, p. 36] say that if H is a measure of size, and if A is greater than B, then HA should be greater than HB. Otherwise, the metric would be invalid, failing to capture in the mathematical world the behavior we perceive in the empirical world. Xia et al. [25, p. 3] show examples of functions with different complexities that were improperly assigned the same value in function points because they fall into the same complexity classification, thus exposing the problem of ambiguous classification; 3) abrupt transition between functional element ranges: Xia et al. [25, p. 4] introduced this problem. They present two logical files, B and C, with apparent similar complexities, differing only in the number of fields: B has 20 fields and C has 19 fields. The two files are classified as low (7 fp, function points) and medium complexity (10 fp), respectively. The difference lies in the transition of the two ranges in the complexity derivation table: up to 19 fields, it is considered low complexity; from 20 fields, it is considered medium complexity. The addition of only one field leading to an increase in 3 pf is inconsistent, since varying from 1 to 19 fields does not involve any change in the function point size. A similar result occurs in other ranges of transitions; 4) limited sizing of high (and low) complexity functions: FPA sets an upper (and a lower) limit for the size of a function TABLE III CATEGORIES OF FUNCTIONAL ATTRIBUTES FOR EACH TYPE OF FUNCTIONALITY Function Functional Attributes Transactions: EQ, EO, EI referenced files (FTR) and fields (DET) Logical files: ILF, EIF logical registers (Record Element Type - RET) and fields (DET) in 6, 7, 10 or 15, according to its type. Kralj et al. [22, p. 83] describe high complexity functions with improper sizes in FPA. They propose a change in the calculation of FPA to support larger sizes for greater complexity; 5) undue operation on ordinal scale: as previously seen, FPA involves classifying the complexity of functions in low, medium or high complexity, as a ordinal scale. These labels in the calculated process are substituted by numbers. An internal logical file, for example, receives 7, 10 or 15 function points, as its complexity is low, medium or high, respectively. Kitchenham [20, p. 29] criticizes the inadequacy of adding up values of ordinal scale in FPA . He argues that it makes no sense to add the complex label with the simple label, even if using 7 as a synonym for simple and 15 as a synonym for complex; 6) inability to measure changes in parts of the function: this characteristic, for example, does not allow to measure function points of part of a functionality that needs to be changed in one maintenance operation. Thus, a function addressed in several iterations in an agile method or other iterative process is always measured with full size, even if the change is considered small in each of them. For example, consider three maintenance requests at different moments for a report already with the maximum size of 7 fp, which initially showed 50 distinct fields. Suppose each request adds a single field. The three requests would be dimensioned with 7 fp each, the same size of the request that created the report, and would total 21 fp. Aware of this limitation, PFA [6, vol. 4, p. 94] points to the Netherlands Software Metrics Association (NESMA) metric as an alternative for measuring maintenance requests. NESMA presents an approach to solve this problem. According to the Function Point Analysis for Software Enhancement [27], NESMA measures a maintenance request as the multiplication of the original size of a function by a factor of impact of the change. The impact factor is the ratio of the number of attributes (e.g., fields and files) included, changed or deleted by the original number of attributes of the function. The adjustment factor assumes multiple values of 25%, varying up to a maximum of 150%. Given the deficiencies reported, the correlation between the size in function points of software and the effort required for the development tends not to be appropriate, since FPA has these deficiencies in the representation of the real functional size of software. If there are inaccuracies in the measuring of the size of what must be done, it is impossible to expect a proper definition of the effort and therefore accuracy in defining the cost of development and maintenance. The mentioned problems motivated the development of this work, in order to propose a quantitative metric, with infinite values, called Functional Elements (EF). C. Derivation process of the new metric The proposed metric, Functional Elements, adopts the same concepts of FPA but changes the mechanism to derive the size of functions. The use of concepts widely known to metric specialists will enable acceptance and adoption of the new metric among these professionals. The reasoning process for deriving the new metric, as described in the following sections, implements linear regression similar to that seen in Fig. 3. The objective is to derive a formula for calculating the number of EF for each type of function (Table VII in Section II-C-4) from the number of functional attributes considered in the derivation of its complexity, as indicated in Table III in Section II-A-2. In this paper, these attributes correspond to the concept of functional elements, which is the name of the metric proposed. The marked points in Fig. 3 indicate the size in fp (Z axis) of an external inquiry derived from the number of files (X axis) and the number of fields (Y axis), which are the attributes used in the derivation of its complexity (see Table II in Section II-A-2). The grid is the result of a linear regression of these points, and represents the new value of the metric. 1) Step 1 - definition of the constants If the values associated with the two categories of functional attributes are zero, the EF metric assumes the value of a constant. Attributes can be assigned value zero, for example, in the case of maintenance limited to the algorithm of a function not involving changes in the number of fields and files involved. The values assigned to these constants come from the NESMA functional metric mentioned in Section 2-B. This metric was chosen because it is an ISO standard and supports the maintenance case with zero-value attributes. For each type of functionality, the proposed metric uses the smallest possible value by applying NESMA, that is, 25% of the number of fp of a low complexity function of each type: EIF - 1.25 (25% of 5); ILF - 1.75 (25% of 7); EQ - 0.75 (25% of 3); EI - 0.75 (25% of 3), and EO - 1 (25% of 4). Fig. 3. Derivation of number of fp of an external inquiry from the attributes used in the calculation 2) Step 2 - treatment of ranges with unlimited number of elements In FPA, each type of function has its own table to derive the complexity of a function. Table II in Section II-A-2 presents the values of the ranges of functional attributes for the derivation of the complexity of external inquiries. The third and last range of values of each functional element of all types of functions is unlimited. We see 20 or more TD in the first cell of the fourth column of the same table, and 4 or more ALR in the last cell of the first column. The number of elements in the greater range, that is, the highest value among the first two ranges, was chosen for setting a upper limit for the third range. In the case of ranges for external inquiries, the number of fields was limited to 33, having 14 elements (20 to 33) as the second range (6 to 19), the largest one. The number of referenced files was limited to 5, using the same reasoning. The limitation of the ranges is a mathematical artifice to prevent imposing an upper limit for the new metric (4th criticism in Section II-B). 3) Step 3 - generation of points for regression The objective of this phase was to generate, for each type of function, a set of data records with three values: the values of the functional attributes and the derived fp, already decreased from the value of the constant in step 1. Table IV illustrates some points generated for the external inquiry. An application developed in MS Access generated a dataset with all possible points for the five types of functions, based on the tables of complexity with bounded ranges developed in the previous section. Table V shows all considered combinations of ranges for EQ. 4) Step 4 - linear regression The several points obtained by the procedure described in the previous section were imported into MS Excel for linear regression using the ordinary least squares method (OLS). The regression between the size fp, which is the dependent variable, and the functional attributes, which are the dependent variables, held constant with value zero, since these constants were already defined in step 1 and decreased from the expected value in step 3. The statistical results of the regression are shown in Table VI for each type of function. Table VII shows the derived formula for each type of function with coefficient values rounded to two decimal place values. Each formula calculates the number of functional elements, which is the proposed metric, based on the functional attributes impacting the calculation and the constants indicated in step 1. The acronym EFt and EFd represent the functional elements associated with transactions (EQ, EI, and EO) and data (ILF and EIF), respectively. The functional elements metric, EF, is the sum of the functional elements transaction, EFT, with the functional TABLE IV PARTIAL EXTRACT OF THE DATASET FOR EXTERNAL INQUIRY FTR DET PF (decreased of constant of step 1) 1 1 2.25 1 2 2.25 (...) 1 33 3.25 2 1 2.25 (...) TABLE V COMBINATIONS OF RANGES FOR CALCULATING FP OF EQ Function Initial Final Initial Final Original PF decreased type FTR FTR DET DET FP of constant EQ 1 1 1 5 3 2.25 EQ 1 1 6 19 3 2.25 EQ 1 1 20 33 4 3.25 EQ 2 3 1 5 3 2.25 EQ 2 3 6 19 4 3.25 EQ 2 3 20 33 6 5.25 EQ 4 5 1 5 4 3.25 EQ 4 5 6 19 6 5.25 EQ 4 5 20 33 6 5.25 TABLE VI STATISTICAL REGRESSION - COMPARING RESULTS PER TYPES OF FUNCTIONS ILF 2 R Records Coefficient pvalue (FTR or RET) Coefficient pvalue (DET) EIF EO EI EQ 0.96363 729 0.96261 729 0.95171 198 0.95664 130 0.96849 165 3.00E-21 1.17E-21 7.65E-57 1.70E-43 4.30E-60 2.28E-23 2.71E-22 1.44E-59 2.76E-39 2.95E-45 TABLE VII CALCULATION FORMULAS OF FUNCTIONAL ELEMENTS BY TYPE OF FUNCTION5 Function type Formula ILF 𝐸𝐹𝑑 = 1.75 + 0.96 ∗ 𝑅𝐸𝑇 + 0.12 ∗ 𝐷𝐸𝑇 EIF 𝐸𝐹𝑑 = 1.25 + 0.65 ∗ 𝑅𝐸𝑇 + 0.08 ∗ 𝐷𝐸𝑇 EO 𝐸𝐹𝑡 = 1.00 + 0.81 ∗ 𝐹𝑇𝑅 + 0.13 ∗ 𝐷𝐸𝑇 EI 𝐸𝐹𝑡 = 0.75 + 0.91 ∗ 𝐹𝑇𝑅 + 0.13 ∗ 𝐷𝐸𝑇 EQ 𝐸𝐹𝑡 = 0.75 + 0.76 ∗ 𝐹𝑇𝑅 + 0.10 ∗ 𝐷𝐸𝑇 elements of data, EFd, as explained in the formulas of Table VII. So the proposed metric is: EF = EFt + EFd. The EFt submetric considers logical files (ILF and EIF) as they are referenced in the context of transactions. Files are not counted in separate as in the EFd submetric. Similar to two other ISO standard metrics of functional size [10, p. 388], MKII FPA [28] and COSMIC-FFP [29], EFt does not take into account logical files. EFt is indicated for the cases where the effort of dealing with data structures (EFd) is not subject to evaluation or procurement. In the next section, the EF and EFt metrics were tested, counting and not counting logical files, respectively. Results show stronger correlation with effort for EFt. Although not evaluated, the EFd submetric has its role as it reflects the structural complexity of the data of an application. D. Evaluation of the new metric The new EF metric and its submetric EFt were evaluated for their correlation with effort in comparison to the FPA metric.6 The goal was not to evaluate the quality of these correlations, but to compare their ability to explain the effort. We obtained a spreadsheet from a federal government agency with records of Service Orders (OS) contracted with private companies for coding and testing activities. An OS 5 The size of a request for deleting a function is equal to the constant value, since no specific attributes are impacted by this operation. 6 Kemerer [8, p. 421] justified linear regression as a means of measuring this correlation. contained one or more requests for maintenance or development of functions of one system, such as: create a report, change a transaction. The spreadsheet showed for each OS the real allocated effort and, for each request, the size of the function handled. The only fictitious data were the system IDs, functionality IDs and OS IDs, as they were not relevant to the scope of this paper. Each system was implemented in a single platform: Java, DotNet or Natural. The spreadsheet showed the time spent in hours and the number of people allocated for each OS. The OS effort, in man-hours, was derived from the product of time by team size. Table VIII presents the structure of the received data. Data from 183 Service Orders were obtained. However, 12 were discarded for having dubious information, for example, undefined values for function type, number of fields, and operation type. The remaining 171 service orders were related to 14 systems and involved 505 requests that dealt with 358 different functions. To achieve higher quality in the correlation with effort, we decided to consider only the four systems associated with at least fifteen OS, namely, systems H, B, C, and D. Table IX indicates the number of OS and requests for each system selected. The data were imported into MS Excel to perform the linear regression using the ordinary least squares method after calculating the size in EF and EFt metrics for each request in an MS-Access application developed by the authors.7 The regression considered the effort as the independent variable and the size calculated in the PF, EF, and EFT metrics as the dependent ones. As there is no effort if there is no size, the regression considered the constant with value zero, that is, the straight line crosses the origin of the axes. Independent regressions were performed for each system, since the variability of the factors that influence the effort is low within a single system, because the programming language is the same and the technical staff is generally also the same.8 Fig. 4 illustrates the dispersion of points (OS) on the correlation between size and effort in EFt (man-hour) and the line derived by linear regression in the context of system H. The coefficient of determination R2 was used to represent the degree of correlation between effort and size calculated for each of the evaluated metrics. According to Sartoris [30, p. 244], R2 indicates, in a linear regression, the percentage of the variation of a dependent variable Y that is explained by the variation of a second independent variable X. Table IX shows the results of the linear regressions performed. From the results presented on Table IX, comparing the correlation of the metrics with effort, we observed that: 1) correlations of the new metrics (EF, EFt) were considered significant at a confidence level of 95% for all 7 A logistic nonlinear regression with constant was also performed using Gretl, a free open source tool (http://gretl.sourceforge.net). However, the R2 factor proved that this alternative was worse than the linear regression for all metrics. 8 The factors that influence the effort and the degree of this correlation are discussed in several articles. We suggest the articles available in the BestWeb database (http://www.simula.no/BESTweb), created as a result of the research of Jorgensen and Shepperd [31]. TABLE VIII STRUCTURE OF THE RECEIVED DATA TO EVALUATE THE METRIC Abbreviation Description Domain OS Function Type Operation Final FTR RET Operation FTR RET Original FTR RET Final DET Operation DET Original TD FP %Impact PM System Hours Team Identification Number of a service order Identification Number of a function Type (categorization) of a functionality according to FPA Operation performed, which may be inclusion (I) of a new feature or change (A) of a function (maintenance) Value at the conclusion of the request implementation: if the function is a transaction, indicates the number of referenced logical files (FTR); if it is a logical file, indicates the number of logical records (RET) Number of FTR or RET that were included, changed or deleted in the scope of a maintenance of a functionality (only in change operation) Number of FTR or RET originally found in the functionality (only in change operation) Number of DET at the conclusion of the request implementation Number of DET included, changed or deleted in the scope of a functionality maintenance (only in change operation) Number of DET originally found in a functionality (only in change operation) Number of function points of the functionality at the conclusion of the request Percentage of the original function impacted by the maintenance, as measured by NESMA [27] Number of maintenance points of the functionality handled, as measured by NESMA [27] Identification of a system Hours dedicated by the team to implement the OS Number of team members responsible for the implementation of the OS up to 10 numbers up to 10 numbers ALI, AIE, EE, SE or CE I or A up to 3 numbers up to 3 numbers up to 3 numbers up to 3 numbers up to 3 numbers up to 3 numbers up to 2 numbers 25, 50, 75, 100, 125, 150 up to 4 numbers one char up to 5 numbers up to 2 numbers systems (p-value less than 0.05).9 However, the correlation of FPA was not significant for system B (p-value 0.088 > 0.05); 2) correlations of the new metrics were higher in both systems with the highest number of OS (H and B). A better result in larger samples is an advantage, because the larger the sample size, the greater the reliability of the results, since the p-value has reached the lowest values for these systems; 3) although no metric got a high coefficient of determination (R2 > 0.8), the new metrics achieved medium correlation (0.8 > R2 > 0.5) in the four systems evaluated, whereas FPA obtained weak correlation (0.2 > R2) in system B. We considered the confidence level of 91.2% in this correlation (p-value 0.88); 4) the correlation of the new metrics was superior in three out of the four systems (H, B, and D). (A correlation C1 is classified as higher than a correlation C2 if C1 is significant and C2 is not significant or if both correlations are significant and C1 has a higher R2 than C2.) 9 To be considered a statistically significant correlation at a confidence level of X%, the p-value must be less than 1 - X [30, p.11]. For a 95% confidence level, the p-value must be less than 0.05. OS Effort (mh) 1500 1000 500 0 0 10 20 30 40 OS size (EFt) Fig.4. Dispersion of points (OS) of H system: effort (man-hour) x size (Functional Element of Transaction) TABLE IX RESULTS OF LINEAR REGRESSIONS - EFFORT VERSUS METRICS OF SIZE System H B C D Quantity of OS 45 25 21 15 Quantity of Requests 245 44 60 20 R2 59.3% 11.2% 67.7% 51.8% FP p-value (test-f) 4.6E-10 8.8E-02 3.3E-06 1.9E-03 R2 65,1% 60.3% 53.0% 54.7% EF p-value (test-f) 1.5E-11 2.3E-06 1.4E-04 1.2E-03 Proportion to FP’s R2 +10% +438% -22% +5% R2 66.1% 60.3% 53.0% 54.7% EFt p-value (test-f) 8.5E-12 2.3E-06 1.4E-04 1.2E-03 Proportion to FP’s R2 +11% +438% -22% +5% Given the observations listed above, , we conclude for the analyzed data that the metrics proposed, EF and EFt, have better correlation with effort in comparison to FPA. A higher correlation of the EFt metric in comparison to the EF was perceived for system H. Only system H allowed a differentiation of the result for the two metrics by presenting requests for changing logical files in its service orders. Therefore, we see that the EFt submetric tends to yield better correlations if compared to the EF. This result reinforces the hypothesis that the EFd submetric, which composes the EF metric, does not impact the effort, at least not for coding and testing, which are tasks addressed in the evaluated service orders. Table X contains the explanation of how the proposed metrics, EF and EFt, address the criticisms presented in Section II-B. E. Illustration of the use of the new metrics in IT governance Kaplan and Norton [31, p. 71] claim that what you measure is what you get. According to COBIT 5 [34, p. 13], governance aims to create value by obtaining the benefits through optimized risks and costs. In relation to IT governance, the metrics proposed in this paper not only help to assess the capacity of IT but also enable the optimization of its processes to achieve the results. Metrics support the communication between the different actors of IT governance (see Fig. 5) by enabling the translation of objectives and results in numbers. The quality of a process can be increased by stipulating objectives and by measuring results through metrics [15, p. 19]. So, the production capacity of the process of information systems development can be enhanced to achieve the strategic objectives with the appropriate use of metrics and estimates. TABLE X JUSTIFICATIONS OF HOW THE NEW METRICS ADDRESS THE CRITIQUES PRESENTED IN SECTION II-B Critique Solution Low representation Each possible combination of the functional attributes considered in deriving the complexity in FPA is associated with a distinct value. Functions with Functionalities with different complexities, as different complexities determined by the number of functional attributes, have the same size assume a different size. Abrupt transition By applying the formulas of calculation described in between functional Section II-C-4, the variation in size is uniform for element ranges each variation of the number of functional attributes, according to its coefficients. Limited sizing of There is no limit on the size assigned to a function high (and low) by applying the calculation formulas described in complexity functions Section II-C-4. Undue operation on The metrics do not have a ordinal scale with finite ordinal scale values, but rather a quantitative scale with infinite discrete values, which provide greater reliability in operations with values. Inability to measure Enables the measurement of changes in part of a changes in parts of functionality considering in the calculation only the the function functional attributes impacted by the amendment. Software metrics contribute to the three IT governance activities proposed by ISO 38500, mentioned in Section I: to assess, to direct and to monitor. These activities correspond, respectively, to the goals of software metrics mentioned in Section II-A-1: to understand, to improve, and to control the targeted entity of a measurement. Regarding the directions of IT area, Weill and Ross [36, p. 188] state that the creation of metrics for the formalization of strategic choices is one of four management principles that summarize how IT governance helps companies achieve their strategic objectives. Metrics must capture the progress toward strategic goals and thus indicate if IT governance is working or not [36, p. 188]. Kaplan and Norton [37, pp. 75-76] claim that strategies need to be translated into a set of goals and metrics in order to have everyone’s commitment. They claim that the Balanced Scorecard (BSC) is a tool which provides knowledge of longterm strategies at all levels of the organization and also promotes the alignment of department and individual goals with those strategies. According to ITGI [2, p. 29], BSC, besides being a holistic view of business operations, also contributes to connect long-term strategic objectives with short-term actions. To adapt the concepts of the BSC for the IT function, the perspectives of a BSC were re-established [38, p. 3]. Table XI presents the perspectives of a BSC-IT and their base questions. Owners and Stake holders Dele gate Account table Gover ning Body Set di rection Monitor Ma na ge ment Fig.5. Roles, activities and relationships of IT governance. Source: ISACA [35, p. 24] In struct Report Ope ra tions Perspective TABLE XI PERSPECTIVES OF A BSC-IT Base question Contribution to How do business executives see the business the IT area? Customer How do customers see the IT orientation area? Operational How effective and efficient are excellence the IT processes? Future How IT is prepared for future orientation needs? Source: inspired in ITGI [2, p. 31] BSC corporative perspective Financial Customer Internal Processes Learning According to ITGI [2, p. 30], BSC-IT effectively helps the governing body to achieve alignment between IT and the business. This is one of the best practices for measuring performance [2, p. 46]. BSC-IT is a tool that organizes information for the governance committee, creates consensus among the stakeholders about the strategic objectives of IT, demonstrates the effectiveness and the value added by IT and communicates information about capacity, performance and risks [2, p. 30]. Van Grembergen [39, p.2] states that the relationship between IT and the business can be more explicitly expressed through a cascade of scorecards. Van Grembergen [39, p.2] divides BSC-IT into two: BSC-IT-Development and BSC-ITOperations. Rohm and Malinoski [40], members of the Balanced Scorecard Institute, present a process with nine steps to build and implement strategies based on scorecard. Bostelman and Becker [41] present a method to derive objectives and metrics from the combination of BSC and the Goal Question Metric (GQM) technique proposed by Basili and Weiss [42]. The association with GQM method is consistent to what ISACA [43, p. 74] says: good strategies start with the right questions. The metric proposed in this paper can compose several indicators that can be used in BSCIT - Development. Regarding the activities of IT monitoring and assessment [3, p. 7], metrics enable the monitoring of the improvement rate of organizations toward a mature and improved process [1, p. 473]. Performance measurement, which is object of monitoring and assessment, is one of the five focus areas of IT governance, and it is classified as a driver to achieve the results [2, p. 19]. To complement the illustration of the applicability of the new metric for IT governance, Table XII shows some indicators based on EF. 10 The same indicator can be used on different perspectives of a BSC-IT-Development, depending on the targeted entity and the objective of the measurement, such as the following examples. The productivity of a resource (e.g., staff, technology) may be associated with the Future Orientation perspective, as it seeks to answer whether IT is prepared for future needs. The same indicator, if associated with an internal process, encoding, for example, reflects a vision of its production capacity, in the Operational Excellence perspective. In the Customer Orientation 10 The illustration is not restricted to EF, as the indicators could use others software size metrics. perspective, production can be divided by client, showing the proportion of IT production to each business area. The evaluation of the variation in IT production in contrast to the production of business would be an example of using the indicator in the Contribution to the Business perspective. The choice of indicators aimed to encompass the five fundamental dimensions mentioned in Section II-A-1: size, effort, time, quality, and rework. A sixth dimension was added: the expected benefit. According to Rubin [44, p. 1], every investment in IT, from a simple training to the creation of a corporate system, should be aligned to a priority of the business whose success must be measured in terms of a specific value. Investigating the concepts and processes associated with the determination of the value of a function (or a system or the IT area) is not part of the scope of this work. This is a complex and still immature subject. The dimension of each indicator is shown in the third column of Table XII. Some measurements were normalized by being divided by the number of functional elements of the product or process, tactics used to allow comparison across projects and systems of different sizes. The ability to standardize comparisons, as in a BSC, is one of the key features of software metrics [45, p. 493]. It is similar to normalize construction metrics based on square meter, a common practice [46, p. 161]. As Dennis argues [47, p. 302], one should not make decisions based on a single indicator, but from a vision formed by several complementary indicators. As IT has assumed greater prominence as a facilitator to the achievement of business strategy, the use of dashboards to monitor its performance, under appropriate criteria, has become popular among company managers [43, p. 74]. Abreu and Fernandes [48, p. 167] propose some topics that may compose such strategic and tactical control panels of IT. Fig. 6 illustrates the behavior of the indicators shown in Table XII with annual verification for hypothetical systems A, TABLE XII DESCRIPTION OF ILLUSTRATIVE INDICATORS Metric Unit Dimension Description of the calculation for a system Functional EF Size sum of the functional size of the size functionalities that compose the system at the end of the period Production EF Effort sum of the functional size of requests in the period for inclusion, deletion, and change implemented in the period Production EF Rework sum of the functional size of requests on rework for deletion and change implemented in the period Productivity Functional Effort sum of the functional size of requests Elements / implemented in the period / sum of Man–hour the efforts of all persons allocated to the system activities in the period Error density Failures / Quality number of failures resulting from the Functional use of the system in a period / size of Element the system at the end of the period Delivery Functional Time sum of the size of the features speed Elements / implemented in the period / elapsed Hour time Density of $ / EF Expected benefit expected by the system in the the expected benefit period / system size benefit B, C, and D.11 The vertical solid line indicates how the indicator to the system was in the previous period, allowing a view of the proportion of the increasing or decreasing of the values over the period. In the productivity column (column 4), a short line at its base indicates, for example, a pattern value obtained by benchmark. The vertical dashed line metric associated with the production in the period (2) indicates the target set in the period for each system: system A reached it, system D exceeded it, and systems B and C failed. In one illustrative and superficial analysis of the indicators for system C, one can associate the cause of not achieving the production goal during that period (2) with the decrease of the delivery speed (6) and the increase of the production on rework (3), resulted, most likely, from the growth in the error density (5). The reduction on the delivery speed (6) which can be associated with decreased productivity (4) led to a low growth of the functional size of the system (1) during that period. These negative results led to a decrease in the density of the expected benefit (7). Fig. 6 represents an option of visualization of the governance indicators shown in Table XII: a multi-metrics chart of multi-instances of a targeted entity or a targeted attribute. The vertical column width is variable depending on the values of the indicators (horizontal axis) associated with the different instances of entities or attributes of interest (vertical axis). The same vertical space is allocated for each entity instance. The width of the colored area, which is traced from the left to the right, indicates graphically the value of the indicator for the instance. In the hands of the governance committee, correct indicators can help senior management, directly or through any governance structure, to identify how IT management is behaving and to identify problems and the appropriate course of action when necessary. D C B A 1 Functio nal Size 2 Produc tion in the period 3 Produc tion on rework 4 Producti vity 5 Error density 6 Deli very speed 7 Density of the expected benefit Fig. 6. Annual indicators of systems A, B, C and D 11 The fictitious values associated with the indicators were adjusted so that all vertical columns had the same maximum width. The adjustment was done by correlating the maximum value for the indicator with the width defined for the column. The other values were derived by a simple rule of three. III. FINAL CONSIDERATIONS The five specific objectives proposed for this work in Section I were achieved, albeit with limitations and with possibilities for improvement that are translated into proposals for future work. The main result was the proposition of a new metric EF and its submetric EFt. The new metrics, free of some deficiencies of the FPA technique taken as a basis for their derivation, reached a higher correlation with effort than the FPA metric, in the context of the analyzed data. The paper also illustrated the connection found between metrics and IT governance activities, either in assessment and monitoring, through use in dashboards, or in giving direction, through use in BSC-IT. There are possibilities for future work in relation to each of the five specific objectives. Regarding the conceptualization and the categorization of software metrics, a comprehensive literature research is necessary to the construction of a wider and updated categorization of software metrics. Regarding the presentation of the criticisms to FPA, only the criticisms addressed by the new proposed metrics were presented. Research in the theme, as a bibliographic research to catalog the criticisms, would serve to encourage other propositions of software metrics. Regarding the process of creating the new metric, it could be improved or it could be applied to other metrics of any area of knowledge based on ordinal values derived from tables of complexity as FPA (e.g., metric proposed by KARNER [49]: Use Case Points). Future works may also propose and evaluate changes in the rules and in the scope of the new metrics. Regarding the evaluation of the new metric, the limitation in using data from only one organization could be overcome in new works. Practical applications of the metric could also be illustrated. New works could compare the results of EF with the EFt submetric as well as compare both with other software metrics. Different statistical models could be used to evaluate its correlation with effort even in specific contexts (e.g., development, maintenance, development platforms). We expect to achieve a higher correlation of the new metric with effort in agile methods regarding to the APF, considering its capacity of partial functionality sizing. (6th criticism in Section II-B.) Regarding to the connection with IT governance, a work about the use of metrics in all IT governance activities is promising. The proposed graph for visualization of multiple indicators of multiple instances through columns with varying widths along their length can also be standardized and improved in future work.12 A suggestion for future work is noteworthy: the definition 12 In http://learnr.wordpress.com (accessed on 04 November 2012) is located a graph that functionally reminds the proposed one: heatmap plotting. However it is different in the format and in the possibilities of evolution. As we did not find any similar graph, we presume to be a new format for viewing the behavior of Multiple Indicators about Multiple Instances through Columns with Varying Widths along their Extension (MIMICoVaWE). An example of evolution would be a variation in the color tone of a cell according to a specific criterion (eg in relation to achieving of a specified goal). of an indicator that shows the level of maturity of a company regarding to the use of metrics in IT governance. Among other aspects, it could consider: the breadth of the entities evaluated (e.g., systems, projects, processes, teams), the dimensions treated (e.g., size, rework, quality, effectiveness) and the effective use of the indicators (e.g., monitoring, assessment). Finally, we expect that the new metric EF and its submetric EFt help increase the contribution of IT to the business, in an objective, reliable, and visible way. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] H. A. Rubin, ― Software process maturity: measuring its impact on productivity and quality,‖ in Proc. of the 15th int. conf. on Softw. Eng., IEEE Computer Society Press, pp. 468-476, 1993. ITGI - IT Governance Institute. Board briefing on IT Governance, 2nd ed, Isaca, 2007. ISO/IEC, 38500: Corporate governance of information technology, 2008. A. J. Albrecht, "Measuring application development productivity" in Guide/Share Application Develop. Symp. Proc., pp.83-92 1979. ISO/IEC, 20926: Software measurement - IFPUG functional size measurement method, 2009. IFPUG - International Function Point Users Group, Counting Practices Manual, Version 4.3.1, IFPUG, 2010. A. Albrecht and J. Gaffney Jr., ― Software function, source lines of code, and development effort prediction: A software science validation,‖ IEEE Trans. Software Eng.,vol. 9, pp. 639-648, 1983. C. F. Kemerer, ― An empirical validation of software cost estimation models,‖ Communications of the ACM, vol. 30, no. 5, pp. 416-429, 1987. Brazil. MCT - Ministério da Ciência e Tecnologia. ― Quality Research in Brazilian Software Industry; Pesquisa de Qualidade no Setor de Software Brasileiro – 2009,‖ Brasília. [Online]. 204p. Available: http://www.mct.gov.br/upd_blob/0214/214567.pdf M. Bundschuh and C. Dekkers, The IT measurement compendium: estimating and benchmarking success with functional size measurement, Springer, 2008. C. E. Vazquez, G. S. Simões and R. M. Albert, Function Point Analysis: Measurement, Estimates and Project Management Software; Análise de Pontos de Função: Medição, Estimativas e Gerenciamento de Projetos de Software. Editora Érica, São Paulo, 2005. Brazil. SISP - Sistema de Administração dos Recursos de Tecnologia da Informação. (2012). ― Metrics Roadmap of SISP - Version 2.0; Roteiro de Métricas de Software do SISP – Versão 2.0,‖ Brasília: Ministério do Planejamento, Orçamento e Gestão.Secretaria de Logística e Tecnologia da Informação. [Online]. Available: http://www.sisp.gov.br/ctgcie/download/file/Roteiro_de_Metricas_de_Software_do_SISP__v2.0.pdf N. E. Fenton and S. L. Pfleeger, Software metrics: a rigorous and practical approach, PWS Publishing Co, 1998. B. Kitchenham, S. L. Pfleeger and N. Fenton, ― Towards a framework for software measurement validation,‖ IEEE Trans. Softw. Eng., vol. 21, no. 12, pp. 929-944, 1995. S. MOSER, ― Measurement and estimation of software and software processes,‖ Ph.D. dissertation, University of Berne, Switzerland, 1996. E. Chikofsky and H. A. Rubin, ― Using metrics to justify investment in IT,‖ IT professional, vol. 1, no. 2, pp. 75-77. 1999. C. P. Beyers, "Estimating software development projects." in IT measuremen,. Addison-Wesley Longman Publishing Co., Inc., pp. 337362, 2002. C. Gencel and O. Demirors, ― Functional size measurement revisited,‖ ACM Transactions on Software Engineering and Methodology (TOSEM) vol. 17, no. 3, p. 15, 2008. A. Abran and P. N. Robillard, "Function Points: A Study of Their Measurement Processes and Scale Transformations", Journal of Systems and Software, vol. 25, pp.171 -184 1994. B. Kitchenham, ― The problem with function points,‖ IEEE Software, vol. 14, no. 2, pp. 29-31, 1997. B. Kitchenham, K. Känsälä, ― Inter-item correlations among function points,‖ in Proc.15th Int. Conf. on Softw. Eng., , IEEE Computer Society Press, pp. 477-480, 1993. [22] T. Kralj, I. Rozman, M. Heričko and A.Živkovič, ― Improved standard FPA method—resolving problems with upper boundaries in the rating complexity process,‖ Journal of Systems and Software, vol. 77, no. 2, pp. 81-90, 2005. [23] S. L. Pfleeger, R. Jeffery, B. Curtis and B. Kitchenham, ― Status report on software measurement,‖ IEEE Software, vol. 14, no. 2, pp. 33-43, 1997. [24] O. Turetken, O. Demirors, C. Gencel, O. O. Top, and B. Ozkan, ― The Effect of Entity Generalization on Software Functional Sizing: A Case Study,‖ in Product-Focused Software Process Improvement, Springer Berlin Heidelberg, pp. 105-116, 2008. [25] W. Xia, D. Ho, L. F. Capretz, and F. Ahmed, ― Updating weight values for function point counting,‖ International Journal of Hybrid Intelligent Systems, vol. 6, no. 1, pp. 1-14, 2009. [26] G. Antoniol, R. Fiutem and C. Lokan, "Object-Oriented Function Points: An Empirical Validation," Empirical Software Engineering, vol. 8, no. 3, pp. 225-254, 2003 [27] NESMA - Netherlands Software Metrics Association, ― Function Point Analysis For Software Enhancement,‖ [Online], Available: http://www.nesma.nl/download/boeken_NESMA/N13_FPA_for_Softwa re_Enhancement_(v2.2.1).pdf [28] ISO/IEC, 20968: MkII Function Point Analysis - Counting Practices Manual, 2002. [29] ISO/IEC, 19761: COSMIC: a functional size measurement method, 2011. [30] A. Sartoris, Estatística e introdução à econometria; Introduction to Statistics and Econometrics, Saraiva S/A Livreiros Editores, 2008. [31] M. Jorgensen and M. Shepperd, ― A systematic review of software development cost estimation studies,‖ IEEE Trans. Softw. Eng., vol. 33, no. 1, pp. 33-53, 2007. [32] M. L. Orlov, ― Multiple Linear Regression Analysis Using Microsoft Excel,‖ Chemistry Department, Oregon State University, 1996. [33] R. S. Kaplan and D. P. Norton, ― The balanced scorecard - measures that drive performance,‖ Harvard business review, vol. 70, no. 1, pp. 71-79, 1992. [34] Isaca, COBIT 5: Enabling Processes, Isaca, 2012. [35] Isaca, COBIT 5: A Business Framework for the Governance and Management of IT, Isaca, 2012. [36] P. Weill and J. W. Ross, IT governance: How top performers manage IT decision rights for superior results, Harvard Business Press, 2004. [37] R. S. Kaplan and D. P. Norton, ― Using the balanced scorecard as a strategic management system,‖ Harvard business review, vol.74, no. 1, pp. 75-85, 1996. [38] W. Van Grembergen and R. Van Bruggen, ― Measuring and improving corporate information technology through the balanced scorecard,‖ The Electronic Journal of Information Systems Evaluation, vol. 1, no. 1. 1997. [39] W. Van Grembergen, ― The balanced scorecard and IT governance,‖ Information Systems Control Journal, Vol 2, pp.40-43, 2000. [40] H. Rohm and M. Malinoski, ― Strategy-Based Balanced Scorecards for Technology,‖ Balanced Scorecard Institute, 2010. [41] S. A. Becker and M. L. Bostelman, "Aligning strategic and project measurement systems," IEEE Software, vol. 16, no. 3, pp. 46-51, May/Jun 1999. [42] V. R. Basili, and D. M. Weiss, "A Methodology for Collecting Valid Software Engineering Data," IEEE Trans. Softw. Eng., vol. SE-10, no. 6, pp. 728-738, Nov. 1984. [43] Isaca,. CGEIT Review Manual 2010, ISACA. [44] H. A. Rubin, ― How to Measure IT Value,‖ CIO insight, 2003. [45] B. Hufschmidt, ― Software balanced scorecards: the icing on the cake,‖ in IT measurement, Addison-Wesley Longman Publishing Co., Inc., pp. 491-502. 2002. [46] C. A. Dekkers, ― How and when can functional size fit with a measurement program?," in IT measurement, Addison-Wesley Longman Publishing Co., Inc., pp. 161-170, 2002. [47] S. P. Dennis, ― Avoiding obstacles and common pitfalls in the building of an effective metrics program,‖ in IT measurement, Addison-Wesley Longman Publishing Co., Inc., pp. 295-304, 2002. [48] A. A. Fernandes and V. F. Abreu, Deploying IT governance: from strategy to process and services management; Implantando a governança de TI: da estratégia à gestão de processos e serviços, Brasport, 2009. [49] G. Karner. ― Metrics for Objectory,‖ Diploma thesis, University of Link ping, Sweden, No. LiTH-IDA-Ex9344:21, December 1993. An Approach to Business Processes Decomposition for Cloud Deployment Uma Abordagem para Decomposição de Processos de Negócio para Execução em Nuvens Computacionais Lucas Venezian Povoa, Wanderley Lopes de Souza, Antonio Francisco do Prado Departamento de Computação (DC) Universidade Federal de São Carlos (UFSCar) São Carlos, São Paulo - Brazil {lucas.povoa, desouza, prado}@dc.ufscar.br Resumo—Devido a requisitos de segurança, certos dados ou atividades de um processo de negócio devem ser mantidos nas premissas do usuário, enquanto outros podem ser alocados numa nuvem computacional. Este artigo apresenta uma abordagem genérica para a decomposição de processos de negócio que considera a alocação de atividades e dados. Foram desenvolvidas transformações para decompor processos representados em WS-BPEL em subprocessos a serem implantados nas premissas do usuário e numa nuvem computacional. Essa abordagem foi demonstrada com um estudo de caso no domínio da Saúde. Palavras-chave—Gerenciamento de Processos de Negócio; Computação em Nuvem; Decomposição de Processos; WSBPEL; Modelo Baseado em Grafos. Abstract—Due to safety requirements, certain data or activities of a business process should be kept within the user premises, while others can be allocated to a cloud environment. This paper presents a generic approach to business processes decomposition taking into account the allocation of activities and data. We designed transformations to decompose business processes represented in WS-BPEL into sub-processes to be deployed on the user premise and in the cloud. We demonstrate our approach with a case study from the healthcare domain. Keywords—Business Process Management; Cloud Computing; Process Decomposition; WS-BPEL; Graph-based model. I. INTRODUÇÃO Atualmente várias organizações dispõem de grandes sistemas computacionais a fim de atenderem à crescente demanda por processamento e armazenamento de um volume cada vez maior de dados. Enquanto na indústria grandes companhias constroem centros de dados em larga escala, para fornecerem serviços Web rápidos e confiáveis, na academia muitos projetos de pesquisa envolvem conjuntos de dados em larga escala e alto poder de processamento, geralmente providos por supercomputadores. Dessa demanda por enormes centros de dados emergiu o conceito de Computação em Nuvem [1], onde tecnologias de Luís Ferreira Pires, Evert F. Duipmans Faculty of Electrical Engineering, Mathematics and Computing Science (EEMCS) University of Twente (UT) Enschede, Overijssel - The Netherlands [email protected], [email protected] informação e comunicação são oferecidas como serviços via Internet. Google App Engine, Amazon Elastic Compute Cloud (EC2), Manjrasoft Aneka e Microsoft Azure são alguns exemplos de nuvens computacionais [2]. O cerne da Computação em Nuvem é oferecer recursos computacionais, de forma que seus usuários paguem somente pelo seu uso e tendo a percepção de que estes são ilimitados. O National Institute of Standards and Technology (NIST) identifica três modelos de serviço [3]: (a) Softwareas-a-Service (SaaS), um software hospedado num servidor é oferecido e usuários acessam-no via alguma interface através de uma rede local ou Internet (e.g., Facebook, Gmail); (b) Platform-as-a-Service (PaaS), uma plataforma é oferecida, usuários implantam suas aplicações na mesma e esta oferece recursos como servidor Web e bases de dados (e.g., Windows Azure, Google AppEngine); (c) Infrastructure-asa-Service (IaaS), uma máquina virtual com certa capacidade de armazenamento é oferecida e usuários alugam esses recursos (e.g., Amazon EC2, GoGrid). Embora muito promissora, a Computação em Nuvem enfrenta obstáculos que devem ser transpostos para que não impeçam o seu rápido crescimento. Segurança dos dados é uma grande preocupação dos usuários, quando estes armazenam informações confidenciais nos servidores das nuvens computacionais. Isto porque geralmente esses servidores são operados por fornecedores comerciais, nos quais os usuários não depositam total confiança [4]. Em alguns domínios de aplicação, a confidencialidade não é só uma questão de segurança ou privacidade, mas também uma questão jurídica. A Saúde é um desses domínios, já que a divulgação de informações devem satisfazer requisitos legais, tais como os presentes no Health Insurance Portability and Accountability Act (HIPAA) [5]. Business Process Management (BPM) tem sido bastante empregado por diversas empresas, nesta última década, para gerenciar e aperfeiçoar seus processos de negócio [6]. Um processo de negócio consiste de atividades exercidas por humanos ou sistemas e um Business Process Management System (BPMS) dispõe de um executor (engine), no qual instâncias de um processo de negócio são coordenadas e monitoradas. A compra de um BPMS pode ser um alto investimento para uma empresa, já que software e hardware precisam ser adquiridos e profissionais qualificados contratados. Escalabilidade também pode ser um problema, já que um executor é somente capaz de coordenar um número limitado de instâncias de um processo simultaneamente, sendo necessária a compra de servidores adicionais para lidar com situações de pico de carga. BPMSs baseados em nuvens computacionais e oferecidos como SaaS via Internet podem ser uma solução para o problema de escalabilidade. Entretanto, o medo de perder ou expor dados confidenciais é um dos maiores obstáculos para a implantação de BPMSs em nuvens computacionais, além do que há atividades num processo de negócio que podem não se beneficiar dessas nuvens. Por exemplo, uma atividade que não exige intensa computação pode tornar-se mais onerosa se colocada numa nuvem, já que os dados a serem processados por essa atividade devem ser enviados à nuvem, o que pode levar mais tempo para a sua execução e custar mais caro, uma vez que transferência de dados é um dos fatores de faturamento das nuvens computacionais [7]. Outros modelos de seviço em nuvens computacionais, além dos identificados pelo NIST, são encontrados na literatura. Por exemplo, no modelo Process-as-a-Service um processo de negócio é executado parcial ou totalmente numa nuvem computacional [8]. Devido a requisitos de segurança, nesse modelo certos dados ou atividades devem ser mantidos nas premissas do usuário enquanto outros podem ser alocados numa nuvem, o que requer uma decomposição desse processo. Neste sentido, este artigo apresenta uma abordagem genérica para a decomposição de processos de negócio, oferecendo uma solução técnica para esse problema. A sequência do mesmo está assim organizada: a Seção II discorre sobre BPM; a Seção III apresenta a abordagem proposta; a Seção IV descreve um estudo de caso acompanhado de análises de desempenho e custo; a Seção V trata de trabalhos correlatos; e a Seção VI expõe as considerações finais apontando para trabalhos futuros. II. BUSINESS PROCESS MANAGEMENT BPM parte do princípio que cada produto oferecido por uma empresa é o resultado de um determinado número de atividades desempenhadas por humanos, sistemas ou ambos, e as metas do BPM são identificar, modelar, monitorar, aperfeiçoar e revisar processos de negócio dessa empresa. Identificando essas atividades via workflows, a empresa tem uma visão de seus processos, e monitorando e revisando os mesmos esta pode detectar problemas e realizar melhorias. O ciclo de vida de um processo de negócio possui as fases: Projeto, os processos de negócio são identificados e capturados em modelos geralmente gráficos, possibilitando aos stakeholders entendê-los e refinálos com certa facilidade. As atividades de um processo são identificadas supervisionando o processo existente e considerando a estrutura da empresa e os seus recursos técnicos, sendo que Business Process Model and Notation (BPMN)[9] é a linguagem mais usada nessa fase. Uma vez capturados nos modelos, os processos podem ser simulados e validados, fornecendo aos stakeholders uma visão da sua correção e adequação; Implementação, um modelo de processo de negócio é implementado manual, semi-automática ou automaticamente. Quando automação não é requerida ou possível, listas de trabalho são criadas com tarefas bem definidas, as quais são atribuídas a funcionários da empresa. O problema é que não há um sistema central para o monitoramento das instâncias do processo, devendo este ser realizado por cada funcionário envolvido. Com a participação de sistemas de informação, um BPMS pode usar o modelo desse processo e criar instâncias do mesmo, sendo capaz de monitorar cada uma destas e fornecer uma visão das atividades realizadas, do tempo consumido e da sua conclusão ou falha; Promulgação, o processo de negócio é executado e para cada iniciação uma instância do mesmo é criada. Tais instâncias são gerenciadas por um BPMS, que as acompanha via um monitor, fornecendo um quadro das que estão em execução e das que terminaram, e detectando eventuais problemas que podem ocorrer com essas instâncias; e Avaliação, a informação monitorada e coletada pelo BPMS é usada para revisar o processo de negócio, sendo que as conclusões obtidas nessa fase serão as entradas da próxima interação no ciclo de vida. A. WS-BPEL BPMSs precisam de linguagens executáveis, sobretudo nas três últimas fases, e uma vez que as usadas na fase de projeto são geralmente muito abstratas, linguagens tais como Web Services Business Process Execution Language (WSBPEL) [10] tornam-se necessárias. Concebida pela Organization for the Advancement of Structured Information Standards (OASIS) para a descrição de processos de negócio e de seus protocolos, WS-BPEL foi definida a partir dos padrões Web WSDL 1.1, XML Schema, XPath 1.0, XSLT 1.0 e Infoset. As suas principais construções serão ilustradas com o exemplo do Picture Archiving and Communication System (PACS) [11], um sistema de arquivamento e comunicação para diagnóstico por imagem, cujo workflow é apresentado na Fig. 1. Fig. 1. Workflow do PACS descrito como um processo monolítico. Um processo descrito em WS-BPEL é um container, onde são declaradas as atividades a serem executadas, dados, tipos de manipuladores (handlers) e as relações com parceiros externos. PACS pode ter sua descrição em WSBPEL iniciada por <process name="PACSBusinessProcess" targetNamespace="http://example.com" xmlns="http://docs.oasisopen.org/wsbpel/2.0/process/executable"> WS-BPEL permite agregar Web Services, definir a lógica de cada interação de serviço e orquestrar essas interações. Uma interação envolve dois lados (o processo e um parceiro), é descrita via o partnerLink, que é um canal de comunicação caracterizado por um partnerLinkType, myRole e partnerRole, sendo que essas informações identificam a funcionalidade a ser provida por ambos os lados. Em PACS pode ser definido um canal de comunicação entre esse processo e um cliente como <partnerLinks> <partnerLink name="client" partnerLinkType="tns:PACSBusinessProcess" myRole="PACSBusinessProcessProvider" partnerRole="PACSBusinessProcessRequester" /> </partnerLinks> Para a troca de mensagens emprega-se receive, reply e invoke. As duas primeiras permitem a um processo acessar componentes externos através de um protocolo de comunicação (e.g., SOAP), sendo que receive permite ao processo captar requisições desses componentes. Em PACS, a requisição de um radiologista pela persistência e detecção automática de nódulos de uma tomografia de pulmão, pode ser captada por <receive name=”ImagePersistenceAndAnalysisReq” partnerLink="processProvider" operation="initiate" variable="input" createInstance="yes"/> Para que um processo possa emitir uma resposta a um solicitante é necessário um reply relacionado a um receive. Um possível reply para o receive acima é <reply name=”ImagePersistenceAndAnalysisResponse” partnerLink="processProvider" operation="initiate" variable="output"/> Um processo requisita uma operação oferecida por um Web Service através de invoke. A operação de persistência de imagem médica pode ser requisitada por Em geral um processo de negócio contém desvios condicionados a critérios. Em PACS, imageResp determina a invocação da função de detecção automática de nódulo ou o disparo de uma exceção. Esse desvio pode ser descrito como <if> <condition>imageResp</condition> <invoke name=”AutomaticAnalysis” … /> <else> <throw faultName=”PersistenceException”/> </else> </if> Atividades executadas iterativamente devem ser declaradas via while, onde é realizada uma avaliação para uma posterior execução, ou via repeat until, onde a avaliação sucede a execução. Em PACS, a persistência de várias imagens pode ser descrita como <while> <condition> currentImageNumber <= numberOfImages </condition> <invoke name=”persistImage” … /> <assign> <copy> <from>$currentImageNumber + 1</from> <to>$currentImageNumber</to> </copy> </assign> </while> Atividades executadas paralelamente devem ser declaradas via flow. Em PACS, as operações de persistência de uma imagem e de análise desta podem ser declaradas para execução em paralelo como <flow name=”parallelRequest”> <invoke name=”MedicalImagePersistence” … /> <invoke name=”AutomaticAnalysis … /> </flow> B. BPM em Nuvens Computacionais O modelo Process enactment, Activity execution and Data (PAD) é apresentado em [7], onde são investigadas possíveis distribuições de um BPM entre as premissas e uma nuvem, considerando a partição de atividades e dados, mas não considerando a partição do executor de processo. Em [12] o PAD é estendido, possibilitando também a partição do executor, conforme ilustrado na Fig. 2. <invoke name=”ImagePersistence” partnerLink="ImagPL" operation="persistImage" inputVariable="imageVar" outputVariable=”imageResp”/> É comum um processo de negócio conter atividades a serem executadas em sequência. Em PACS, a solicitação de persistência de imagem médica, a execução dessa tarefa e a emissão da resposta ao solicitante, podem ser descritas como <sequence name=”ImagePersistenceSequence”> <receive name=”ImagePersistenceRequest” … /> <invoke name=”ImagePersistence” … /> <reply name=”ImagePersistenceResponse … /> </sequence> Fig. 2. Possibilidades de partição e distribuição de BPM. Processos de negócio definem fluxos de controle, que regulam as atividades e a sequência das mesmas, e fluxos de dados, que determinam como estes são transferidos de uma atividade a outra. Um executor tem que lidar com ambos os tipos e, se dados sensíveis estiverem presentes, os fluxos de dados devem ser protegidos. Na dissertação de mestrado [13] é proposto um framework para a decomposição de um processo em dois processos colaborativos, com base numa lista de distribuição de atividades e dados, onde restrições relativas aos dados podem ser definidas, para assegurar que dados sensíveis permaneçam nas premissas. A Fig. 3 ilustra essa decomposição. base na avaliação de uma condição; mixagem simples, que une múltiplos ramos alternativos para um único desses ser executado; e ciclos arbitrários, que modela comportamento recursivo. Essa RI suporta também: dependência de dados, que representa explicitamente as dependências de dados entre os nós, que é necessária pois o processo original é decomposto em processos colaborativos e dados sensíveis podem estar presentes; e comunicação, que permite descrever como um processo invoca outro. A RI emprega um modelo baseado em grafos para representar processos, onde um nó representa uma atividade ou um elemento de controle e uma aresta representa uma relação entre dois nós. Esses nós e arestas foram especializados, definindo-se uma representação gráfica para cada especialização: Atividade, cada nó tem em geral uma aresta de controle de entrada e uma de saída; Comportamento paralelo, ilustrado na Fig. 5 (a), é modelado com nós flow e eflow. O primeiro divide um ramo de execução em vários ramos paralelos e possui no mínimo duas arestas de controle de saída. O segundo junta vários ramos paralelos num único ramo e possui duas ou mais arestas de controle de entrada e no máximo uma de saída; Fig. 3. Exemplo de decomposição. III. ABORDAGEM PROPOSTA O framework apresentado em [13], cujas fases são ilustradas na Fig. 4, contém uma Representação Intermediária (RI) baseada em grafos, na qual conceitos de processos de negócio são capturados. A decomposição de um processo passa pela RI, sendo que a adoção de uma linguagem de processos de negócio requer transformações da linguagem para a RI (lifting) e vice-versa (grounding). Em [13] foi adotada a linguagem abstrata Amber [14], efetuada uma análise para definir as regras de decomposição suportadas pelo framework, concebidos algoritmos para a sua implementação, os quais realizam transformações em grafos, e concebido um algoritmo para verificar se restrições relativas aos dados são violadas pela decomposição. Comportamento condicional, ilustrado na Fig. 5 (b), é modelado com nós if e eif. O primeiro possui duas arestas de controle de saída, uma rotulada true a outra false, e após a avaliação da condição somente uma destas é tomada. O segundo junta ramos condicionais, convertendo-os num único ramo de saída; Comportamento repetitivo, ilustrado nas Fig. 5 (c) e (d), é modelado com um único nó loop e, após a avaliação da condição, o ramo do comportamento repetitivo é tomado ou abandonado. Esse nó pode estar antes ou depois do comportamento, sendo que no primeiro caso resulta em zero ou mais execuções e no segundo em pelo menos uma execução; Fig. 4. Etapas envolvidas no framework. A. Representação Intermediária Para definir os requisitos da RI, foram adotados os seguintes padrões de workflow [15]: sequência, que modela fluxos de controle e expressa a sequência de execução de atividades num processo; divisão paralela, que divide um processo em dois ou mais ramos para execução simultânea; sincronização, que junta múltiplos ramos num único ramo de execução; escolha condicional, que executa um ramo com Fig. 5. Construções para comportamentos paralelo (a), condicional (b) e repetitivo com loop antes (c) e loop depois (d). Comunicações síncrona e assíncrona são ilustradas na Fig. 6 (a) e Fig. 6 (b) respectivamente. Por exemplo, a síncrona é modelada com os nós ireq, ires, rec e rep, através dos quais dois subprocessos, partes do processo global, se comunicam; Ecom é o conjunto de arestas de comunicação, onde e = (n1, Communication, n2) com n1, n2 ∈ C; L é um conjunto de rótulos textuais que podem ser atribuídos aos nós e arestas; nlabel : N → L, onde N = A∪ C∪ S atribui um rótulo textual a um nó; elabel : E → L atribui um rótulo textual a uma aresta; Fig. 6. Comunicações síncrona (a) e assíncrona (b). Arestas de controle, representadas por setas sólidas, modelam o fluxo de controle. Uma aresta de controle é disparada pelo seu nó de origem, tão logo a ação associada ao mesmo termina, e o nó terminal dessa aresta aguarda pelo seu disparo para iniciar a sua ação associada. Caso o nó de origem seja if, essa aresta é rotulada true ou false, e caso a condição avaliada corresponda a esse rótulo, esta é disparada pelo nó; Arestas de dados possibilitam investigar os efeitos na troca de dados causados pelas mudanças das atividades de um processo a outro, permitindo verificar se alguma restrição aos dados foi violada durante a partição do processo original. Uma aresta de dados é representada por uma seta tracejada. Uma aresta de dados do nó de origem ao nó terminal implica que os dados definidos no primeiro são usados pelo segundo. Cada aresta possui um rótulo, que define o nome dos dados compartilhados; Arestas de comunicação permitem enviar controle e dados a diferentes processos e são rotuladas com nomes de itens de dados enviados via as mesmas. Formalmente, um grafo na RI é uma tupla (A, C, S, ctype, stype, E, L, nlabel, elabel), onde: A é um conjunto de nós de atividade; C é um conjunto de nós de comunicação; S é um conjunto de nós estruturais ∈ {flow, eflow, if, eif, loop}; Os conjuntos par a par A, C e S são disjuntos; Os conjuntos N e E são disjuntos. B. Decomposição Em [13], para cada construção da RI foram identificadas decomposições para processos situados nas premissas, que possuem atividades a serem alocadas na nuvem, e vice-versa. A Fig. 7 ilustra um conjunto de atividades sequenciais, marcado para a nuvem, sendo alocado num único processo e substituído, no processo original, por nós de invocação síncrona. Fig. 7. Conjunto de atividades sequenciais movido como um bloco. Embora semanticamente diferentes, as construções paralelas e condicionais são generalizadas como compostas, pois possuem a mesma estrutura sintática, podendo ser decompostas de várias formas. Neste trabalho os nós de início e fim devem ter a mesma alocação e as atividades de um ramo, com a mesma alocação desses nós, permanecem com os mesmos. Se uma determinada construção é toda marcada para a nuvem, a decomposição é semelhante a das atividades sequenciais. Na Fig. 8, os nós de início e fim são marcados para a nuvem, e um ramo permanece nas premissas, sendo que a atividade desse ramo é colocada num novo processo nas premissas, o qual é invocado pelo processo na nuvem. ctype : C → {InvokeRequest, InvokeResponse, Receive, Reply} atribui um tipo comunicador a um nó de comunicação; stype : S → {Flow, EndFlow, If, EndIf, Loop} atribui um tipo nó controle a um nó de controle; E = Ectrl ∪ Edata ∪ Ecom é o conjunto de arestas no grafo, sendo que uma aresta é definida como (n1, etype, n2), onde etype ∈ {Control, Data, Communication} é o tipo da aresta e n1, n2 ∈ A ∪ C ∪ S; Ectrl é o conjunto de arestas de fluxo de controle, onde e=(n1, Control, n2) com n1, n2 ∈ A ∪ C ∪ S; Edata é o conjunto de arestas de dados, onde e = (n1, Data, n2) com n1, n2 ∈ A ∪ C ∪ S; Fig. 8. Um ramo da construção composta permanece nas premissas. Na Fig. 9, os nós de início e fim são marcados para a nuvem e os ramos permanecem nas premissas, sendo criado para cada ramo um novo processo nas premissas. Fig. 11. Ramos iterativos. Como já mencionado, a abordagem de decomposição aqui descrita emprega uma lista de distribuição de atividades e dados, que determina o que deve ser alocado nas premissas e numa nuvem computacional. Embora a definição dessa lista esteja fora do escopo deste trabalho, parte-se do princípio que esta é elaborada manual ou automaticamente de acordo com os seguintes critérios: Atividades sigilosas ou que contenham dados sigilosos devem ser alocadas nas premissas; Fig. 9. Os ramos da construção composta permanecem nas premissas. Na Fig. 10, os nós de início e fim permanecem nas premissas e os ramos são marcados para a nuvem, sendo criado para cada ramo um processo na nuvem. Fig. 10. Os nós início e fim permanecem nas premissas. Laços usam o loop, e se um laço é todo marcado para a nuvem, a decomposição é semelhante a das atividades sequenciais. Quando loop e comportamento são marcados com alocações distintas, este último é tratado como um processo separado. A Fig. 11 ilustra um laço onde o nó loop é marcado para a nuvem e a atividade iterativa fica nas premissas. Em função da complexidade da decomposição, os algoritmos para a sua implementação foram concebidos em quatro etapas consecutivas: identificação, partição, criação de nós de comunicação e criação de coreografia. Tais algoritmos, apresentados em [13], foram omitidos aqui devido às limitações de espaço. Atividades com baixo custo computacional e volume de dados devem ser alocadas nas premissas; e Atividades com alto custo computacional, com uma alta relação entre tempo de processamento e tempo de transferência de dados e que não se enquadrem no primeiro critério, devem ser alocadas na nuvem. C. Lifting e Grounding Devido à base XML de WS-BPEL, o lifting e o grounding convertem estruturas de árvores em grafos e viceversa, sendo que lifting possui um algoritmo para cada tipo de construção WS-BPEL e grounding um algoritmo para cada tipo de elemento da RI. Dessa forma, os principais mapeamentos foram: assign e throw para nós Atividade; flow para Comportamento paralelo; if para Comportamento condicional, onde construções com mais de uma condição são mapeadas para Comportamentos condicionais aninhados junto ao ramo false; while e repeatUntil para Comportamento repetitivo; receive e reply para Comunicação com nós rec e res; sequence para um conjunto de nós que representam construções aninhadas interconectados por arestas de controle; invoke assíncrono para o nó ireq e síncrono para os nós ireq e ires. Os algoritmos para o lifting e o grouding foram implementados em Java 7 usando a API para XML, baseada nas especificações do W3C, e o framework para testes JUnit. Por exemplo, as estruturas de árvore e grafo para o if, apresentadas na Fig. 12, tiveram seus lifting e grounding implementados a partir dos Algoritmos 1 e 2 respectivamente, cujas entrada e saída estão ilustradas na Fig. 12. Os algoritmos para o lifting e o grounding de outras estruturas da RI e de WS-BPEL foram omitidos devido a limitações de espaço. enviado a FalseGenerator. Caso contrário, se todas as condições foram assumidas false e havendo atividade para execução, um nó else é adicionado à árvore. Algoritmo 2 Grounding para o grafo if function IfGenerator(g) t ← IfTree() t.children ← t.children ∪ {CondGenerator(g.cond)} t.children ← t.children ∪ {Generator(g.true)} t.children ← t.children ∪ {FalseGenerator(g.false)} end function Fig. 12. Estruturas de árvore e grafo para a construção if. IfParser caminha nos nós aninhados da árvore verificando a condição e construindo o ramo true do grafo if com as atividades relacionadas, sendo que as construções restantes são enviadas a FalseParser para que o ramo false seja construído. Caso a árvore tenha mais de uma condição, o ramo false conterá um grafo if para a segunda condição, esse grafo terá um ramo false que conterá outro grafo if para a terceira condição, e assim sucessivamente. Algoritmo 1 Lifting para a árvore da construção if function IfParser(t) cond ← {} if t of type IfTree then cond ← IfGraph() for all c ∈ t.children do if c type of Condition then cond.cond ← CondParser(c) else if c type of ElseTree ∨ c type of ElseIfTree then cond.false ← FalseParser(t.children) return cond else if c type of Tree then cond.true ← Parser(c) end if t.children ← t.children – {c} end for end if return cond end function function FalseParser(s) if s = {} then return s end if falseBranch ← Graph() if s.first of type ElseIfTree then cond ← IfBranch() cond.true ← ElseParse(s.first) cond.false ← FalseParse(s-{s.first}) falseBranch.nodes ← {cond} else if s.first of type ElseTree then falseBranch.nodes←{ElseParse(s.first)} else return FalseParse(s-{s.first}) end if return falseBranch end function IfGenarator caminha no ramo true do grafo verificando e adicionando à árvore if a condição junto com as atividades relacionadas, sendo que o ramo false é enviado à FalseGenerator que verifica se há um nó if aninhado. Caso exista uma construção elseif, com a condição e as atividades relacionadas, esta é adicionada à árvore e seu ramo false é function FalseGenerator(f) r ← {} while f ≠ {} do if # of f.nodes = 1 ^ f of type ElseIfTree then t ← ElseIfTree() t.children ← CondGenerator(f.cond) ∪ Generator(f.true) r ← r ∪ t else r ← r ∪ ElseGenerator(f) end if f ← f.false end while return r end function IV. ESTUDO DE CASO O estudo de caso para validar a decomposição foi baseado no PACS, um processo na Saúde que tem por objetivo persistir diagnósticos e tomografias mamárias e aplicar uma função para a detecção de possíveis nódulos nas mesmas. O PACS aceita um conjunto de imagens e seus respectivos pré-diagnósticos e identificadores, efetua a persistência de cada imagem e diagnóstico, executa a função para detecção automática de nódulos sobre as tomografias mamárias e emite um vetor contendo os identificadores das imagens com nódulos em potencial. No workflow do processo monolítico do PACS, ilustrado na Fig. 1, as construções marcadas para alocação na nuvem estão com um fundo destacado. A Fig. 13(a) ilustra a RI do PACS monolítico após o lifting, enquanto a Fig. 13(b) ilustra a RI após a execução da decomposição. A Fig. 14 ilustra o PACS decomposto após o grounding com a adição de dois observadores: um externo, cuja visão é a mesma do observador do PACS monolítico, ou seja, só enxerga as interações entre Cliente e PACS; um interno que, além dessas interações, enxerga também as interações entre os processos nas premissas e na nuvem. A Fig. 15 ilustra, via diagramas UML de comunicação, exemplos de traços obtidos executando o processo monolítico (a) e o processo decomposto (b), sendo que as interações destacadas neste último são visíveis somente ao observador interno. Se ocultadas tais interações, ambos os traços passam a ser equivalentes em observação para o observador externo. A. Análise de Desempenho A fim de comparar o desempenho entre os processos monolítico e decomposto, estes foram implementados empregando-se as seguintes ferramentas: sistema operacional Debian 6; servidor de aplicação Apache Tomcat 6; Java 6; mecanismo de processos BPEL Apache ODE; e o framework para disponibilizar os Web Services Apache AXIS 2. O processo monolítico e a parte nas premissas do processo decomposto foram executados sobre uma infraestrutura com 1GB de RAM, 20 GB de disco e 1 núcleo virtual com 2,6 GHz. A parte na nuvem do processo decomposto foi executada sobre um modelo IaaS, em uma nuvem privada gerenciada pelo software OpenStack, com as diferentes configurações descritas na Tabela I. TABELA I CONFIGURAÇÕES DAS INSTÂNCIAS NA NUVEM. (a) (b) Fig. 13. RIs dos processos monolítico (a) e decomposto (b). Código Memória HD Núcleos Frequência conf#1 conf#2 conf#3 conf#4 conf#5 conf#6 2 GB 2 GB 4 GB 4 GB 6 GB 6 GB 20 GB 20 GB 20 GB 20 GB 20 GB 20 GB 1 2 1 2 1 2 2.6 GHz 2.6 GHz 2.6 GHz 2.6 GHz 2.6 GHz 2.6 GHz As execuções dos processos empregaram uma carga de trabalho composta por duas tuplas na forma <id, diagnostic, image>, onde id é um identificador de 4 bytes, diagnostic é um texto de 40 bytes e image é uma tomografia mamária de 11,1 MB. Foram coletadas 100 amostras dos tempos de resposta dos processos monolítico e decomposto para cada configuração i. De acordo com [16], o percentual Pi de desempenho ganho do processo decomposto em relação ao monolítico para a iésima configuração pode ser definido como 𝑃𝑖 = 1 − 𝑇𝑑𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑡𝑜 𝑇𝑚𝑜𝑛𝑜𝑙 í𝑡𝑖𝑐𝑜 𝑖 onde: 𝑇𝑑𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑡𝑜 𝑖 é o tempo de resposta médio do processo decomposto na configuração i; e 𝑇𝑚𝑜𝑛𝑜𝑙 í𝑡𝑖𝑐𝑜 é o tempo de resposta médio do processo monolítico. O tempo de comunicação adicional foi desconsiderado, pois essa medida é relativa a cada recurso disponível e ao tamanho da carga de trabalho. A Fig. 16 ilustra o percentual de ganho de desempenho do processo decomposto em relação ao monolítico para cada uma das configurações, sendo que o percentual mínimo é superior a 10%. Fig. 14. PACS decomposto com observadores externo e interno. 1: request() (a) :ProcessoMonolítico :WebServiceCliente 1.1: response() 1: request() (b) :WebServiceCliente 1.3: response() 1.1: cloudRequest() :ProcessoPremissa :ProcessoNuvem 1.2: cloudResponse() Fig. 15. Diagramas UML dos processos monolítico (a) e decomposto (b). Para verificar a hipótese de que as médias dos tempos de resposta do processo decomposto foram significativamente menores que a do processo monolítico, foi empregada a estatística do teste t [17] a um nível de significância de 5%. Os testes resultaram em valores da probabilidade p-value na ordem de 2,2 × 10−16 , confirmando essa hipótese. Fig. 18. Percentual de desempenho ganho e custo/hora do recurso na nuvem. Fig. 16. Percentual de desempenho ganho do processo decomposto. B. Custos Relativos à Nuvem Para determinar os custos adicionais agregados a esses ganhos de desempenho, foi criado um modelo de regressão linear [18] com os dados obtidos via 45 observações de preços de três grandes provedores de IaaS, o qual emprega as seguintes variáveis independentes: quantidade de RAM em MB; quantidade de disco em GB; o número de núcleos virtuais; e a frequência de cada um desses núcleos. Dessa forma, o valor estimado 𝑦 do preço em dólar/hora do recurso alocado na nuvem é definido como 𝑦 =∝ +𝜷𝑿 onde: ∝= −2,4882 × 10−16 é o intercepto do modelo; 𝜷 = [0.013506, 0.072481, 0.083593, 0.000092282] é o vetor de coeficientes de regressão; e X = [memory_in_gb, number_of_virtual_cores, ghz_by_core, hard_disk_in_gb] é o vetor de variáveis independentes. 2 Esse modelo possui o coeficiente de determinação R de 89,62% e erro aleatório médio de US$ 0,0827, o qual foi determinado com a técnica de validação cruzada leave-oneout [19]. A Fig. 17 ilustra a aderência dos valores estimados, via a Equação (2), aos valores observados. Já a Fig. 18 ilustra a relação entre o custo adicional de cada configuração, definida via a Equação 2, e a porcentagem de desempenho ganho através da mesma. Observa-se na Fig. 18 que o maior ganho de desempenho é obtido com a conf#6, a qual proporciona uma redução maior que 12% no tempo de resposta do processo de negócio, sendo acompanhada de um custo adicional de aproximadamente US$ 0,20/hora do recurso alocado na nuvem. V. TRABALHOS CORRELATOS Em [20] novas orquestrações são criadas para cada serviço usado por um processo de negócio, resultando numa comunicação direta entre os mesmos ao invés destes terem uma coordenação única. O processo WS-BPEL é convertido para um grafo de fluxo de controle, que gera um Program Dependency Graph (PDG), a partir do qual são realizadas as transformações, e os novos grafos gerados são reconvertidos para WS-BPEL. Como no algoritmo cada serviço no processo corresponde a um nó fixo para o qual uma partição é gerada, este trabalho não é adequado para a abordagem aqui proposta, pois esta visa a criação de processos nos quais múltiplos serviços possam ser usados. Os resultados descritos em [21] focam na descentralização de orquestrações de processos WS-BPEL, usando Dead Path Elimination (DPE) para garantir a conclusão da execução de processos descentralizados, mas DPE também torna a abordagem muito dependente da linguagem empregada na especificação do processo de negócio. A RI aqui apresentada é independente dessa linguagem e, consequentemente, também a decomposição, bastando o desenvolvimento das transformações de lifting e grounding apropriadas. Em [22] é reportado que a maioria das pesquisas, em descentralização de orquestrações, foca em demasia em linguagens de processos de negócio específicas. Não focar tanto nessas linguagens foi um dos principais desafios da pesquisa aqui apresentada, sendo que outro desafio foi não se preocupar somente com problemas de desempenho, mas também com medidas de segurança reguladas por governos ou organizações. Consequentemente, a decisão de executar uma atividade nas premissas ou na nuvem, neste trabalho, é já tomada na fase de projeto do ciclo de vida do BPM. VI. Fig. 17. Aderência dos valores estimados aos valores observados. CONSIDERAÇÕES FINAIS E TRABALHOS FUTUROS Este trabalho é uma continuação do apresentado na dissertação de mestrado [13] e focou nas regras de decomposição de processos de negócio, sendo que as seguintes contribuições adicionais merecem destaque: Para demonstrar a generalidade da abordagem proposta, ao invés da linguagem Amber usada em [13], foi utilizada WS-BPEL para a especificação de processos de negócio; Para que essa abordagem pudesse ser empregada, transformações de lifting e grounding tiveram que ser desenvolvidas para WS-BPEL; O fato de WS-BPEL ser executável, possibilitou a implementação dos processos criados e a comparação de seus comportamentos ao comportamento do processo original, validando assim a abordagem proposta; e Essas implementações possibilitaram também a realização de uma analise comparativa de desempenho entre os processos original e decomposto e uma avaliação dos custos inerentes à alocação de parte do processo decomposto na nuvem. Os resultados obtidos com esse trabalho indicam que a abordagem proposta é genérica, viável e eficaz tando do ponto de vista de desempenho quanto financeiro. Atualmente, a RI está sendo estendida para suportar mais padrões de workflow e para modelar comportamentos de exceção de WS-BPEL. Num futuro próximo, esta pesquisa continuará nas seguintes direções: complementar as regras de decomposição para suportar construções compostas, nas quais os nós de início e fim tenham diferentes localizações, e para possibilitar a extensão do número de localizações, já que múltiplas nuvens podem ser usadas e/ou múltiplos locais nas premissas podem existir nas organizações; e desenvolver um framework de cálculo, que leve em consideração os custos reais do processo original e dos processos criados, visando recomendar quais atividades e dados devem ser alocados em quais localizações. AGRADECIMENTOS [5] D. L. Banks, "The Health Insurance Portability and Accountability Act: Does It Live Up to the Promise?," Journal of Medical Systems, vol. 30, no. 1, pp. 45-50, February 2006. [6] R. K. L. Ko, "A computer scientist's introductory guide to business process management (BPM)," Crossroads, vol. 15, no. 4, pp. 11-18, June 2009. [7] Y.-B. Han, J.-Y. Sun, G.-L. Wang and H.-F. Li, "A Cloud-Based BPM Architecture with User-End Distribution of Non-ComputeIntensive Activities and Sensitive Data," Journal of Computer Science and Technology, vol. 25, no. 6, pp. 1157-1167, 2010. [8] D. S. Linthicum, Cloud Computing and SOA Convergence in Your Enterprise: A Step-by-Step Guide, Boston, MA, USA: Pearson Education Inc., 2009. [9] OMG, "Business Process Model and Notation (BPMN) Version 2.0," January 2011. [Online]. Available: http://goo.gl/k2pvi. [Accessed 17 março 2013]. [10] A. Alves, A. Arkin, S. Askary, C. Barreto, B. Bloch, F. Curbera, M. Ford, Y. Goland, A. Guízar, N. Kartha, C. K. Liu, R. Khalaf, D. König, M. Marin, V. Mehta, S. Thatte, D. van der Rijn, P. Yendluri and A. Yiu, "Web Services Business Process Execution Language Version 2.0," OASIS Standard, 11 April 2007. [Online]. Available: http://goo.gl/MTrpo. [Accessed 1 Março 2013]. [11] P. M. d. Azevedo-Marques and S. C. Salomão, "PACS: Sistemas de Arquivamento e Distribuição de Imagens," Revista Brasileira de Física Médica, vol. 3, no. 1, pp. 131-139, 2009. [12] E. Duipmans, L. F. Pires and L. da Silva Santos, "Towards a BPM Cloud Architecture with Data and Activity Distribution," Enterprise Distributed Object Computing Conference Workshops (EDOCW), 2012 IEEE 16th International, pp. 165-171, 2012. [13] E. F. Duipmans, Business Process Management in the Cloud with Data and Activity Distribution, master's thesis, Enschede, The Netherlands: Faculty of EEMCS, University of Twente, 2012. [14] H. Eertink, W. Janssen, P. O. Luttighuis, W. Teeuw and C. Vissers, "A business process design language," World Congress on Formal Methods, vol. I, pp. 76-95, 1999. [15] W. v. d. Aalst, A. t. Hofstede, B. Kiepuszewski and A. Barros., "Workflow Patterns," Distributed and Parallel Databases, vol. 3, no. 14, pp. 5-51, 2003. [16] R. Jain, The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling, Wiley, 1991, pp. 1-685. Os autores agradecem ao suporte do CNPq através do INCT-MACC. [17] R Core Team, "R: A Language and Environment for Statistical Computing," 2013. [Online]. Available: http://www.R-project.org/. [Accessed 5 Abril 2013]. REFERÊNCIAS [18] J. D.Kloke and J. W.McKean, "Rfit: Rank-based Estimation for Linear Models," The R Journal, vol. 4, no. 2, pp. 57-64, 2012. [1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica and M. Zaharia, "Above the Clouds: A Berkeley View of Cloud Computing," EECS Department, University of California, Berkeley, 2009. [19] R. Kohavi, "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection," in Proceedings of the 14th international joint conference on Artificial intelligence, vol. 2, San Francisco, CA: Morgan Kaufmann Publishers Inc., 1995, pp. 11371143. [2] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg and I. Brandic, "Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility," Future Generation Computer Systems, vol. 25, no. 6, pp. 599-616, June 2009. [20] M. G. Nanda, S. Chandra and V. Sarkar, "Decentralizing execution of composite web services," SIGNPLAN Notices, vol. 39, no. 10, pp. 170-187, October 2004. [3] P. Mell and T. Grance, "The NIST Definition of Cloud Computing," National Institute of Standards and Technology, vol. 53, no. 6, pp. 150, 2009. [4] S. Yu, C. Wang, K. Ren and W. Lou, "Achieving secure, scalable, and fine-grained data access control in cloud computing," in Proceedings of the 29th conference on Information communications, Piscataway, NJ: IEEE Press, 2010, pp. 534-542. [21] O. Kopp, R. Khalaf and F. Leymann, "Deriving Explicit Data Links in WS-BPEL Processes," Services Computing, 2008. SCC '08, vol. 2, pp. 367-376, July 2008. [22] W. Fdhila, U. Yildiz and C. Godart, "A Flexible Approach for Automatic Process Decentralization Using Dependency Tables," Web Services, 2009. ICWS 2009, pp. 847-855, 2009. On the Influence of Model Structure and Test Case Profile on the Prioritization of Test Cases in the Context of Model-based Testing João Felipe S. Ouriques∗ , Emanuela G. Cartaxo∗ , Patrı́cia D. L. Machado∗ ∗ Software Practices Laboratory/UFCG, Campina Grande, PB, Brazil Email: {jfelipe, emanuela}@copin.ufcg.edu.br, [email protected] Abstract—Test case prioritization techniques aim at defining an ordering of test cases that favor the achievement of a goal during test execution, such as revealing faults as earlier as possible. A number of techniques have already been proposed and investigated in the literature and experimental results have discussed whether a technique is more successful than others. However, in the context of model-based testing, only a few attempts have been made towards either proposing or experimenting test case prioritization techniques. Moreover, a number of factors that may influence on the results obtained still need to be investigated before more general conclusions can be reached. In this paper, we present empirical studies that focus on observing the effects of two factors: the structure of the model and the profile of the test case that fails. Results show that the profile of the test case that fails may have a definite influence on the performance of the techniques investigated. context or in a more specific context, such as regression testing, depending on the information that is considered by the techniques [4]. Moreover, both code-based and specificationbased test suites can be handled, although most techniques presented in the literature have been defined and evaluated for code-based suites in the context of regression testing [5] [6]. Keywords—Experimental Software Engineering, Software Testing, Model-Based Testing, Test Case Prioritization. Techniques for ordering the test cases may be required to support test case selection, for instance, to address constrained costs of running and analysing the complete test suite and also to improve the rate of fault detection. However, to the best of our knowledge, there are only few attempts presented in the literature to define test case prioritization techniques based on model information [10] [11]. Generally, empirical studies are preliminary, making it difficult to assess current limitations and applicability of the techniques in the MBT context. I. I NTRODUCTION The artifacts produced and the modifications applied during software development and evolution are validated by the execution of test cases. Often, the produced test suites are also subject to extensions and modifications, making management a difficult task. Moreover, their use can become increasingly less effective due to the difficulty to abstract and obtain information from test execution. For instance, if test cases that fail are either run too late or are difficult to locate due to the size and complexity of the suite. To cope with this problem, a number of techniques have been presented in the literature. These techniques can be classified as: test case selection, test suite reduction and test case prioritization. The general test case selection problem is concerned with selecting a subset of the test cases according to a specific (stop) criterion, whereas test suite reduction techniques focus on selecting a subset of the test cases, but the selected subset must provide the same coverage as the original suite [1]. While the goal of selection and reduction is to produce a more cost-effective test suite, studies presented in the literature have shown that the techniques may not work effectively, since some test cases are discarded and consequently, some failures may not be revealed [2]. On the other hand, test case prioritization techniques have been investigated in order to address the problem of defining an execution order of the test cases according to a given testing goal, particularly detecting faults as early as possible [3]. These techniques can be applied either in a general development Model-based Testing (MBT) is an approach to automate the design and generation of black-box test cases from specification models, together with all oracle information needed [7]. MBT can be applied to any model with different purposes, from which specification-based test cases are derived, and also at different testing levels. As usually, automatic generation produces a big number of test cases that may also have a considerable degree of redundancy [8] [9]. To provide useful information that may influence on the development of prioritization techniques, empirical studies must focus on controlling and/or observing factors that may determine the success of a given technique. Given the goals of prioritization in the context of MBT, a number of factors can be determinant such as the size and the coverage of the suite, the structure of the model (that may determine the size and structure of test cases), the amount and distribution of failures and the degree of redundancy of test cases. In this paper, we investigate mainly the influence of two factors: the structure of the model and the profile of the test cases that fail. For this, we conduct 3 empirical studies, where real application models, as well as automatically generated ones, are considered. The focus is on general prioritization techniques that can be applied to MBT test suites. The purpose of the first study was to acquire preliminary observations by considering real application models. From this study, we concluded that a number of different factors may influence on the performance of the techniques. Therefore, the purpose of the second and third studies, the main contribution of this paper, was to investigate on specific factors by controlling them through the use of generated models. Results from these studies show that, despite the fact that the structure of the models may present or not certain constructions (for instance the presence of loops1 ), it is not possible to differentiate the performance of the techniques when focusing on the presence of the construction investigated. On the other hand, depending on the profile of the test case that fails (longest, shortest, essential, and so on), one technique may perform better than the other. of the source code/model. Then, a set of permutations P T S is obtained and the T S0 that has the highest value of f is chosen. In the studies presented in this paper, we focus on system level models, that can be represented as activity diagrams and/or as labelled transition systems with inputs and outputs as transitions. Models are generated according to the strategy presented by Oliveira Neto et al. [12]. Test cases are sequences of transitions extracted from a model by a depth-search algorithm as presented by Cartaxo et al. [9] and Sapna and Mohanty [11]. Prioritization techniques receive as input a test suite and produces as output an ordering for the test cases. When the goal is to increase fault detection, the Average Percentage of Fault Detection (APFD) metric has been largely used in the literature. The highest the APFD value is, the faster and the better the fault detection rates are [14]. The paper is structured as follows. Section II presents fundamental concepts along with a quick definition of the prioritization techniques considered in this paper. Section III discusses related works. Section IV presents a preliminary study where techniques are investigated in the context of two real applications, varying the ammount of faults. Sections V and VI presents the main empirical studies conducted: the former reports a study with automatically generated models where the presence of certain structural constructions is controlled, whereas the latter depicts a study with automatically generated models that are investigated for different profiles of the test case that fails. Section VII presents concluding remarks about the results obtained and pointers for further research. Details on the input models and data collected in the studies can be found at the project site2 . Empirical studies have been defined according to the general framework proposed by Wohlin [13] and the R tool3 has been used to support data analysis. II. BACKGROUND This section presents the test case prioritization concept (subsection II-A) and the techniques considered in this paper (subsection II-B). A. Test Case Prioritization Test case prioritization is a technique that orders test cases in an attempt to maximize an objective function. This problem was defined by Elbaum et al. as follows [14]: Given: T S, a test suite; P T S, a set of permutations of T S; and, f , a function that maps P T S to real numbers (f : P T S → R). Problem: Find a T S0 ∈ P T S | ∀ T S00 (T S00 ∈ P T S) (T S00 6= T S0) · f (T S0) ≥ f (T S00) The objective function is defined according to the goal of the test case prioritization. For instance, the manager may need to quickly increase the rate of fault detection or the coverage 1 A number of loops distributed in a model may lead to huge test suites with a certain degree of redundancy between the test cases even if they are traversed only once for each test case. 2 https://sites.google.com/a/computacao.ufcg.edu.br/mb-tcp/ 3 http://www.r-project.org/ Note that the key point for the test case prioritization is the goal, and the success of the prioritization is measured by this goal. However, it is necessary to have some data (according to the defined goal) to calculate the function for each permutation. Then, for each test case, a priority is assigned and test cases with the highest priority are scheduled to execute first. Test case prioritization can be applied in code-based and specification-based contexts, but it has been more applied in the code-based context and it is often related to regression testing. This way, Rothermel et al. [4] has proposed the following classification: • General test case prioritization - test case prioritization is applied any time in the software development process, even in the initial testing activities; • Regression testing prioritization - test case prioritization is performed after a set of changes was made. Therefore, test case prioritization can use information gathered in previous runs of existing test cases to help prioritize the test cases for subsequent runs. B. Techniques This subsection presents general test case prioritization techniques that will be approached in this paper. Optimal. This technique is largely used in experiments as upper bound on the effectiveness of the other techniques. This technique presents the best result that can be obtained. To obtain the best result, it is necessary to have, for example, the faults (if the goal is to increase fault detection) as input that are not available in practice (so the technique is not feasible in practice). For this, we can only use applications with known faults. This let us determine the ordering of test cases that maximizes a test suite’s rate of fault detection. Random. This technique is largely used in experiments as lower bound on the effectiveness of the other techniques [6], based on a random choice strategy. Adaptive Random Testing (ART). This technique distributes the selected test case as spaced out as possible based on a distance function [15]. To apply this technique, two sets of test cases are required: the executed set (the set of distinct test cases that have been executed but without revealing any failure) and the candidate set (the set of test cases that are randomly selected without replacement). Initially, the executed set is empty and the first test case is randomly chosen from the input domain. The executed set is then updated with the selected element from the candidate set. From the candidate set, an element that is farthest away from all executed test cases, is selected as the next one. There are several ways to implement the concept of farthest away. In this paper, we will consider: • Jaccard distance: The use of this function in the prioritization context was proposed by Jiang et al. [6]. It calculates the distance between two sets and it is defined as 1 minus the size of the intersection divided by the size of the union of the sample sets. In our context, we consider a test case as an ordered set of edges (that represent transitions). Considering p and c as test cases and B(p) and B(c) as a set of branches covered by the test cases p and c, respectively, the distance between them can be defined as follows: J(p, c) = 1 − • |B(p) ∩ B(c)| |B(p) ∪ B(c)| Manhattan distance: This distance, proposed by Zhou [16], is calculated by using two arrays. Each array has its size equal to the number of branches in the model. Since this function is used to evaluate the distance between two test cases, each test case is associated with one array. For each position of the array, it is assigned 1 if the test case has the branch or 0, otherwise. Fixed Weights. This technique was proposed by Sapna and Mohanty [11] and it is a prioritization technique based on UML activity diagrams. The structures of the activity diagram are used to prioritize the test cases. First of all, the activity diagram is converted into a tree structure. Then, weights are assigned according to the structure of the activity diagram (3 for forkjoin nodes, 2 for branch-merge nodes, 1 for action/activity nodes). Lately, the weight for each path is calculated (sum of the weights assigned to nodes and edges) and the test cases are prioritized according to the weight sums obtained. STOOP. This technique was proposed by Kundu et al. [17]. The inputs are sequence diagrams. These diagrams are converted into a graph representation called as sequence graph (SG). After this, the SGs are merged. From the merged sequence graph, the test cases are generated. Lastly, the set of test case is prioritized. The test cases are sorted in descending order taking into account the average weighted path length (AWPL) metric. This is defined as follows: Pm AW P L(pk ) = i=1 eW eight(ei ) m where pk = e1 ; e2 ; em is a test case and eW eight is the amount of test cases that contains the edge ei . III. R ELATED W ORK Several test case prioritization techniques have been proposed and investigated in the literature. Most of them focus on code-based test suites and the regression testing context [18], [19]. The experimental studies presented, have discussed whether a technique is more effective than others, comparing them mainly by the APFD metric. And, so far, there is no experiment that presented general results. This evidences the need for further investigation and empirical studies that can contribute to advances in the state-of-the-art. Regarding code-based prioritization, Zhou et al. [20] compared fault-detection capabilities of the Jaccard-distance-based ART and Manhattan-distance-based ART. Branch coverage information was used for test case prioritization and the results showed that Manhattan is more effective than Jaccard distance in the context considered [20]. Also, Jeffrey and Gupta [21] proposed an algorithm that prioritizes test cases based on coverage of statements in relevant slices and discuss insights from an experimental study that considers also total coverage. Moreover, Do et al. [22] presented a series of controlled experiments evaluating the effects of time constraints and faultiness levels on the costs and benefits of test case prioritization techniques. The results showed that time constraints can significantly influence both the cost and effectiveness. Moreover, when there are time constraints, the effects of increased faultiness are stronger. Furthermore, Elbaum et al. [5] compared the performance of 5 prioritization techniques in terms of effectiveness, and showed how the results of this comparison can be used to select a technique (regression testing) [18]. They applied the prioritization techniques to 8 programs. Characteristics of each program (such as: number of versions, KLOC, number and size of the test suites, and average number of faults) were taken into account. By considering the use of models in the regression testing context, Korel et al. [10], [19], [23] presented two model-based test prioritization methods: selective test prioritization and model dependence-based test prioritization. Both techniques focus on modifications made to the system and models. The inputs are the original EFSM system model and the modified EFSM. Models are run to perform the prioritization. On the other hand, our focus is on general prioritization techniques were modifications are not considered. Generally, in the MBT context, we can find proposals to apply general test case prioritization from UML diagrams, such as: i) the technique proposed by Kundu [17] et al. where sequence diagrams are used as input; and ii) the technique proposed by Sapna and Mohanty [11] where activity diagrams are used as input. Both techniques are investigated in this paper. In summary, the original contribution of this paper is to present empirical studies in the context of MBT that consider different techniques and factors that may influence on their performance such as the structure of the model and the profile of the test case that fails. IV. F IRST E MPIRICAL S TUDY The main goal of this study is to “analyze general prioritization techniques for the purpose of comparing their performances, observing the impact of the number of test cases that fail, with respect to their ability to reveal failures earlier, from the point of view of the tester and in the context of MBT”. We worked with the following research hypothesis: “The general test case prioritization techniques present different abilities of revealing failures, considering different amount of failing test cases in the test suite”. In the next sections, we present the study design and analysis on data collected. A. Planning We conducted this experiment in a research laboratory – a controlled environment. This characteristic leads to an offline study. Moreover, all the techniques involved in the study only require the test cases to prioritize and the mapping from the branches of the system model to the test cases that satisfy each branch. Thus, no human intervention is required, eliminating the “expertise” influence. As objects, from which system models were derived, we considered real systems. Despite the fact that the applications are real ones, they do not compose a representative sample of the whole set of applications and, thereby, this experiment deal with a specific context. In order to analyze the performance of the techniques, observing the influence of the number of test cases that fail, we defined the following variables: Independent variables and factors • • General prioritization techniques: Techniques defined in Section II. We will consider the following short-names for the sake of simplicity: optimal, random, ARTjac (Adaptive Random Testing with Jaccard distance), ARTman (Adaptive Random Testing with Manhattan distance), fixedweights, stoop; Number of test cases that fail: low (lower than 5% of the total), medium (between 5% and 15% of the total), high (higher than 15% of the total); Dependent variable • Average Percentage of Fault Detection - APFD In this study, we used two system models from two realworld applications: i) Labelled Transition System-Based Tool – LTS-BT [24] – a MBT activities support tool developed in the context of our research group and ii) PDF Split and Merge - PDFsam4 – a tool for PDF files manipulation. They were modelled by UML Activity Diagram, using the provided use cases documents and the applications themselves. From this diagram a graph model was obtained for each application, from which test cases were generated by using a depth search-based algorithm proposed by Sapna and Mohanty [11] where each loop is considered two times at most. Table I shows some structural properties from the models and the test cases that were generated from them to be used as input to the techniques. It is important to remark that test cases for all techniques were obtained from the same model using a single algorithm. Also, even though the STOOP technique has been generally proposed to be applied from sequence diagrams, the technique itself works on an internal model that combines the diagrams. Therefore, it is reasonable to apply STOOP in the context of this experiment. TABLE I. S TRUCTURAL PROPERTIES OF THE MODELS IN THE EXPERIMENT. Property Branching Nodes Loops Join Nodes Test Cases Shortest Test Case Longest Test Case Defects TC reveal failures LTS-BT 26 0 7 53 10 34 4 14 PDFSam 11 5 6 87 17 43 5 32 The number of test cases that fail variable was defined considering real and known defects in the models and allocated as shown in Table II. 4 Project’s site: http://www.pdfsam.org TABLE II. Level low medium high D EFINITION OF THE T EST C ASES THAT FAIL VARIABLE Failures in LTS-BT 2 test cases → 3,77% 4 test cases → 7,54% 8 test cases → 15,09% Failures in PDFSam 4 test cases → 4,59% 7 test cases → 8,04% 16 test cases → 18,39% The relationship between a defect (associated with a specific edge in the model) and a failure (a test case that fails) is that when a test case exercises the edge, it reveals the failure. For each different level, we considered a different set of defects of each model, and in the high level, two defects originate the failures. Moreover, these test cases do not reveal the two defects at the same time for the two models. By using the defined variables and detailing the informal hypothesis, we postulated eight pairs of statistical hypotheses (null and alternative): three pairs evaluating the techniques at each level of number of test cases that fail (e.g. H0 : AP F D(low,i) = AP F D(low,j) and H1 : AP F D(low,i) 6= AP F D(low,j) , for techniques i and j, with i 6= j) and five pairs evaluating the levels for each technique (e.g. H0 : AP F D(random,k) = AP F D(random,l) and H1 : AP F D(random,k) 6= AP F D(random,l) , for levels k and l, with k 6= l), excluding the optimal technique. For the lack of space, the hypotheses pairs are not written here. Based on the elements already detailed, the experimental design for this study is One-factor-at-a-time [25]. The data analysis for the hypotheses pairs is based on 2-Way ANOVA [26] [27], after check the assumptions of normality of residuals and equality of variances. Whether any assumption is not satisfied, a non parametric analysis needed to be proceeded. We calculated the number of replications based on a pilot sample, using the following formula proposed by Jain [27]. We obtained 815 as result, for a precision (r) of 2% of the sample mean and significance (α) of 5%. The following steps were executed to perform the experiment: 1) Instantiate lists for data collection for each replication needed; 2) Instantiate the failure models to be considered; 3) Generate test cases; 4) Map branches to test cases; 5) Execute each technique for each object considering the replications needed; 6) Collect data and compute dependent variable; 7) Record and analyse results. All techniques were automatically executed. B. Data Analysis When analysing data collected, we must verify the ANOVA assumptions. Figure 1 assures that the residuals are not normally distributed, because the black solid line should be near of the straight continuous line of the normal distribution. Thus, we proceeded a nonparametric analysis. A confidence interval analysis, as seen in Table III of the 95% confidence intervals of the pseudomedians5 of APFD values collected might give a first insight about some null hypotheses rejection. The set of hypothesis defined for this experiment compare the techniques under two points of view: i) whole set of 5 The pseudomedian is a nonparametric estimator for the median of a population [28]. QQ−Plot for the Residuals −0.2 −0.4 −0.6 Sample Quantiles 0.0 0.2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● −2 0 C. Threats to validity 2 4 Quantiles of Normal Distribution Fig. 1. the growth of the level, while fixedweights decreases its performance when the level goes from low to medium and increase when the level goes from medium to high. This different patterns compose an evidence of influence of another factors over the researched techniques that motivated the execution of the experiments presented in Sections V and VI. QQ-Plot of the residuals and the normal distribution TABLE III. C ONFIDENCE I NTERVAL OF THE P SEUDOMEDIANS – optimal random ARTJac ARTMan fixedweights stoop Low 0.992 0.992 0.807 0.829 0.902 0.906 0.808 0.830 0.540 0.543 0.244 0.244 Medium 0.992 0.992 0.864 0.876 0.888 0.900 0.863 0.876 0.436 0.439 0.319 0.319 High 0.992 0.992 0.834 0.847 0.877 0.885 0.839 0.850 0.679 0.679 0.560 0.560 techniques at each single level, and ii) each technique isolated in the different levels. For the first set of hypothesis, when considering the levels of number of test cases that fail separately (set of two columns for each level), some confidence intervals do not overlap, therefore the null hypotheses of equality must be rejected. However, in the three levels, there is an overlap between random and ARTMan and the p-values of Mann-Whitney tests between the two techniques are 0.9516, 0.9399 and 0.4476 for low, medium and high, respectively. These p-values are greater than the significance of 5%, thus the performance of these techniques are statistically similar at this significance. For the second set of hypothesis, by analyzing each technique separately (lines of Table III), all the null hypothesis of equality must be rejected, once every technique present no overlap between the confidence intervals at each level. This means that the performance of the techniques can vary when more or less test cases fail. As general observations, ARTJac presented the best performance for the three levels. Moreover, the techniques presented slightly variations when considering the three levels (by increasing or decreasing), except from fixedweights and stoop that increase more than other techniques. These techniques that are mostly based on structural elements of the test cases, may be more affected by the number of test cases that fail than the random based ones. Furthermore, by increasing the level of the number of test cases that fail, different evolution patterns in the techniques performance arise, e.g. stoop increases its performance with As a controlled experiment with statistical analysis, measures were rigorously taken to address conclusion validity regarding data treatment and assumptions, number of replications and tests needed. For the internal validity of this experiment, it is often difficult to represent a defect at a high abstract level since a code defect may refer to detailed contents. Therefore, an abstract defect may correspond to one or more defects at code level and so on. To mitigate this threat, we considered test cases that fail as measure instead of counting defects (even though we had data on the real defects). This decision suits our experiment perfectly, since the APFD metric focus on failure rather than defects. The construct validity regarding the set of techniques and evaluation metric chosen to compose the study, was supported by a systematic review [29] that revealed suitable techniques and evaluation metrics, representing properly the research context. The low number of system models used in this experiment threatened its external validity, since two models do not represent the whole universe of applications. However, as preliminary study, we aimed at a specific context observation only. V. S ECOND E MPIRICAL S TUDY Motivated by the study reported in the Section IV, this section contains a report of an empirical study that aims to “analyzing general prioritization techniques for the purpose of observing the model structure influence over the studied techniques, with respect to their ability to reveal failures earlier, from the point of view of the tester and in the context of Model-Based Testing”. Complementing the definition, we postulated the following research hypothesis: “The general test case prioritization techniques present different abilities to reveal failures, considering models with different structures”. A. Planning We also conducted this experiment in a research environment and the techniques involved in the study need the same artifacts from the first experiment – the test suite generated through a MBT test case generation algorithm. Thus, the execution of the techniques does not need human intervention, what eliminates the factor “experience level” from the experiment. The models that originate the test suites processed in the experiment were generated randomly using a parametrized graph generator (Section V-B). Thus, the models do not represent real application models. For this study, we defined the following variables: Independent variables • General prioritization techniques (factor): ARTJac, stoop; • Number of branch constructions to be generated in the input models (factor): 10, 30, 80; • Number of join constructions to be generate in the input models (factor): 10, 20, 50; • Number of loop constructions to be generate in the input models (factor): 1, 3, 9; • Maximum depth of the generated models (fixed value equals to 25); • Rate of test cases that fail (fixed value equals to 10%); Dependent variable • Average Percentage of Fault Detection - APFD. For the sake of simplicity of the experimental design required when considering all techniques and variables, in this study, we decided to focus only on two techniques among the ones considered in Section IV – ARTJac and stoop – particularly the ones with best and worst performance, respectively. They can be seen as representatives of the random-based and structural based techniques considered respectively. Moreover, we defined the values for the variables that shape the models based on the structural properties from the models considered in the motivational experiment reported in the Section IV. In this experiment, we do not desire the effect of the failures location over the techniques, thus we selected failures randomly. To mitigate the effect of the number of test cases that fail, we assign a constant rate of 10% of the test cases to reveal failure. In order to evaluate the model structure, we defined three different experimental designs and according to Wu and Hamada [25], each one is a one-factor-at-a-time. The designs are described in the next subsections. 1) Branches Evaluation: In order to evaluate the impact of the number of branches in the capacity of revealing failures, we defined three levels for this factor and fixed the number of joins and branches in zero. For each considered level of number of branches with the another parameters fixed, 31 models were generated by the parameterized generator. For each model, the techniques were executed with 31 different random failure attributions and we gathered the APFD value of each execution. We postulated five pairs of statistical hypotheses: three analyzing each level of the branches with the null hypothesis of equality between the techniques and the alternative indicating they have a different performance (e.g. H0 : AP F D(ART Jac,10 branch) = AP F D(Stoop,10 branch) and H1 : AP F D(ART Jac,10 branch) 6= AP F D(Stoop,10 branch) ) and two related to each technique isolately, comparing the performance in the three levels with the null hypotheses of equality and alternative indicating some difference (e.g. H0 : AP F D(ART Jac,10 branch) = AP F D(ART Jac,30 branch) = AP F D(ART Jac,80 branch) and H1 : AP F D(ART Jac,10 branch) 6= AP F D(ART Jac,30 branch) 6= AP F D(ART Jac,80 branch) ). 2) Joins Evaluation: In the number of joins evaluation, we proposed a similar design, but varying just the number of joins and fixing the another variables. We fixed the number of branches in 50, loops in zero and all the details that were exposed in the branch evaluation are applied for this design. The reason for allowing 50 branches is that branches may be part of a join. Therefore, we cannot consider 0 branches. The corresponding set of hypotheses follows the same structure of the branch evaluation, but considering the number of joins. 3) Loops Evaluation: In the number of loops evaluation, once again, we proposed a similar design, but varying only the number of loops and fixing the number of branches in 30 and the joins in 15 (again, this structures are commonly parte of a loop, so it is not reasonable to consider 0 branches and 0 joins). We structured a similar set of hypotheses as in the branch evaluation, but considering the three levels of the number of loops variable. The following steps were executed to perform the experiment: 1) Generate test models as described in Section V-B; 2) Instantiate lists for data collection for each replication needed; 3) Instantiate the failure models to be considered; 4) Generate test cases; 5) Map branches to test cases; 6) Execute each technique for each object considering the replications needed; 7) Collect data and compute dependent variable; 8) Record and analyse results. All techniques were automatically executed and test cases were generated by using the same algorithm as in Section IV. B. Model Generation The considered objects for this study are the randomly generated models. The generator receives five parameters: 1) 2) 3) 4) 5) Number of branch constructions; Number of join constructions; Number of loop constructions; The maximum depth of the graphs; The number of graphs to generate. The graph is created by executing operations to include the constructions in sequences of transitions (edges). The first step is to create an initial sequence using the forth parameter, e.g. let a maximum depth be equal to five, so a sequence with five edges is created, as in the Figure 2. Fig. 2. Initial configuration of a graph with maximum depth equals to 5. Over this initial configuration, the generator executes the operations. To increase the probability of generating structurally different graphs, the generator executes operations randomly, but respecting the number passed as parameter. The generator perform the operations of adding branching, joining, and looping in the following way: • Branching: from a non-leaf random node x, create two more new nodes y and z and create two new edges (x, y) and (x, z) (Figure 3a); • Joining: from two non-leaf different random nodes x and y, create a new node z and create two new edges (x, z) and (y, z) (Figure 3b); • Looping: from two non-leaf different random nodes x and y, with depth(x) > depth(y), create a new edge (x, y) (Figure 3c). Following the analysis, we performed three tests, as summarized in the Table V. We chose the test according to the normality from the samples: for normal samples, we performed the T-test and for non-normal samples, the Mann-Whitney test. TABLE V. P-VALUES FOR THE SAMPLES EQUALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE FIRST EXPERIMENTAL DESIGN SAMPLES . 10 Branches 0.7497 (a) Branching the node 4 to nodes 7 and 8. (b) Joining the nodes 2 and 5 to node 7. (c) Looping the node 4 to 2. Fig. 3. Examples of operations performed by the parametrized graph generator. The generator execute the same process as many times as the number of graphs to generate parameter indicates. C. Data Analysis 80 Branches 0.1745 All the p-values on the Table V are greater than the defined significance of 5%, so the null hypothesis of equality of the techniques cannot be rejected, at the defined significance level, in other words, the two techniques presented similar performance at each level separately. The next step of the analysis is to evaluate each technique separately through the levels and we proceeded a nonparametric test of Kruskal-Wallis to test their correspondent hypothesis. The tests calculated for ARTJac and stoop p-value equals to 0.6059 and 0.854 respectively. Comparing the pvalues against the significance level of 5%, we cannot reject the null hypothesis of equality between the levels for each technique, so the performance is similar, at this significance level. 2) Joins Analysis: Following the same approach from the first experimental design, we can see on Table VI the p-values of the normality tests. The bold face p-values indicate the samples normally distributed, at the considered significance. TABLE VI. As we divided the whole experiment into three experimental designs, the data analysis will respect the division. Basically, we followed the same chain of tests for the three designs. Firstly, we tested the normality assumptions over the samples using the Anderson-Darling test and the equality of variances through F-test. Depending on the result of these tests, we chose the next one, that evaluate the equality of the samples, Mann-Whitney or T-Test. After evaluate the levels separately, we tested the techniques separately through the three levels using ANOVA or Kruskal-Wallis test. We considered for each test the significance level of 5%. The objective in this work is to expose influences of the studied structural aspects of the models on the performance of the techniques, thus if the p-value analysis in a hypothesis testing suggests that the null hypothesis of equality may not be rejected, this is an evidence that the variable considered alone does not affect the performance of the techniques. On the other hand, if the null hypothesis must be rejected, it represents a evidence of some influence. 1) Branches Analysis: The first activity for the analysis is the normality test and Table IV summarizes this step. The two samples from the low level had the null hypotheses of normality rejected. TABLE IV. P-VALUES FOR THE A NDERSON -DARLING NORMALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE FIRST EXPERIMENTAL DESIGN SAMPLES . N ORMAL SAMPLES ARE IN BOLD FACE . ART Jaccard Stoop 30 Branches 0.9565 10 Branches 30 Branches 80 Branches 3.569 · 10−15 2.207 · 10−13 0.3406 0.273 0.3566 0.06543 P-VALUES FOR THE A NDERSON -DARLING NORMALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE SECOND EXPERIMENTAL DESIGN SAMPLES . N ORMAL SAMPLES ARE IN BOLD FACE . ART Jaccard Stoop 10 Joins 0.9394 0.5039 20 Joins 0.8015 0.5157 50 Joins 0.6733 0.05941 Based on these normality tests, we tested the equality of the performance of the techniques at each level and, according to the Table VII, the techniques performs statistically in a similar way at all levels. TABLE VII. P-VALUES FOR THE SAMPLES EQUALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE SECOND EXPERIMENTAL DESIGN SAMPLES . 10 Joins 0.9816 20 Joins 0.62 50 Joins 0.06659 The next step is to assess each technique separately. We executed a Kruskal-Wallis test comparing the three samples for ARTJac and stoop and the p-value was 0.4418 and 0.3671, respectively. Comparing with the significance level considered of 5%, both null hypothesis of equality was not rejected, what means the techniques behave similarly through the levels. 3) Loops Analysis: Following the same line of argumentation, the first step is to evaluate the normality of the measured data and the Table VIII summarizes these tests. According to the results of the normality tests, we tested the equality of the techniques at each level of this experimental design. As we can see on Table IX, the null hypotheses for 1 Loop, 3 Loops and 9 loops cannot be rejected because they have p-value greater than 5%, thus the techniques present similar behaviour for all levels of the factor. TABLE VIII. P-VALUES FOR THE A NDERSON -DARLING NORMALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE THIRD EXPERIMENTAL DESIGN SAMPLES . N ORMAL SAMPLES ARE IN BOLD FACE . ART Jaccard Stoop 1 Loop 3 Loops 9 Loops 0.07034 0.985 0.02681 0.08882 2.75 · 10−10 9.743 · 10−11 TABLE IX. P-VALUES FOR THE SAMPLES EQUALITY TESTS WITH 5% OF SIGNIFICANCE FROM THE THIRD EXPERIMENTAL DESIGN SAMPLES . 1 Loop 0.141 3 Loops 0.6049 9 Loops 0.07042 Analyzing the two techniques separately through the levels, we performed the non-parameric Kruskal-Wallis test and the p-values were 0.9838 and 0.3046 for ARTJac and stoop, respectively. These p-values, compared with the significance level of 5%, indicate that the null hypotheses of the considered pairs cannot be rejected, in another words, the techniques perform statistically similar through the different levels of the number of looping operations. A. Planning We performed the current experiment in the same environment of the previous ones and the application models used in this experiment are the same used in Section V. Since we do not aim to observing variations of model structure, we considered the 31 models that were generated with 30 branches, 15 joins, 1 loop and maximum depth 25. For this experiment, we defined these variables: Independent variables • General prioritization techniques (factor): ARTJac, stoop; • Failure profiles, i.e., characteristics of the test cases that fail (factor); ◦ Long test cases – with many steps (longTC); ◦ Short test cases – with few steps (shortTC); ◦ Test cases that contains many branches (manyBR); ◦ Test cases that contains few branches (fewBR); ◦ Test cases that contains many joins (manyJOIN); ◦ Test cases that contains few joins (fewJOIN); ◦ Essential test cases (ESSENTIAL) (the ones that uniquely covers a given edge in the model); • Number of test cases that fail: fixed value equals to 1; D. Threats to Validity About the validity of the experiment, we can point some threats. To the internal validity, we defined different designs to evaluate separately the factors, therefore, it is not possible analyze the interaction between the number of joins and branches, for example. We did it because some of the combinations between the three variables might be unfeasible, e.g. a model with many joins and without any branch. Moreover, we did not calculate the number of replications in order to achieve a defined precision because the execution would be infeasible (conclusion validity). The executed configuration took several days because some test suites were huge. To deal with this limitation, we limited the generation to 31 graphs for each experimental design and 31 failure attributions for each graph, keeping the balancing principle [13] and samples with size greater than, or equal to, 31 are wide enough to test for normality with confidence [26], [27]. Furthermore, the application models were generated randomly to deal with the problem of lack of application models, but, at the same time, this reduces the capability of represent the reality, threatening the external validity. To deal with this, we used structural properties, e.g. depth and number of branches, from existent models. VI. T HIRD E MPIRICAL S TUDY This section contains a report of an experiment that aims to “analyze general prioritization techniques for the purpose of observing the failure profile influence over the studied techniques, with respect to their ability to reveal failures earlier, from the point of view of the tester and in the context of Model-Based Testing”. Complementing the definition, we postulated the following research hypothesis: “The general test case prioritization techniques present different abilities to reveal failures, considering that the test cases that fail have different profiles”. We are considering profiles as the characteristics of the test cases that reveal failures. Dependent variable • Average Percentage of Fault Detection - APFD. A special step is the failure assignment, according to the profile. As the first step, the algorithm sorts the test cases according to the profile. For instance, for the longTC profile, the test cases are sorted decreasingly by the length or number of steps. If there are more than one with the biggest length (same profile), one of them is chosen randomly. For example, if the maximum size of the test cases is 15, the algorithm selects randomly one of the test cases with size equals to 15. Considering the factors, this experiment is a one-factor-ata-time, and we might proceed analysis between the techniques at each failure profile and between the levels at each technique. In the execution of the experiment, each one of the 31 models were executed with 31 different and random failure assigned to each profile, with just one failure at once (a total of 961 executions for each technique). This number of replications keeps the design balanced and gives confidence for testing normality [27]. Based on these variables and in the design, we defined the correspondent pairs of statistical hypotheses: i) to analyse each profile with the null hypothesis of equality between the techniques and the alternative indicating they have a different performance (e.g. H0 : AP F D(ART Jac,longT C) = AP F D(stoop,longT C) and H1 : AP F D(ART Jac,longT C) 6= AP F D(stoop,longT C) ), and also ii) to analyse each technique with the null hypothesis of equality between the profiles( ∀f1 , f2 ∈ {longTC, shortTC, manyBR, fewBR, manyJOIN, fewJOIN, ESSENTIAL}, f1 6= f2 · H0 : AP F D(ART Jac,f1 ) = AP F D(ART Jac,f 2) , and H1 : AP F D(ART Jac,f1 ) 6= AP F D(ART Jac,f 2) ). If the tests reject null hypotheses, this fact will be considered as an evidence of the influence of the failure profile over the techniques. Experiment execution followed the same steps defined in Section V. However, as mentioned before, each technique was run by considering one failure profile at a time. B. Data Analysis The boxplots from Figures 4 and 5 summarizes the trends of the data collected. The notches in the boxplots are a graphical representation of the confidence interval calculated by the R software. When these notches overlap, it suggests the better and deeper investigation of the statistical similarity of the samples. techniques, because frequently a test case among the longest ones are among the ones with the biggest number of branches. The same happens with the profiles ShortT C and F ewBR, by the same reasoning. There is a relationship between the profiles F ewJOIN and ESSEN T IAL as we can see in Figures 4 and 5. The essential test cases are the ones that contains some requirement uniquely, in this case a branch, only covered by itself, and by this definition, the test cases among the ones with least joins frequently are essentials. In summary, rejection of null hypothesis are a strong evidence of the influence of the failure profiles over the performance of the general prioritization techniques. Furthermore, data suggests that ARTJac may not have a good performance when the test case that fails is either long or with many branches. In this case, stoop has a slightly better performance. In the other cases, ARTJac has a better performance, similarly to results obtained in the first experiment with real applications (Section IV). C. Threats to Validity Regarding conclusion validity, we did not calculate the number of replications needed To deal with this threat of precision, we limited the random failure attributions at each profile for each graph in 31, keeping the balancing principle [13] and samples with size greater than, or equal to, 31 are wide enough to test for normality with confidence [26], [27]. Fig. 4. Boxplot with the samples from ARTJac. Construct validity is threatened by the definition of the failure profiles. We chose the profiles based on data and observations from previous studies, not necessarily the specific results. Thus, we defined them according to our experience and there might be other profiles not investigated yet. This threat is reduced by the experiment’s objective, that is to expose the influence of different profiles on the prioritization techniques performance, and not to show all the possible profiles. VII. C ONCLUDING R EMARKS This paper presents and discusses the results obtained from empirical studies on the use of test case prioritization techniques in the context of MBT. It is widely accepted that a number of factors may influence on the performance of the techniques, particularly due to the fact that the techniques can be based on different aspects and strategies, including or not random choice. Fig. 5. Boxplot with the samples from stoop. For testing the performance between the two techniques at every failure profile from a visual analysis of the boxplots of the samples, seen in the Figures 4 and 5, we can see that there are no overlaps between the techniques in any profile (the notches in the box plot do not overlap), in another words, at 5% of significance, ARTJac and stoop perform statistically different in every researched profile. Comparing each technique separately through the failure profiles, both of them present differences between the profiles, enough condition to also reject the null hypothesis of equality. By observing the profiles longT C and manyBR, in Figures 4 and 5, they incur in similar performances for the two In this sense, the main contribution of this paper is to investigate on the influence of two factors: the structure of the model and the profile of the test case that fails. The intuition behind this choice is that the structure of the model may determine the size of the generated test suites and the redundancy degree among their test cases. Therefore, this factor may affect all of the techniques involved in the experiment due to either the use of distance functions or the fact that the techniques consider certain structures explicitly. On the other hand, depending on the selection strategy, the techniques may favor the selection of given profiles of test cases despite others. Therefore, whether the test cases that fail have a certain structural property may also determine the success of a technique. To the best of our knowledge, there are no similar studies presented in the literature. In summary, in the first study, performed with real applications in a specific context, different growth patterns of APFD for the techniques can be considered as evidence of influence of more factors in the performance of the general prioritization techniques other than the number of test cases that fail. This result motivated the execution of the other studies. On one hand, the second study, aimed at investigating the influence of the number of occurrences of branches, joins and loops over the performance of the techniques, showed that there is no statistical difference on the performance of the techniques studied with significance of 5%. On the other hand, in the third study, based on the profile of the test case that fail, the fact that all of the null hypotheses were rejected may indicate a high influence of the failure profile on the performance of the general prioritization techniques. Moreover, from the perspective of the techniques, this study exposed weaknesses associated with these profiles. For instance, ARTJac presented low performance when long test cases (and/or with many branches) reveal failures and high when short test cases (and/or with few branches) reveal failures. On the other hand, stoop showed low performance with almost all profiles. From these results, testers may opt to use one technique or the other based on failure prediction and the profile of the test cases. As future work, we will perform a more complex factorial experiment, calculating the interaction between the factors analyzed separately in the experiments reported in this paper. Moreover, we plan an extension of the third experiment to consider other techniques and also investigate other profiles of test cases that may be of interest. From the analysis of the results obtained, new (possibly hybrid) technique may emerge. [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] ACKNOWLEDGMENT This work was supported by CNPq grants 484643/2011-8 and 560014/2010-4. Also, this work was partially supported by the National Institute of Science and Technology for Software Engineering6 , funded by CNPq/Brasil, grant 573964/2008-4. First author was also supported by CNPq. [21] [22] [23] R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] M. J. Harrold, R. Gupta, and M. L. Soffa, “A methodology for controlling the size of a test suite,” ACM Trans. Softw. Eng. Methodol., vol. 2, no. 3, pp. 270–285, Jul. 1993. D. Jeffrey and R. Gupta, “Improving fault detection capability by selectively retaining test cases during test suite reduction,” Software Engineering, IEEE Transactions on, vol. 33, no. 2, pp. 108–123, 2007. G. Rothermel, R. Untch, C. Chu, and M. Harrold, “Test case prioritization: an empirical study,” in Software Maintenance, 1999. (ICSM ’99) Proceedings. IEEE International Conference on, 1999, pp. 179–188. G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, “Prioritizing test cases for regression testing,” IEEE Transactions on Software Engineering, vol. 27, pp. 929–948, 2001. S. G. Elbaum, A. G. Malishevsky, and G. Rothermel, “Test case prioritization: A family of empirical studies,” IEEE Transactions in Software Engineering, February 2002. B. Jiang, Z. Zhang, W. K. Chan, and T. H. Tse, “Adaptive random test case prioritization,” in ASE, 2009, pp. 233–244. M. Utting and B. Legeard, Practical Model-Based Testing: A Tools Approach, 1st ed. Morgan Kauffman, 2007. E. G. Cartaxo, P. D. L. Machado, and F. G. O. Neto, “Seleção automática de casos de teste baseada em funções de similaridade,” in XXIII Simpósio Brasileiro de Engenharia de Software, 2008, pp. 1–16. 6 www.ines.org.br [24] [25] [26] [27] [28] [29] E. G. Cartaxo, P. D. L. Machado, and F. G. Oliveira, “On the use of a similarity function for test case selection in the context of model-based testing,” Software Testing, Verification and Reliability, vol. 21, no. 2, pp. 75–100, 2011. B. Korel, G. Koutsogiannakis, and L. Tahat, “Application of system models in regression test suite prioritization,” in IEEE International Conference on Software Maintenance, 2008, pp. 247–256. S. P. G. and H. Mohanty, “Prioritization of scenarios based on uml activity diagrams,” in CICSyN, 2009, pp. 271–276. F. G. O. Neto, R. Feldt, R. Torkar, and P. D. L. Machado, “Searching for models to test software technology,” 2013, proc. of First International Workshop on Combining Modelling and Search-Based Software Engineering, CMSBSE/ICSE 2013. C. Wohlin, P. Runeson, M. Host, M. C. Ohlsson, B. Regnell, and A. Wesslen, Experimentation in software engineering: an introduction. Norwell, MA, USA: Kluwer Academic Publishers, 2000. S. Elbaum, A. G. Malishevsky, and G. Rothermel, “Prioritizing test cases for regression testing,” in In Proc. of the Int. Symposium on Software Testing and Analysis. ACM Press, 2000, pp. 102–112. T. Y. Chen, H. Leung, and I. K. Mak, “Adaptive random testing,” in Advances in Computer Science - ASIAN 2004, ser. Lecture Notes in Computer Science, vol. 3321/2005. Springer, 2004, pp. 320–329. Z. Q. Zhou, “Using coverage information to guide test case selection in adaptive random testing,” in IEEE 34th Annual COMPSACW, 2010, july 2010, pp. 208 –213. D. Kundu, M. Sarma, D. Samanta, and R. Mall, “System testing for object-oriented systems with test case prioritization,” Softw. Test. Verif. Reliab., vol. 19, no. 4, pp. 297–333, Dec. 2009. S. Elbaum, G. Rothermel, S. K, and A. G. Malishevsky, “Selecting a cost-effective test case prioritization technique,” Software Quality Journal, vol. 12, p. 2004, 2004. B. Korel, L. Tahat, and M. Harman, “Test prioritization using system models,” in Software Maintenance, 2005. ICSM’05. Proceedings of the 21st IEEE International Conference on, 2005, pp. 559–568. Z. Q. Zhou, A. Sinaga, and W. Susilo, “On the fault-detection capabilities of adaptive random test case prioritization: Case studies with large test suites,” in HICSS, 2012, pp. 5584–5593. D. Jeffrey, “Test case prioritization using relevant slices,” in In the Intl. Computer Software and Applications Conf, 2006, pp. 411–418. H. Do, S. Mirarab, L. Tahvildari, and G. Rothermel, “The effects of time constraints on test case prioritization: A series of controlled experiments,” IEEE Transactions on Software Engineering, vol. 36, no. 5, pp. 593–617, 2010. B. Korel, G. Koutsogiannakis, and L. H. Tahat, “Model-based test prioritization heuristic methods and their evaluation,” in Proceedings of the 3rd international workshop on Advances in model-based testing, ser. A-MOST ’07. New York, NY, USA: ACM, 2007, pp. 34–43. [Online]. Available: http://doi.acm.org/10.1145/1291535.1291539 E. G. Cartaxo, W. L. Andrade, F. G. O. Neto, and P. D. L. Machado, “LTS-BT: a tool to generate and select functional test cases for embedded systems,” in Proc. of the 2008 ACM Symposium on Applied Computing, vol. 2. ACM, 2008, pp. 1540–1544. C. F. J. Wu and M. S. Hamada, Experiments: Planning, Analysis, and Optimization, 2nd ed. John Wiley and Sons, 2009. D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers. John Wiley and Sons, 2003. R. K. Jain, The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, 1991. E. Lehmann, Nonparametrics, ser. Holden-Day series in probability and statistics, H. D’Abrera, Ed. San Francisco u.a.: Holden-Day u.a., 1975. J. F. S. Ouriques, “Análise comparativa entre técnicas de priorização geral de casos de teste no contexto do teste baseado em especificação,” Master’s thesis, UFCG, Janeiro 2012. The Impact of Scrum on Customer Satisfaction: An Empirical Study Bruno Cartaxo1 , Allan Araújo1,2 , Antonio Sá Barreto1 and Sérgio Soares1 Informatics Center - CIn / Federal University of Pernambuco - UFPE1 Recife Center for Advanced Studies and Systems - C.E.S.A.R2 Recife, Pernambuco - Brazil Email: {arsa,bfsc,acsbn,scbs}@cin.ufpe.br Abstract—In the beginning of the last decade, agile methodologies emerged as a response to software development processes that were based on rigid approaches. In fact, the flexible characteristics of agile methods are expected to be suitable to the lessdefined and uncertain nature of software development. However, many studies in this area lack empirical evaluation in order to provide more confident evidences about which contexts the claims are true. This paper reports an empirical study performed to analyze the impact of Scrum adoption on customer satisfaction as an external success perspective for software development projects in a software intensive organization. The study uses data from real-life projects executed in a major software intensive organization located in a nation wide software ecosystem. The empirical method applied was a cross-sectional survey using a sample of 19 real-life software development projects involving 156 developers. The survey aimed to determine whether there is any impact on customer satisfaction caused by the Scrum adoption. However, considering that sample, our results indicate that it was not possible to establish any evidence that using Scrum may help to achieve customer satisfaction and, consequently, increase the success rates in software projects, in contrary to general claims made by Scrum’ advocates. I. I NTRODUCTION Since the term software engineering emerged in 1968 [1] it has motivated a tremendous amount of discussions, works, and research on processes, methods, techniques, and tools for supporting high-quality software development in a wide and industrial scale. Initially, industrial work — based on manufacturing — introduced several contributions to the software engineering body of knowledge. Many software processes has been supported by industrial work concepts such as functional decomposition and localized labor [2]. During the last decades, techniques and tools has been created as an analogy to the production lines. The first generation of software processes family was based on the waterfall life cycle assuming that the software development life cycle was a linear and sequential similar to a production line [3]. Then, in the early 90’s, other initiatives were responsible for creating iterative and incremental processes such as the Unified Process [4]. Despite these efforts and investments, software projects success rate has presented a dramatic situation in which less than 40% of projects achieve success (Figure 1). Obviously, these numbers may not be compared to other profitable industries [5]. Fig. 1. 2011 Chaos Report - Extracted from [5] Some specialists argue that software development is different from the traditional industrial work in respect to its nature. Software engineering may be described as knowledge work which is “focused on information and collaboration rather than manufacturing placing value on the ownership of knowledge and the ability to use that knowledge to create or improve goods and services” [2]. There are several differences between these two kinds of work. While the work is visible and stable in industrial work; it is invisible and changing in knowledge work. Considering that knowledge work (including software development) is more uncertain and less defined than the industrial work that is based on predictability, the application of industrial work techniques on knowledge work may lead to projects with increased failure rates. Since 2001, agile methods have emerged as a response for overcoming the difficulties related to the software development. Some preliminary results shown that agile methodologies may increase success rates as shown in Figure 2 [5]: Although some results may indicate that agile methodologies help to achieve success in software development, many of these researches fail to present evidence through empirical evaluation. Only through these evaluation it is possible to establish whether and in which context the proposed method or technique is efficient, effective, and can be applied [6] [7] [8]. In particular, for the agile context, a minor part of studies contains an empirical evaluation as shown in Figure 3 [9]. results obtained from this research. Fig. 2. Waterfall vs. Agile - Extracted from [5] Hence, this paper is organized as following. Sections 2 and 3 present the definition for this study, including the conceptual model and the research method used for the survey, respectively. Section 4 is aimed to find out the results obtained from the survey execution. Limitations of this study as well as possible future studies are discussed at Section 5. Section 6 introduces some related studies and, finally, Section 7 presents the conclusion. Additionally, we present the applied questionnaire as well as the used likert anchoring scheme at the appendix. II. C ONCEPTUAL M ODEL OF C USTOMER S ATISFACTION The research model presented by this study verifies the impact of an independent variable (software development approach) on the project’s success indexes considering the customer point of view. This independent variable may be assigned with two different values: Scrum and not Scrum (traditional approaches for software project management). Fig. 3. Agile empirical evaluation rate - Extracted from [9] Thus, the scope for this work was defined intending to provide a comparison between agile methods and traditional software development approaches. First, it necessary to point out that there are several agile methodologies such as Scrum, Extreme Programming (XP), Feature-Driven Development, Dynamic Systems Development Method (DSDM), Lean Software Development that are intended to support knowledge work (less defined and more uncertain) [2]. In parallel, it also exist many traditional approaches that are intended to support industrial work (more defined and less uncertain). These methods and processes are usually based on the remarkable frameworks such as PMBoK (Project Management Body of Knowledge) [10] and Unified Process [4]. These methods may include several perspectives such software engineering, project management, design and so on. For an objective analysis, it was chosen the project management perspective. On one hand, for agile methods, it was selected Scrum (project management based); on the other hand, it was chosen any traditional approach that include a perspective for the project management. In this context, a survey was executed in the C.E.S.A.R (Recife Center for Advanced Studies and Systems) using a random sample containing 19 different projects adopting Scrum or any other traditional approach for managing the initiative involving 156 developers. The main expected contributions by this study are listed below: • • Increase the body of knowledge about Scrum and agile methods using a systematic approach through evidences within an industrial environment. In particular it is intended to reduce the lack of empirical evaluation in software development discussions. Help the organization to understand how to increase internal success rates by analyzing and discussing the In particular, it is necessary to recognize that customers probably have different definitions for “success” within a software project. In order to establish an external perspective, the model assumes seven critical factors for customer satisfaction (dependent variables), and consequently, for project success: time, goals, quality, communication and transparency, agility, innovation and benchmark. The next subsections provide more details for each one. A. Time In general, “time to market” is a critical variable within a software project. Thus, we define a project as successful if agreed and negotiated deadlines are met. Since Scrum is based on small iterations, it is expected anticipated delivery of valuable software [11] and also short time-to-market. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates regarding to the time constraints by meeting the agreed and negotiated deadlines. Hypotheses 1: Scrum-based projects provide increased customer satisfaction from the time perspective. B. Goals Software projects are launched for strategic purposes, such as costs reduction, legal compliance, market-share increase, etc. Thus, we define a project as successful if the goals that motivated the endeavor are met. Since Scrum considers a deeper and frequent stakeholder participation and collaboration, it is expected a continuous goals adjustment [11]. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by addressing the customer needs regarding to the defined goals within a project. Hypotheses 2: Scrum-based projects provide increased customer satisfaction from the goals perspective. C. Quality By definition, “quality is the degree to which a set of inherent characteristics fulfill requirements” [10]. Product and process quality depend on the software project criticality demanded by the customers. Thus, we define a project as a successful if the required quality standards for that specific situation are met. So, regular inspections (one of the Scrum pillars) are one of most effective quality tools within a software development project [2]. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by addressing the customer needs regarding to the defined quality standards within a project. Hypotheses 3: Scrum-based projects provide increased customer satisfaction from the quality perspective. D. Communication and Transparency Software projects are expected to create intangible products under a dynamic and uncertain environment. Therefore, frequent and continuous communication is required in order to provide confidence to the stakeholders regarding to the work progress. One of the Scrum pillars is transparency [11]. Thus, we define a project as successful if the customers feel themselves confident as a result of the communication and transparency. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by addressing the customer needs regarding to the expected level of communication and transparency within a project. Hypotheses 4: Scrum-based projects provide increased customer satisfaction from the communication and transparency perspective. G. Benchmark Usually, software projects are launched as a procurement initiative in which an organization (buyer) hires a development organization (seller) to create a product or service that may be developed by several companies. It is natural that seller organizations do comparison between their suppliers. In this sense, we consider ”benchmark” as a comparison between organizations that develop software. Thus, we define a project as successful if customers may recommend a development organization when comparing its project results to other organizations project results. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by comparing a project executed by a specific organization with other ones. Hypotheses 7: Scrumbased projects provide increased customer satisfaction from the benchmark perspective. III. In order to define a methodology to guide this study, we have chosen an approach based on surveys; and selected five of six recommended steps by Kitchenham [12], as below: • Setting the objectives: This study investigates the relationship between the Scrum adoption (as a software development approach) and the customer satisfaction; • Survey design: Cross-sectional, since the survey instrument was applied only once at a fixed point in time. It is not intended to promote a forward-looking to provide information about changes in the specific population through time; • Developing the survey instrument: It was based on a questionnaire designed to identify the customer satisfaction within a particular project which determines its success degree from the external point of view; • Obtaining valid data: The questionnaire was sent through e-mail for each customer business representatives (e.g. sponsor, product or project managers); • Analyzing the data: Finally, the data analysis was executed using techniques from descriptive and inferential statistics. E. Agility Some projects occurring in a fast-moving or timeconstrained environments, call for an agile approach [2]. The main characteristics of an agile software project are the “early and continuous delivery of valuable software” and “ability to provide fast response to changes”. Thus, we define a project as successful if the agility expected by the customers is met. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by addressing the agility demanded by the customer. Hypotheses 5: Scrum-based projects provide increased customer satisfaction from the agility perspective. F. Innovation Software projects are expected to deliver new softwarebased products and services for users/customers existing and emerging needs. Therefore, the innovation comes through new ways of work, study, entertainment, healthcare, etc. supported by software. Since Scrum also supports the principle of “early and continuous delivery of valuable software” it is expected that Scrum software development might help to create innovative products and services for the customer business. Thus, we define a project as successful if the innovation expected by the customer is met. Hence, we argue that a software project in which Scrum is adopted is able to provide higher customer satisfaction rates by addressing the customer expectation through innovative products and services generated by the project. Hypotheses 6: Scrum-based projects provide increased customer satisfaction from the innovation perspective. R ESEARCH M ETHOD The following subsections present discussions related to the population, sample, variables, data collection procedure, and data analysis techniques used for this study. A. Population The population for this study is targeted on software intensive organizations, including companies of different sizes, developing several software-based solutions for a wide variety of markets. B. Sample It was selected a random sample of projects executed by C.E.S.A.R - Recife Center for Advanced Studies and Systems1 which belongs to the target population. C.E.S.A.R is an innovation institute which has more than 500 employees working 1 http://www.cesar.org.br/site/ TABLE I. S TUDY HYPOTESES Null Hypotheses (NH) Alternative Hypotheses (AH) (NH1) Ts = Tns (AH1) Ts ̸= Tns (NH2) Gs = Gns (AH2) Gs ̸= Gns (NH3) Qs = Qns (AH3) Qs ̸= Qns (NH4) CTs = CTns (AH4) CTs ̸= CTns (NH5) As = Ans (AH5) As ̸= Ans (NH6) Is = Ins (AH6) Is ̸= Ins (NH7) Bs = Bns (AH7) Bs ̸= Bns the customer satisfaction the Likert scale was used assuming values from 1 (poor) to 5 (excellent) values. • Fig. 4. Contextual variables on projects from different business domains (e.g. finance, thirdsector, manufacture, service, energy, government, telecommunication, etc.), creating solutions for several platforms (mobile, embedded, web, etc.). The number of projects may vary from 70 to 100 in a year. Initially, the sample contained 27 projects, but it was reduced to 19 projects because incomplete questionnaires responses were eliminated from the sample. Even though, it represents an effective response rate of 70.3%, which is above the minimum norm of 40% suggested by [13] for academic studies. Furthermore, it was collected additional information related to each project, including project type, team size as below (Figure 4): • Project type: 5 private and 14 public/brazilian tax incentives law. • Team size: From 4 to 21. • Project nature: Consulting: 4; Information Systems: 3; Telecommunications: 4; Maintenance: 1; Research & Development (R&D): 6; Embedded Systems: 4. Notice that one project may have different natures. Due to this reason, the number may be slightly different from the sample size. Contextual Variables: Project type, team size, and project nature were identified as variables that may potentially influence the results. Project type and nature categorization was previously defined. The team size was the number of people involved during the development, including engineers, designers and manager. D. Data Collection Procedure First, the questionnaires were sent to customer business representatives through e-mail in a Microsoft Excel spreadsheet format. Each document contained the project categorization regarding to the contextual variables (project type, nature, and team size) and to the independent variable (Scrum/NonScrum). Thus, the customer business representatives were responsible for answering the questionnaire and then sending it back to the C.E.S.A.R project management office (PMO). E. Data Analysis Techniques The data analysis considered two different techniques. First, it was executed an exploratory data analysis (descriptive statistics) using tools such as barplots and boxplots in order to identify the preliminary insights about the data characteristics regarding to measures such as mean, position and variation. Then, hypotheses tests (inferential statistics) were conducted to provide more robust information for the data analysis process as shown in Table I. After the exploratory data analysis, it was not found apparent relevant difference in the obtained results. Thus, the alternative hypotheses were modified to verify the inequality, instead of the superiority. IV. C. Variables This study contains several variables as following: • Independent Variable: The software process is the independent variable and may assume two different values: Scrum (agile method) and Non-Scrum (any traditional approach). • Dependent Variables: The success of a software project is the result of customer satisfaction from an external point of view considering several aspects: time, goals, quality, communication and transparency, agility, innovation and benchmark. In order to measure R ESULTS A. Descriptive Statistics - Exploratory Data Analysis Initially, the final sample - the one containing 19 projects - was divided into two groups (Scrum and Non-Scrum). Then, some exploratory data analysis techniques (descriptive statistics) were applied in order to find out central tendency, position and dispersion related to the data set. On one hand, barplots (Figure 5) helped to identify the means (central tendency) for each variable representing different aspects of customer satisfaction. On the other hand, boxplots (Figure 6) helped to reveal the data dispersion and position [14]. According to the barplots in Figure 5, we can notice that the projects using Scrum presented better results considering the was a lot of data dispersion from grade one to five; and three was the mode. Fig. 5. Dependent variables means • Communication and Transparency (CT): For the Scrum group, there was a variation (data dispersion) from grade two to five without a predominance of any value. For the Non-Scrum group, the grades were more concentrated from grade four to five and the mode was five. • Agility (A): Both boxplots (Scrum and Non-Scrum groups) for the agility variable were extremely similar presenting a variation from grade three to five and the mode was the grade four. • Innovation (I): For the Scrum group the variation was from grade four to five with an outlier (the grade three). For the Non-Scrum group the grades presented a dispersion from grade two to five. • Benchmark (B): For both groups, the variation was the same: from grade three to five without any additional information. Finally, it is not possible to determine a relevant difference between the results from the groups considering the seven dependent variables as aspects of customer satisfaction. Therefore, there is no evidence about an advantage for the projects in which Scrum was applied. B. Inferential Statistics - Hypotheses Tests Fig. 6. Dependent variables boxplots following aspects: time, communication and transparency and agility. The projects that did not use Scrum presented better results for quality, goals, innovation and benchmark aspects. Despite these results, it not possible to assume that any group (Scrum and Non-Scrum) has an absolute advantage. According to the boxplot in Figure 6, it is possible to make some comments about each aspect of customer satisfaction considering the grades obtained from the sample observations: • • • Time (T): For the Scrum projects groups, the grades presented a dispersion from two to five; and the second and third quartiles are coincident, showing that many grades four were given by the customers. For the Non-Scrum projects, the grades presented a more concentrated behavior with a dispersion from three to five; and a the first and second quartiles are coincident. Goals (G): For both groups, it was possible to identify a more concentrated data dispersion: from four to five in the Scrum projects; and three to four in the NonScrum projects. Besides, there are many occurrences of grades four in both groups. In particular, for the Non-Scrum group, it may be seen an outlier (the grade five). Quality (Q): For the Scrum group, the variation (dispersion) was from three to four and the mode (most frequent value) was four with two outliers (the grades two and five). For the Non-Scrum group, there Since the exploratory data analysis (descriptive statistics) was not able to provide any conclusion within this study, it was decided to go ahead through another method. Hypotheses tests (inferential statistics) was then used intending to establish a systematic basis for a decision about the data set behavior. First, the same previous segmentation was handled separating the sample into two groups: Scrum (seven elements) and Non-Scrum (12 elements) projects. Thus, we assumed both as independent samples containing ordinal data. In this case, it is recommended using nonparametric test for ordinal variables. In particular, it was chosen the Mann-Whitney’s U test [15]. When performing nonparametric (or distribution free), there is no need to perform any kind of normality test (goodness of fit). The choice of U Mann-Whitney test did not bring harm to problem analysis, as in situations where the data are normal, the loss of efficiency compared to using the Student’s t test is only 5%; in other situations where the data distribution has a “heavier” tail than normal, the U test will be more efficient [14]. Thus, hypotheses tests were performed (using the U test) through R language2 to determine equality or inequality considering the samples means for each group (Scrum and Non-Scrum) from the perspective of each aspect (dependent variable). According to the previous hypothesis definitions, the equality was supposed to be accepted if the null hypothesis could not be rejected. Instead (in case of null hypothesis rejection) 2 http://www.r-project.org/ TABLE II. Criterion H YPOTHESES TEST RESULTS NH AH P-Value W T Ts = Tns Ts ̸= Tns 0.09736 60 G Gs = Gns Gs ̸= Gns 0.1137 26 Q Qs = Qns Qs ̸= Qns 0.7911 39 CT CTs = CTns CTs ̸= CTns 0.4849 49.5 A As = Ans As ̸= Ans 0.7126 46 I Is = Ins Is ̸= Ins 0.4681 34 B Bs = Bns Bs ̸= Bns 0.8216 39.5 we were supposed to recognize a difference related to the means for each group and assume the alternative hypothesis. The results obtained are presented in Table II. The reference parameter to allow inference about the acceptance or rejection of the null hypothesis was the p-value, which is the test significance level. In addition, the p-value obtained in each test would be compared to the Fisher’s scale [14] which states that any p-value less than 0.05 should cause the rejection of the null hypothesis. Hence, the obtained results of p-values were absolutely superior compared 0.05. In this case, no null hypothesis should be rejected. Therefore, there is no evidence that the Scrum group results were higher than Non-Scrum group results. Thus, it is impossible to infer that the adoption of Scrum may increase the customer satisfaction (and the project success as well) within the scope of this research work. V. L IMITATIONS AND F UTURE W ORK In this research work the internal validity was reduced at the expense of external validity. On one side, the data was collected from real-life industrial software development projects and helped to increase the study external validity. On the other hand, there are several contextual variable (e.g. organization culture and environment factors) that were not controlled and may influence on results, harming the internal validity. So, for studies with reduced internal validity, it is not possible to determine causality and generalization to other contexts. Furthermore, it is important to point out that it was not executed any test to evaluate the questionnaire psychometric properties what may jeopardize construct validity related to this research. Spite of these limitations, the study is expected to contribute to agile methodologies body of knowledge and to Scrum discussion (in particular) once it supported by reallife experiences and an empirical evaluation. Thus, some refinements are listed as possible future works as below: • Increase the sample size for obtaining more robust results. The larger the sample, the stronger will be the inferences about the data population behavior. • Investigation through other perspectives than customer satisfaction such as team and satisfaction related to the definition of success within a software project. • Execution of a different empirical evaluation technique. It would be promoted an experiment in order to determine causality relationships. Additionally, it would be included a case study intended to figure out behavior and phenomenon that may lead to increased customer satisfaction. VI. R ELATED W ORK França et al [16] promoted a survey aimed to investigate the relationship between agile practices usage and success of projects using Scrum. The context for that research was similar to the one considered for this work: software development companies located in the Porto Digital initiative, Recife, Pernambuco, Brazil. Among the 25 attributes of agile methodologies, only 8 (32%) correlated with the success of the projects. Thus, as in our study, the agile methodologies practices do not seem to show evidence of being decisive for projects success. Otherwise, a longitudinal case study conducted for 2 (two) years by Mann [17] obtained quantitative indications that Scrum adoption may lead to increased customer satisfaction and overtime reduction. Begel [18] also presented an industrial survey with Microsoft employees about the use of agile practices. In this context, it was reported improved communication, quick releases, and flexibility/rapid response to changes as the main benefits. On the other side, it was also reported disadvantages including excessive number of meetings, difficulty to scale-up for large projects; and buy-in decisions management. VII. C ONCLUSION This paper has described an empirical evaluation designed to provide insights for the question: “What is the impact of Scrum on the Customer Satisfaction”? In general, people who are enthusiastic of agile methods (including Scrum) argue that these approaches are more suitable for software development that is uncertain and requires flexibility to accommodate changes. In this context, we aimed to investigate the relationship between the adoption of agile methodologies and increased success rates in software development projects. In order to provide an accurate comparison, we defined the scope as considering an external perspective for success based on the customer satisfaction according to several aspects including time, goals, quality, communication and transparency, agility, innovation and benchmark (dependent variables). Thus, other perspectives and aspects were considered out of scope for this research. Additionally, for a proper comparison the study focused on the project management property for software development approaches. We chose a cross-sectional survey using a real-life project sample as our empirical evaluation method. The sample was separated into two groups, Scrum and Non-Scrum (independent variable). This segmentation was intended to allow a comparison between projects using Scrum and those using other traditional approach for managing software development approaches. In particular, the comparison was performed for each dependent variable, intending to promote a detailed analysis instead of an overall comparison. The preliminary results from the exploratory analysis showed no differences regarding to the data behavior fo both groups (Scrum and Non-Scrum), considering several properties such as central tendency, position, and dispersion. Then, quantitative analysis using a Mann-Whitney hypothesis test (U test) also showed no relevant difference between both groups results. Therefore, it was not possible to establish any superiority associated with the use of Scrum in software development projects. met are less priority. 3. Fair: Some important goals were not met according to the customer expectations. 2. Unsatisfactory: Not meeting several important goals. 1. Poor: The executing organization staff showed lack of ability to identify customer needs and care of the goals was very unsatisfactory. We recognize some limitations for this study. First, the internal validity might be threatened since we did not control any contextual variable. Then, the construct validity might be harmed because we were not able to verify: the psychometric properties related to the questionnaire and also the standard application of Scrum practices and guidelines. In spite of these limitations, we expect this research can help industry and academia in developing the software development body of knowledge by combining scientific rigor with industry experience. We also expect to contribute to the organization (C.E.S.A.R) to understand how to increase success rates internally. C. Quality: What is the perception about the quality of the project and its products and services? In the future, we intend to execute another survey with an increased size sample (containing projects from several organizations) considering the contextual variables as criteria for data categorization. By promoting these refinements, we aimed to figure out patterns of data behavior for specific groups. In addition, other empirical evaluation techniques (experiments, case study) might be applied in order to overcome the limitations mentioned previously. A PPENDIX - Q UESTIONNAIRE This appendix describes the questionnaire used as instrument to measure each specific aspect of customer satisfaction, as well as the likert scale anchoring approach as recommend by Uebersax [19]. It is aimed to provide a common understanding about the concept model and their qualitative values. A. Time: What is the customer feeling regarding to the project deadlines? 5. Excellent: All deadlines defined or negotiated with customer have been achieved. Deadlines adjusted due to external dependencies to the customer must be considered here. 4. Good: All deadlines, including the ones were negotiated due to internal technical problem within executing organization were met. In this classification each deadline may not have been rescheduled more than once. 3. Fair: Existence of negotiated deadlines more than once, due to problems with the executing organization, but were met. 2. Unsatisfactory: Existence of some deadlines that were not met and the deliveries occurred late. 1. Poor: Much of time constraints were not met, or there is delay (s) that seriously impacted the customer. 5. Excellent: It was found no defect or only a few minor ones. 4. Good: Some low severity defects were found and they were resolved in a satisfactory manner and within the agreed time. 3. Fair: Few moderate severity defects were detected and they have been resolved in a satisfactory manner and within the agreed time. 2. Unsatisfactory: Various defects of low severity were identified. Or defects in general were not resolved within the time agreed with the client. 1. Poor: Critical severity defects were identified at the stage of acceptance tests. D. Communication and Transparency: Does the customer feel comfortable due to the information provided on the progress of the project? 5. Excellent: Very effective communication between the executing organization and the customer is performed proactively, without the client request, providing the proper level of information. 4. Good: Continuous transparency to the project through its execution. Communication is established when requested by the customer. 3. Fair: Existence of some problems related to one of them: information display; lack of information, form of presentation and data confusion. 2: Unsatisfactory: At various times it was not easy to see the actual project progress. Information was not available when they should be. 1. Poor: Transparency about the project progress was nonexistent throughout its execution. E. Agility: What is the customer perception about the organization agility within a specific project? B. Goals: Does the customer think the project objectives were met? 5. Excellent: Expectations exceeded and high level of professionalism. 4. Good: There was a satisfactory flexibility. 3. Fair: There was flexibility but sometimes expectations were not met. 2. Unsatisfactory: There was some problem that not impacted the project execution. However, there are several improvement areas. 1. Poor: Existence of major problems that introduced impacts the project execution, including unresolved and controversial issues. 5. Excellent: All agreed objectives were met. 4. Good: Nearly all agreed objective were met. Goals not F. Innovation: What is the customer perception about the team capacity of bring innovation and innovative solutions? 5. Excellent: Team with excellent ability to present innovative and efficient solutions, beyond expectations. 4. Good: Team presented satisfactory innovative solutions, meeting expectations. 3. Fair: Team presented some innovative solutions, but not all expectations were met. 2. Unsatisfactory: Team with low capacity to present innovative solutions to the tasks. Several problems were faced when trying to resolve more complex requirements. 1. Poor: Lack of ideas / innovative solutions, not meeting the expectations. G. Benchmark. What is the organization performance compared to other suppliers considering the project execution? 5. 4. 3. 2. 1. Excellent Good Fair Unsatisfactory Poor ACKNOWLEDGMENT The authors would like to thank C.E.S.A.R (Recife Center for Advanced Studies and Systems) for kindly providing reallife projects data obtained from its PMO (Project Management Office); We would also like to thank Federal University of Pernambuco (UFPE) and Informatics Center (CIn) for supporting this research work. This work was partially supported by the National Institute of Science and Technology for Software Engineering (INES), grants 5739642008-4 (CNPq) and APQ-1037-1.03/08 (FACEPE). Bruno Cartaxo and Antonio Sa Barreto is supported by FACEPE, Sérgio Soares is partially supported by CNPq grant 3050852010-7. R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] P. Naur and B. Randell, Eds., Software Engineering: Report of a conference sponsored by the NATO Science Committee, Garmisch, Germany, 7-11 Oct. 1968, Brussels, Scientific Affairs Division, NATO, 1969. M. Griffiths, PMI-ACP Exam Prep: Rapid Learning to Pass the Pmi Agile Certified Practitioner (Pmi-acp) Exam - on Your First Try!: Premier Edition. Rmc Publications Incorporated, 2012. [Online]. Available: http://books.google.com.ar/books?id=mM6rtgAACAAJ I. Sommerville, Software Engineering, 9th ed. Harlow, England: Addison-Wesley, 2010. P. Kruchten, The Rational Unified Process: An Introduction, 3rd ed. Boston: Addison-Wesley, 2003. “2001 chaos report,” Tech. Rep. B. Cartaxo, I. Costa, D. Abrantes, A. Santos, S. Soares, and V. Garcia, “Eseml: empirical software engineering modeling language,” in Proceedings of the 2012 workshop on Domain-specific modeling, ser. DSM ’12. New York, NY, USA: ACM, 2012, pp. 55–60. [Online]. Available: http://doi.acm.org/10.1145/2420918.2420933 D. I. K. Sjoberg, T. Dyba, and M. Jorgensen, “The future of empirical methods in software engineering research,” in 2007 Future of Software Engineering, ser. FOSE ’07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 358–378. [Online]. Available: http://dx.doi.org/10.1109/FOSE.2007.30 W. F. Tichy, “Should computer scientists experiment more?” Computer, vol. 31, no. 5, pp. 32–40, May 1998. [Online]. Available: http://dx.doi.org/10.1109/2.675631 [9] T. Dyba and T. Dingsoyr, “What do we know about agile software development?” Software, IEEE, vol. 26, no. 5, pp. 6–9, 2009. [10] PMI, Ed., A Guide to the Project Management Body of Knowledge (PMBOK Guide): An American National Standard ANSI/PMI 99-0012008, 4th ed. Newtown Square, PA: Project Management Institute, 2008. [11] K. Schwaber, Agile Project Management With Scrum. Redmond, WA, USA: Microsoft Press, 2004. [12] B. A. Kitchenham and S. L. Pfleeger, “Personal Opinion Surveys,” in Guide to Advanced Empirical Software Engineering, F. Shull, J. Singer, and D. I. K. Sjøberg, Eds. Springer, 2008, pp. 63–92+. [13] Y. Baruch, “Response Rate in Academic Studies-A Comparative Analysis,” Human Relations, vol. 52, no. 4, pp. 421–438, Apr. 1999. [Online]. Available: http://dx.doi.org/10.1177/001872679905200401 [14] W. O. Bussab and P. A. Morettin, Estatı́stica Básica, 6th ed. Saraiva, 2010. [15] S. Siegel and N. Castellan, Nonparametric statistics for the behavioral sciences, 2nd ed. McGraw–Hill, Inc., 1988. [16] A. C. C. França, F. Q. B. da Silva, and L. M. R. de Sousa Mariz, “An empirical study on the relationship between the use of agile practices and the success of scrum projects,” in Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ser. ESEM ’10. New York, NY, USA: ACM, 2010, pp. 37:1–37:4. [Online]. Available: http://doi.acm.org/10.1145/1852786.1852835 [17] C. Mann and F. Maurer, “A case study on the impact of scrum on overtime and customer satisfaction,” in Proceedings of the Agile Development Conference, ser. ADC ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 70–79. [Online]. Available: http://dx.doi.org/10.1109/ADC.2005.1 [18] A. Begel and N. Nagappan, “Usage and perceptions of agile software development in an industrial context: An exploratory study,” in Proceedings of the First International Symposium on Empirical Software Engineering and Measurement, ser. ESEM ’07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 255–264. [Online]. Available: http://dx.doi.org/10.1109/ESEM.2007.85 [19] J. S. Uebersax, “Likert scales: Dispelling the confusion,” april 2013. [Online]. Available: http: //www.john-uebersax.com/stat/likert.htm Identifying a Subset of TMMi Practices to Establish a Streamlined Software Testing Process Kamilla Gomes Camargo∗ , Fabiano Cutigi Ferrari∗ , Sandra Camargo Pinto Ferraz Fabbri∗ ∗ Computing Department – Federal University of São Carlos – Brazil Email: {kamilla camargo, fabiano, sfabbri}@dc.ufscar.br Abstract—Context: Testing is one of the most important phases of software development. However, in industry this phase is usually compromised by the lack of planning and resources. Due to it, the adoption of a streamlined testing process can lead to the construction of software products with desirable quality levels. Objective: Presenting the results of a survey conducted to identify a set of key practices to support the definition of a generic, streamlined software testing process, based on the practices described in the TMMi (Test Maturity Model integration). Method: Based on the TMMi, we have performed a survey among software testing professionals who work in both academia and industry. Their responses were analysed quantitatively and qualitatively in order to identify priority practices to build the intended generic process. Results: The analysis enabled us to identify practices that were ranked as mandatory, those that are essential and should be implemented in all cases. This set of practices (33 in total) represents only 40% of the TMMi’s full set of practices, which sums up to 81 items related to the steps of a testing process. Conclusion: The results show that there is consensus on a subset of practices that can guide the definition of a lean testing process when compared to a process that includes all TMMi practices. It is expected that such a process encourages a wider adoption of testing activities in software development. I. I NTRODUCTION Since software had become widely used, it has played an important role in the people’s daily lives. Consequently, its reliability cannot be ignored [1]. In this context, quality assurance (QA) activities should monitor the whole development process, promoting the improvement of the final product quality, and hence making it more reliable. One of the main QA activities is software testing that, when well executed, may deliver a final product with a low number of defects. Despite the importance of software testing, many companies face difficulties in devising a software testing process and customising it to their reality. A major barrier is the difficulty in adapting testing maturity models for the specific environment of the organisation [2]. Many organisations realise that process improvement initiatives can solve these problems. However, in practice, defining the steps that can be taken to improve and control the testing process phases and the order they should be implemented is, in general, a difficult task [3]. Reference models, such as TMMi [4], point out what should be done for the improvement of the software testing process. However, such models do not indicate how to do it. Despite the model organisation in levels (such as in CMMI [5]), which suggests an incremental implementation from the lowest level, TMMi has a large number of practices that must be satisfied, though not all of them are feasible for all sizes of companies and teams. In addition, the establishment of a testing process relying on a reference model becomes a hard task due to the difficulty for model comprehension. Moreover, the models do not define priorities in case of lack of time and/or resources, thus hindering the whole model adoption. According to Purper [6], the team responsible to define the testing process usually outlines a mind map of the model requirements in relation to the desired testing process. During the elaboration of the real testing process, this team manually verifies whether the mandatory practices, required by the model, are addressed. In general, these models indicate some prioritisation through their levels; however, within each level, it is not clear what should be satisfied at first. For better results, the testing process should include all phases of software testing. However, the process should be as minimal as possible, according to the reality of the company and the model used for software development. This adequacy can make the testing process easier to be applied and does not require many resources or a large team. This shall help the company achieve the goal of improving the product quality. Based on this scenario, we conducted a survey in order to identify which are the practices of TMMi that should be always present in a testing process. Our goal was to characterise the context of Brazilian companies to provide them with a direction on how to define a lightweight, still complete testing process. Therefore, the survey results reflect the point of view of Brazilian testing professionals. Given that a generic testing process encompasses phases such as planning, test case design, execution and analysis, and monitoring [7, 8], we expected the survey could indicate which are the essential practices for each phase. The assumption was that there are basic activities related to each of these phases that should never be put aside, even though budget, time or staff are scarce. The remainder of this paper is organised as follows: Section II describes the underlying concepts of this research. Section III presents the survey planning, how the participants were invited, and the data evaluation methods. Section IV shows the survey results and the participant’s profile. Section V discusses these results for each stage of the generic testing process. Finally, Section VI presents possible threats to the validity of the survey, and Section VII presents the conclusions. II. BACKGROUND TMMi [4] is a reference model that complements CMMI [5] and was established to guide the implementation and improvement of testing processes. It is similar to CMMI in structure, because it includes maturity levels that are reached through the achievement of goals and practices. For TMMi, a process evolves from a chaotic initial state (Level 1), to a state in which the process is managed, controlled and optimised (Level 5). Each specific goal indicates a single characteristic that must be present in order to satisfy the corresponding process area. A specific goal is divided into specific practices that describe which activities are important and can be performed to achieve the goal. Generic goals are related to more than one process area and describe features which may be used to institutionalise the testing process. Figure 1 illustrates the structure of TMMi. The survey questionnaire of this study was developed based on the TMMi specific goals and practices. Each goal was represented by a question and each practice represented by a sub question. Fig. 1. TMMi structure and components [4] Höhn [9] has defined a mind map of TMMi. The map distributes process areas, specific goals and their practices throughout phases of a generic testing process. This map is called KITMap and was developed to facilitate the TMMi understanding and to share information. In the map, the root node is the name of the treated theme, i.e. the testing process. Nodes of the second level are the phases of a generic testing process. Such phases guided the grouping of the survey questions. They are: Planning, Test Case Design, Setup of Test Environment and Data, Execution and Evaluation, and Monitoring and Control. At the third level of KITMap are process areas of TMMi. Höhn [9] organised the process areas of TMMi according to their relation to each phase of the generic testing process. Figure 2 illustrates, from the left side, (i) the phase of the generic testing process (i.e. Test Case Design); (ii) two process areas that are related to that phase; (iii) the specific goal related to the first process area; and (iv) the various specific practices related to the specific goal. Note that process areas from different TMMi levels may be associated to the same phase of the generic testing process. This can be observed in Figure 2, in which one process area is from Level 2 of TMMi while the other is from Level 3. Despite this, both are associated to the same phase (Test Case Design). III. S URVEY P LANNING A. Survey Goals This study was performed with the aim of identifying which are the most important practices of TMMi, and hence should be prioritised during the testing process execution, according to the opinion of who works with software testing for three years or more. This study is motivated by our experience and, equally important, real life observations that some testing-related practices should never be put aside, even though time, budget or human resources are scarce. B. Survey Design The survey was developed using the Lime Survey tool [10]. Lime Survey allows one to organise survey questions in groups and visualise them in separate web pages. The questionnaire is based on TMMi Version 3.1 [4]. The questions were split into six groups. The first group aims to characterise the subject profiles; it includes questions related to the level of knowledge on software quality reference models, namely, CMMI [5], MR-MPS [11] and TMMi. The remaining groups of questions (i.e. 2 to 6) each focuses on a phase of a generic testing process, as defined by Höhn [9]. The phases are: (1) Planning (2) Test Case Design (3) Setup of Testing Environment and Data (4) Execution and Evaluation (5) Monitoring and Control. Each questionnaire page includes a single group of questions. The first page also brings some directions regarding how to fill in the forms, including a table to describe the values the subjects could use to assign each TMMi practice a level of importance. The values are described in Table I. Note that we decided not to include a neutral value for the scale of importance. This was intended to make the subject decide between practices that should be classified as priority (i.e. levels 4 or 3 of importance) or not (i.e. levels 2 or 1). All questions related to testing practices (i.e. groups 2 to 6) are required; otherwise, a subject cannot go ahead to the next group of questions. TABLE I L EVELS OF IMPORTANCE FOR SURVEYED PRACTICES . 1- Dispensable Dispensable activity; does not need to be performed. 2- Optional Activity that not necessarily needs to be performed. 3- Desirable Activity that should be implemented, though may be put aside. 4- Mandatory Essential activity that must always be performed. We highlight two key points in this survey: (1) none of the subjects were told the questionnaire was based on the TMMi structure – this intended to avoid bias introduced by knowledge on the process maturity model; and (2) the subjects should answer the questionnaire according to their personal opinion – this intended to avoid bias introduced by the company or institution context. To build the questionnaire, which is available online1 , we translated TMMi goals, practices and other items of interest to Portuguese, since there is no official translation of TMMi to languages other than English. The translation took into account technical vocabulary in the target language (i.e. Portuguese). In the questionnaire, every question within a given group (i.e. within a specific testing process phase) regards a TMMi Specific Goal. Each question includes a set of sub-questions regarding the TMMi Specific Practices (SPs). Note that in TMMi a Specific Goal is achieved when the associated SPs are performed in a testing process. Therefore, assigning a 1 http://amon.dc.ufscar.br/limesurvey/index.php?sid=47762&lang=pt-BR – accessed on 17/04/2013. SP1.1 Identify and prioritise test conditions Level 2 PA2.4 Test Design and Execution SG1 Perform Test Analisys and Design using Test Design Techiniques Test Case Design SP1.2 Identify and prioritise test cases SP1.3 Identify necessary specifc data SP1.4 Maitain horizontal treceability with requirements SP3.1 Identify and prioritise non-functional test conditions Level 3 PA3.4 Non-Functional Testing Fig. 2. SG3 Perform Non-Funcional Test Analisys and Design SP3.2 Identify and prioritise non-functional test cases SP3.3 Identify necessary specifc data SP3.4 Maitain horizontal treceability with non-functional requirements KITMap excerpt (adapted from Höhn [9]). particular set of SPs levels of importance shall allow us to also draw conclusions about the Specific Goal relevance, according to the subject’s personal experience. Figure 3 illustrates a question related to the Planning Phase. This question addresses the Perform a Product Risk Assessment Specific Goal, and includes three sub-questions regarding the associated Spa’s. As previously described, the subject should assign a level of importance ranging from 1 to 4 to each SP (see Table I), according to its opinion about the relevance of the SP to achieving the goal defined in the question. Note that all questions bring a side note to help the subject understand and properly answer the question. This help note can be seen in the bottom of Figure 3. Characterising the Profiles: The first group of questions aims to characterise the profile of subjects taking into account their work environment. Figure 4 shows part of the profile form. To design the profile form, we considered that the subject’s experience and the process maturity level of its institution or company impact on the subject’s knowledge on testing. Therefore, the following information is required: • Experience with software testing (research, industry and teaching): it is well-known that tacit knowledge is different from explicit knowledge. Due to this, this information aims to characterise different types of knowledge, acquired either with industrial, research or teaching experience. • Testing process in the company: this information is required only for those who report experience in industry, in order to characterise their work environment. • Certification in process maturity model: this information is required for those who report their companies have any certification in maturity models; if applicable, the subject is required to inform which maturity model (namely, MR-MPS, CMMI, TMMi or any other) and the corresponding level. This might have impact on the subject’s personal maturity regarding the model. • Knowledge of TMMi and MR-MPS: knowledge of reference models, specially TMMi, grants the subject a higher maturity regarding testing processes. C. Obtained Sample For this survey, a personal e-mail announcement was sent to Brazilian software testing professionals from both academy and industry. It was also announced in a mailing list2 that includes more than 3,000 subscribers from Brazil. Furthermore, Fig. 4. we invited professionals that work for a pool of IT companies named PISO (Pólo Industrial de Software)3 from the city of Ribeirão Preto, Brazil. The questionnaire was made available in December, 2011, and remained open for a period of 45 days. In total, we registered 113 visits, from which 39 resulted in fully answered questionnaires that were considered for data analysis. Even though the sample is not large, these 39 answers allowed us to analyse the data statistically, however with less rigour than in analyses applied to large samples. The analysis procedures are described in the following section. D. Data Analysis Procedures Initial Analysis: An initial data analysis revealed that the practices were mostly ranked as 3 and 4 in regard to their level of importance. This is depicted in Figure 5, which groups the answers of all subjects for all questions according to the assigned levels of importance4 . This initial analysis also allowed us to identify two outliers which were removed from the dataset: the first regards a subject that assigned level 4 to all practices, while the second inverted all values in his/her answers (i.e. he/she interpreted the value of 4 as the lowest level of importance and the value of 1 as the highest level). Therefore, the final dataset, depicted in Figure 5, comprises 37 fully filled in questionnaires. 3 http://www.piso.org.br/ - accessed on 17/04/2013 that we had a total of 37 set of answers for 81 questions; thus, the four groups shown in Figure 5 sum up 2997 individual answers. 4 Note 2 http://br.dir.groups.yahoo.com/group/DFTestes – accessed on 17/04/2013 Part of the profile characterisation form (translated to English). Fig. 3. Example of question, structured according to a testing process phase and TMMi Specific Goals and Practices. 1309 Number of answers 1225 355 108 1 2 3 4 Level of Importance Fig. 5. Frequency distribution of levels of importance, considering all answers of all subjects. In this survey, we considered the following independent variables: (i) industrial experience with testing process; (ii) knowledge and usage experience with MR-MPS; and (iii) knowledge of TMMi. The dependent variable is the level of importance assigned to each practice. The scale used for the dependent variable characterises data with ordinal measurement level, i.e. we were dealing with discrete values. Besides, the data distribution was non-symmetric since the vast majority of practices were ranked as 3 and 4, as shown in Figure 5. The characteristics of the data led us to use the nonparametric Sign Test [12]. This test evaluates if the median, for a given set of values (in our case, for each practice), is higher than a fixed value. We used the fixed value of 3.5, which would allow us to identify which practices were indeed classified as mandatory (i.e. with maximum level of importance), since more than 50% of the subject would have been ranked those practices as mandatory. Due to the size of our sample, we adopted a p-value=0.15 to draw conclusions on the executed tests. Even though this is not a widely adopted level of confidence, some other exploratory studies [13, 14], which dealt with similar small samples, also adopted relaxed levels of confidence instead of the traditional p-value=0.01 or p-value=0.05. The results of this analysis did not result in statistical significance for some practices, even when the majority of subjects assigned levels 3 or 4 for those practices. For instance, the Identify and prioritise test cases practice was ranked as mandatory by most of the subjects (19 out of 37); however, the Sign Test did not show statistical significance. Obviously, the sample size may have impacted on the sensitiveness of the statistical test, leading to inconclusive results even in cases of majority of answers ranging from 3 to 4. This is the case of the Identify and prioritise test conditions practice. The answer distribution for this practice is summarised in Table II. The figures show that the number of subjects that assigned this practice level 3 of importance is higher than the number of subjects that assigned it level 4; despite this, we could not observe any difference, statistically significant, in favour of the former (i.e. level 3). TABLE II L EVELS OF IMPORTANCE ASSIGNED TO THE Identify and prioritise test conditions PRACTICE . Level 4 3 2 1 Number of answers 14 19 2 2 After this first analysis, we elaborated a new set of three questions to help us clarify some open issues. These new questions, which require simple “Yes” or “No” answers, aimed to resolve some dependencies observed in the results. Such dependencies were identified by Höhn [9] and indicate that the implementation of some testing-related practices require the previous implementation of others. This new questionnaire was announced by e-mail to all subjects who answered the first one, and remained open for a period of 14 days. We had feedback from 14 subjects. The results of this new round of questions are discussed in Section V. A new analysis, based on the frequency of answers in the first set of questions, indicated some trends the statistical tests did not allow for. It consisted in a descriptive analysis of the data, since we were unable to conclude on some practices, even when they were ranked as mandatory by many subjects. In short, we identified the practices that were mostly ranked as mandatory when comparing to the other values of the scale (desirable, optional and dispensable – see Table I). In spite of the weak confidence such kind of analysis may represent, the identified subset of practices was similar to the subset obtained solely based on a statistical basis. In fact, this set of practices included all practices identified through the statistical aforementioned procedures. A summary of the results is depicted in the Venn diagram of Figure 7. Details are discussed in the next section. IV. R ESULTS The results of the survey are described in this section. Initially, Section IV-A defines some profiles, each representing a group of subjects, based on the experience reported in the profile characterisation form. Then, Section IV-B shows the results with respect to the level of importance of TMMi practices according to each profile. A. Profile Definition Figure 6 summarises the level of knowledge of both subjects and their institutions according the the profile characterisation questions. A description of the charts comes in the sequence. a) Experience More than 3 years From 1 to 3 years Less than 1 year 11% b) Testing Process Yes No n/a 14% 22% 46% 65% 43% c) Certification d) Type of Certification Yes No n/a 16% MR-MPS CMMI 50% 50% 24% 59% 8% e) TMMi 59% Fig. 6. 32% Knows in practice Knows in theory only Does not know Summary of profile characterisation. a) Experience: this chart shows that 46% of the subjects (17 out of 37) have more than three years of experience in testing either in industry or academy; only 11% (4 out of 37) have less than one-year experience. b) Testing Process: this chart shows that 65% of the subjects (24 out of 37) work (or have worked) in a company that has a testing process officially implemented (i.e. an explicit testing process). From the remaining subjects, 22% (8 out of 37) do not (or have not) worked in a company with an explicit testing process, while around 14% (5 out of 37) have not answered this question. c) Certification: this chart shows that 59% of the subjects (22 out of 37) work (or have worked) in a company that has been certified with respect to a software process maturity model (e.g. CMMI, MR-MPS). The remaining subjects have never worked in a certified company (24%) or have not answered this question (16%). d) Type of Certification: from the subjects that reported to work (or have worked) in a certified company – chart (c) of Figure 6 –, half of them (i.e. 11 subjects) are (or were) in a CMMI-certified company, while the remaining are (or were) in a MR-MPS-certified company. e) TMMi: this chart reveals that only 8% of the subjects (3 out of 37) have had any practical experience with TMMi. Besides this, 59% of subjects (22 out of 37) have stated to have only theoretical knowledge of TMMi, whereas 32% (12 out of 37) do not know this reference model. Based on the results depicted in Figure 6, we concluded that the sample is relevant with respect to the goals established for this work. This conclusion relies on the fact that, amongst the 37 subjects who have fully answered the questionnaire, (i) 89% of them have good to high knowledge of software testing (i.e. more than one-year experience); (ii) 65% work (or have worked) in companies that officially have a software testing process; (iii) 59% work (or have worked) in a CMMI- or MR-MPS-certified company; and (iv) 67% are knowledgeable of TMMi, at least in theory. For CMMI-certified companies, the maturity levels vary from 2 to 5 (i.e. from Managed to Optimising). For MR-MPS-certified companies, the maturity levels range from G to E (i.e. from Partially Managed to Partially Defined). To analyse the results regarding the level of importance of TMMi practices according to the subjects’ personal opinion, we defined three different profiles as follows: • Profile-Specialist: compound by 12 subjects who have at least three years of experience with software testing and work (or have worked) in a company that has a formally implemented software testing process. • Profile-MR-MPS: compound by 20 subjects that are knowledgeable of MR-MPS and use this reference model in practice. • Profile-TMMi: compound by 25 subjects that are knowledgeable of TMMi. The choice for a MPS.BR-related profile was motivated by the straight relationship between the reference model and context of Brazilian software companies. Furthermore, these three specific profiles were defined because we believe the associated subjects’ tacit knowledge is very representative. Note that the opinion of experts in CMMI was not overlooked at all; instead, such experts’ opinion are spread over the analysed profiles. Finally, we also considered the answers of all subjects, in a group named Complete Set. B. Characterising the Importance of TMMi Practices As previously mentioned, the results herein describe are based on the three profiles (namely, Profile-Specialist, Profile-MR-MPS and Profile-TMMi) as well as on the whole survey sample. Within each profile, we identified which practices were mostly ranked as mandatory. The Venn diagram depicted in Figure 7 includes all mandatory practices, according to each profile. The practices are represented by numbers and are listed in the table shown together with the diagram. In Figure 7, the practices with grey background are also present in the set obtained solely from the statistical analysis described in Section III-D. As the reader can notice, this set of practices appears in the intersection of all profiles. Furthermore, practices with bold labels (e.g. practices 5, 7, 22, 31 etc.) are present in the set aimed to compose a lean testing process (this is discussed in details in Section V). Next we describe the results depicted in Figure 7. • Complete Set: taking the full sample into account, 31 practices were assigned level 4 of importance (i.e. ranked 2 3 4 5 7 9 10 11 12 13 15 16 17 19 20 21 22 23 25 26 27 28 29 30 31 32 33 38 41 Identify product risks Analyse product risks Identify items and features to be tested Define the test approach Define exit criteria Establish a top-level work breakdown structure Define test lifecycle Determine estimates for test effort and cost Establish the test schedule Plan for testing staffing Identify test project risks Establish the test plan Review test plan Obtain test plan commitments Elicit test environment needs Develop the test environment requirements Analyse the test environment requirements Identify non-functional product risks Identify non-functional features to be tested Define the non-functional test approach Define non-functional exit criteria Identify work products to be reviewed Define peer review criteria Identify and prioritise test conditions Identify and prioritise test cases Identify necessary specific test data Maintain horizontal traceability with requirements Develop and prioritise test procedures Develop test execution schedule 42 45 46 49 50 51 52 53 54 55 56 57 58 60 63 66 67 69 71 72 73 74 75 76 77 79 80 81 Implement the test environment Perform test environment intake test Develop and prioritise non-functional test procedures Execute test cases Report test incidents Write test log Decide disposition of test incidents in configuration control board Perform appropriate action close the test incident Track the status of test incidents Execute non-functional test cases Report non-functional test incidents Write test log (non-functional) Conduct peer reviews Analyse peer review data Monitor test commitments Conduct test progress reviews Conduct test progress milestone reviews Monitor defects Monitor exit criteria Monitor suspension and resumption criteria Conduct product quality reviews Conduct product quality milestone reviews Analyse issues Take corrective action Manage corrective action Perform test data management Co-ordinate the availability and usage of the test environments Report and manage test environment incidents Fig. 7. Venn diagram that shows the intersections of results with respect to the practices ranked as mandatory by the majority of subjects within each profile. as mandatory) by most of the subjects. The majority of them are also present in the other profile-specific sets, as shown in Figure 7. The reduced set of practices to compose a lean testing process includes these 31 items, and is complemented with practices 5 and 7 (the justification is presented in Section V). • Profile-Specialist: 49 practices were ranked as mandatory by most of subjects within this profile. From these, 27 practices appear in the intersection with at least another set; • Profile-MR-MPS: subjects of this profile ranked 33 practices as mandatory, from which only 30 are in intersections with the other profiles; only 3 practices are considered mandatory exclusively for subjects of this profile. • Profile-TMMi: for those who know TMMi, 42 practices are mandatory, from which 41 ones appear in the intersections with other profiles. V. A NALYSIS AND D ISCUSSION Before the definition of the aimed reduced set of practices, we analysed the results of the second questionnaire, which has been designed to resolve some dependencies observed in the initial dataset (i.e. based on the 37 analysed answers). The dependencies have been identified by Höhn [9], who has pointed out some practices that must be implemented before the implementation of others. Based on the feedback of 14 subjects, all included in the initial sample, we were able to resolve the observed dependencies, which are related to the following practices: Analyse product risks, Define the test approach, and Define exit criteria. Regarding Analyse product risks, the subjects were asked if this task should be done as part of the testing process. We got 12 positive answers, thus indicating this practice is relevant, for example, to support the prioritisation of test cases. In fact, the Analyse product risks practice was already present in the reduced set of practices identified from the first part of the survey. In spite of this, we wanted to make sure the subjects have had clear comprehension that it should be performed as part of the testing process. The subjects were also asked whether a testing approach could be considered fully defined when the product risks were already analysed, and items and features to be tested were already defined. This question was motivated by the fact that Define the test approach (practice 5 in Figure 7) was not present in the reduced set of practices derived from the initial questionnaire. For this question, we received 10 negative answers; that is, one cannot consider the testing approach fully defined only by analysing product risks and defining items and features to be tested. Therefore, we included practice 5 in the final set, thus resolving a dependency reported by Höhn [9]. The third question of the second questionnaire addressed the Define exit criteria practice (#5 in Figure 7), since it was not identified as mandatory after the first data analysis. Subjects were asked whether it is possible to run a test process without explicit exit criteria (i.e. information about when test should stop). Based on 9 negative answers (i.e. 65%), this practice was also included in the reduced set. This second analysis helped us to either clarify or resolve the aforementioned dependencies amongst TMMi practices. In the next sections we analyse and discuss the survey results. For this, we adapted Höhn’s mind map [9] (Figures 8–12), according to each phase of a generic testing process. Practices highlighted in grey are identified as mandatory and should be implemented in any testing process. A. Planning Planning the testing activity is definitely one of the most important process phases. It comprises the definition of how testing will be performed and what will be tested; it enables proper activity monitoring, control and measurement. The derived test plan includes details of the schedule, team, items to be tested, and the approach to be applied [15]. In TMMi, planning-related practices also comprise non-functional testing, definition of the test environment and peer reviews. In total, 29 practices are related to planning (see Figure 8), spread over the nine specific goals (labelled with SG in the figure). To achieve these goals, the organisation must fulfil all the practices shown in Figure 8. Despite this, our results show that only 8 out of these 27 practices are mandatory, according to the Complete Set subject group. According to Höhn’s analysis, TMMi has internal dependencies amongst practices, some Phase of Generic Process TMMI Process Area (PA) Specific Goal (SG) of Process Area SG 1 Perform a Product Risk Assessment Phase 1: Planning Specific Practice (SP) of Specific Goal SP 1.1 – Define product risk categories and parameters SP 1.2 – Identify product risks SP 1.3 – Analyse product risks SP 2.1 – Identify items and features to be tested SG 2 Establish a Test Approach SP 2.2 – Define the test approach SP 2.3 – Define entry criteria SP 2.4 – Define exit criteria SP 2.5 – Define suspension and resumption criteria Level 2 – PA 2.2 SG 3 Establish Test Estimates SP 3.1 – Establish a top-level work breakdown structure SP 3.2 – Define test lifecycle Test Planning SP 3.3 – Determine estimates for test effort and cost SP 4.1 – Establish the test schedule SG 4 Develop a Test Plan SP 4.2 – Plan for test staffing SP 4.3 – Plan stakeholder involvement SP 4.4 – Identify test project risks SP 4.5 Establish the test plan SG 5 Obtain Commitment to the Test Plan SP 5.1 – Review test plan SP 5.2 – Reconcile work and resource levels SP 5.3 – Obtain test plan commitments Level 2 – PA 2.5 SG 1 Develop Test Environment Requirements Test Environment SG 1 Perform a NonFunctional Product Risk Assessment SP 1.3 – Analyse the test environment requirements SP 1.1 – Identify non-functional product risks SP 1.2 – Analyse non-functional product risks Level 3 – PA 3.4 Non-Functional Testing SP 1.1 – Elicit test environment needs SP 1.2 – Develop the test environment requirements SG 2 Establish a NonFunctional Test Approach SP 2.1 – Identify non-functional features to be tested SP 2.2 – Define the non-functional test approach SP 2.3 – Define non-functional exit criteria Level 3 – PA 3.5 Peer Reviews Fig. 8. SG 1 Establish a Peer Review Approach SP 1.1 – Identify work products to be reviewed SP 1.2 – Define peer review criteria TMMi practices related to Planning. related to the Planning phase. Therefore, 2 other practices are necessary to resolve such dependencies (this is discussed in the sequence). Thus, the final set of 10 mandatory practices for the Planning phase is shown in grey background in Figure 8. Amongst these practices, Identify product risks and Analyse product risks demonstrate the relevance of evaluating product risks. Their output plays key role in the testing approach definition and test case prioritisation. The product risks consist of a list of potential problems that should be considered while defining the test plan. Figure 7 shows that these two practices were mostly ranked as mandatory considering all profiles. According to the IEEE-829 Standard for Software and System Test Documentation [15], a test plan shall include: a list of what will be and will not be tested; the approach to be used; the schedule; the testing team; test classes and conditions; exit criteria etc. In our survey, Identify items and features to be tested, Establish the test schedule and Plan for test staffing practices were mostly ranked as mandatory. They are directly related to Establish the test plan, and address the definition of most of the items listed in the IEEE-829 Standard. This is complemented with the Define exit criteria, selected after the dependency resolution. This evinces the coherence of the survey’s subject choices for mandatory practices with respect to the Planning phase. The Planing phase also includes practices that address the definition of the test environment. In regard to this, Elicit test environment needs and Analyse the test environment requirements are ranked as mandatory and as clearly inter-related. To conclude this analysis regarding the Planning phase, note that not all TMMi specific goals are achieved only with the execution of this selection of mandatory practices. Despite this, the selected practices are able to yield a feasible test plan and make the process clear, managed and measurable. After Planning, the next phase is related to Test Case Design. The input to this phase if the test plan, which includes some essential definitions such as risk analysis, the items which will be tested and the adopted approach. B. Test Case Design Figure 9 summarises the results of our survey for this phase, based on the set of TMMi practices identified by Höhn [9]. As the reader can notice, only two practices were mostly ranked as mandatory by the Complete Set group of subjects: Identify and prioritise test cases and Identify necessary specific test data (both shown in grey background in Figure 9). Phase of Generic Process TMMI Process Area (PA) Phase 2: Test Case Design Level 2 – PA 2.4 Specific Goal (SG) of Process Area SG 1 Perform Test Analysis and Design Using Test Design Techniques Test Design and Execution is likely that part of the subjects consider that the test plan itself already fulfils the needs regarding test case design, thus most of the practices are not really necessary. For instance, if we considered solely the Profile-MR-MPS, none of the practices within this phase would appear in the results (see Figure 7 to crosscheck this finding). On the other hand, subjects of the other profiles consider some other practices of this phase should be explicitly performed in a testing process. For instance, subjects of the Profile-Specialist profile ranked Identify and prioritise test conditions, Identify necessary specific test data and Maintain horizontal traceability with requirements as mandatory. For the Profile-TMMi subjects, Identify and prioritise test cases and Maintain horizontal traceability with requirements should be mandatory. From these results, we can conclude that there is uncertainty about what should indeed be done during the test case design phase. Moreover, this uncertainty may also indicate that not always test cases are documented separately from the test plan; the plan itself includes the testing approach (and its underlying conditions) and the exit criteria. Thus, the two selected practices for this phase complement the needs to compose a feasible, streamlined testing process. C. Setup of Test Environment and Data As discussed in Section V-A, in the Planning phase test environment requirements are identified and described. The Setup of Test Environment and Data phase addresses the prioritisation and implementation of such requirements. Figure 10 shows the TMMi specific goals and practices for this phase. Phase of Generic Process TMMI Process Area (PA) Phase 3: Setup of Test Environment and Data Level 2 – PA 2.4 Specific Practice (SP) of Specific Goal SP 1.1 – Identify and prioritise test conditions Level 2 – PA 2.5 Non-Functional Testing Fig. 9. SG 2 Perform Test Environment Implementation Non-Functional Testing SP 3.2 – Identify and prioritise nonfunctional test cases SP 3.3 – Identify necessary specific test data SP 3.4 – Maintain horizontal traceability with non-functional requirements TMMi practices related to Test Case Design. According to the IEEE-829 Standard, the test plan encompasses some items related to test case design, such as the definition of test classes and conditions [15]. Due to this, it Fig. 10. SP 2.1 – Implement the test environment SP 2.2 – Create generic test data SP 2.3 – Specify test environment intake test procedure SP 1.3 – Identify necessary specific test data Level 3 – PA 3.4 SP 2.2 – Create specific test data SP 2.3 – Specify intake test procedure SP 2.4 – Develop test execution schedule Test Environment SP 3.1 – Identify and prioritise nonfunctional test conditions Level 3 – PA 3.4 SG 2 Perform Test Implementation Test Design and Execution SP 1.2 – Identify and prioritise test cases Specific Practice (SP) of Specific Goal SP 2.1 – Develop and prioritise test procedures SP 1.4 – Maintain horizontal traceability with requirements SG 3 Perform NonFunctional Test Analysis and Design Specific Goal (SG) of Process Area SG 4 Perform NonFunctional Test Implementation SP 2.4 – Perform test environment intake test SP 4.1 – Develop and prioritise nonfunctional test procedures SP 4.2 – Create specific test data TMMi practices related to Setup of Test Environment and Data. According to TMMi, Develop and prioritise test procedures consists in determining the order test cases will be executed. Such order is defined in accordance with the product risks. The classification of this practice as mandatory is aligned with the practices selected for the Planning phase, some of which related to risk analysis. Another practice ranked as mandatory is Develop test execution schedule, which is directly related to the prioritisation of test case execution. The other two practices (i.e. Implement the test environment and Perform test environment intake test) address the environment implementation and ensuring it is operational, respectively. The conclusion regarding this phase is that the four practices are sufficient to create an adequate environment to run the tests. D. Execution and Evaluation The next phase of a generic testing process consists of test case execution and evaluation. At this point, the team runs the tests and, eventually, creates the defect reports. The evaluation aims to assure the test goals were achieved and to inform the results to the stakeholders [8]. For this phase, Höhn [9] identified 13 TMMi practices, which are related to test execution goals, management of incidents, non-functional test execution and peer reviews. This can be seen in Figure 11. As the reader can notice, only four practices were not ranked as mandatory. This makes evident the relevance of this phase, since it encompasses the activities which are related to test execution and management of incidents. Phase of Generic Process TMMI Process Area (PA) Specific Goal (SG) of Process Area information needs to be organised and consolidated to enable rapid status checking and, if necessary, corrective actions. This is addressed during the Monitoring and Control phase [7]. Figure 12 depicts the TMMi practices with respect to this phase. Again, the practices ranked as mandatory by most of the subjects are highlighted in grey. Note that there is consensus amongst all profile groups (i.e. Profile-Specialist, Profile-MR-MPS, Profile-TMMi and the Complete Set) about what is mandatory regarding Monitoring and Control. This can be crosschecked in Figure 7. Phase of Generic Process TMMI Process Area (PA) Specific Goal (SG) of Process Area Specific Practice (SP) of Specific Goal SP 1.1 – Monitor test planning parameters SG 1 Monitor Test Progress Against Plan Phase 5: Monitoring and Control SP 1.2 – Monitor test environment resources provided and used SP 1.3 – Monitor test commitments SP 1.4 – Monitor test project risks SP 1.5 – Monitor stakeholder involvement Specific Practice (SP) of Specific Goal SP 1.6 – Conduct test progress reviews SP 1.7 – Conduct test progress milestone reviews SP 3.1 – Perform intake test Phase 4: Execution and Evaluation SG 3 Perform Test Execution Level 3 – PA 3.4 Non-Functional Testing SP 3.3 – Report test incidents Level 2 – PA 2.3 Test Monitoring and Control SG 4 Manage Test Incidents to Closure SG 5 Perform NonFunctional Test Execution Peer Reviews SG 2 Perform Peer Reviews SP 2.4 – Monitor exit criteria SP 2.5 – Monitor suspension and resumption criteria SP 4.2 – Perform appropriate action to close the test incident SP 2.6 – Conduct product quality reviews SP 4.3 – Track the status of test incidents SP 2.7 – Conduct product quality milestone reviews SP 5.1 – Execute non-functional test cases SG 3 Manage Corrective Actions to Closure SP 5.2 – Report non-functional test incidents SP 3.3 – Manage corrective action SP 2.1 – Conduct peer reviews SP 2.2 – Testers review test basis documents Level 2 – PA 2.5 Test Environment E. Monitoring and Control The execution of the four phases of a generic testing process yields a substantial amount of information. Such SG 3 Manage and Control Test Environments SP 3.1 – Perform systems management SP 3.2 – Perform test data management SP 3.3 – Co-ordinate the availability and usage of the test environments TMMi practices related to Execution and Evaluation. The results summarised in Figure 11 include practices that regard the execution of non-functional tests. However, in the Planning an Test Case Design phases, the selected practices do not address the definition of such type of tests. Although this sounds incoherent, this may indicate that, from the planning and design viewpoints, there is not a clear separation between functional and non-functional testing. The separation is a characteristic of the TMMi structure, but for the testing community these two types of testing are performed in conjunction, since the associated practices as described in TMMi are very similar in both cases. SP 3.1 – Analyse issues SP 3.2 – Take corrective action SP 2.3 – Analyse peer review data Fig. 11. SP 2.2 – Monitor defects SP 4.1 – Decide disposition of test incidents in configuration control board SP 5.3 – Write test log Level 3 – PA 3.5 SP 2.1 – Check against entry criteria SP 2.3 – Monitor product risks SP 3.4 – Write test log Level 2 – PA 2.4 Test Design and Execution SP 3.2 – Execute test cases SG 2 Monitor Product Quality Against Plan And Expectations SP 3.4 – Report and manage test environment incidents Fig. 12. TMMi practices related to Monitoring and Control. Performing the Conduct test progress reviews and Conduct product quality reviews practices means keeping track of both the testing process status and the product quality, respectively. Monitor defects addresses gathering metrics that concern incidents (also referred to as issues), while Analyse issues, Take corrective action and Manage corrective action are clearly inter-related practices. The two other practices considered mandatory within this phase are Co-ordinate the availability and usage of the test environments and Report and manage test environment incidents. Both are important since either unavailability or incidents in the test environment may compromise the activity as a whole. As a final note with respect to the survey results, we emphasise that the subjects were not provided with any information about dependencies amongst TMMi practices. Besides this, we were aware that the inclusion of practices not mostly ranked as mandatory might have been created new broken dependencies. Despite this, the analysis of the final set of mandatory practices shows that all dependencies are resolved. VI. VALIDITY T HREATS This section describes some issues that may threaten the validity of our results. Despite this, the study limitations did not prevent the achievement of significant results with respect to software testing process definition, based on the opinion of software testing professionals. A first limitation concerns the questionnaire design. The questions were based on the TMMi structure, so were the help notes provided together with the questions. Even though the intent of the help notes was facilitating the subjects’ understanding regarding the questions, they might not have been enough to allow for correct comprehension. Although the TMMi structure is very detailed aiming to facilitate its implementation, this structure can become confusing for readers, who cannot comprehend the difference between some activities. For instance, in this survey it was clear that the practices related to functional and non-functional testing were not understood as distinct activities, since they were ranked as mandatory only in the Execution and Evaluation phase. Another threat regards the scale of values used in the first questionnaire. The answer scale was composed of four values. This represented a limitation for the statistical analysis, since the responses were mostly concentrated in values 3 and 4. If a wider scale were used, e.g. from 1 to 10, this could have yielded a better distribution of answers, thus enabling us to apply a more adequate interpretation model. The sample size was also a limitation of the study. In practice, although the sample includes only software testing professionals, its size is reduced in the face of the real population. Perhaps the way the participation call was announced and the time it was available have limited the sample. VII. C ONCLUSIONS AND F UTURE W ORK This paper described a survey that was conducted in two stages and investigated whether there is a subset of TMMi practices that can be considered essential for a generic testing process. The survey was applied amongst professionals who work with software testing. The analysis led us to conclude that, from the set of 81 TMMi practices distributed by Höhn [9] across the phases of a generic testing process, 33 are considered essential for maintaining consistency when such a process is defined. This represents a reduction of around 60% in the number of TMMi practices. Note that the other TMMi practices are not disposable; however, when the goal is to implement a streamlined process, or even when the company does not have the necessary know-how to implement its own testing process, it can use this reduced set of practices to do so. Thus, the results reported in this paper represent a simplified way to create or improve testing processes, which is based on a recognised reference model. The practices highlighted in Figures 8–12 can also indicate the priority of implementation for a company that is using TMMi as a reference for its testing process. This model does not indicate what can be implemented first, or the possible dependencies amongst the process areas. Nonetheless, the results of this study point out a set of activities that can be implemented as a priority. At a later stage, the company may decide to continue to deploy the remaining practices required by the model in order to obtain the TMMi certification. TMMi is fine-grained in terms of practices and their distribution across the specific goals and process areas. Even though this may ease the implementation of practices, this makes the model complex and difficult to understand. Once a company is willing to build a testing process based on a reference model, this process must be in accordance with its reality. Not all TMMi practices are feasible for all sizes of companies and teams. Thus, it is important be aware of a basic set of practices that, if not performed, may compromise the quality of the process, and hence the quality of the product under test. In this context, we hope the results of this work can support small and medium companies that wish to implement a new testing process, or even improve their current processes. ACKNOWLEDGEMENTS We thank the financial support by CAPES and CNPq. R EFERENCES [1] P. Cao, Z. Dong, and K. Liu, “An Optimal Release Policy for Software Testing Process,” in 29th Chinese Control Conference, 2010, pp. 6037–6042. [2] A. Rodrigues, P. R. Pinheiro, and A. Albuquerque, “The definiton of a testing process to small-sized companies: The Brazilian scenario,” in QUATIC’10. IEEE, 2010, pp. 298–303. [3] J. Andersin, “TPI- a model for test process improvement,” University of Helsinki, Helsinki - Finland, Seminar, 2004. [4] TMMI Foudation, “Test Maturity Model integration (TMMi) (Version 3.1),” pp. 1–181, 2010. [5] SEI, “Capability Maturity Model Integration Version 1.2. (CMMI-SE/SW, V1.2 – Continuous Representation),” Carnegie Mellon University, Tech. Report CMU/SEI-2006-TR-001, 2006. [6] C. B. Purper, “Transcribing Process Model Standards into Meta-Processes,” in EWSPT’00. London - UK: Springer-Verlag, 2000, pp. 55–68. [7] A. N. Crespo, M. Jino, M. Argollo, P. M. S. Bueno, and C. P. Barros, “Generic process model for software testing,” Online, 2010, http://www.softwarepublico.gov.br/5cqualibr/xowiki/ Teste-item13 - accessed on 16/04/2013 (in Portuguese). [8] A. M. J. Hass, “Testing processes,” in ICSTW’08. IEEE, 2008, pp. 321–327. [9] E. N. Höhn, “KITest: A framework of knowledge and improvement of testing process,” Ph.D. dissertation, University of São Paulo, São Carlos, SP - Brazil, Jun. 2011, (in Portuguese). [10] “Limesurvey,” http://www.limesurvey.org/, Apr. 2011. [Online]. Available: http://www.limesurvey.org/ [11] Softex, Imrovement of Brazilian Software Process - General Guide (in Portuguese), Online, Softex - Association for Promoting Excellency in Brazilian Software, 2011. [12] E. Whitley and J. Ball, “Statistics review 6: Nonparametric methods,” Critical Care, vol. 6, no. 6, p. 509, Sep. 2002. [13] J. Miller, “Statistical significance testing–a panacea for software technology experiments?” Journal of Systems and Software, vol. 73, no. 2, pp. 183–192, 2004. [14] V. Basili and R. W. Reiter, “A controlled experiment quantitatively comparing software development approaches,” IEEE Trans. Soft. Engineering, vol. SE-7, no. 3, pp. 299–320, 1981. [15] “IEEE standard for software and system test documentation,” IEEE Std 829-2008, pp. 1 –118, 2008. On the Relationship between Features Granularity and Non-conformities in Software Product Lines: An Exploratory Study Iuri Santos Souza1,2 , Rosemeire Fiaccone1 , Raphael Pereira de Oliveira1,2 , Eduardo Santana de Almeida1,2,3 1 Federal University of Bahia (UFBA), Salvador, BA, Brazil Reuse in Software Engineering (RiSE), Recife, PE, Brazil 3 Fraunhofer Project Center (FPC) for Software and Systems Engineering, Brazil Email: {iurisin,esa,raphaeloliveira}@dcc.ufba.br, r [email protected] 2 Abstract—Within Software Product Lines (SPL) features are well-understood and facilitate the communication among SPL developers and domain experts. However, the feature specification task is usually based on natural language, which can present lack of clarity, non-conformities and defects. In order to understand the feature non-conformity in SPL, this paper presents an empirical study to investigate the possible correlation between feature granularity and feature non-conformity, based on an SPL industrial project in the medical domain. The investigation aims at exploring the features non-conformities and their likely root causes using results from a previous study, which captured and classified 137 features non-conformities, identified in 92 features. The findings indicated that there is significant association between the variables feature interaction and feature granularity. From predictive models to estimate feature non-conformities based on feature granularity and feature interaction values, the variable feature interaction presented positive influence on the feature non-conformity and the variable feature granularity presented negative influence on the variable feature non-conformity. Keywords—Software Product Lines; Feature Non-Conformity; Features Granularity; Exploratory Study I. I NTRODUCTION In Software Product Lines (SPL), mass customizations is a crucial aspect and different products can be tailored for covering distinct customers by selecting a particular set or subset of features [1]. A feature can be defined as “a prominent uservisible aspect, quality, or characteristic of a software system or systems” [2]. The feature concept has been successfully applied in product portfolio, domain analysis, and product derivation in the context of product lines [3]. The level or degree extension in source code necessary to implement a given feature is a definition of feature granularity [4]. Kästner et al. classified the granularity of features in coarse granularity, which represents code extensions as the addition of new classes or methods to the source code or the addition of source code in explicit extension points; and fine granularity, which represents code extensions as the addition of new statements into existing methods, expressions or even method signatures [4]. They discussed also the effects of feature granularity in different approaches on SPL development and identified challenges to handle feature granularity, mainly when creating an SPL by decomposing a legacy application. During the SPL Scoping phase, features are specified to define the capabilities of a product line [5]. The feature specification document is composed by feature information, such as feature type (mandatory or optional), feature granularity (coarse or fine-grained), feature priority, binding time, parent feature, required feature, excluded feature, and so on [2]. Feature specification task is usually specified on natural language, which can present lack of clarity, non-conformities and defects. Consequently, scoping analysts can introduce ambiguity, inconsistency, and non-conformities. Feature nonconformity is a bad occurrence identified in the feature specification which means the absence of compliance with the required quality attributes of a feature specification document [6]. In the SPL context, quality assurance techniques, such as inspections and testing, have a fundamental role [7][8] since the assets developed can be reused in several products. Nevertheless, the literature has showed that this aspect is being poorly investigated [7][8][9], mainly considering data analysis after performing quality assurance techniques. In this context, this paper presents an exploratory study in order to investigate the influence of feature granularity on the feature non-conformity information (amount, types, and occurrences). The study object (dataset) is based on an SPL industrial project in the health information systems domain. We believe that the findings and insights discussed in this work can be useful to SPL researchers and practitioners to realize what software projects variables can influence on the feature non-conformities. The remainder of this paper is organized as follows: Section II presents the related work. Sections III details the context and the design of the empirical study carried out in this work. Section IV presents the analysis and the results of this work grouped by research questions. Section V discusses the main findings of the study. Section VI discusses the threats to validity and finally, Section VII describes the conclusions and future directions. II. R ELATED W ORK This section presents four similar studies to our proposal, in terms of exploring the effects of feature granularity in the SPL context and data analysis from software quality assurance techniques. Murphy et al. presented an exploratory study that characterized the effects of applying mechanisms to separation of concerns (or features) in codebases of object-oriented systems (within methods and among classes) from two perspectives [10]. In the first perspective, they observed the effect of mechanisms on the structure of codebases. In the second perspective, they characterized the restructuring process required to perform the separation. The study applied three different separation of concern mechanisms: Hyper/J, a tool that supports the concept of hyperspaces [11], AspectJ, a tool that supports the concept of aspect-oriented programming [12], and a lightweight lexically-based approach [13], which considers that the separation is possible without advanced tool support. The study concluded that manual restructuring is time-consuming and error-prone, then automated support would ease the problems of restructuring codebases and separation of concerns. Moreover, Murphy et al. argued that the exploratory study provides a guideline to help practitioners to choose an appropriate target structure, prepare their codebase for separation of concerns, and perform the necessary restructurings. Kästner et al. explored the effects of feature granularity (coarse and fine-grained) in two types of SPL development approaches (Compositional and Annotative) [4]. The study identified that compositional approaches do not support finegrained extensions and workarounds are required which raise the implementation complexity. On the other hand, annotative approaches can implement fine-grained extensions but introduce readability problems by obfuscating the source code. Thus, they concluded that Compositional and Annotative approaches are not able to implement fine-grained extensions satisfactorily and analyze possible solutions that allow implementing SPLs without sacrificing understandability. Furthermore, the work presents a tool (Colored Integrated Development Environment - CIDE) that intends to avoid the cited problems, developing SPLs with fine-grained extensions to implements features. Based on two case studies to evaluate the tool, Kästner et al. argue that CIDE allows implementing finegrained features including statement and expression extensions and even signature changes without workarounds. In [14], Kalinowski et al. described the concepts incorporated into an evolved approach for software process improvements based on defect data, called Defect PreventionBased Process Improvement (DPPI). DPPI approach was assembled based on Defect Causal Analysis (DCA) guidance [15] obtained from a Systematic Review in the DCA area and feedback gathered from experts in the field [16]. DPPI provides a framework for conducting, measuring and controlling DCA in order to use it efficiently for process improvement. This approach integrates cause-effect learning mechanisms into DCA meetings. The learning mechanisms consider the characteristics of the product and the defects introduced in its artifacts to enable the construction of a causal model for the organization using Bayesian network. Kalinowski et al. argue that the possibility of using the resulting Bayesian network for defect prediction can support the definition of risk mitigation strategies by performing ”what-if” scenario simulation. In the work that investigated the relationship between inspection and evolution within SPL [17], we showed an empirical study to search evidence between information from features non-conformities and data from corrective maintenance. The study sample was analyzed using statistical techniques, such as Spearman correlation rank and Poisson regression models. The findings indicated that there is a significant positive correlation between feature non-conformities and corrective maintenance. Also, sub-domains with a high number of feature non-conformities had a higher number of corrective maintenance and sub-domains qualified as high risk had also positive correlation with corrective maintenance. This correlation allowed us to build predictive models to estimate corrective maintenance based on the risk sub-domain attribute values. In a previous work [6], we performed an empirical study investigating the effects of applying an inspection approach in feature specification. Our data was gathered from an industrial SPL project. The study sample was analyzed using statistical and economical techniques, such as (i) Pareto’s principle, which showed that incompleteness and ambiguity reported higher non-conformity occurrences, (ii) Spearman correlation rank, which showed that sub-domain risk information can be a good indicator for prioritization of sub-domains in the inspection activity, and (iii) Poisson regression models, which enabled to build a predictive model for estimating non-conformities in features specifications using risk attribute. Besides, the analysis identified that optional features presented a higher non-conformity density than mandatory features. Despite our two previous work that used feature nonconformities data and the same industrial SPL project, the research in this paper aims at investigating the feature nonconformity influences and their likely root causes based on a different perspective, using feature granularity information. To the best of our knowledge, we did not find a work exploring or investigating empirical data and evidence from inspection results and features granularity information in the SPL context. Thus, the main contribution of this work is to analyze simultaneously software inspection data gathered from previous empirical study and features granularity information in an SPL industrial project. III. T HE S TUDY A. Background The SPL industrial project is been conducted in a partnership with a company, which develops information systems in the medical domain for almost twenty years. The company has more than 50 customers and a total of 51 staff members, distributed in different areas. It has four main products, comprising a total set of 42 modules, which are responsible for specific functions in different sub-domains (e.g., financial, inventory control, nutritional control, home care, nursing and medical assistance and so on). Each one of the four products, with theirs sub-domains, is described below: • • • • SmartDoctor: a web-based product composed of 11 subdomains. Its goal is to manage the tasks and routines of a doctors office. SmartClin: a desktop-based product composed of 28 sub-domains. It performs clinical management support activities (e.g., medical exams, diagnostics and so on). SmartLab:a desktop-based product composed of 28 subdomains. It integrates a set of features to manage clinical pathology labs. SmartHealth: a desktop-based product composed of 35 sub-domains. It manages the whole area of a hospital, from financial to patient issues. Some sub-domains are common to all products (present in all products), other are variable (present in two or more products) and some are specific (present just in one product). Market trends, technical constraints and competitiveness motivated the company to migrate their products from singlesystem development to an SPL approach. In the Scoping phase [18], based on the list of products previously developed by the company, the scoping analysts identified the domains with better market potential and selected the products, sub-domains and features for the product line. After that, the scoping analysts collected feature information (e.g. feature type, feature hierarchy, and feature granularity). The feature granularity definition is associated to Object Oriented Paradigm. As output of the Scoping phase some documents were developed: the product map document, composed of products and their respective features and; the feature specification documents by sub-domain. The features were implemented through PowerBuilder 1 , an Object Oriented language. The assets (e.g., product map and feature specification) built on the Scoping phase [19] of the SPL project were reviewed by inspection activities, which were performed in order to assess the quality of the artifacts. For example, the feature specification documents from the two first iterations in the project (9 sub-domains with 92 features) were inspected and as result we found 137 feature non-conformities, which were fixed by the scoping analysts. The non-conformities were classified within nine types, as defined by van Lamsweerde [20], which proposes a classification for non-conformities based on requirement specification. As the literature does not present a similar classification related to feature specification, we believe that this more general definition could be used in this context. They are next described: • • Incompleteness or Omission: Absence or omission of information necessary to specify the domain and subdomain features e.g. the feature document template contains incomplete or partially complete items and entries. Ambiguity: Presence of a specification item that allows more than one interpretation or understanding e.g. ambiguous term or statement. 1 PowerBuilder - http://goo.gl/flmo3 • • • • • • • Incorrectness or Inadequacy: Presence of a specification item that is incorrectly or inappropriately described e.g. feature specifications that not justify its presence in products of the SPL. Inconsistency or contradiction: A situation in which some specified feature contain constraints, priority or composition rules that are in conflict with other features and/or work products, such as requirements or use case documents. Non-traceability or opacity: Presence of features that do not specify or wrongly specify its identifier or interaction with other features e.g. one feature that do not have a unique identifier or one child feature that does not specify its respective parent feature. Incomprehensibility or un intelligibility: A situation in which a specification item is stated in such a way that it is incomprehensible to the target stakeholders. Non-Organization or poor structuring: When the specified features do not facilitate the reading, understanding and do not clearly states their relationship e.g. one feature does not specify the name of the respective sub-domain or the feature specification document does not organized by sub-domain. Unnecessary information or over specification: When the specification provides more details than it is required e.g. when it brings information regarding the later phases of the development cycle of the product line (and anticipating decisions). Business rule: Situation where the definition of the domain business rules is incorrectly specified. B. Empirical Study Definition 1) Data Collection: For the exploratory study, we applied interview and archival data methods [21] [22] to collect data and information. The collected inspection data were treated strictly confidential, in order to assure anonymity of the company. Archival data refers to documents from different development phases, organizational charts, financial records, and previously collected measurements in an organization [21]. For this study, we used archival data to collect features nonconformities information, gathered from the software inspection activity and features information (feature type, feature hierarchy, feature interaction and feature granularity), related to all the features in the dataset. 2) Analysis Procedure: This phase comprises the qualitative and quantitative analysis of the collected data. We performed quantitative data analysis based on descriptive statistics, correlation analysis [23], and the development of predictive models [24]. The objective of using qualitative analysis is to sketch conclusions, based on the amount of collected data, which may lead us to a clear chain of evidence [25]. Moreover, the relevant data from documents, assets and extracted statements, as well as the observations, were grouped and stored in the study database in order to optimize the exploration from sources of evidence in this study [25]. P (Yi = yi |xi , Yi > 0) = e−λi λyi i , yi = 1, 2, . . . yi !(1 − e−λi ) (1) Let Y1 , . . . , YNobs be a random sample from the zerotruncated Poisson distribution with parameter λi , i=1, . . . , Nobs . Considering the regression model [27] and Equation 2, log(λi ) = β T xi (2) A. RQ1: What is the distribution of feature non-conformities per feature granularity, feature interaction, feature type and feature hierarchy? In order to answer this question, samples for each variable, according to Table I, are shown in box plots and then analyzed. To compare the two samples from each box plot (Feature Granularity, Interaction, Type and Hierarchy), the medians, the interquartile ranges (the box lengths), the overall spreads (distances between adjacent values) and the skewness were analyzed to generalize some possible conclusions. According to the feature non-conformities per granularity box plot (Figure 1), the medians are separated meaning that there is a significant difference between the medians of the two boxes, with the median for coarse grained features being higher. The length of the fine grained feature box is slight bigger than the coarse grained one. The overall spreads are different, being smaller for coarse grained features. The box plot for coarse grained features shows a slight upper-skew: the upper whisker is longer than the lower. Also the box plot for fine grained features shows a upper-skew. The median for fine grained features is lower than the median for coarse grained features, which indicates that the number of nonconformities varies according to the feature granularity. For the showed samples, coarse grained features have more feature non-conformities than fine grained ones. 2.5 2.0 1.5 1.0 Number of Non−Conformities per Granularity 3.0 Feature Non−Conformities Per Granularity Boxplot 0.5 In this section, the analysis and the results are grouped by research questions. The investigation of this study was guided by four research questions. The necessary data to answer the stated research questions were organized by sub-domains considering number of features; feature non-conformities; feature granularity; feature type; feature hierarchical profile; and feature interaction, as shown in Table I. In order to better understand and answer the research questions, some statistical techniques were used. For the first research question is related with the distribution of nonconformities per feature granularity, interaction, hierarchy and type. To answer this research question, we did a descriptive analysis over the box plots. The second question used a statistical test to investigate the presence of an association between feature information (feature type, feature hierarchy profile and feature interaction profile) and feature granularity. The Pearson chi-square test (or Fischer exact test) [26] is the most widely used method to detect association between categorical variables based on the frequency in the two-way tables (also known as contingency tables). These tables provide a foundation for statistical inference, where statistical tests check the relationship between the variables on the dataset observed. Once detected, it was calculated odds ratio as a measure of association (effect size) between two binary categorical variables. To answer the third and fourth questions, it was fitted a regression model [27]. Indeed, a truncated Poisson regression model [26] where the dependent variable (outcome) was the number of non-conformities. This approach assumes that the logarithm of its expected value (the average number of nonconformities) can be modeled by a linear combination of unknown parameters. The linear combination can be composed of one or more independent variables. Modeling count variables is a common task in several sciences such as health, micro econometrics, social and political. The classical Poisson regression model [27] for count data is often of limited use in these disciplines because empirical count data sets typically exhibit over-dispersion and/or an excess number of zeros or structurally exclude zero counts named left-truncated count component. Examples of truncated counts include the number of bus trips made per week in surveys taken on buses, the number of shopping trips made by individuals sampled at a mall, and the number of unemployment spells among a pool of unemployed. The most common form of truncation in count models is left truncation at zero. Thus, observation apparatus is activated only by the occurrence of an event. Equation 1 represents the probability distribution for the special case of Poisson-without-zeros. where β = (β0 , β1 , . . . , βp )T , and xi is a vector of covariate values for subject i , that is xi = (1, xi 1, ..., xi p)T . Furthermore, the above model can be fitted in order to maximize the likelihood investigation. 0.0 IV. A NALYSIS AND R ESULTS OF THE S TUDY Coarse Fig. 1. Fine Boxplot - Feature non-conformities per feature granularity For the feature non-conformities per feature interaction box plot (Figure 2), the medians are more separated, with the median for features with interaction being higher. The length TABLE I DATASET SUMMARY BOARD BY SUB - DOMAIN Feature Granularity Feature type Feature Hierarchy ∗ Fine Coarse Mandatory Optional yes no A 4 0 4 4 0 0 4 B 22 1 21 21 1 5 17 C 23 6 17 22 1 15 8 D 8 1 7 8 0 0 8 E 4 1 3 4 0 2 2 F 11 1 10 11 0 1 10 G 3 0 3 3 0 3 0 H 8 2 6 8 0 0 8 I 9 3 6 5 4 9 0 Total 92 15 77 86 6 35 57 ∗ feature with at least one hierarchy profile has the value “yes”, otherwise it has the value “no” ∗∗ feature with at least one interaction profile has the value “yes”, otherwise it has the value “no” Sub-domains Features of the box corresponding to the features without interaction is bigger than the box corresponding to the feature with interaction. The overall spreads are different, being smaller for features with interaction. The box plot for feature with interaction shows a upper-skew: the upper whisker is longer than the lower. On the other hand, the box plot for feature without interaction shows a lower-skew: the lower whisker is longer than the upper. The median for the feature without interaction is close to the lower adjacent value for the feature with interaction, which indicates that the number of non-conformities varies according to the feature interactions. According to this box plot, the number of non-conformities for feature with interaction is bigger than the number of nonconformities for feature without interaction. Non-conformities 4 33 31 20 8 13 3 7 18 137 are related to features without hierarchy. The upper outlier represents the number of non-conformities for the Domain D per features without hierarchy for the same Domain (20/8 = 2,5). The lower outlier from Figure 3 represents the number of non-conformities for the Domain G per features without hierarchy for the same Domain (0) and also the number of nonconformities for the Domain I per features without hierarchy for the same Domain (0). The box plot for feature with hierarchy shows a lower-skew: the lower whisker is longer than the upper. On the other hand, the box plot for feature without hierarchy shows a upper-skew: the upper whisker is longer than the lower one. The medians for both boxes are the same, however, without considering the two outliers from the feature without hierarchy, the number of non-conformities for features with hierarchy is bigger than the number of nonconformities for features without hierarchy. 2.5 Feature Non−Conformities Per Feature Interaction Boxplot Feature Interaction ∗∗ yes no 3 1 15 7 9 14 6 2 4 0 11 0 2 1 3 5 6 3 59 33 3.0 2.5 2.0 1.5 1.0 Without_Interaction 0.0 With_Interaction ● 0.5 Number of Non−Conformities per Feature Hierarchy 2.0 1.5 1.0 0.5 0.0 Number of Non−Conformities per Feature Interaction Feature Non−Conformities Per Feature Hierarchy Boxplot ● With_Hierarchy Fig. 2. Without_Hierarchy Boxplot - Feature non-conformities per feature interaction Analyzing the feature non-conformities per feature hierarchy box plot (Figure 3), the medians are the same. The length of the boxes is different. The feature with hierarchy box is much bigger than the feature without hierarchy box. The overall spreads are also different, being bigger for the features with hierarchy, even considering the two outliers from the feature without hierarchy. The outliers from Figure 3 Fig. 3. Boxplot - Feature non-conformities per feature hierarchy Analyzing the feature non-conformities per feature type box plot (Figure 4), the medians are well separated, with the median for mandatory feature being higher and very distanced from the optional feature median. The mandatory feature box is bigger than the optional feature one. The overall spreads are also different, being bigger also for the mandatory features, even considering the two outliers from the optional feature. The outliers from Figure 4 are related to optional features. The upper outlier represents the number of non-conformities for the Domain I per optional features for the same Domain (9/4 = 2,25). The lower outlier from Figure 4 represents the number of non-conformities for the Domain B per optional features for the same Domain (1/1 = 1). The box plot for mandatory feature has the same size for the lower and upper whiskers. For the optional feature box plot, almost all the non-conformities values were zero, transforming the box into a line. The median for mandatory feature is bigger than the high quartile of the optional feature, which leads to the conclusion that the number of non-conformities varies according to the feature type. The number of non-conformities for mandatory feature is bigger than the number of non-conformities for optional features in these samples. 2.5 Feature Non−Conformities Per Feature Type Boxplot 2.0 1.5 1.0 0.5 ● 0.0 Number of Non−Conformities per Feature Type ● Mandatory Fig. 4. Optional Boxplot - Feature non-conformities per feature type According to the box plots (feature granularity, interaction, hierarchy and type), it can be observed that the feature nonconformities are more spread over coarse features, features with interactions, features with hierarchy and mandatory features. During the feature specification activity, these feature information should be considered to achieve the required quality [6]. B. RQ2: Is there any association between feature information and feature granularity? This question aims to investigate whether there is significant association between feature information and feature granularity from some defined perspectives, which are next described: • Feature hierarchy and feature granularity: The feature hierarchy defines that the feature immediately above any feature is its parent feature and the feature immediately below a parent feature is its child feature in a feature model [2]. This perspective investigates if there is a correlation between hierarchical feature information and feature granularity. • Feature type and feature granularity: The feature type specifies whether a feature is contained within all products of the product line (mandatory feature) or only in some products (optional feature). Furthermore, when children features have the same parent and only one of them can be chosen to a product configuration, they are called alternative features [2]. This perspective investigates if there is correlation between feature type information and feature granularity. • Feature interaction and feature granularity: The feature interaction specifies whether a feature has any dependency relationship with another one. Thus, (I) a feature can require the existence of another feature, because they are interdependent, this second feature is a required feature and (II) a feature can be mutually exclusive with another, they cannot coexist (excluded features) [2]. This perspective investigates the possible correlation between feature interaction and feature granularity. To better understand and answer the issues from these perspectives, this question was split into three sub-questions. 1) RQ2.1: Is there any association between feature type and feature granularity?: The investigated feature sample is composed of 6 optional features and 86 mandatory features. Analyzing the relationship between the feature type and the feature granularity (Table II), we observed that among the optional features, 2 optional features are fine-grained and 4 are coarse-grained features. In addition, among the mandatory features, 13 mandatory features are fine-grained features and 73 mandatory features are coarse-grained features. Observing the sample, to the optional features, 66.7% of them are coarse-grained features; and within the mandatory features this percentage grew to 84% (coarse-grained features). Using the Fisher exact test [26] there was not enough evidence to state whether there is statistical association between features type and features granularity (Table II). 2) RQ2.2: Is there any association between feature hierarchy profiles and feature granularity?: Considering features with hierarchical profiles, the sample is composed of 5 parent features, 19 children features, 2 features took parent and child profiles, simultaneously, and the other 66 features did not have any hierarchical profiles, they are on the root level of the feature model without any child feature. Analyzing the relationship between the features hierarchy profiles and the features granularity (Table II), we observed that among the parent features, all 5 parent features are coarse-grained features; regarding the children features, 4 children features are fine-grained features and 15 child features are coarsegrained features; and among the parent-children features, all the 2 parent-children features are coarse-grained features. Based on the sample, the features that do not have any hierarchical profile, 83% of them are coarse-grained features; on the other hand, for the features that had, at least, one hierarchical profile, this percentage grew to 85% (coarsegrained features). Thus, using a generalization of the Fisher TABLE II DATA FROM FEATURE INFORMATION AND FEATURE GRANULARITY PERSPECTIVES Feature information Fine Feature type Mandatory Features Optional Features Feature hierarchy Parent Features Children Features Parent-Children Features Feature without hierarchy profile Feature interaction Required Features Requesting Features Requesting-Required Features Feature without interaction profile Feature Granularity % Coarse % 13 2 15.1 33.3 73 4 84.9 66.7 0 4 0 11 0 21.1 0 16.7 5 15 2 55 100 78.9 100 83.3 3 0 1 11 10 0 8.3 33.3 27 17 11 22 90 100 91.7 66.7 p-value 0.252∗ 0.784∗∗ 0.010∗∗ ∗ Results from Fisher exact test ∗∗ Results from generalization of the Fisher exact test exact test there was not enough evidence to state whether there is statistical association between the variables feature hierarchy and feature granularity (Table II). 3) RQ2.3: Is there any association between feature interaction and feature granularity?: Considering features with interactions (dependencies), the sample is composed of 30 features which require another one, 17 features which are requested by another one, 12 features that require another one and are requested by another feature simultaneously, and 33 features did not have interaction profiles. Analyzing the relationship between the feature interaction profiles and the feature granularity (Table II), we observed that among the features which require another one, 3 required features are fine-grained features and 27 features are coarse-grained features; regarding the features which are requested by another one, all 17 required features are coarse-grained features; and among the features that require another one and are requested by another feature simultaneously, 1 feature is fine-grained feature and 11 features are coarse-grained features. Among the features which did not have any interaction profile, 67% of them are coarse-grained features; and among the features that had, at least, one interaction profile, this percentage grew to 93% (coarse-grained features). Using a generalization of the Fisher exact test, it obtained relevant evidence within the significance level (p-value) of 5% to the association between the variables feature interaction and feature granularity (Table II). Moreover, using the odds ratio to measure the ratio of the odds that an event occurs to the odds of the event not happen, it was observed in our data that the odds of the coarse-grained features among those with at least some interaction is 6.9 times higher compared to the features which did not have any interaction profile. Thus, the variable feature interaction presented significant statistical association with the variable feature granularity. C. RQ3: Is there any influence from feature granularity on feature non-conformity data? Assuming the importance of the variable feature granularity [4] and in order to explore the variables that influence the fea- ture non-conformity values, this question aims at investigating whether the variable feature granularity present any influence on the values collected to the variable feature non-conformity. Thus, considering the studied sample with 92 features, the non-conformity average was slightly less in the coarse-grained features (1.47) when compared to fine-grained features (1.60). The same behavior was observed regarding to the dispersion of the feature non-conformities (Table 3). Thus, in order to assess the influence of the feature granularity (X) on the feature non-conformity data, some models were fitted for estimating feature non-conformities through a truncated Poisson regression. In this case, 72 observations were considered representing all positive count-features that presented non-conformities after inspection (Y1 , . . . , YNobs ). Based on Equation 3, four regression models (univariate) were fitted in attempt to answer this research question. log(λi ) = β0 + β1 X (3) In Equation 3, λi represents the average intensity of feature non-conformities, β0 and β1 the values to the unknown parameters. The estimate values were obtained by Stata2 software and according to results (Table IV), it was observed that the log of average intensity of feature non-conformities increase in 0.067 when it compares coarse-grained features with fine-grained features. Based on the investigated sample, this result means that features granularity presented small influence on features non-conformity, however this finding has low significance level, p-value = 0.785 (Table IV). D. RQ4: Can feature information and feature granularity simultaneously influence on feature non-conformity data? To answer this question it was used a regression model to assess the effects (influence) of all simultaneously variables factors (independent variables) on the positive cases of feature non-conformities. Thus, it was fitted a truncated Poisson model to predict the average intensity of feature non-conformities as a function of feature information, such as feature type (X1 ); 2 http://www.stata.com TABLE III D ESCRIPTIVE STATISTICAL ANALYSIS - FEATURE GRANULARITY AND FEATURE NON - CONFORMITY DATA Feature Granularity Fine Coarse Feature amount 15 77 Mean 1.60 1.47 Std. Deviation 0.986 1.165 Std error mean 0.254 0.133 TABLE IV Z ERO - TRUNCATED S IMPLE P OISSON REGRESSION - F EATURE INFORMATION AND FEATURE NON - CONFORMITY Parameters Intercept Feature Granularity Intercept Feature Type Intercept Feature Hierarchy Intercept Feature Interaction Estimate 0.324 0.067 0.373 0.933 0.416 −0.139 0.027 0.491 feature interaction (X2 ); hierarchical feature (X3 ); and feature granularity (X4 ) simultaneously (Equation 4). Estimated regression parameters were defined by the maximum likelihood (Table 5(a)). log(λi ) = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 (4) Comparing Table IV and Table 5(a), we can observe that there was an increasing of the estimated influence to the variable feature interaction on the average intensity of feature non-conformity; the estimated influence values to the variables hierarchical features and feature granularity were maintained negative; and the variable features type decreased the estimated influence. Although the estimated value to the variable feature interaction, the final influence in the presence of the others independent variables in the Poisson regression multivariate model (Equation 4) was reduced due to the negative values to the variables hierarchical features and feature granularity. In addition, a reduced model was fitted to take into account the influence of the independent variables feature interaction (X̂1 ) and feature granularity (X̂2 ) on the average intensity feature non-conformities (Equation 5). log(λi ) = β0 + β1 X̂1 + β2 X̂2 (5) From Table 5(b), it was observed that the log of average intensity of feature non-conformity decreases when it compares coarse-grained features with fine-grained features. However, this effect was not significant at the level of 10%. On the other hand, the features that had interaction profile presented a significant effect on the average intensity feature non-conformities in comparison the features that did not have interaction profile. Considering the data from Table 5 and based on the selected multiple Poisson regression model to estimate feature nonconformities (Equation 5), we could compute the predicted values for feature non-conformities according to the combination of the values to the parameters of the model, feature interaction and feature granularity (Table VI). Then, these results can be used to make prediction. For example, the Std Error 0.217 0.245 0.107 0.356 0.120 0.229 0.264 0.283 z-value 1.50 0.27 3.48 0.26 3.49 −0.61 0.10 1.74 p-value 0.134 0.785 0 0.793 0.183 0.543 0.919 0.082 expected average intensity of the feature non-conformities for features without interaction and with fine granularity would be 1.126; the expected average intensity of the feature nonconformities for features without interaction and with coarse granularity would be 0.971; the expected average intensity of the feature non-conformities for features with, at least, one interaction and fine granularity is 1.921; and the expected average intensity of the feature non-conformities for features with, at least, one interaction and coarse granularity, would be 1.657. These results highlight that fine-grained features had larger prediction values (feature non-conformities) than coarse-grained features, as well as, features that had interaction (dependencies) with another feature estimated larger values than features which did not have interaction with another one. V. M AIN FINDINGS OF THE S TUDY The main findings of this study can be summarized as following: a) During the feature specification activity, the scope analysts should give more attention to coarse grained features, features with interactions, features with hierarchy and mandatory features to achieve a better quality within the documents, since they indicated more number of non-conformities. b) The variables feature type and feature hierarchy do not present significant statistical association with the variable feature granularity using the Fischer exact test and generalization of the Fischer exact test. c) The variable feature interaction presented significant statistical association with the variable feature granularity using the generalization of the Fischer exact test (p-value = 1%). As well as, the odds of the coarse-grained features among those with at least some interaction is 6.9 times higher compared to features which do not have any interaction profile. d) The variable feature granularity alone presents small influence on the variable feature non-conformity (estimated value = 0.067), however, the model returned low significance level (p-value = 0.785) using zero-truncated simple Poisson regression model. e) From zero-truncated multiple Poisson regression model, the variable feature interaction presented relevant positive TABLE V Z ERO - TRUNCATED M ULTIPLE P OISSON REGRESSION MODELS TO ESTIMATE FEATURE NON - CONFORMITY (a) Zero-truncated Multiple Poisson regression to estimate feature non-conformity - candidate model Parameters Estimate Std Error z-value p-value Intercept 0.110 0.309 0.36 0.722 Feature Interaction 0.530 0.324 1.63 0.102 Feature Hierarchy −0.013 0.228 -0.06 0.954 Feature Type 0.1 0.292 0.34 0.733 Feature Granularity −0.139 0.302 −0.46 0.646 AIC: 185.9765 Akaike information criterion (AIC) is a measure of the relative goodness of fit of a statistical model. (b) Zero-truncated Multiple Poisson regression to estimate feature non-conformity - selected model Parameters Estimate Std Error z-value p-value Intercept 0.119 0.281 0.42 0.673 Feature Interaction 0.534 0.309 1.73 0.084 Feature Granularity −0.148 0.291 −0.51 0.61 AIC: 182.0345 Akaike information criterion (AIC) is a measure of the relative goodness of fit of a statistical model. TABLE VI VALUES TO THE FEATURE NON - CONFORMITIES PREDICTION MODEL - E QUATION 5 Feature Interaction No No Yes Yes Feature Granularity Fine Coarse Fine Coarse influence on the feature non-conformity (estimated value = 0.534) within of significance level of 10% (p-value = 0.084). On the other hand, the variable feature granularity presented negative influence on the variable feature non-conformity (estimated value = − 0.148) and low significance level (pvalue = 0.61). f) Features that had at least one interaction estimated larger feature non-conformities than features which did not have interaction. This finding needs more investigation (empirical studies) for observing its reproduction and understanding the possible reasons for that. g) Fine-grained features estimated larger feature non-conformities than coarse-grained features. Kästner et al. [4] also reported problems to handle implementation of fine-grained features in the SPL context. In this study, we identified that fine-grained features were highlighted since they estimated more features non-conformities (RQ4). VI. T HREATS T O VALIDITY In order to reduce the threats to validity, countermeasures were taken during the whole study. The countermeasures followed the quality criteria in terms of construct, external and internal validity as discussed in [25]. Moreover, we briefly described also mitigation strategy to the research questions definition and negative results. Construct validity: It makes use of two strategies as described: • Longstanding involvement: In this strategy, the researchers had a long involvement with the object of study • λ̂i 1.126 0.971 1.921 1.657 allowing gathering tacit knowledge which helped us to avoid misunderstandings and misinterpretations [28]. Peer debriefing: It recommends that the analysis and conclusions be shared and reviewed by other researchers [28]. This was achieved by conducting the analysis with three researchers, by performing discussion groups in which analysis and conclusions were discussed and by the supervision of a statistical researcher (second author of this paper). Internal validity: This threat was mitigated by ensuring the company anonymity and free access to the company for the research team. External validity: Although the study was applied in only one company, our intention is to build a knowledge base to enable future analytical generalization where the results are extended to cases that have similar characteristics. Reliability: This aspect was achieved by using two tactics: a detailed empirical study protocol and a structured study database with all relevant and raw data such as interviews and meetings tapes, transcripts, documents, and outline of statistical models. Research Questions: The set of questions might not have properly covered all the aspects on the relationship of SPL inspection and features information. As it was considered a feasible threat, some discussions among the authors of this work and some members of the research group (RiSE Labs 3 ) were conducted in order to calibrate the questions. 3 http://labs.rise.com.br/ Negative results: Some correlational and prediction analysis presented negative results. However, these should not be discarded immediately, and future analysis for these cases must be provided to increase the validity of the conclusion. VII. C ONCLUSIONS The use of the feature concept is a key aspect for organizations interested in achieving improvements in software reuse, productivity, quality, and costs reduction [2]. Software product lines, as a software reuse approach, have proven its benefits in different industrial environments and domains [29][30]. Thus, to achieve these benefits, quality assurance techniques, such as software inspection, should be performed on the feature specification artifacts, once that feature units can be considered the kick-off for reusing assets in different products. In this exploratory study, we investigated relationships between the data of features granularity and features non-conformity based on a sample of 92 features and 137 features non-conformities. We believe that although there were some negatives results, some variables need other empirical investigations to validity the results of this study. Based on the dataset, we identified that: there was no significant statistical association between the variables feature type and feature granularity; as well as, there was no significant statistical association between variables feature hierarchy and feature granularity. On the other hand, there was significant statistical association between the variables feature interaction and feature granularity; the variable feature granularity did not present significant statistical influence on the variable feature non-conformity; and investigating the simultaneously influence of the variables feature interaction and feature granularity on the variable feature non-conformity, the outcome was that: the variable feature interaction presented positive influence on the feature non-conformity, however, the variable feature granularity presented negative influence on the variable feature non-conformity with low significance level. Also, during the feature specification activity, coarse grained features, features with interactions, features with hierarchy and mandatory features should receive more attention since they revealed to have more non-conformities. Furthermore, this work can be seen as a further step towards the understanding what variables of software projects influence on the feature non-conformities data in the SPL context. As future work, we are planning to replicate this study in another company within the financial domain. R EFERENCES [1] P. Clements and L. Northrop, Software Product Lines: Practices and Patterns. Boston, MA, USA: Addison-Wesley, 2001. [2] K. Kang, S. Cohen, J. Hess, W. Nowak, and S. Peterson, FeatureOriented Domain Analysis (FODA) Feasibility Study. Technical Report CMU/SEI-90-TR-21, 1990. [3] T. Von Der Massen and H. Lichter, “Deficiencies in feature models,” in Workshop on Software Variability Management for Product Derivation Towards Tool Support, SPLC 2004, T. Mannisto and J. Bosch, Eds. Springer Verlag, 2004. [4] C. Kästner, S. Apel, and M. Kuhlemann, “Granularity in software product lines,” Proceedings of the 30th international conference on Software engineering, pp. 311–320, 2008. [5] I. John and M. Eisenbarth, “A decade of scoping: a survey,” Proceedings of the 13th International Software Product Line Conference, pp. 31–40, 2009. [6] I. S. Souza, G. S. S. Gomes, P. A. M. S. Neto, I. C. Machado, E. S. Almeida, and S. R. L. Meira, “Evidence of software inspection on feature specification for software product lines,” Journal of Systems and Software, vol. 86, no. 5, pp. 1172–1190, 2013. [7] P. A. da Mota Silveira Neto, I. do Carmo Machado, J. D. McGregor, E. S. de Almeida, and S. R. de Lemos Meira, “A systematic mapping study of software product lines testing,” Information & Software Technology, vol. 53, no. 5, pp. 407–423, 2011. [8] E. Engström and P. Runeson, “Software product line testing - a systematic mapping study,” Information & Software Technology, vol. 53, no. 1, pp. 2–13, 2011. [9] M. A. Babar, L. Chen, and F. Shull, “Managing variability in software product lines,” IEEE Software, vol. 27, no. 3, pp. 89–91, 94, 2010. [10] G. C. Murphy, A. Lai, R. J. Walker, and M. P. Robillard, “Separating features in source code: an exploratory study,” Proc. 23rd Int. Conf. on Soft. Eng. ICSE 2001, vol. 12, pp. 275–284, 2001. [11] H. Ossher and P. Tarr, “Hyper/j: Multi-dimensional separation of concerns for java,” Proceedings of the 23rd International Conference on Software Engineering, pp. 821–822, 2000. [12] G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. V. Lopes, J.-M. Loingtier, and J. Irwin, “Aspect-oriented programming,” 11th European Conference on Object-Oriented Programming, pp. 220–242, 1997. [13] M. P. Robillard and G. C. Murphy, An Exploration of a Lightweight Means of Concern Separation. Aspects and Dimensions of Concern Workshop, 2000, pp. 1–6. [14] M. Kalinowski, E. Mendes, D. N. Card, and G. H. Travassos, “Applying dppi: A defect causal analysis approach using bayesian networks,” Proceedings of the 11th international conference on Product-Focused Software Process Improvement, pp. 92–106, 2010. [15] M. Kalinowski, D. N. Card, and G. H. Travassos, “Evidence-based guidelines to defect causal analysis,” Software, IEEE, vol. 29, no. 4, pp. 16–18, 2012. [16] M. Kalinowski, G. Travassos, and D. Card, “Towards a defect prevention based process improvement approach,” Software Engineering and Advanced Applications, SEAA ’08., pp. 199–206, 2008. [17] I. S. Souza, R. P. de Oliveira, G. Gomes, and E. S. de Almeida, “On the relationship between inspection and evolution in software product lines: An exploratory study,” in 26th Brazilian Symposium on Software Engineering. IEEE, 2012, pp. 131–140. [18] M. S. M. Balbino, E. S. Almeida, and . R. L. M. Meira, “A scoping process for software product lines,” 23rd International Conference on Software Engineering and Knowledge Engineering, pp. 717–722, 2011. [19] I. John, “Using documentation for product line scoping,” IEEE Software, vol. 27, pp. 42–47, 2010. [20] A. van Lamsweerde, Requirements Engineering: From System Goals to UML Models to Software Specifications. Wiley, March 2009. [21] P. Runeson and M. Höst, “Guidelines for conducting and reporting case study research in software engineering,” Empirical Software Engineering, vol. 14, pp. 131–164, April 2009. [22] C. B. Seaman, “Qualitative methods in empirical studies of software engineering,” IEEE Trans. Softw. Eng., vol. 25, pp. 557–572, July 1999. [23] T. Gauthier, “Detecting trends using spearman’s rank correlation coefficient,” Environmental Forensics, vol. 2, no. 4, pp. 359–362, 2001. [24] T. M. Khoshgoftaar, K. Gao, and R. M. Szabo, “An application of zeroinflated poisson regression for software fault prediction,” Proc. 12th Int. Symp. on Soft. Reliability Eng., pp. 66–73, 2001. [25] R. K. Yin, Case Study Research: Design and Methods. Sage Publications, 2008. [26] D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers. Wiley, 2006. [27] A. C. Cameron and P. K. Trivedi, Regression Analysis of Count Data. Cambridge: Cambridge University Press, September 1998. [28] D. Karlström and P. Runeson, “Integrating agile software development into stage-gate managed product development,” Empirical Software Engineering, vol. 11, pp. 203–225, June 2006. [29] F. Ahmed, L. Capretz, and S. Sheikh, “Institutionalization of software product line: An empirical investigation of key organizational factors,” Journal of Systems and Software, vol. 80, no. 6, pp. 836–849, 2007. [30] J. F. Bastos, P. A. M. Silveira, E. S. Almeida, and . R. L. M. Meira, “Adopting software product lines: A systematic mapping study,” 15th International Conference on Evaluation and Assessment in Software Engineering, pp. 11–20, 2011. An Extended Assessment of Data-driven Bayesian Networks in Software Effort Prediction Ivan A. P. Tierno and Daltro J. Nunes Instituto de Informática UFRGS Porto Alegre, Brazil Email: {iaptierno,daltro}@inf.ufrgs.br Abstract—Software prediction unveils itself as a difficult but important task which can aid the manager on decision making, possibly allowing for time and resources sparing, achieving higher software quality among other benefits. Bayesian Networks are one of the machine learning techniques proposed to perform this task. However, the data pre-processing procedures related to their application remain scarcely investigated in this field. In this context, this study extends a previously published paper, benchmarking data-driven Bayesian Networks against mean and median baseline models and also against ordinary least squares regression with a logarithmic transformation across three public datasets. The results were obtained through a 10-fold cross validation procedure and measured by five accuracy metrics. Some current limitations of Bayesian Networks are highlighted and possible improvements are discussed. Furthermore, we assess the effectiveness of some pre-processing procedures and bring forward some guidelines on the exploration of data prior to Bayesian Networks’ model learning. These guidelines can be useful to any Bayesian Networks that use data for model learning. Finally, this study also confirms the potential benefits of feature selection in software effort prediction. I. I NTRODUCTION Accurate software predictions can provide significant advantages in project planning and are essential for effective project management being strongly linked to the success of software projects. Underestimating the effort can cause delays, degrade software quality and bring about increased costs and dissatisfied customers. On the other hand, overestimating the project’s effort can lose a contract bid or waste resources that could be allocated elsewhere. Although the primary objetive of software effort prediction is budgeting, there are also other important objectives. Boehm et al. [1] mention tradeoff and risk analysis, project planning and control and software improvement investment analysis. Since the nineties, researchers began applying machine learning techniques for software effort prediction [2] [3] [4]. Ever since, studies on machine learning techniques for software prediction have grown more and more common. Currently this is visibly a thriving trend with many empirical studies being published regularly and comprising a very active research field. In a systematic review, Wen et al. [5] identified eight machine learning techniques employed in software prediction including CART (a type of decision tree) [3], Case-based Reasoning (CBR) [4], Artificial Neural Networks, Genetic algorithms, Support Vector Regression among others. CBR, Artificial Neural Networks and Decision Trees were considered by Wen et al. [5] the most popular machine learning techniques in software development effort prediction research. One of these machine learning techniques are Bayesian Networks(henceforth BNs), which is the technique we assess in this study. BNs were initially proposed and are generally more common in software quality prediction. Since then there has been a steady increase of efforts towards BNs in software effort prediction and in software projects management in general. Wen et al. ranked BNs fourth in popularity in software development effort prediction among the machine learning techniques. This technique has some distinguishing features that make it look suitable to deal with the uncertainties prevalent in this field. BNs will be discussed briefly in the next section. This research field has suffered with contradictions and few clear conclusions. In spite of the large number of empirical studies there are conflicting results and conclusions instability [6] [7] [8]. Shepperd and Macdonell [9] state that ‘empirical evaluation has not led to consistent or easy to interpret results’. This matters because it is hard to know what advice to offer to practitioners. There are many examples of contradictions in comparisons among different machine learning and statistical techniques, as described for instance in, e.g. [7], [9]. Part of these inconsistencies stem from differences in the experiments and sometimes from errors in the procedures like discussed in, e.g., [9] and [10]. The latter study points out mistakes in the application of regression models. Myrtveit, Stensrud and Shepperd [11] discuss reasons for the unreliability of conclusions in detail, chiefly focusing on validation and measuring, and concluded that more reliable research procedures are necessary. Several other researchers have made suggestions about the validation of results in comparative studies, e.g., [12], [13] and [9]. With regard to BNs, details on their employment and the preparation and pre-processing prior to model learning remain scarcely investigated. There is some uncertainty about its effectiveness and about the pre-processing procedures applied prior to model learning. Given the relevancy of BNs in software prediction research, investigations on its employment and effectiveness are necessary. In this context, this study strives to assess the employment of data-driven BNs in software effort prediction through extensive validation procedures, including analyses on data preprocessing, providing guidelines on how to best explore data, and discussing BNs’ current limitations and possibilities of improvements. The investigation of data-driven BNs matters because even if this might not become the best way to apply them, the optimization of data exploration is an important direction of development for this technique. By finding ways to optimize the exploration of data there can be benefits to any BNs that use data. This paper extends a preliminary work [14] by assessing other pre-processing steps, and extending significantly the validation by including other metrics and another dataset, and also by refining the observations on the results. This paper is organized as follows. In section II we present a brief overview on BNs. In section III we mention some closely related studies. In section IV we bring forward the empirical procedures, datasets used, and how we compared the prediction systems. In section V we analyze and discuss the results and finally put forth the conclusions in the last section. II. BAYESIAN N ETWORKS BNs [15] [16] are a modeling technique which boasts some distinguishing characteristics. A striking feature of this modelling approach is the possibility, through application of probability theory, to model uncertainty or subjectivity. The probability distributions allow for the integration of objective evaluations, learned from data, with subjective evaluations defined by experts. Furthermore, this allows the model to output several possible outcomes with varying degrees of certainty, unlike deterministic models like linear regression which simply output a single possible outcome, i.e., a numeric value. BNs comprise a qualitative part, i.e., the graph structure that models the dependencies among a set of variables, and a quantitative part made up of node probability tables (NPT’s) which contain the probability distributions for each node. The graph structure is a directed acyclic graph (DAG) encoding the dependencies among the variables. The nodes represent the relevant variables or factors in the domain being modeled, and each directed arc depicts the dependencies among these factors which can be causality relationships. The NPT’s contain the prior probabilities (in case the variables has no parents) or conditional probabilities (in case the variable has one or more parents). The conditional probabilities define the state of a variable given the combination of states of its parents. With the definition of these probabilities during the training phase a test record can later be classified. These components are illustrated on a simple example in Fig. 1. Fig. 1. A simple Bayesian Network. BNs can be modeled fully based on data, through a hybrid approach, i.e., integrating data modeling and experts knowledge or fully expert-based. When the BNs are learnt from data, the learning algorithm strives to identify the dependencies among the variables and thus making up the network structure, i.e., the DAG. The algorithm will identify a model that best fits the relationship between the attribute set and the response variable on the input data (training data). Thereafter, the probability distributions are learned for every combination of variables. This happens during the so called training or learning phase. The BNs found in this research field most frequently consist of discrete variables. The tool used in this study currently does not support continuous variables. Although some tools offer support to continuous variables, this support has limitations, e.g., imposing restrictions in the relationships among the variables or making assumptions about the distributions of the continuous variables. There are progresses concerning continuous variables in machine learning research and there are also constant developments in the BNs tools, so these limitations could be overcome in the future. For a more detailed review on BNs we refer the reader to other works in the field, e.g., [17], [18], [19] and to data mining literature [15], [16]. III. R ELATED WORK In this section we describe some closely related studies. Radlinski and Hoffman [20] carried out a comprehensive benchmarking study comparing 23 classifiers in WEKA over four public datasets. The authors state their main research question is: “Is it possible to easily predict software development effort from local data?”. So, they establish two specific constraints: easy predictions and using local data, i.e., data from a single company. This paper focused more on the practitioners viewpoint, trying to avoid complex and time-consuming procedures. So, the authors do not address specific details of the techniques but provide a wide-ranging assessment of easy-to-use machine learning classifiers. By comparing so many classifiers this study illustrates very well the lack of stability of the ranking of the techniques across different datasets. They mentioned that due to the ranking instability it is difficult to recommend practitioners with a particular model even though they did conclude that K* technique with feature selection was the most accurate overall. BNs were among the most accurate predictors in two of the four datasets but did not particularly stand out. They also demonstrate that most techniques achieve higher accuracy by performing feature selection. Mendes and Mosley [13] outline thorough experiments comparing BNs, CBR, manual stepwise regression and simple mean and median based models for web effort prediction using Tukutuku, a proprietary cross-company dataset. The study compares four automatic and four hybrid BN models. The results were unfavourable to BNs, with most of the models being more inaccurate than the median model and two of them barely matching it. The authors conclude that manual stepwise regression may be the only effective technique for web effort estimation. Furthermore, they recommend that researchers benchmark proposed models against mean and median based models as they show these can be more effective than more complex models. One of the last investigations can be found in [21] wherein comprehensive experiments are laid out yielding a benchmark of some statistical and data mining techniques, not including however, BNs. This study benchmarks numeric predictors, as opposed to [20] which assesses classifiers, i.e., discrete class predictors. This study included thirteen techniques over eight public and private datasets. Their results “indicate that ordinary least squares regression with a logarithmic transformation performs best”. They also investigate feature subset selection with a wrapper approach confirming the improvements brought by this technique. The authors also discuss appropriate procedures and address efforts towards statistically rigorous analyses. A survey covering BNs for software development effort prediction can be found in [19]. IV. E XPERIMENTS SETUP We assess data-driven BNs by comparing them to ordinary least squares regression with a logarithmic transformation, which was found in [21] to be invariably among the most accurate predictors. We remind the reader once again that we are comparing a classifier, i.e., a discrete class predictor, to a regression technique, i.e., a numerical predictor. We do this by converting the BN’s class predictions to numeric ones by means of a variant of the method originally proposed in [18] which will be explained in subsection C. We decided to experiment performing a logarithmic transformation on the data prior to BNs’ building. So, this variant is included in the comparison amounting so far to three prediction systems. Furthermore, we also assess the effectiveness of feature subset selection [22] [15] as a pre-processing step. This technique has been employed with good results in this field, e.g., [23], [21], [20]. So, for each of the aforementioned models there is a variant with the application of feature selection prior to model building which multiplies by two the number of prediction systems. So, there are four variants of BNs and two variants of OLS regression amounting so far to six prediction systems. Finally, we include in the comparison mean and median based models like proposed in [13]. These models simply use the mean and median of all projects effort as a constant prediction. These are very simple benchmark models and an effective model should be able to be more accurate than them. The comparison with such models allows us to better assess the effectiveness of the other techniques by establishing a minimum benchmark of accuracy. The inclusion of such benchmark models is another recent trend proposed in several studies like [13] and [9], with the goal of verifying whether the models are effectively predicting and therefore bringing clarity to the results. So, with these two benchmark models we have in total eight prediction systems. An abstract outline of the experiments we carried out is shown in Fig. 2. We omitted the different versions of the dataset and the two models of BNs on log-transformed data to avoid cluttering up the figure, for intuitiveness’ sake. So, prepared data is an abstract entity which represents any of the datasets versions (log-transformed or not, and discretized or not) and besides the six prediction systems depicted in this figure there are two BNs on log-transformed data (with and without FSS) which are not shown. We will explain these procedures in the next sections. These experiments were carried out in the WEKA data mining tool [24]. The next subsections describe briefly the datasets, the conversion method necessary to compare the techniques and the metrics used to assess accuracy. Fig. 2. Experiments outline. A. Datasets A significant barrier for analysis of findings and replication of experiments has been the lack of publicly available datasets since the employment of proprietary datasets inhibits the replication of experiments and confirmation of results. The PROMISE repository [25] is an initiative that attempts to counter to some extent the lack of transparency that pervades this research field. Datasets are made available allowing for replication and scrutiny of findings with the intent of improving research efforts and to stir up analyses and discussion. In this work, we used three widely studied datasets available in the PROMISE repository [25]. these are the Desharnais, Maxwell and Cocomo81 datasets. These datasets are relatively clean in comparison to other datasets we have checked. They are local datasets, i.e., data was collected within a single company. Table I describes basic information on the datasets. TABLE I BASIC INFORMATION ON DATASETS Data set Desharnais Maxwell Cocomo81 Local data Domain Effort unit Range years Yes Yes Yes Unknown Finnish bank Various domains Person-Hours Person-Hours Person-Months 1981-1988 1985-1993 1970-1981 The histograms in Fig. 3, Fig. 4 and Fig. 5 illustrate the distribution of data over effort, the dependent variable. Effort is measured in person-hours on Desharnais and Maxwell and in person-months of 152 hours in Cocomo81. In all three cases the variables are positively skewed, i.e., variables with most records situated towards lower values and a few very high outlying values. Desharnais is the least skewed of the three at 2.00. Maxwell is significantly more skewed at 3.35 and Cocomo81 is the most skewed one at 4.48. Skewness is a very common characteristic in software project datasets. This characteristic poses some hindrances for modeling. In order to carry out linear regression these variables must be transformed as to approximate a Gaussian distribution. With regard to BNs, this is also a problem since the discretization could yield very uneven classes intervals. In such a scenario, the qual-widths discretization technique [26] [15] can produce empty classes and dispose most of the dataset population within just a couple of classes, thus turning the validation highly dubious. If almost all of the data is within just a Fig. 3. Fig. 4. Desharnais data set. Maxwell data set. couple of classes the model can hardly predict wrong or find meaningful patterns. A very high hit-rate would not be surprising, but the predictions would be meaningless. When software managers carry out effort predictions they do not know, for instance, how long a project will last, even though they can have an estimate. Therefore, variables whose values are unknown at the time the prediction is to be performed must be removed, e.g., ‘Duration’, ‘Defects’. This is standard practice in the software prediction field. On the other hand, when a sizing variable is quantified in Functions points it is usually included since it can be obtained in the specification phase, depending on the process model. On Desharnais dataset three variables were removed: the ID variable, ‘YearEnd’ and ‘Length’. On Maxwell dataset three variables were removed: ‘Duration’, ‘Time’ and ‘Syear’. Finally, no variables were removed from Cocomo81 dataset. In order to carry out OLS regression we removed records with missing values. This amounts to four records on Desharnais dataset and two records on Maxwell dataset. There were no missing values on Cocomo81 dataset. For the BNs models all the records were kept. We also experimented not removing the missing values for the OLS regression Fig. 5. Cocomo81 data set. model by performing median imputation and the difference on Desharnais dataset, which is the one with more missing values, was minimal. So, we decided to show the results on the dataset without records with missing values because these are the same we used in our previous paper [14]. On Maxwell dataset there were two records with missing values and on Cocomo dataset there were none. The categorical variables were coded to dummy variables for the linear regression model following good statistical practices [10]. This study also suggests the removal of outliers. Although this is the standard practice for statistical procedures, we decided to keep the outliers for both models for two reasons: To keep the same conditions for both models; and chiefly because these outliers are actual projects which are rare but can happen. They are not noisy or irrelevant entries. Other studies in software prediction also keep the outliers, e.g., [21], [20]. For more detailed information on these datasets we refer the reader to [20] and to the original works referenced in the PROMISE repository [25]. B. Comparing the Predictions The prediction systems are compared through numerical metrics. This has been another controversial topic. There is no consensus on what is the most reliable metric [11]. the standard metric some years ago used to be MMRE [27], but due to some flaws it lost popularity. MMRE, like other numerical metrics used in this study, is based on the magnitude of relative error. MRE is a measure of the relative error of the actual effort ei against the predicted effort êi . |ei − êi | (1) ei MMRE measures the mean of all the predictions’ MREs. This metric has not passed without criticism [27] [6], for it is highly affected by outliers and it favours models that underestimate. MMRE is biased towards underestimates because the magnitude of the error is unbounded when overestimating and limited to at most 1 (or 100%) when underestimating. This is well explained by means of a didactic example in [9]. This bias entails that models that tend to underestimate will be likely to have smaller MREs overall, therefore performing better according to MRE based metrics. Even though this bias is present in all MRE based metrics it is specially so in MMRE. MdMRE is the median of the MRE’s. It smoothes out MMRE’s bias, for it is more robust to outliers. Amply inaccurate predictions do not bear on MdMRE like on MMRE. So, on the one hand it shows which models are generally more accurate, but on the other hand it conceals which models can be occasionally very inaccurate. This effect is even more pronounced on Pred metric because it ignores completely the predictions with large errors. Pred measures how frequently predictions fall within a specified percentage of the actual effort, e.g., Pred25 tells us how often the predicted effort is within 25% of the project’s actual effort (25 is a common parameter value for this metric). Therefore, this metric ignores the predictions whose errors are in excess of 25% magnitude, i.e., it does not matter for this metric if the error is 30% or 200% (assuming Pred25 ). This is a limitation which we criticize about these metrics. Obtaining a model whose predictions rarely lie too far from the actual value is certainly M REi = advantageous. This is a desirable quality in a model and these metrics overlook this aspect. Several studies proposed new metrics discussing their characteristics. But none of these metrics was widely adopted in the research field. MdMRE and Pred appear to be still the most popular. Foss et al. [27] concluded that every metric studied has flaws or limitations and that it is unlikely that a single entirely reliable metric will be found. So, the use of complementary metrics is recommended. Miyazaki et al. [28], being the first to observe MRE’s bias towards underestimates, proposed MBRE (Mean Balanced Relative Error). This metric addresses this flaw because it makes the relative error unbounded towards both underestimates and overestimates. By making the ratio relative to the lowest value (between actual and predicted values) the bias of MRE based metrics is eliminated, therefore avoiding favouring models that underestimate. BREi = |ei − êi | minimum(ei , êi ) (2) However, it has a flaw in that it does not account for negative predictions. Linear regression models can at times predict a negative number and therefore distort a bit the results under MBRE. Kitchenham et al. [12] propose the use of the absolute residuals as another alternative to bypass these problems of MRE based metrics. MAR (Mean Absolute Residuals) being an absolute measure also avoids this bias of ratio metrics like MRE. MAR has the disadvantage of not being comparable across different datasets. M ARi = n X |ei − êi | i n TABLE II N UMERICAL CONVERSION FOR BN S ON D ESHARNAIS DATA SET. Prediction System Bayesian Networks (mean) Bayesian Networks (median) MMRE MdMRE Pred MAR 70 57.23 35.65 32.66 38.27 33.33 2556.98 2153.52 TABLE III N UMERICAL CONVERSION FOR BN S WITH FSS SET. Prediction System Bayesian Networks + FSS (mean) Bayesian Networks + FSS (median) ON D ESHARNAIS DATA MMRE MdMRE Pred MAR 68.94 56.18 35.49 34.16 39.5 39.5 2509.52 2133.84 (3) We consider our selection of metrics to be robust with MAR and MBRE being complementary to the MRE based metrics and making the evaluation more reliable. Higher accuracy in MMRE, MdMRE, MAR and MBRE is inferred from lower values, whereas for Pred metric, the higher the value the more accurate the model. In our result tables, the results under MMRE, MdMRE, Pred and MBRE are multiplied by 100 to keep them in a percentage perspective, e.g., 0.253 turns into 25.3. C. Comparing the Bayesian Classifier to regression techniques In order to compare BNs’ results to linear regression we used a variant of the conversion method first proposed in [18], and also used in [13], in which the numerical prediction is the sum of the multiplication of each class’ mean by its respective class probability after the probabilities are normalized so that their sum equals one. Instead of using the mean however, we used the median. Each class’ median value Md is multiplied by its respective normalized class probability ρ, output in the probability distributions of the BN’s predictions. See formula below. Ef f ort = ρclass1 Mdclass1 + ... + ρclassN MdclassN . under MdMRE and Pred metric and significant and consistent improvements in MMRE and MAR results when using the median for the numerical conversion. This modification increased accuracy and lessened the amount of outliers, i.e., wildly inaccurate predictions. This happens because the mean of each class is more affected by outliers than the median. These datasets are positively skewed, therefore each class’s mean value (and specially the highest effort class) will be closer to where the outliers are and farther from the majority of the data, pushing the numerical conversion of the output towards higher values. Therefore, when skewness is present the median is a more faithful and accurate representative of the data which makes up each class. An evidence supporting this reasoning is that the larger improvements were achieved on Maxwell dataset which is the more skewed one. (4) Like the aforementioned studies, we used the mean in a preliminary study [14]. We report here accuracy improvements The effectiveness of this modification can be seen in the tables. Table II shows the results for BNs on Desharnais dataset. Table III shows the results for BNs with the employment of feature subset selection on the same dataset. Tables II and III show the results on Maxwell dataset. These tables show the results comparing the conversion with the mean against the conversion with the median. BNs with and without feature subset selection are different prediction systems. So, the effect of the conversion method can be assessed by comparing the results on the same prediction system. Comparisons between the two prediction systems do not belong in this section and will be discussed in the results section. Here we are discussing only the improvements provided by this adaptation to the method proposed in [18]. TABLE IV N UMERICAL CONVERSION FOR BN S ON M AXWELL DATA SET. Prediction System MMRE MdMRE Pred MAR Bayesian Networks (mean) Bayesian Networks (median) 132.69 86.18 64.44 58.77 22.58 24.19 6726.23 4655.29 TABLE V N UMERICAL CONVERSION FOR BN S WITH FSS ON M AXWELL DATA SET. Prediction System MMRE MdMRE Pred MAR Bayesian Networks + FSS (mean) Bayesian Networks + FSS (median) 163.53 97.50 67.74 55.99 19.35 27.42 6281.83 4854.74 The effect of using the median in the conversion is quite clear for both prediction systems and in both datasets. However on Desharnais dataset under Pred metric there is no improvement. This can be ascribed to the limitation about Pred discussed in the previous section. This metric ignores predictions whose errors are larger than the parameter used, i.e., 25. All errors over this threshold are ignored. So, an error that is reduced from 100% MRE to 50% MRE will not affect this metric despite being a valuable improvement. We can infer from this that the improvements happened in the predictions that lie outside the 25% error range since all the other metrics clearly show there were improvements. We can see the impact of this adaptation is quite significant on the Maxwell dataset, which is the more skewed one. This result can probably be more easily grasped in all detail by the reader after reading the analysis and discussion of results in the next section. V. R ESULTS AND A NALYSIS Table VI reports on the results for the Desharnais dataset according to the continuous metrics previously exposed. On the Desharnais dataset there is an obvious improvement in the BNs’ hit-rates when applying feature selection. The hitrates are simply the percentage of times the classifier predicted the right class (therefore, the higher the value the more precise the model). However, when we consider the continuous metrics there were generally no improvements except under Pred. Pred metric resembles the hit-rates in its characteristic of only considering the accurate predictions and ignoring predictions lying far from the actual value. This shows there were more accurate predictions but that there were also more wrong predictions since the other metrics do not show improvements. This illustrates the limitation of Pred metric that we highlighted in subsection B of the previous section. For OLS regression, there is a small improvement under MMRE, MdMRE, MAR and MBRE and a marginal degradation under Pred metric. The improvements were relatively small because the number of variables dwindles in such cases and the feature selection technique cannot find much improvements by further decreasing the number of variables. The accuracy of the BNs on log-transformed data was about the same as on the non transformed data. The log transformation did not bring improvements to BNs’ predictions. BNs’ performance was very constant regardless of data preprocessing. So, on this dataset, BNs performed relatively well but were more prone to large inaccuracies. Finally, BNs clearly overcame the baseline models. TABLE VI M ODELS PERFORMANCE ON D ESHARNAIS DATA SET Predictor Hit-rate MMRE MdMRE Pred MAR MBRE BNs BNs+FSS BNs+log BNs+log+FSS OLS+log OLS+log+FSS Mean model Median model 46.91% 54.32% 44.44% 48.15% - 57.23 56.18 56.37 57.64 37.62 34.24 121.66 78.46 32.66 34.16 33.61 36.42 29.19 27.66 59.49 42.03 33.33 39.5 34.57 38.27 46.75 45.45 18.51 29.62 2153.52 2133.84 2128.65 2165.47 1731.53 1567.93 3161.52 2861.53 65.83 64.52 67.38 72.19 48.04 42.54 140.04 120.42 Table VII reports on the results for the Maxwell dataset. For being the dataset with the largest amount of variables in this study, it is likely to contain irrelevant variables and benefit the most by undergoing feature selection. This expectation is fulfilled for OLS regression. Feature selection reduced by half the mean of residuals and all the other metrics show large improvements as well. But again, like on Desharnais dataset, BNs’ performance did not improve convincingly with the application of feature selection. There is a clear improvement on the hit-rates and an improvement under Pred metric, but the other metrics show that the increase of good predictions (i.e., predictions close to the actual value) was offset by larger errors. It is interesting to observe that when data did not undergo feature selection, the performance of BNs is comparable to the performance of OLS regression. But with the application of feature selection OLS regression has a large improvement in accuracy as opposed to BNs which do not collect any improvement. This highlights that the BNs models are missing very significant improvements in accuracy which are expected with the application of feature selection. With regard to the logarithmic transformation, the results show small improvements for BNs under all metrics but Pred as opposed to the Desharnais dataset in which there was no effect. Like on Desharnais dataset, BNs clearly overcame the baseline models. In our view, an important observation on this dataset is the improvement with feature selection that is being missed by BNs. We will discuss the reasons for this after exposing all results. TABLE VII M ODELS PERFORMANCE ON M AXWELL DATA SET Predictor Hit-rate MMRE MdMRE Pred MAR MBRE BNs BNs+FSS BNs+log BNs+log+FSS OLS+log OLS+log+FSS Mean model Median model 40.32% 51.61% 40.32% 51.61% - 86.18 97.5 73.41 70.67 76.86 42.57 119.67 108.95 58.77 55.99 52.72 53.88 43.78 28.62 52.96 66.28 24.19 27.41 19.35 25.81 30 40 19.35 20.97 4655.29 4854.74 4550.90 4576.05 4932.6 2500.04 5616.54 5654.11 110.48 122.19 104.54 106.44 101.19 52.28 225.64 180.91 Table VIII reports on the results for Cocomo81. On this dataset, the logarithmic transformation did yield an observable improvement on the BNs’ predictions, specially under MMRE, This suggests a decrease of large overestimates. We can observe the difference in performance compared to OLS regression grew in comparison to the previous datasets, even though this effect can be slightly reduced by the application of the logarithmic transformation. Feature selection brought an improvement for OLS regression though not as pronounced as on Maxwell. For BNs, the same pattern of improved hit-rates and no improvements under other metrics which was observed in the other dataset stands on this dataset. This appears to be related to the skewness of the datasets and the loss of precision brought about by the discretization process. Skewness increases this imprecision because it makes the classes more uneven. The logarithmic transformation is only to some extent able to reduce this effect. Nevertheless, even in this very skewed dataset they were able to overcome both baseline models. Table IX shows the frequency of underestimates and overestimates for each model over the three datasets. OLS models have a tendency to underestimate, which is considered less desirable than a tendency to overestimate. The variables most frequently identified by the feature selection algorithm were related to ‘Size’. In all datasets studies here, a size variable was selected. This variable appears to be frequently the one with the highest predictive value for effort estimation. TABLE VIII M ODELS PERFORMANCE ON C OCOMO 81 DATA SET Predictor Hit-rate MMRE MdMRE Pred MAR MBRE BNs BNs+FSS BNs+log BNs+log+FSS OLS+log OLS+log+FSS Mean model Median model 50.79% 55.56% 52.38% 55.56% - 134.85 270.64 91.19 76.94 46.6 44.28 1775.35 235.42 58.64 130.37 53.64 64.93 30.49 22.98 571.16 86.25 25.81 9.68 19.35 25.81 44.44 53.96 4.76 15.87 551.95 606.22 536.54 530.61 278 297.47 891.64 642.63 197.82 336.39 233.15 212.73 61.83 55.97 1905.81 842.24 We can observe in all of these results that feature selection improved clearly and consistently the hit-rates of BNs and the accuracy of linear regression over all datasets. This effect is very pronounced on the Maxwell dataset which is the one with the highest number of variables. Such improvements are expected because the larger the amount of variables in a dataset, the more likely it is for the dataset to contain irrelevant or redundant variables. This emphasizes the importance of applying the feature selection especially on datasets with many variables. It also highlights the fact that many variables in software projects datasets have a small predictive value and can actually make the models less accurate. Therefore, collecting a smaller amount of variables focusing on high data quality may be more interesting for data-based predictions. This finding is a confirmation of the findings of previous studies, e.g.,[23] , [21] and [20]. Fig. 6. BNs missing expected improvements from FSS. In spite of these clear improvements however, we can see that the improvements of BNs predictions when measured by the continuous metrics was small or at times the accuracy even worsened. Specially on the Maxwell and Cocomo81 datasets, on which the predictions were significantly less accurate than without feature selection as opposed to what one would expect. This contradiction is illustrated in Fig. 6, where we can see improvements in hit-rates and a degradation according to MBRE. According to data mining literature, wrapper approaches like the one applied here use the algorithm’s own accuracy measure to assess the feature subset [22] [15]. And it is obvious the BNs algorithm is not using this numerical conversion to measure accuracy. The model selection is clearly favouring the hit-rates. This brings into question the validity of hit-rates as an accuracy measure or at least highlights its limitation. Improved hit-rates were offset by larger magnitude errors, i.e., less wrong predictions but when the predictions were wrong they were wrong by a larger margin. This could also be seen in the confusion matrices, but they were omitted due to lack of room. So, does the improved hit-rate really reflect a more accurate model? In all these experiments, BNs ended up missing the improvements expected from feature selection. This could make a significant difference in Maxwell and Cocomo81 datasets which are the ones with larger amounts of variables. It follows from this observation that an interesting development for BNs would be to investigate the feasibility of incorporating this numerical conversion into the BNs algorithms and tools, using it as a measure of accuracy instead of the hit-rates or error-rates. This modification could bring in some improvements in the predictions and also in the effect of the feature selection technique. The application of feature selection would find improvements in overall accuracy even if with lower hit-rates. As it is, the potential improvements expected from feature selection are being wasted in the strive for higher hit-rates. Alternatively, a suggestion for future research is to experiment with other BNs search algorithms, score types and CPT estimators and check out whether these bypass this focus on hit-rates. In this study we restricted ourselves to the K2 search algorithm [29] with Bayes method for scoring of the networks and Simple estimator to estimate the NPTs. TABLE IX F REQUENCY OF U NDERESTIMATES AND OVERESTIMATES Prediction System Overestimates (count) Underestimates (count) BNs BNs + FSS BNs + log BNs + log + FSS OLS + log OLS + log + FSS 110 127 99 104 86 91 96 79 107 102 114 109 We can also observe a trend in these results. BNs accuracy degrades according to the datasets’ skewness. With increases in skewness BNs struggle to predict accurately. BNs best performance in these experiments was achieved in the least skewed dataset, i.e., Desharnais. When the data is too skewed the discretized classes become too uneven and there is an increased loss of precision with the largest discretized intervals. The highest effort classes tend to be very sparse. An example is the highest effort class defined for the Maxwell dataset which spans a wider interval than all others put together (ranges from 10000 to 64000 person-hours), thus being very imprecise. Besides the effect on the discretization, there is also an effect on the numerical conversion because even a small probability of the highest effort class (Very High) affects the conversion quite significantly. In Fig. 7 we illustrate this degradation by dividing the error margin of BNs by the error of OLS, for each dataset and according to two metrics. We can see that BNs’ error margin increases significantly in comparison to OLS Fig. 7. Accuracy degradation of BNs according to dataset skewness. as the skewness of the dataset increases under both metrics (datasets are sorted from left to right according to skewness). Much of the imprecison of the BNs can be ascribed to the discretization process. This subject has been neglected to some extent in this research field and the establishment of guidelines on this could benefit research initiatives. The imprecision brought about by the discretization process is directly related to the skewness of the datasets. In this scenario of highly skewed datasets, the equal-frequencies discretization generates classes’ intervals of too different widths and the numerical conversion will show larger error margins. The alternative of equal-widths discretization causes meaningless results, for there will be empty or near empty classes and the model learning will simply state the obvious, predicting nearly always the same class which is the lowest effort class since it contains most of the records. High hit-rates are not only unsurprising but very likely when using equal-widths in very skewed datasets. Unless a log transformation is applied to the data, predictions based on skewed data discretized with the equal-widths method bring in deceitful results. Related to these findings are the results of [30], which compared equal-widths, equal-frequencies and k-means discretization on a subset of a well known dataset and concluded that equal-frequencies with a log transformation can improve the accuracy results according to most evaluation criteria. Further investigations on discretization methods are necessary. An interesting undertaking was to investigate the effect of the log transformation on the Bayesian classifier. Even though a couple of studies used this transformation, we are not aware of studies assessing its effects. The log transformation was able to provide only slight improvements of accuracy. The results show that in very skewed datasets, transforming the data can be beneficial. Fig. 8 illustrates this improvement according to MdMRE metric. As another suggestion for future research, we observe that it would be interesting to try out this data transformation with BNs that support continuous variables since in these experiments much of the benefit of performing this transformation appears to have been lost with the discretization. These experiments on data-driven BNs are relevant because the way data is explored can have a significant impact on the model’s performance. Much of the excitement over BNs revolves around their capability to integrate objective and subjective knowledge. Therefore, learning how to optimize the use Fig. 8. Effect of log transformation on BNs (MdMRE). of data (i.e., the objective part) can improve the performance of not only data-driven BNs, but also hybrid BNs which appear to be the most promising for this research field. Even though BNs solely based on data may not become the most accurate approach in software effort prediction, improvements on the use of data for BNs benefit this technique as a whole and given its relevancy in software engineering, these investigations are necessary. Optimizing the performance of the data mining capabilities of BNs is an essential part in the development of this modelling technique. Our results on these datasets are more optimistic for BNs than the ones reported in [13], which were obtained on another dataset. Our experiments show the BNs models struggle in very skewed datasets but are still capable of achieving a minimum standard of accuracy. In [13], most BNs, including hybrid BNs, performed worse than the baseline models. Fig. 9 compares the BNs prediction systems to the baseline models according to MBRE metric. Fig. 9. Comparison between BNs and baseline models (MBRE). From our studies on the literature and our own experiments, we observe that it appears to be hard to overcome OLS regression when it is properly applied. Our results on OLS regression confirm the results of [21] and the results of [13]. While OLS regression does perform better with regard to accuracy, one must observe that OLS regression as a well established statistical technique is optimized to its best. On the other hand, we have shown in this study that techniques like BNs have room for improvements and are under constant development. As BNs theory evolves and the tools catch up with the developments, more accurate predictions will be possible. Ideally, if data-driven BNs catch up with OLS regression, they will be very advantageous due to their flexibility and powerful experimenting features. When such a standard is achieved BNs users will be able to trust this technique is exploring data as well as the most accurate data-based models. Specifically, we have observed room for improvements for BNs with regard to discretization techniques and experimenting with different model selection methods which could provide improvements in accuracy under other metrics than the hit-rates and also optimizing the effects of feature selection. This appears to be a fundamental problem. Furthermore, there are developments in data mining research concerning support for ordinal and continuous variables. These could also bring further improvements in accuracy. And besides these improvements on BNs’ data mining capabilities, there are also improvements concerning support for experts’ model building. The BNs tools are currently a limitation [19]. The latest developments are not available for most of the tools. In these experiments we did not have the opportunity to experiment with continuous variables nor with dynamic discretization. It would be interesting to verify the improvements techniques like dynamic discretization proposed in [31] could bring in. Although WEKA offers validation advantages over other tools, it does not have other developments from BNs theory. As we already mentioned, an interesting development would be the incorporation of the numerical conversion method. This conversion is not automated in the tools and it can be somewhat cumbersome to perform which may hinder its employment. Having this conversion automated into the tools could be interesting. Some studies on BNs indicate that BNs’ main strength for the software prediction area lies in their possibility to incorporate domain knowledge and qualitative factors, therefore favouring hybrid or expert-driven approaches. Currently, an advantage of data-driven models like this, as pointed out in [20], is that by owning a projects dataset it is possible to obtain quick predictions as supporting evidence for the expert’s prediction, as opposed to expert based networks which take much more effort to build and to have the NPT’s elicited. The employment of data-based models to support expert estimates has been indicated to practitioners as a means to increase safety and reliability on experts’ estimates, since the situation with expert-based estimations has not been easier than the situation seen in this research field. Finally, an observation obtained with this study and the difficulties in the field is that it is important to show faithful and realistic results even if they are not positive towards a particular technique. This research field has suffered in the last twenty years due to over-optimism towards some techniques. In recent years, efforts towards correcting inconsistencies and addressing reasons for conflicting results are on the rise even if these show a less than flattering state of affairs in the field. To move forward it is important to recognize the actual situation paving the way for improvements and solutions. VI. C ONCLUSION This study provided a sound assessment of automatic BNs by means of a comparison with a well established statistical technique and with benchmark models, thereby illustrating its current limitations and possibilities of improvements. BNs’ limitations are discussed and some guidelines on its employment are provided. Specifically, the skewness of datasets prevalent in this research field and the discretization are shown to bring about inaccuracies that limit BNs’ effectiveness. One suggestion arising from these observations and set forth to the research community is to investigate the feasibility of incorporating the numerical conversion into BNs model building as we consider it portrays accuracy more faithfully than the basic hit-rates. This could make BNs models generally more accurate even if achieving lower hit-rates. Also, the inclusion of this conversion in the tools would be interesting for research undertakings. We consider this study discusses important matters that are scarcely discussed in software prediction studies and that can be a source of confusion. Most studies have not addressed much attention to dataset properties and implications on model’s functioning. Shedding light on these somewhat neglected topics is an important step to address some of the current difficulties in the field. This study showed some of the problems arising from the datasets in the field and the constraints they impose specially on classifiers. Much of this is related to the discretization process and the uneven classes that it generates. We brought forward some points concerning the exploration of data which we believe to be important for the development of BNs. There is a limit on how accurate data-driven prediction techniques can be depending on the data used. Therefore, more efforts should be addressed in studying software prediction datasets properties and data pre-processing in order to increase prediction accuracy. The performance of these models is highly dependent on data quality, which is a subject that has not received sufficient attention. Significant improvements could come from investigations on this. Our observations indicate that BNs have a potential for data-based predictions but still need improvements to catch up with the most accurate data based models. In spite of the apparent advantage of linear models in this scenario, i.e., datadriven modeling, it must be observed that this is only part of the potentiality of BNs. BNs offer experimenting possibilities beyond that of linear regression. The linear regression method can only provide a point estimate, whereas BNs meet other requirements expected from a prediction model. Furthermore, due to the human factors and inherent uncertainties in software projects, the capability to incorporate expert’s subjective knowledge can provide an advantage over models solely based on data. Bayesian Networks appear to be one of the most suitable techniques for future progresses in this aspect. BNs theory and tools are under constant development and some technical breakthroughs regarding discretization and NPT’s elicitation appear to herald progresses for BNs in software prediction and software projects management in general. A. Future Work A topic that could provide some improvements for the software prediction field and that warrants investigations is data pre-processing. Carrying out this work we observed the impact discretization, data transformations and feature selection can have on the models’ performance. Moreover, we observed the implications of and hindrances posed by the characteristics of software projects datasets. In our view, discretization is a topic that needs thorough investigations as there are currently no guidelines on this. In this work we applied a specific feature subset selection technique (a Wrapper approach with BestFirst algorithm ). It would be interesting to assess whether other feature selection techniques can bypass this focus on hit-rates that this wrapper approach demonstrated. Good improvements could be obtained if BNs could better extract the accuracy improvements expected from feature selection. Another suggestion is to experiment with other learning and selection algorithms, as in this work we restricted ourselves to the K2 search algorithm with Bayes method for scoring of the networks and Simple estimator to estimate the NPTs. We have the expectancy that other algorithms could assess accuracy in a different way, as in this study the algorithms were clearly favouring the hit-rates, which we questioned as an accuracy measure. Furthermore, investigating BNs with continuous variables and the related pre-processing procedures could yield interesting results. Also, statistical significance tests could be performed to enhance the validation of the results. R EFERENCES [1] B. W. Boehm, Software Engineering Economics. Englewood Cliffs, NJ: Prentice Hall, 1981. [2] N. E. Fenton and M. Neil, “A critique of software defect prediction models,” IEEE Trans. Softw. Eng., vol. 25, no. 5, pp. 675–689, 1999. [3] G. R. Finnie, G. E. Wittig, and J.-M. Desharnais, “A comparison of software effort estimation techniques: using function points with neural networks, case-based reasoning and regression models,” J. Syst. Softw., vol. 39, no. 3, pp. 281–289, Dec. 1997. [Online]. Available: http://dx.doi.org/10.1016/S0164-1212(97)00055-1 [4] M. Shepperd and C. Schofield, “Estimating software project effort using analogies,” IEEE Trans. Softw. Eng., vol. 23, pp. 736–743, Nov. 1997. [Online]. Available: http://dl.acm.org/citation.cfm?id=269857.269863 [5] J. Wen, S. Li, Z. Lin, Y. Hu, and C. Huang, “Systematic literature review of machine learning based software development effort estimation models,” Inf. Softw. Technol., vol. 54, no. 1, pp. 41–59, Jan. 2012. [Online]. Available: http://dx.doi.org/10.1016/j.infsof.2011.09.002 [6] M. Korte and D. Port, “Confidence in software cost estimation results based on mmre and pred,” in Proceedings of the 4th international workshop on Predictor models in software engineering, ser. PROMISE ’08. New York, NY, USA: ACM, 2008, pp. 63–70. [Online]. Available: http://doi.acm.org/10.1145/1370788.1370804 [7] C. Mair and M. J. Shepperd, “The consistency of empirical comparisons of regression and analogy-based software project cost prediction,” in Proceedings ISESE’05, 2005, pp. 509–518. [8] T. Menzies, O. Jalali, J. Hihn, D. Baker, and K. Lum, “Stable rankings for different effort models,” Automated Software Engg., vol. 17, pp. 409–437, Dec. 2010. [Online]. Available: http://dx.doi.org/10.1007/ s10515-010-0070-z [9] M. Shepperd and S. MacDonell, “Evaluating prediction systems in software project estimation,” Inf. Softw. Technol., vol. 54, no. 8, pp. 820–827, Aug. 2012. [Online]. Available: http://dx.doi.org/10.1016/j. infsof.2011.12.008 [10] B. Kitchenham and E. Mendes, “Why comparative effort prediction studies may be invalid,” in Proceedings of the 5th International Conference on Predictor Models in Software Engineering, PROMISE ’09. New York, NY, USA: ACM, 2009, pp. 1–5. [11] I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and validity in comparative studies of software prediction models,” IEEE Trans. Softw. Eng., vol. 31, no. 5, pp. 380–391, May 2005. [Online]. Available: http://dx.doi.org/10.1109/TSE.2005.58 [12] B. Kitchenham, L. Pickard, S. G. MacDonell, and M. J. Shepperd, “What accuracy statistics really measure,” IEE Proceedings - Software, vol. 148, no. 3, pp. 81–85, 2001. [13] E. Mendes and N. Mosley, “Bayesian network models for web effort prediction: A comparative study,” Software Engineering IEEE Transactions on, vol. 34, no. 6, pp. 723–737, 2008. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper. htm?arnumber=4589218 [14] I. A. Tierno and D. J. Nunes, “Assessment of automatically built bayesian networks in software effort prediction,” Ibero-American Conference on Software Engineering, Buenos Aires - Argentina, pp. 196– 209, Apr. 2012. [15] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, (First Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2005. [16] I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011. [17] M. N. N. Fenton and L. Radlinski, “Software project and quality modelling using bayesian networks,” in Artificial Intelligence Applications for Improved Software Engineering Development: New Prospects. (Part of the Advances in Intelligent Information Technologies (AIIT) Book Series). Information Science Reference. ISBN: 978-1-60566-758-4, 2009, pp. 223–231, edited by: F. Meziane and S. Vadera. [18] P. C. Pendharkar, G. H. Subramanian, and J. A. Rodger, “A probabilistic model for predicting software development effort,” IEEE Trans. Softw. Eng., vol. 31, no. 7, pp. 615–624, 2005. [19] L. Radlinski, “A survey of bayesian net models for software development effort prediction,” International Journal of Software Engineering and Computing, vol. 2, no. 2, pp. 95–109, 2010. [20] L. Radlinski and W. Hoffmann, “On predicting software development effort using machine learning techniques and local data,” International Journal of Software Engineering and Computing, vol. 2, no. 2, pp. 123– 136, 2010. [21] K. Dejaeger, W. Verbeke, D. Martens, and B. Baesens, “Data mining techniques for software effort estimation: A comparative study,” IEEE Trans. Software Eng., vol. 38, no. 2, pp. 375–397, 2012. [22] M. A. Hall and G. Holmes, “Benchmarking attribute selection techniques for discrete class data mining,” IEEE Trans. on Knowl. and Data Eng., vol. 15, no. 6, pp. 1437–1447, 2003. [23] Z. Chen, B. Boehm, T. Menzies, and D. Port, “Finding the right data for software cost modeling,” IEEE Softw., vol. 22, no. 6, pp. 38–46, 2005. [24] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10–18, 2009. [Online]. Available: http://dx.doi.org/10.1145/1656274.1656278 [25] T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan, “The promise repository of empirical software engineering data. available at: <http://promisedata.googlecode.com>. viewed in apr. 16th, 2013,” June 2012. [Online]. Available: http: //promisedata.googlecode.com [26] H. Liu, F. Hussain, C. L. Tan, and M. Dash, “Discretization: An enabling technique,” Data Min. Knowl. Discov., vol. 6, pp. 393–423, Oct. 2002. [Online]. Available: http://dl.acm.org/citation.cfm?id=593435.593535 [27] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit, “A simulation study of the model evaluation criterion mmre,” IEEE Trans. Softw. Eng., vol. 29, pp. 985–995, Nov. 2003. [Online]. Available: http://dl.acm.org/citation.cfm?id=951850.951936 [28] Y. Miyazaki, A. Takanou, H. Nozaki, N. Nakagawa, and K. Okada, “Method to estimate parameter values in software prediction models,” Inf. Softw. Technol., vol. 33, no. 3, pp. 239–243, Apr. 1991. [Online]. Available: http://dx.doi.org/10.1016/0950-5849(91)90139-3 [29] G. F. Cooper and E. Herskovits, “A bayesian method for the induction of probabilistic networks from data,” Mach. Learn., vol. 9, pp. 309–347, Oct. 1992. [Online]. Available: http://dl.acm.org/citation.cfm? id=145254.145259 [30] M. FernÁndez-Diego and J.-M. Torralba-Martínez, “Discretization methods for nbc in effort estimation: an empirical comparison based on isbsg projects,” in Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement, ser. ESEM ’12. New York, NY, USA: ACM, 2012, pp. 103–106. [Online]. Available: http://doi.acm.org/10.1145/2372251.2372268 [31] M. Neil, M. Tailor, and D. Marquez, “Inference in hybrid bayesian networks using dynamic discretization,” Statistics and Computing, vol. 17, pp. 219–233, 2007. [Online]. Available: http: //dl.acm.org/citation.cfm?id=1285820.1285821