Mining Source Code Repositories with Solr

Exploring Full-Text Search

www.garysieling.com/jug

In this talk I�m going to discuss full-text indexing in Solr, which is an open-source Java tool that wraps Lucene. I�ll show you how to build up an example project which indexes the contents of git source code repositories.

Some of you may be familiar with products like Atlassian�s Fisheye, or code search engines like Krugle, or used the Github search. The nice thing about these is they let you do more than the out-of-the box 'blame' tool in your version control software, like finding out who ported a feature between branches, or searching both git and cvs repositories.

Since there are mature products for this type of thing, this talk isn�t intended to be a replacement for them; rather it�s a well contained way to understand full-text search, and to explore the design challenges in building ETL processes.

Overview

Demo
How-to
Inspiration

Motivation

Which engineer worked on a particular client/project/technology?

Project Structure

Search for Experts In...

Corporate Open Source...

Motivation

Who is an expert in technology X?

In some large companies, it�s also helpful to know how many engineers have worked with a particular technology, and what the trends are. For instance, Microsoft�s MSDN program awards points based on certifications of developers in the organization, so it�s useful to know if you have a lot of certified C# developers, or people who could go get a certification if needed.

You can answer these types of questions by counting the number of commits a person has that reference a client or tool, which is essentially faceted search. If you imagine the UI for Amazon, on the left hand side it shows you categories and category counts for fields that apply to items in your search results, such as the genre or studio for a movie. What we are going to do here is to build a facet for an author or other commit attributes.

This is different from a typical search, where you're looking for the commits individually, which serves different use cases than I'm after here. This technique works especially well if you company is diligent about what you put in commits; clearly counting numbers of commits will not be helpful if your average commit is hundreds of files and infrequent.

Solr - Schema


<field name="id" type="string" indexed="true" stored="false" required="true" /> 
<field name="author" type="git_author" indexed="true" stored="true" required="true" /> 
<field name="company" type="string" indexed="true" stored="true" required="true" /> 
<field name="year" type="string" indexed="true" stored="true" required="true" /> 
<field name="email" type="string" indexed="true" stored="false" required="true" /> 
<field name="message" type="string" indexed="true" stored="false" required="true" /> 
<field name="search" type="text_general" indexed="true" stored="false" required="false" />

Stats for Live Demo

2.1 GB Git History => 132 MB Solr Repository
Index of 232,839 commits
Drupal, Git, Lucene, Lucene.NET, Solr, Mono, Node.JS, PHP, Postgres, Vagrant

Wingspan Source Code

2,000 MB Git Archive => 90 MB Solr Index
Works well for finding project leads
Works well for finding technology experts

Github Archive

202 GB Git Archive => 1 GB Solr Index
18,000 repositories
4,554,502
2-3 hour conversion process
Developer workstation (i7 / 16 GB RAM)

What is Full Text Search?

The dogs ran home => dog run home

The dog runs home => dog run home

Full Text Search vs. Google

What is Solr?

How this is built

Front-end

Back-end

JSON API

/select/?q=search:sql

&version=2.2

&start=0&rows=0

&indent=on

&facet=on

&facet.field=author

&facet.method=fc

&facet.limit=30

&wt=json

Shown here are an example URL to Solr's REST API with faceting turne on, and an example of what the join syntax looks like. Solr comes with several query parsers. This implements some interesting functionality - rather than having a single query language which is detailed like SQL, there are applications which nest multiple query parsers.

The join case allows you to join the table to itself - if you wanted to use this to join in an ACL, for instance, you could add new columns for the ACL options, and an identifier which tells you which type of object a row is, essentially using a partitioning concept to build sub-tables.

facet.method has three options, which control the execution path. One starts with an enumarable field, and counts the intersection of fields. The default loops through documents which match the query summing as it goes. This can also be split across sub-indexes and then re-added.

Return types - csv, xml, json, php, python, ruby, javabin

Building a facet UI


var url = 'http://solr/core1/select/?q=search:(search:jug)';
$.ajax({
  type: "GET",
  url: url,
  success: function(response){
    // display search results
});

Backend ETL

Extracting Git Commits
Transforming Commits
Loading data to Solr

Why do you need Java code?

Extract data (e.g. PDF contents)
Data transformations (email -> company)

Denormalized data

Row = document

Document comes from PDF / Office / etc

Must eval document yourself

Fiddle with data as needed

Solr Repositories resemble a single, denormalized database table, which they call a core. Each row in the table is referred to as a �document�, which comes from the idea that you may have indexed the contents of a PDF, HTML file, or Word document, although there is no particular requirement on what a document must be.

Even though Solr refers to rows a documents, Solr index just stores text, so if you started with a bunch of PDFs you�d have to build your own processing pipeline to extract and process text from them.

Some customizations can be done with configurations in Solr - if these work, they tend to be pretty easy to figure out, and when you can't figure them out quickly, you usually need to resort to code.

Extracting Data from Git


FileRepositoryBuilder builder = new FileRepositoryBuilder();
Repository repository = 
  builder.setGitDir(new File(path))
    .build();

RevWalk walk = new RevWalk(repository);

for (Ref ref : repository.getAllRefs().values()) {
  if ("HEAD".equals(ref.getName())) {
    walk.markStart(walk.parseCommit(ref.getObjectId()));
    break;
  }
}

for (RevCommit commit : walk) {
  ...
}

Extracting Data - Patches


// To fetch file diffs, we must provide a stream, where they will be written:
ByteArrayOutputStream out = new ByteArrayOutputStream();
DiffFormatter df = new DiffFormatter(out);
df.setRepository(repository);

List<DiffEntry> diffs = df.scan(parent.getTree(), commit.getTree());

// Each of these objects is one file in the commit
for (Object obj : diffs) {
  DiffEntry diff = (DiffEntry) obj;
  FileHeader fh = df.toFileHeader(diff);
  df.format(diff);
  String diffText = out.toString("UTF-8");

  // Reset the stream, so we can get each patch separately
  out.reset();

  ...
}

Extracting Commit Metadata


// For this application, only find new or modified files
if (diff.getChangeType() == DiffEntry.ChangeType.MODIFY ||
    diff.getChangeType() == DiffEntry.ChangeType.ADD)) {

  // We have enough information now to get commit messages, 
  // file names, and author information
  System.out.println(diff.getNewPath());
  System.out.println(commit.getFullMessage());
  System.out.println(commit.getAuthorIdent().getName());
  System.out.println(commit.getAuthorIdent().getEmailAddress());
}

Transformations - Search Data


Pattern capitals = Pattern.compile(".*([a-z])([A-Z]).*");
Matcher m = capitals.matcher(file);

// myAbstractFactory => my Abstract Factory
String fileNameTokens = 
  m.replaceAll("\1 \2");

// /project/src/large_grid.js => project src large grid js
fileNameTokens = fileNameTokens
   .replace("/", " ")
   .replace("-", " ")
   .replace(".", " ")
   .replace("_", " ");
  
String search =
   commit.getFullMessage() + 
   " " + file;

Transformations - Companies


// x.y@google.com => google.com
String company = emailAddress.split("@")[1];

if (company.contains("."))
{	
  // abc.com => google
  company = company.substring(0, company.lastIndexOf("."));
}
			
return company;

Transformations - Commit Footers

Signed-off-by
Acked-by
Reported-by
Tested-by
CC, Cc
Bug

Loading Data in Solr


// Connections happen over HTTP:
HttpSolrServer server = new HttpSolrServer(
     "http://localhost:8080/solr");

Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();

SolrInputDocument doc = new SolrInputDocument();

// ID for the document contains enough information to find it in the source data:
doc.addField("id", remoteUrl + "." + commit.getId());

doc.addField("author", commit.getAuthorIdent().getName());
doc.addField("email", commit.getAuthorIdent().getEmailAddress());
doc.addField("message", commit.getFullMessage());

// Any data we let the user search against is included in this value:
doc.addField("search", search);

Loading Data - Threading


File[] files = new File("repositories\\").listFiles();
String lastRepository = getLastRepository(); // resume after failure

int i = 0;
int numThreads = Runtime.getRuntime().availableProcessors();
        
for (File f : files) {
  if (i++ % numThreads == _myIndex) { // load every nth repository
    String filename = f.getAbsolutePath() + "\\.git";
            
    convertRepo(server, filename);
  }
  i++;
}

Setting up Solr

Download Solr
Modify schema.xml
Start Jetty:
java -jar start.jar

Solr - Limitations

Don't expose Solr publically
Don't rely on joins
Roll your own ACLs, if needed

Lessons Learned

Parsing Code Comments
Testing
High Volume Indexing
Inspiration for similar tools

Parsing Comments

        
@@ -1,9 +1,9 @@
/* 
* Here is a multi-line comment.
*/
String a = "x"; // here is a comment at the end of a line

-String b /* a comment in a line */ = "y"; 
+String b /* a long comment in a line */ = "y"; 

// Commented out code:
// String c = "d";

Haven't tried to parse code yet - couple example challenges

Comments show challenges of attributing which work a person did within a file to what they're really working on

Would have to build parse tree for code and compare before/after

At this point, you may have noticed that I haven't attempted to parse code - however, I have given a lot of thought to why I don't want to parse code, or at least handle it as a special case of unstructured text. What I want to show now is a couple interesting theoretical challenges, which I think provide some compelling cases of why it can be really challenging to get an ETL process or a data migration project to be successful.

Code comments also provide good insight into who worked on what. Obviously these aren't code, but often contain valuable information, and should be relatively easy to detect, at least in most languages which descend from C. In a source file, you could detect these by looking for lines which contain two slashes, or the slash-star form for multi-line comments.

JSX


var MarkdownEditor = React.createClass({
  render: function() {
    return (
      <div className="MarkdownEditor">
        <h3>Input</h3>
        <div
          className="content"
          dangerouslySetInnerHTML={{
            __html: converter.makeHtml(this.state.value)
          }}
        />
      </div>
    );
  }
});

Testing ETL processes

Indexing Many Repositories

Constraints on ETL processes

Real-time updates are hard
Plan for power failures
Plan for reloading data

Variations

Lots of free data, especially Government - PACER extracts, U.S. Code, Philly Code
MS Exchange / Sharepoint
D3.js, Tilemill

References

https://github.com/garysieling/git-solr-talk

https://github.com/garysieling/solr-git

THE END

BY Gary Sieling / @garysieling

gsieling@wingspan.com